INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
The attached patch implements INSERT...ON DUPLICATE KEY LOCK FOR
UPDATE. This is similar to INSERT...ON DUPLICATE KEY IGNORE (which is
also proposed at part of this new revision of the patch), but
additionally acquires a row exclusive lock on the row that prevents
insertion from proceeding in respect of some tuple proposed for
insertion.
This feature offers something that I believe could be reasonably
described as upsert. Consider:
postgres=# create table foo(a int4 primary key, b text);
CREATE TABLE
postgres=# with r as (
insert into foo(a,b)
values (5, '!'), (6, '@')
on duplicate key lock for update
returning rejects *
)
update foo set b = r.b from r where foo.a = r.a;
UPDATE 0
Here there are 0 rows affected by the update, because all work was
done in the insert. If l do it again 2 rows are affected by the
update:
postgres=# with r as (
insert into foo(a,b)
values (5, '!'), (6, '@')
on duplicate key lock for update
returning rejects *
)
update foo set b = r.b from r where foo.a = r.a;
UPDATE 2
Obviously, rejects were now projected into the wCTE, and the
underlying rows were locked. The idea is that we can update the rows,
confident that each rejection-causing row will be updated in a
race-free fashion. I personally prefer this to something like MySQL's
INSERT...ON DUPLICATE KEY UPDATE, because it's more flexible. For
example, we could have deleted the locked rows instead, if that
happened to make sense. Making this kind of usage idiomatic feels to
me like the Postgres way to do upsert. Others may differ here. I will
however concede that it'll be unfortunate to not have some MySQL
compatibility, for the benefit of people porting widely used web
frameworks.
I'm not really sure if I should have done something brighter here than
lock the first duplicate found, or if it's okay that that's all I do.
That's another discussion entirely. Though previously Andres and I did
cover the question of prioritizing unique indexes, so that the most
sensible duplicate for the particular situation was returned,
according to some criteria.
As previously covered, I felt that including a row locking component
was essential to reasoning about our requirements for what I've termed
"speculative insertion" -- the basic implementation of value locking
that is needed to make all this work. As I said in that earlier
thread, there are many opinions about this, and it isn't obvious which
one is right. Any approach needs to have its interactions with row
locking considered right up front. Those that consider this a new
patch with new functionality, or even a premature expansion on what
I've already posted should carefully consider that. Do we really want
to assume that these two things are orthogonal? I think that they're
probably not, but even if that happens to turn out to have been not
the case, it's an unnecessary risk to take.
Row locking
==========
Row locking is implemented with calls to a new function above
ExecInsert. We don't bother with the usual EvalPlanQual looping
pattern for now, preferring to just re-check from scratch if there is
a concurrent update from another session (see comments in
ExecLockHeapTupleForUpdateSpec() for details). We might do better
here. I haven't considered the row locking functionality in too much
detail since the last revision, preferring to focus on value locking.
Buffer locking/value locking
======================
Andres raised concerns about the previous patch's use of exclusive
buffer locks for extended periods (i.e. during a single heap tuple
insertion). These locks served as extended value locks. With this
revision, we don't hold exclusive buffer locks for the duration of
heap insertion - we hold shared buffer locks instead. I believe that
Andres principal concern was the impact on concurrent index scans by
readers, so I think that all of this will go some way towards
alleviating his concerns generally.
This necessitated inventing entirely new LWLock semantics around
"weakening" (from exclusive to shared) and "strengthening" (from
shared to exclusive) of locks already held. Of course, as you'd
expect, there are some tricky race hazards surrounding these new
functions that clients need to be mindful of. These have been
documented within lwlock.c.
I looked for a precedent for these semantics, and found a few. Perhaps
the most prominent was Boost, a highly regarded, peer-reviews C++
library. Boost implements exactly these semantics for some of its
thread synchronization/mutex primitives:
They have a concept of upgradable ownership, which is just like shared
ownership, except, I gather, that the owner reserves the exclusive
right to upgrade to an exclusive lock (for them it's not quite an
exclusive lock; it's an upgradeable/downgradable exclusive lock). My
solution is to push that responsibility onto the client - I admonish
something along the lines of "don't let more than one shared locker do
this at a time per LWLock". I am of course mindful of this caveat in
my modifications to the btree code, where I "weaken" and then later
"strengthen" an exclusive lock - the trick here is that before I
weaken I get a regular exclusive lock, and I only actually weaken
after that when going ahead with insertion.
I suspect that this may not be the only place where this trick is helpful.
This intended usage is described in the relevant comments added to lwlock.c.
Testing
======
This time around, in order to build confidence in the new LWLock
infrastructure for buffer locking, on debug builds we re-verify that
the value proposed for insertion on the locked page is in fact not on
that page as expected during the second phase, and that our previous
insertion point calculation is still considered correct. This is kind
of like the way we re-verifying the wait-queue-is-in-lsn-order
invariant in syncrep.c on debug builds. It's really a fancier
assertion - it doesn't just test the state of scalar variables.
This was invaluable during development of the new LWLock infrastructure.
Just as before, but this time with just shared buffer locks held
during heap tuple insertion, the patch has resisted considerable
brute-force efforts to break it (e.g. using pgbench to get many
sessions speculatively inserting values into a table. Many different
INSERT... ON DUPLICATE KEY LOCK FOR UPDATE statements, interspersed
with UPDATE, DELETE and SELECT statements. Seeing if spurious
duplicate tuple insertions occur, or deadlocks, or assertion
failures).
As always, isolation tests are included.
Bugs
====
I fixed the bug that Andres reported in relation to multiple exclusive
indexes' interaction with waits for another transaction's end during
speculative insertion.
I did not get around to fixing the broken ecpg regression tests, as
reported by Peter Eisentraut. I was a little puzzled by the problem
there. I'll return to it in a while, or perhaps someone else can
propose a solution.
Thoughts?
--
Peter Geoghegan
Attachments:
On Sun, Sep 8, 2013 at 10:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
This necessitated inventing entirely new LWLock semantics around
"weakening" (from exclusive to shared) and "strengthening" (from
shared to exclusive) of locks already held. Of course, as you'd
expect, there are some tricky race hazards surrounding these new
functions that clients need to be mindful of. These have been
documented within lwlock.c.
I've since found that I can fairly reliably get this to deadlock at
high client counts (say, 95, which will do it on my 4 core laptop with
a little patience). To get this to happen, I used pgbench with a
single INSERT...ON DUPLICATE KEY IGNORE transaction script. The more
varied workload that I tested this most recent revision (v2) with the
most, with a transaction consisting on a mixture of different
statements (UPDATEs, DELETEs, INSERT...ON DUPLICATE KEY LOCK FOR
UPDATE) did not show the problem.
What I've been doing to recreate this is pgbench runs in an infinite
loop from a bash script, with a new table created for each iteration.
Each iteration has 95 clients "speculatively insert" a total of 1500
possible tuples for 15 seconds. After this period, the table has
exactly 1500 tuples, with primary key values 1 - 1500. Usually, after
about 5 - 20 minutes, deadlock occurs.
This was never a problem with the exclusive lock coding (v1),
unsurprisingly - after all, as far as buffer locks are concerned, it
did much the same thing as the existing code.
I've made some adjustments to LWLockWeaken, LWLockStrengthen and
LWLockRelease that made the deadlocks go away. Or at least, no
deadlocks or other problems manifested themselves using the same test
case for over two hours. Attached revision includes these changes, as
well as a few minor comment tweaks here and there.
I am working on an analysis of the broader deadlock hazards - the
implications of simultaneously holding multiple shared buffer locks
(that is, one for every unique index btree leaf page participating in
value locking) for the duration of a each heap tuple insertion (each
heap_insert() call). I'm particularly looking for unexpected ways in
which this locking could interact with other parts of the code that
also acquire buffer locks, for example vacuumlazy.c. I'll also try and
estimate how much of a maintainability burden unexpected locking
interactions with these other subsystems might be.
In case it isn't obvious, the deadlocking issue addressed by this
revision is not inherent to my design or anything like that - the bugs
fixed by this revision are entirely confined to lwlock.c.
--
Peter Geoghegan
Attachments:
Hi Peter,
Nice to see the next version, won't have time to look in any details in
the next few days tho.
On 2013-09-10 22:25:34 -0700, Peter Geoghegan wrote:
I am working on an analysis of the broader deadlock hazards - the
implications of simultaneously holding multiple shared buffer locks
(that is, one for every unique index btree leaf page participating in
value locking) for the duration of a each heap tuple insertion (each
heap_insert() call). I'm particularly looking for unexpected ways in
which this locking could interact with other parts of the code that
also acquire buffer locks, for example vacuumlazy.c. I'll also try and
estimate how much of a maintainability burden unexpected locking
interactions with these other subsystems might be.
I think for this approach to be workable you also need to explain how we
can deal with stuff like toast insertion that may need to write hundreds
of megabytes all the while leaving an entire value-range of the unique
key share locked.
I still think that even doing a plain heap insertion is longer than
acceptable to hold even a share lock over a btree page, but as long as
stuff like toast insertions happen while doing so that's peanuts.
The easiest answer is doing the toasting before doing the index locking,
but that will result in bloat, the avoidance of which seems to be the
primary advantage of your approach.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 11, 2013 at 2:28 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Nice to see the next version, won't have time to look in any details in
the next few days tho.
Thanks Andres!
I think for this approach to be workable you also need to explain how we
can deal with stuff like toast insertion that may need to write hundreds
of megabytes all the while leaving an entire value-range of the unique
key share locked.
Right. That is a question that needs to be addressed in a future revision.
I still think that even doing a plain heap insertion is longer than
acceptable to hold even a share lock over a btree page
Well, there is really only one way of judging something like that, and
that's to do a benchmark. I still haven't taken the time to "pick the
low hanging fruit" here that I'd mentioned - there are some fairly
obvious ways to shorten the window in which value locks are held.
Furthermore, I'm sort of at a loss as to what a fair benchmark would
look like - what is actually representative here? Also, what's the
baseline? It's not as if someone has an alternative, competing patch.
We can only hypothesize what additional costs those other approaches
introduce, unless someone has a suggestion as to how they can be
simulated without writing the full patch, which is something I'd
entertain.
As I've already pointed out, all page splits occur with the same
buffer exclusive lock held. Only, in our case, we're weakening that
lock to a shared lock. So I don't think that the heap insertion is
going to be that big of a deal, particularly in the average case.
Having said that, it's a question that surely must be closely examined
before proceeding much further. And yes, the worst case could be
pretty bad, and that surely matters too.
The easiest answer is doing the toasting before doing the index locking,
but that will result in bloat, the avoidance of which seems to be the
primary advantage of your approach.
I would say that the primary advantage of my approach is that it's
much simpler than any other approach that has been considered by
others in the past. The approach is easier to reason about because
it's really just an extension of how btrees already do value locking.
Granted, I haven't adequately demonstrated that things really are so
rosy, but I think I'll be able to. The key point is that with trivial
exception, all other parts of the code, like VACUUM, don't consider
themselves to directly have license to acquire locks on btree buffers
- they go through the AM interface instead. What do they know about
what makes sense for a particular AM? The surface area actually turns
out to be fairly manageable.
With the promise tuple approach, it's more the maintainability
overhead of new *classes* of bloat that I'm concerned about than the
bloat itself, and all the edge cases that are likely to be introduced.
But yes, the overhead of doing all that extra writing (including
WAL-logging twice), and the fact that it all has to happen with an
exclusive lock on the leaf page buffer is also a concern of mine. With
v3 of my patch, we still only have to do all the preliminary work like
finding the right page and verifying that there are no duplicates
once. So with recent revisions, the amount of time spent exclusive
locking with my proposed approach is now approximately half the time
of alternative proposals (assuming no page split is necessary). In the
worst case, the number of values locked on the leaf page is quite
localized and manageable, as a natural consequence of the fact that
it's a btree leaf page. I haven't run any numbers, but for an int4
btree (which really is the worst case here), 200 or so read-locked
values would be quite close to as bad as things got. Plus, if there
isn't a second phase of locking, which is on average a strong
possibility, those locks would be hardly held at all - contrast that
with having to do lots of exclusive locking for all that clean-up.
I might experiment with weakening the exclusive lock even earlier in
my next revision, and/or strengthening later. Off hand, I can't see a
reason for not weakening after we find the first leaf page that the
key might be on (granted, I haven't thought about it that much) -
_bt_check_unique() does not have license to alter the buffer already
proposed for insertion. Come to think of it, all of this new buffer
lock weakening/strengthening stuff might independently justify itself
as an optimization to regular btree index tuple insertion. That's a
whole other patch, though -- it's a big ambition to have as a sort of
incidental adjunct to what is already a big, complex patch.
In practice the vast majority of insertions don't involve TOASTing.
That's not an excuse for allowing the worst case to be really bad in
terms of its impact on query response time, but it may well justify
having whatever ameliorating measures we take result in bloat. It's at
least the kind of bloat we're more or less used to dealing with, and
have already invested a lot in controlling. Plus bloat-wise it can't
be any worse than just inserting the tuple and having the transaction
abort on a duplicate, since that already happens after toasting has
done its work with regular insertion.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 11, 2013 at 8:47 PM, Peter Geoghegan <pg@heroku.com> wrote:
In practice the vast majority of insertions don't involve TOASTing.
That's not an excuse for allowing the worst case to be really bad in
terms of its impact on query response time, but it may well justify
having whatever ameliorating measures we take result in bloat. It's at
least the kind of bloat we're more or less used to dealing with, and
have already invested a lot in controlling. Plus bloat-wise it can't
be any worse than just inserting the tuple and having the transaction
abort on a duplicate, since that already happens after toasting has
done its work with regular insertion.
Andres is being very polite here, but the reality is that this
approach has zero chance of being accepted. You can't hold buffer
locks for a long period of time across complex operations. Full stop.
It's a violation of the rules that are clearly documented in
src/backend/storage/buffer/README, which have been in place for a very
long time, and this patch is nowhere near important enough to warrant
a revision of those rules. We are not going to risk breaking every
bit of code anywhere in the backend or in third-party code that takes
a buffer lock. You are never going to convince me, or Tom, that the
benefit of doing that is worth the risk; in fact, I have a hard time
believing that you'll find ANY committer who thinks this approach is
worth considering.
Even if you get the code to run without apparent deadlocks, that
doesn't mean there aren't any; it just means that you haven't found
them all yet. And even if you managed to squash every such hazard
that exists today, so what? Fundamentally, locking protocols that
don't include deadlock detection don't scale. You can use such locks
in limited contexts where proofs of correctness are straightforward,
but trying to stretch them beyond that point results not only in bugs,
but also in bad performance and unmaintainable code. With a much more
complex locking regimen, even if your code is absolutely bug-free,
you've put a burden on the next guy who wants to change anything; how
will he avoid breaking things? Our buffer locking regimen suffers
from painful complexity and serious maintenance difficulties as is.
Moreover, we've already got performance and scalability problems that
are attributable to every backend in the system piling up waiting on a
single lwlock, or a group of simultaneously-held lwlocks.
Dramatically broadening the scope of where lwlocks are used and for
how long they're held is going to make that a whole lot worse. What's
worse, the problems will be subtle, restricted to the people using
this feature, and very difficult to measure on production systems, and
I have no confidence they'd ever get fixed.
A further problem is that a backend which holds even one lwlock can't
be interrupted. We've had this argument before and it seems that you
don't think that non-interruptibility is a problem, but it project
policy to allow for timely interrupts in all parts of the backend and
we're not going to change that policy for this patch. Heavyweight
locks are heavy weight precisely because they provide services - like
deadlock detection, satisfactory interrupt handling, and, also
importantly, FIFO queuing behavior - that are *important* for locks
that are held over an extended period of time. We're not going to go
put those services into the lightweight lock mechanism because then it
would no longer be light weight, and we're not going to ignore the
importance of them, either.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 13, 2013 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Andres is being very polite here, but the reality is that this
approach has zero chance of being accepted.
I quite like Andres, but I have yet to see him behave as you describe
in a situation where someone proposed what was fundamentally a bad
idea. Maybe you should let him speak for himself?
You can't hold buffer
locks for a long period of time across complex operations. Full stop.
It's a violation of the rules that are clearly documented in
src/backend/storage/buffer/README, which have been in place for a very
long time, and this patch is nowhere near important enough to warrant
a revision of those rules.
The importance of this patch is a value judgement. Our users have been
screaming for this for over ten years, so to my mind it has a fairly
high importance. Also, every other database system of every stripe
worth mentioning has something approximately equivalent to this,
including ones with much less functionality generally. The fact that
we don't is a really unfortunate omission.
As to the rules you refer to, you must mean "These locks are intended
to be short-term: they should not be held for long". I don't think
that they will ever be held for long. At least, when I've managed the
amount of work that a heap_insert() can do better. I expect to produce
a revision where toasting doesn't happen with the locks held soon.
Actually, I've already written the code, I just need to do some
testing.
We are not going to risk breaking every
bit of code anywhere in the backend or in third-party code that takes
a buffer lock. You are never going to convince me, or Tom, that the
benefit of doing that is worth the risk; in fact, I have a hard time
believing that you'll find ANY committer who thinks this approach is
worth considering.
I would suggest letting those other individuals speak for themselves
too. Particularly if you're going to name someone who is on vacation
like that.
Even if you get the code to run without apparent deadlocks, that
doesn't mean there aren't any;
Of course it doesn't. Who said otherwise?
Our buffer locking regimen suffers
from painful complexity and serious maintenance difficulties as is.
That's true to a point, but it has more to do with things like how
VACUUM interacts with hio.c. Things like this:
/*
* Release the file-extension lock; it's now OK for someone else to extend
* the relation some more. Note that we cannot release this lock before
* we have buffer lock on the new page, or we risk a race condition
* against vacuumlazy.c --- see comments therein.
*/
if (needLock)
UnlockRelationForExtension(relation, ExclusiveLock);
The btree code is different, though: It implements a well-defined
interface, with much clearer separation of concerns. As I've said
already, with trivial exception (think contrib), no external code
considers itself to have license to obtain locks of any sort on btree
buffers. No external code of ours - without exception - does anything
with multiple locks, or exclusive locks on btree buffers. I'll remind
you that I'm only holding shared locks when control is outside of the
btree code.
Even within the btree code, the number of access method functions that
could conflict with what I do here (that acquire exclusive locks) is
very small when you exclude things that only exclusive lock the
meta-page (there are also very few of those). So the surface area is
quite small.
I'm not denying that there is a cost, or that I haven't expanded
things in a direction I'd prefer not to. I just think that it may well
be worth it, particularly when you consider the alternatives - this
may well be the least worst thing. I mean, if we do the promise tuple
thing, and there are multiple unique indexes, what happens when an
inserter needs to block pending the outcome of another transaction?
They had better go clean up the promise tuples from the other unique
indexes that they're trying to insert into, because they cannot afford
to hold value locks for a long time, no matter how they're
implemented. That could take much longer than just releasing a shared
buffer lock, since for each unique index the promise tuple must be
re-found from scratch. There are huge issues with additional
complexity and bloat. Oh, and now your lightweight locks aren't so
lightweight any more.
If the value locks were made interruptible through some method, such
as the promise tuples approach, does that really make deadlocking
acceptable? So at least your system didn't seize up. But on the other
hand, the user randomly had a deadlock error through no fault of their
own. The former may be worse, but the latter is also inexcusable. In
general, the best solution is just to not have deadlock hazards. I
wouldn't be surprised if reasoning about deadlocking was harder with
that alternative approach to value locking, not easier.
Moreover, we've already got performance and scalability problems that
are attributable to every backend in the system piling up waiting on a
single lwlock, or a group of simultaneously-held lwlocks.
Dramatically broadening the scope of where lwlocks are used and for
how long they're held is going to make that a whole lot worse.
You can hardly compare a buffer's LWLock with a system one that
protects critical shared memory structures. We're talking about a
shared lock on a single btree leaf page per unique index involved in
upserting.
A further problem is that a backend which holds even one lwlock can't
be interrupted. We've had this argument before and it seems that you
don't think that non-interruptibility is a problem, but it project
policy to allow for timely interrupts in all parts of the backend and
we're not going to change that policy for this patch.
I don't think non-interruptibility is a problem? Really, do you think
that this kind of inflammatory rhetoric helps anybody? I said nothing
of the sort. I recall saying something about an engineering trade-off.
Of course I value interruptibility.
If you're concerned about non-interruptibility, consider XLogFlush().
That does rather a lot of work with WALWriteLock exclusive locked. On
a busy system, some backend is very frequently going to experience a
non-interruptible wait for the duration of however long it takes to
write and flush perhaps a whole segment. All other flushing backends
are stuck in non-interruptible waits waiting for that backend to
finish. I think that the group commit stuff might have regressed
worst-case interruptibility for flushers by quite a bit; should we
have never committed that, or do you agree with my view that it's
worth it?
In contrast, what I've proposed here is in general quite unlikely to
result in any I/O for the duration of the time the locks are held.
Only writers will be blocked. And only those inserting into a narrow
range of values around the btree leaf page. Much of the work that even
those writers need to do will be unimpeded anyway; they'll just block
on attempting to acquire an exclusive lock on the first btree leaf
page that the value they're inserting could be on. And the additional
non-interruptible wait of those inserters won't be terribly much more
than the wait of the backend where heap tuple insertion takes a long
time anyway - that guy already has to do close to 100% of that work
with a non-interruptible wait today (once we eliminate
heap_prepare_insert() and toasting). The UnlockReleaseBuffer() call is
right at the end of heap_insert, and the buffer is pinned and locked
very close to the start.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Peter Geoghegan (pg@heroku.com) wrote:
I would suggest letting those other individuals speak for themselves
too. Particularly if you're going to name someone who is on vacation
like that.
It was my first concern regarding this patch.
Thanks,
Stephen
On Fri, Sep 13, 2013 at 12:14 PM, Stephen Frost <sfrost@snowman.net> wrote:
It was my first concern regarding this patch.
It was my first concern too.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-09-13 11:59:54 -0700, Peter Geoghegan wrote:
On Fri, Sep 13, 2013 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Andres is being very polite here, but the reality is that this
approach has zero chance of being accepted.I quite like Andres, but I have yet to see him behave as you describe
in a situation where someone proposed what was fundamentally a bad
idea. Maybe you should let him speak for himself?
Unfortunately I have to agree with Robert here, I think it's a complete
nogo to do what you propose so far and I've several times now presented
arguments why I think so.
The reason I wasn't saying "this will never get accepted" are twofold:
a) I don't want to stiffle alternative ideas to the "promises" idea,
just because I think it's the way to go. That might stop a better idea
from being articulated. b) I am not actually in the position to say it's
not going to be accepted.
*I* think that unless you make some fundamental and very, very clever
modifications to your algorithm that end up *not holding a lock over
other operations at all*, it's not going to get committed. And I'll chip
in with my -1.
And clever modification doesn't mean slightly restructuring heapam.c's
operations.
The importance of this patch is a value judgement. Our users have been
screaming for this for over ten years, so to my mind it has a fairly
high importance. Also, every other database system of every stripe
worth mentioning has something approximately equivalent to this,
including ones with much less functionality generally. The fact that
we don't is a really unfortunate omission.
I aggree it's quite important but that doesn't mean we have to do stuff
that we think are unacceptable, especially as there *are* other ways to
do it.
As to the rules you refer to, you must mean "These locks are intended
to be short-term: they should not be held for long". I don't think
that they will ever be held for long. At least, when I've managed the
amount of work that a heap_insert() can do better. I expect to produce
a revision where toasting doesn't happen with the locks held soon.
Actually, I've already written the code, I just need to do some
testing.
I personally think - and have stated so before - that doing a
heap_insert() while holding the btree lock is unacceptable.
The btree code is different, though: It implements a well-defined
interface, with much clearer separation of concerns.
Which you're completely violating by linking the btree buffer locking
with the heap locking. It's not about the btree code alone.
At this point I am a bit confused why you are asking for review.
I mean, if we do the promise tuple thing, and there are multiple
unique indexes, what happens when an inserter needs to block pending
the outcome of another transaction? They had better go clean up the
promise tuples from the other unique indexes that they're trying to
insert into, because they cannot afford to hold value locks for a long
time, no matter how they're implemented.
Why? We're using normal transaction visibility rules here. We don't stop
*other* values on the same index getting updated or similar.
And anyway. It doesn't matter which problem the "promises" idea
has. We're discussing your proposal here.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 13, 2013 at 12:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
The reason I wasn't saying "this will never get accepted" are twofold:
a) I don't want to stiffle alternative ideas to the "promises" idea,
just because I think it's the way to go. That might stop a better idea
from being articulated. b) I am not actually in the position to say it's
not going to be accepted.
Well, the reality is that the promises idea hasn't been described in
remotely enough detail to compare it to what I have here. I've pointed
out plenty of problems with it. After all, it was the first thing that
I considered, and I'm on the record talking about it in the 2012 dev
meeting. I didn't take that approach for many good reasons.
The reason I ended up here is not because I didn't get the memo about
holding buffer locks across complex operations being a bad thing. At
least grant me that. I'm here because in all these years no one has
come up with a suggestion that doesn't have some very major downsides.
Like, even worse than this.
As to the rules you refer to, you must mean "These locks are intended
to be short-term: they should not be held for long". I don't think
that they will ever be held for long. At least, when I've managed the
amount of work that a heap_insert() can do better. I expect to produce
a revision where toasting doesn't happen with the locks held soon.
Actually, I've already written the code, I just need to do some
testing.I personally think - and have stated so before - that doing a
heap_insert() while holding the btree lock is unacceptable.
Presumably your reason is essentially that we exclusive lock a heap
buffer (exactly one heap buffer) while holding shared locks on btree
index buffers. Is that really so different to holding an exclusive
lock on a btree buffer while holding a shared lock on a heap buffer?
Because that's what _bt_check_unique() does today.
Now, I'll grant you that there is one appreciable difference, which is
that multiple unique indexes may be involved. But limiting ourselves
to the primary key or something like that remains an option. And I'm
not sure that it's really any worse anyway.
The btree code is different, though: It implements a well-defined
interface, with much clearer separation of concerns.Which you're completely violating by linking the btree buffer locking
with the heap locking. It's not about the btree code alone.
You're right that it isn't about just the btree code.
In order for a deadlock to occur, there must be a mutual dependency.
What code could feel entitled to hold buffer locks on btree buffers
and heap buffers at the same time except the btree code itself? It
already does so. But no one else does the same thing. If anyone did
anything with a heap buffer lock held that could result in a call into
one of the btree access method functions (I'm not contemplating the
possibility of this other code locking the btree buffer *directly*),
I'm quite sure that that would be rejected outright today, because
that causes deadlocks. Certainly, vacuumlazy.c doesn't do it, for
example. Why would anyone ever want to do that anyway? I cannot think
of any reason. I suppose that that does still leave "transitive
dependencies", but now you're stretching things. After all, you're not
supposed to hold buffer locks for long! The dependency would have to
transit through, say, one of the system LWLocks used for WAL Logging.
Seems pretty close to impossible that it'd be an issue - index stuff
is only WAL-logged as index tuples are inserted (that is, as the locks
are finally released). Everyone automatically does that kind of thing
in a consistent order of locking, unlocking in the opposite order
anyway.
But what of the btree code deadlocking with itself? There are only a
few functions (2 or 3) where that's possible even in principle. I
think that they're going to be not too hard to analyze. For example,
with insertion, the trick is to always lock in a consistent order and
unlock/insert in the opposite order. The heap shared lock(s) needed in
the btree code cannot deadlock with another upserter because once the
other upserter has that exclusive heap buffer lock, it's *inevitable*
that it will release all of its shared buffer locks. Granted, I could
stand to think about this case more, but you get the idea - it *is*
possible to clamp down on the code that needs to care about this stuff
to a large degree. It's subtle, but btrees are generally considered
pretty complicated, and the btree code already cares about some odd
cases like these (it takes special precuations for catalog indexes,
for example).
The really weird thing about my patch is that the btree code trusts
the executor to call the heapam code to do the right thing in the
right way - it now knows more than I'd prefer. Would you be happier if
the btree code took more direct responsibility for the heap tuple
insertion instead? Before you say "that's ridiculous", consider the
big modularity violation that has always existed. It may be no more
ridiculous than that. And that existing state of affairs may be no
less ridiculous than living with what I've already done.
At this point I am a bit confused why you are asking for review.
I am asking for us, collectively, through consensus, to resolve the
basic approach to doing this. That was something I stated right up
front, pointing out details of where the discussion had gone in the
past. That was my explicit goal. There has been plenty of discussing
on this down through the years, but nothing ever came from it.
Why is this an intractable problem for over a decade for us alone? Why
isn't this a problem for other database systems? I'm not implying that
it's because they do this. It's something that I am earnestly
interested in, though. A number of people have asked me that, and I
don't have a good answer for them.
I mean, if we do the promise tuple thing, and there are multiple
unique indexes, what happens when an inserter needs to block pending
the outcome of another transaction? They had better go clean up the
promise tuples from the other unique indexes that they're trying to
insert into, because they cannot afford to hold value locks for a long
time, no matter how they're implemented.Why? We're using normal transaction visibility rules here. We don't stop
*other* values on the same index getting updated or similar.
Because you're locking a value in some other, earlier unique index,
all the while waiting *indefinitely* on some other value in a second
or subsequent one. That isn't acceptable. A bunch of backends would
back up just because one backend had this contention on the second
unique index value that the others didn't actually have themselves. My
design allows those other backends to immediately go through and
finish.
Value locks have these kinds of hazards no matter how you implement
them. Deadlocks, and unreasonable stalling as described here is always
unacceptable - whether or not the problems are detected at runtime is
ultimately of marginal interest. Either way, it's a bug.
I think that the details of how this approach compare to others are
totally pertinent. For me, that's the whole point - getting towards
something that will balance all of these concerns and be acceptable.
Yes, it's entirely possible that that could look quite different to
what I have here. I do not want to reduce all this to a question of
"is this one design acceptable or not?". Am I not allowed to propose a
design to drive discussion? That's how the most important features get
implemented around here.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan <pg@heroku.com> wrote:
we exclusive lock a heap buffer (exactly one heap buffer) while
holding shared locks on btree index buffers. Is that really so
different to holding an exclusive lock on a btree buffer while
holding a shared lock on a heap buffer? Because that's what
_bt_check_unique() does today.
Is it possible to get a deadlock doing only one of those two
things? Is it possible to avoid a deadlock doing both of them?
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 13, 2013 at 2:59 PM, Peter Geoghegan <pg@heroku.com> wrote:
I would suggest letting those other individuals speak for themselves
too. Particularly if you're going to name someone who is on vacation
like that.
You seem to be under the impression that I'm mentioning Tom's name, or
Andres's, because I need to win some kind of an argument. I don't.
We're not going to accept a patch that uses lwlocks in the way that
you are proposing.
I mean, if we do the promise tuple
thing, and there are multiple unique indexes, what happens when an
inserter needs to block pending the outcome of another transaction?
They had better go clean up the promise tuples from the other unique
indexes that they're trying to insert into, because they cannot afford
to hold value locks for a long time, no matter how they're
implemented.
As Andres already pointed out, this is not correct. Just to add to
what he said, we already have long-lasting value locks in the form of
SIREAD locks. SIREAD can exist at different levels of granularity, but
one of those levels is index-page-level granularity, where they have
the function of guarding against concurrent insertions of values that
would fall within that page, which just so happens to be the same
thing you want to do here. The difference between those locks and
what you're proposing here is that they are implemented differently.
That is why those were acceptable and this is not.
That could take much longer than just releasing a shared
buffer lock, since for each unique index the promise tuple must be
re-found from scratch. There are huge issues with additional
complexity and bloat. Oh, and now your lightweight locks aren't so
lightweight any more.
Yep, totally agreed. If you simply lock the buffer, or take some
other action which freezes out all concurrent modifications to the
page, then re-finding the lock is much simpler. On the other hand,
it's much simpler precisely because you've reduced concurrency to the
degree necessary to make it simple. And reducing concurrency is bad.
Similarly, complexity and bloat are not great things taken in
isolation, but many of our existing locking schemes are already very
complex. Tuple locks result in a complex jig that involves locking
the tuple via the heavyweight lock manager, performing a WAL-logged
modification to the page, and then releasing the lock in the
heavyweight lock manager. As here, that is way more expensive than
simply grabbing and holding a share-lock on the page. But we get a
number of important benefits out of it. The backend remains
interruptible while the tuple is locked, the protocol for granting
locks is FIFO to prevent starvation, we don't suppress page eviction
while the lock is held, we can simultaneously lock arbitrarily large
numbers of tuples, and deadlocks are detect and handled cleanly. If
those requirements were negotiable, we would surely have negotiated
them away already, because the performance benefits would be immense.
If the value locks were made interruptible through some method, such
as the promise tuples approach, does that really make deadlocking
acceptable?
Yes. It's not possible to prevent all deadlocks. It IS possible to
make sure that they are properly detected and that precisely one of
the transactions involved is rolled back to resolve the deadlock.
You can hardly compare a buffer's LWLock with a system one that
protects critical shared memory structures. We're talking about a
shared lock on a single btree leaf page per unique index involved in
upserting.
Actually, I can and I am. Buffers ARE critical shared memory structures.
A further problem is that a backend which holds even one lwlock can't
be interrupted. We've had this argument before and it seems that you
don't think that non-interruptibility is a problem, but it project
policy to allow for timely interrupts in all parts of the backend and
we're not going to change that policy for this patch.I don't think non-interruptibility is a problem? Really, do you think
that this kind of inflammatory rhetoric helps anybody? I said nothing
of the sort. I recall saying something about an engineering trade-off.
Of course I value interruptibility.
I don't see what's inflammatory about that statement. The point is
that this isn't the first time you've proposed a change which would
harm interruptibility and it isn't the first time I've objected on
precisely that basis. Interruptibility is not a nice-to-have that we
can trade away from time to time; it's essential and non-negotiable.
If you're concerned about non-interruptibility, consider XLogFlush().
That does rather a lot of work with WALWriteLock exclusive locked. On
a busy system, some backend is very frequently going to experience a
non-interruptible wait for the duration of however long it takes to
write and flush perhaps a whole segment. All other flushing backends
are stuck in non-interruptible waits waiting for that backend to
finish. I think that the group commit stuff might have regressed
worst-case interruptibility for flushers by quite a bit; should we
have never committed that, or do you agree with my view that it's
worth it?
It wouldn't take a lot to convince me that it wasn't worth it, because
I was never all that excited about that patch to begin with. I think
it mostly helps in extremely artificial situations that are not likely
to occur on real systems anyway. But, yeah, WALWriteLock is a
problem, no doubt about it. We should try to make the number of such
problems go down, not up, even if it means passing up new features
that we'd really like to have.
In contrast, what I've proposed here is in general quite unlikely to
result in any I/O for the duration of the time the locks are held.
Only writers will be blocked. And only those inserting into a narrow
range of values around the btree leaf page. Much of the work that even
those writers need to do will be unimpeded anyway; they'll just block
on attempting to acquire an exclusive lock on the first btree leaf
page that the value they're inserting could be on.
Sure, but you're talking about broadening the problem from the guy
performing the insert to everybody who might be trying to an insert
that hits one of the same unique-index pages. Instead of holding one
buffer lock, the guy performing the insert is now holding as many
buffer locks as there are indexes. That's a non-trivial issue.
For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes,
you'll error out. In fact, if you get the number of indexes exactly
right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and
panic the whole system.
Oh, and if different backends load the index list in different orders,
because say the system catalog gets vacuumed between their respective
relcache loads, then they may try to lock the indexes in different
orders and cause an undetected deadlock.
And, drifting a bit further off-topic, even to get as far as you have,
you've added overhead to every lwlock acquisition and release, even
for people who never use this functionality. I'm pretty skeptical
about anything that involves adding additional frammishes to the
lwlock mechanism. There are a few new primitives I'd like, too, but
every one we add slows things down for everybody.
And the additional
non-interruptible wait of those inserters won't be terribly much more
than the wait of the backend where heap tuple insertion takes a long
time anyway - that guy already has to do close to 100% of that work
with a non-interruptible wait today (once we eliminate
heap_prepare_insert() and toasting). The UnlockReleaseBuffer() call is
right at the end of heap_insert, and the buffer is pinned and locked
very close to the start.
That's true but somewhat misleading. Textually most of the function
holds the buffer lock, but heap_prepare_insert(),
CheckForSerializableConflictIn(), and RelationGetBufferForTuple(), and
XLogWrite() are the parts that do substantial amounts of computation,
and only the last of those happens while holding the buffer lock. And
that last is really fundamental, because we can't let any other
backend see the modified buffer until we've xlog'd the changes. The
problems you're proposing to create do not fall into the same
category.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I haven't read the patch and the btree code is an area I really don't know,
so take this for what it's worth....
It seems to me that the nature of the problem is that there will
unavoidably be a nexus between the two parts of the code here. We can try
to isolate it as much as possible but we're going to need a bit of a
compromise.
I'm imagining a function that takes two target heap buffers and a btree
key. It would descend the btree and holding the leaf page lock do a
try_lock on the heap pages. If it fails to get the locks then it releases
whatever it got and returns for the heap update to find new pages and try
again.
This still leaves the potential problem with page splits and I assume it
would still be tricky to call it without unsatisfactorily mixing executor
and btree code. But that's as far as I got.
--
greg
On 2013-09-14 09:57:43 +0100, Greg Stark wrote:
It seems to me that the nature of the problem is that there will
unavoidably be a nexus between the two parts of the code here. We can try
to isolate it as much as possible but we're going to need a bit of a
compromise.
I think Roberts and mine point is that there are several ways to
approach this without doing that.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-09-13 14:41:46 -0700, Peter Geoghegan wrote:
On Fri, Sep 13, 2013 at 12:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
The reason I wasn't saying "this will never get accepted" are twofold:
a) I don't want to stiffle alternative ideas to the "promises" idea,
just because I think it's the way to go. That might stop a better idea
from being articulated. b) I am not actually in the position to say it's
not going to be accepted.Well, the reality is that the promises idea hasn't been described in
remotely enough detail to compare it to what I have here. I've pointed
out plenty of problems with it.
Even if you disagree, I still think that doesn't matter in the very
least. You say:
I think that the details of how this approach compare to others are
totally pertinent. For me, that's the whole point - getting towards
something that will balance all of these concerns and be acceptable.
Well, the two other people involved in the discussion so far have gone
on the record saying that the presented approach is not acceptable to
them. And you haven't started reacting to that.
Yes, it's entirely possible that that could look quite different to
what I have here. I do not want to reduce all this to a question of
"is this one design acceptable or not?".
But the way you're discussing it so far is exactly reducing it that way.
If you want the discussion to be about *how* can we implement it that
the various concerns are addressed: fsck*ing great. I am with you there.
In the end, even though I have my usual strong opinions which is the
best way, I don't care which algorithm gets pursued further. At least,
if, and only if, it has a fighting chance of getting committed. Which
this doesn't.
After all, it was the first thing that
I considered, and I'm on the record talking about it in the 2012 dev
meeting. I didn't take that approach for many good reasons.
Well, I wasn't there when you said that ;)
The reason I ended up here is not because I didn't get the memo about
holding buffer locks across complex operations being a bad thing. At
least grant me that. I'm here because in all these years no one has
come up with a suggestion that doesn't have some very major downsides.
Like, even worse than this.
I think you're massively, massively, massively overstating the dangers
of bloat here. It's a known problem that's *NOT* getting worse by any of
the other proposals if you compare it with the loop/lock/catch
implementation of upsert that we have today as the only option. And we
*DO* have infrastructure to deal with bloat, even if could use some
improvement. We *don't* have infrastructure with deadlocks on
lwlocks. And we're not goint to get that infrastructure, because it
would even further remove the "lw" part of lwlocks.
As to the rules you refer to, you must mean "These locks are intended
to be short-term: they should not be held for long". I don't think
that they will ever be held for long. At least, when I've managed the
amount of work that a heap_insert() can do better. I expect to produce
a revision where toasting doesn't happen with the locks held soon.
Actually, I've already written the code, I just need to do some
testing.I personally think - and have stated so before - that doing a
heap_insert() while holding the btree lock is unacceptable.Presumably your reason is essentially that we exclusive lock a heap
buffer (exactly one heap buffer) while holding shared locks on btree
index buffers.
It's that it interleaves an already complex but local locking scheme
that required several years to become correct with another that is just
the same. That's an utterly horrid idea.
Is that really so different to holding an exclusive
lock on a btree buffer while holding a shared lock on a heap buffer?
Because that's what _bt_check_unique() does today.
Yes, it it is different. But, in my opinion, bt_check_unique() doing so
is a bug that needs fixing. Not something that we want to extend.
(Note that _bt_check_unique() already needs to deal with the fact that
it reads an unlocked page, because it moves right in some cases)
And, as you say:
Now, I'll grant you that there is one appreciable difference, which is
that multiple unique indexes may be involved. But limiting ourselves
to the primary key or something like that remains an option. And I'm
not sure that it's really any worse anyway.
I don't think that's an acceptable limitation. If it were something we
could lift in a release or two, maybe, but that's not what you're
talking about.
At this point I am a bit confused why you are asking for review.
I am asking for us, collectively, through consensus, to resolve the
basic approach to doing this. That was something I stated right up
front, pointing out details of where the discussion had gone in the
past. That was my explicit goal. There has been plenty of discussing
on this down through the years, but nothing ever came from it.
At the moment ISTM you're not conceding on *ANY* points. That's not very
often the way to find concensus.
Why is this an intractable problem for over a decade for us alone? Why
isn't this a problem for other database systems? I'm not implying that
it's because they do this. It's something that I am earnestly
interested in, though. A number of people have asked me that, and I
don't have a good answer for them.
Afaik all those go the route of bloat, don't they? Also, at least in the
past, mysql had a long list of caveats around it...
I mean, if we do the promise tuple thing, and there are multiple
unique indexes, what happens when an inserter needs to block pending
the outcome of another transaction? They had better go clean up the
promise tuples from the other unique indexes that they're trying to
insert into, because they cannot afford to hold value locks for a long
time, no matter how they're implemented.Why? We're using normal transaction visibility rules here. We don't stop
*other* values on the same index getting updated or similar.
Because you're locking a value in some other, earlier unique index,
all the while waiting *indefinitely* on some other value in a second
or subsequent one. That isn't acceptable. A bunch of backends would
back up just because one backend had this contention on the second
unique index value that the others didn't actually have themselves. My
design allows those other backends to immediately go through and
finish.
That argument doesn't make sense to me. You're inserting a unique
value. It completely makes sense that you can only insert one of
them. If it's unclear whether you can insert, you're going to have to
wait. Thats why they are UNIQUE after all. You're describing a complete
nonadvantange here. It's also how unique indexes already work.
Also note, that wait's on xids are properly supervised by deadlock detection.
Even if it had an advantage, not blocking *for the single unique key alone*
opens you to issues of livelocks where several backends retry because of
each other indefinitely.
Value locks have these kinds of hazards no matter how you implement
them. Deadlocks, and unreasonable stalling as described here is always
unacceptable - whether or not the problems are detected at runtime is
ultimately of marginal interest. Either way, it's a bug.
Whether postgres locks down in a way that can only resolved by kill -9
or whether it aborts a transaction are, like, a couple of magnitude of a
difference.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 14, 2013 at 12:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I mean, if we do the promise tuple
thing, and there are multiple unique indexes, what happens when an
inserter needs to block pending the outcome of another transaction?
They had better go clean up the promise tuples from the other unique
indexes that they're trying to insert into, because they cannot afford
to hold value locks for a long time, no matter how they're
implemented.As Andres already pointed out, this is not correct.
While not doing this is not incorrect, it certainly would be useful
for preventing deadlocks and unnecessary contention. In a world where
people expect either an insert or an update, we ought to try and
reduce contention across multiple unique indexes. I can understand why
that doesn't matter today, though - if you're going to insert
duplicates indifferent to whether or not there will be conflicts,
that's a kind of abuse, and not worth optimizing - seems probable that
most transactions will commit. However, it seems much less probable
that most upserters will insert. People may well wish to upsert all
the time where an insert is hardly ever necessary, which is one reason
why I have doubts about other proposals.
Note that today there is no guarantee that the original waiter for a
duplicate-inserting xact to complete will be the first one to get a
second chance, so I think it's hard to question this on correctness
grounds. Even if they are released in FIFO order, there is no reason
to assume that the first waiter will win the race with a second. Most
obviously, the second waiter may not even ever get the chance to block
on the same xid at all (so it's not really a waiter at all) and still
be able to insert, if the blocking-xact aborts after the second
"waiter" starts its descent but before it checks uniqueness. All this,
even though the second "waiter" arrived maybe minutes after the first.
What I'm talking about here is really unlikely to result in lock
starvation, because the original waiter typically gets to observe the
other waiter go through, and that's reason enough to give up entirely.
Now, it's kind of weird that the original waiter will still end up
blocking on the xid that caused it to wait in the first instance. So
there should be more thought put into that, like remembering the xid
and only waiting on it on a retry, or some similar scheme. Maybe you
could contrive a scenario where this causes lock starvation, but I
suspect you could do the same thing for the present btree insertion
code.
Just to add to
what he said, we already have long-lasting value locks in the form of
SIREAD locks. SIREAD can exist at different levels of granularity, but
one of those levels is index-page-level granularity, where they have
the function of guarding against concurrent insertions of values that
would fall within that page, which just so happens to be the same
thing you want to do here. The difference between those locks and
what you're proposing here is that they are implemented differently.
That is why those were acceptable and this is not.
As the implementer of this patch, I'm obligated to put some checks in
unique index insertion that everyone has to care about. There is no
way around that. Complexity issues aside, I think that an argument
could be made for this approach *reducing* the impact on concurrency
relative to other approaches, if there isn't too many unique indexes
to deal with, which is the case the vast majority of the time. I mean,
those other approaches necessitate doing so much more with *exclusive*
locks held. Like inserting, maybe doing a page split, WAL-logging, all
with the lock, and then either updating in place or killing the
promise tuple, and WAL-logging that, with an exclusive lock held the
second time around. Plus searching for everything twice. I think that
frequently killing all of those broken-promise tuples could have
deleterious effects on concurrency and/or index bloat or the kind only
remedied by reindex. Do you update the freespace map too? More
exclusive locks! Or if you leave it up to VACUUM (and just set the xid
to InvalidXid, which is still extra work), autovacuum has to care
about a new *class* of bloat - index-only bloat. Plus lots of dead
duplicates are bad for performance in btrees generally.
As here, that is way more expensive than
simply grabbing and holding a share-lock on the page. But we get a
number of important benefits out of it. The backend remains
interruptible while the tuple is locked, the protocol for granting
locks is FIFO to prevent starvation, we don't suppress page eviction
while the lock is held, we can simultaneously lock arbitrarily large
numbers of tuples, and deadlocks are detect and handled cleanly. If
those requirements were negotiable, we would surely have negotiated
them away already, because the performance benefits would be immense.
False equivalence. We only need to lock as many unique index *values*
(not tuples) as are proposed for insertion per slot (which can be
reasonably bound), and only for an instant. Clearly it would be
totally unacceptable if tuple-level locks made backends
uninterruptible indefinitely. Of course, this is nothing like that.
If the value locks were made interruptible through some method, such
as the promise tuples approach, does that really make deadlocking
acceptable?Yes. It's not possible to prevent all deadlocks. It IS possible to
make sure that they are properly detected and that precisely one of
the transactions involved is rolled back to resolve the deadlock.
You seem to have misunderstood me here, or perhaps I was unclear. I'm
referring to deadlocks that cannot really be predicted or analyzed by
the user at all - see my comments below on insertion order.
I don't think non-interruptibility is a problem? Really, do you think
that this kind of inflammatory rhetoric helps anybody? I said nothing
of the sort. I recall saying something about an engineering trade-off.
Of course I value interruptibility.I don't see what's inflammatory about that statement.
The fact that you simply stated that I don't think
non-interruptibility is a problem in a non-qualified way, obviously.
Interruptibility is not a nice-to-have that we
can trade away from time to time; it's essential and non-negotiable.
I seem to recall you saying something about the Linux kernel and their
attitude to interruptibility. Yes, interruptibility is not just a
nice-to-have; it is essentially. However, without dismissing your
other concerns, I have yet to hear a convincing argument as to why
anything I've done here is going to make any difference to
interruptibility that would be appreciable to any human. So far it's
been a slippery slope type argument that can be equally well used to
argue against some facet of almost any substantial patch ever
proposed. I just don't think that regressing interruptibility
marginally is *necessarily* sufficient justification for rejecting an
approach outright. FYI, *that's* how I value interruptibility
generally.
In contrast, what I've proposed here is in general quite unlikely to
result in any I/O for the duration of the time the locks are held.
Only writers will be blocked. And only those inserting into a narrow
range of values around the btree leaf page. Much of the work that evehn
those writers need to do will be unimpeded anyway; they'll just block
on attempting to acquire an exclusive lock on the first btree leaf
page that the value they're inserting could be on.Sure, but you're talking about broadening the problem from the guy
performing the insert to everybody who might be trying to an insert
that hits one of the same unique-index pages.
In general, that isn't that much worse than just blocking the value
directly. The number of possible values that could also be blocked is
quite low. The chances of it actually mattering that those additional
values are locked in the still small window in which the buffer locks
are held is in generally fairly low, particularly on larger tables
where there is naturally a large number of possible distinct values. I
will however concede that the impact on inserters that want to insert
a non-locked value that belongs on the locked page or its child might
be worse, but it's already a problem that inserted index tuples can
all end up on the same page, if not to the same extent.
Instead of holding one
buffer lock, the guy performing the insert is now holding as many
buffer locks as there are indexes. That's a non-trivial issue.
Actually, as many buffer locks as there are *unique* indexes. It might
be a non-trivial issue, but this whole problem is decidedly
non-trivial, as I'm sure we can all agree.
For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes,
you'll error out. In fact, if you get the number of indexes exactly
right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and
panic the whole system.
Oh, come on. We can obviously engineer a solution to that problem. I
don't think I've ever seen a table with close to 100 *unique* indexes.
4 or 5 is a very high number. If we just raised on error if someone
tried to do this with more than 10 unique indexes, I would guess
that'd we'd get exactly zero complaints about it.
Oh, and if different backends load the index list in different orders,
because say the system catalog gets vacuumed between their respective
relcache loads, then they may try to lock the indexes in different
orders and cause an undetected deadlock.
Undetected deadlock is really not much worse than detected deadlock
here. Either way, it's a bug. And it's something that any kind of
implementation will need to account for. It's not okay to
*unpredictably* deadlock, in a way that the user has no control over.
Today, someone can do an analysis of their application and eliminate
deadlocks if they need to. That might not be terribly practical much
of the time, but it can be done. It certainly is practical to do it in
a localized way. I wouldn't like to compromise that.
So yes, you're right that I need to control for this sort of thing
better than in the extant patch, and in fact this was discussed fairly
early on. But it's an inherent problem.
And, drifting a bit further off-topic, even to get as far as you have,
you've added overhead to every lwlock acquisition and release, even
for people who never use this functionality.
If you look at the code, you'll see that I've made very modest
modifications to LWLockRelease only. I would be extremely surprised if
the overhead was not only in the noise, but was completely impossible
to detect through any conventional benchmark. These are the same kind
of very modest changes made for LWLockAcquireOrWait(), and you said
nothing about that at the time. Despite the fact that you now appear
to think that that whole effort was largely a waste of time.
That's true but somewhat misleading. Textually most of the function
holds the buffer lock, but heap_prepare_insert(),
CheckForSerializableConflictIn(), and RelationGetBufferForTuple(), and
XLogWrite() are the parts that do substantial amounts of computation,
and only the last of those happens while holding the buffer lock.
I've already written modifications so that I don't have to do
heap_prepare_insert() with the locks held. There is no reason to call
CheckForSerializableConflictIn() with the additional locks held
either. After all, "For a heap insert, we only need to check for
table-level SSI locks". As for RelationGetBufferForTuple(), yes, the
majority of the time it will have to do very little without acquiring
an exclusive lock, because it's going to get that from the last place
a heap tuple was inserted from.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 14, 2013 at 3:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Well, the reality is that the promises idea hasn't been described in
remotely enough detail to compare it to what I have here. I've pointed
out plenty of problems with it.Even if you disagree, I still think that doesn't matter in the very
least.
It matters if you care about getting this feature.
You say:
I think that the details of how this approach compare to others are
totally pertinent. For me, that's the whole point - getting towards
something that will balance all of these concerns and be acceptable.Well, the two other people involved in the discussion so far have gone
on the record saying that the presented approach is not acceptable to
them. And you haven't started reacting to that.
Uh, yes I have. I'm not really sure what you could mean by that. What
am I refusing to address?
Yes, it's entirely possible that that could look quite different to
what I have here. I do not want to reduce all this to a question of
"is this one design acceptable or not?".But the way you're discussing it so far is exactly reducing it that way.
The fact that I was motivated to do things this way serves to
illustrate the problems generally.
If you want the discussion to be about *how* can we implement it that
the various concerns are addressed: fsck*ing great. I am with you there.
Isn't that what we were doing? There has been plenty of commentary on
alternative approaches.
In the end, even though I have my usual strong opinions which is the
best way, I don't care which algorithm gets pursued further. At least,
if, and only if, it has a fighting chance of getting committed. Which
this doesn't.
I don't think that any design that has been described to date doesn't
have serious problems. Causing excessive bloat, particularly in
indexes is a serious problem also.
The reason I ended up here is not because I didn't get the memo about
holding buffer locks across complex operations being a bad thing. At
least grant me that. I'm here because in all these years no one has
come up with a suggestion that doesn't have some very major downsides.
Like, even worse than this.I think you're massively, massively, massively overstating the dangers
of bloat here. It's a known problem that's *NOT* getting worse by any of
the other proposals if you compare it with the loop/lock/catch
implementation of upsert that we have today as the only option.
Why would I compare it with that? That's terrible, and very few of our
users actually know about it anyway. Also, will an UPDATE followed by
an INSERT really bloat all that much anyway?
And we
*DO* have infrastructure to deal with bloat, even if could use some
improvement. We *don't* have infrastructure with deadlocks on
lwlocks. And we're not goint to get that infrastructure, because it
would even further remove the "lw" part of lwlocks.
Everything I said so far is predicated on LWLocks not deadlocking
here, so I'm not really sure why you'd say that. If I can't find a way
to prevent deadlock, then clearly the approach is doomed.
It's that it interleaves an already complex but local locking scheme
that required several years to become correct with another that is just
the same. That's an utterly horrid idea.
You're missing my point, which is that it may be possible, with
relatively modest effort, to analyze things to ensure that deadlock is
impossible - regardless of the complexities of the two systems -
because they're reasonably well encapsulated. See below, under "I'll
say it again".
Now, I can certainly understand why you wouldn't be willing to accept
that at face value. The idea isn't absurd, though. You could think of
the heap_insert() call as being under the control of the btree code
(just as, say, heap_hot_search() is), even though the code isn't at
all structured that way, and that's awkward. I'm actually slightly
tempted to structure it that way.
Is that really so different to holding an exclusive
lock on a btree buffer while holding a shared lock on a heap buffer?
Because that's what _bt_check_unique() does today.Yes, it it is different. But, in my opinion, bt_check_unique() doing so
is a bug that needs fixing. Not something that we want to extend.
Well, I think you know that that's never going to happen. There are
all kinds of reasons why it works that way that cannot be disavowed.
My definition of a bug includes a user being affected.
At this point I am a bit confused why you are asking for review.
I am asking for us, collectively, through consensus, to resolve the
basic approach to doing this. That was something I stated right up
front, pointing out details of where the discussion had gone in the
past. That was my explicit goal. There has been plenty of discussing
on this down through the years, but nothing ever came from it.At the moment ISTM you're not conceding on *ANY* points. That's not very
often the way to find concensus.
Really? I've conceded plenty of points. Just now I conceded a point to
Robert about insertion being blocked for inserters that want to insert
a value that isn't already locked/existing, and he didn't even raise
that in the first place. Most prominently, I've conceded that it is
entirely questionable that I hold the buffer locks for longer - before
you even responded to my original patch! I've said it many many times
many many ways. It should be heavily scrutinized. But you both seem to
be making general points along those lines, without reference to what
I've actually done. Those general points could almost to the same
extent apply to _bt_check_unique() today, which is why I have a hard
time accepting them at face value. To say that what that function does
is "a bug" is just not credible, because it's been around in
essentially the same form since at least a time when you and I were in
primary school. I'll remind you that you haven't been able to
demonstrate deadlock in a way that invalidates my approach. While of
course that's not how this is supposed to work, I've been too busy
defending myself here to get down to the business of carefully
analysing the relatively modest interactions between btree and heap
that could conceivably introduce a deadlock. Yes, the burden to prove
this can't deadlock is mine, but I thought I'd provide you with the
opportunity to prove that it can.
I'll say it again: For a deadlock, there needs to be a mutual
dependency. Provided the locking phase doesn't acquire any locks other
than buffer locks, and during the interaction with the heap btree
inserters (or the locking phase) cannot acquire heap locks in a way
that conflicts with other upserters, we will be fine. It doesn't
necessarily matter how complex each system individually is, because
the two meet in such a limited area (well, two areas now, I suppose),
and they only meet in one direction - there is no reciprocation where
the heap code locks or otherwise interacts with index buffers. When
the heap insertion is performed, all index value locks are already
acquired. The locking phase cannot block itself because of the
ordering of locking, but also because the locks on the heap that it
takes are only shared locks.
Now, this analysis is somewhat complex, and underdeveloped. But as
Robert said, there are plenty of things about locking in Postgres that
are complex and subtle. He also said that it doesn't matter if I can
prove that it won't deadlock, but I'd like a second opinion on that,
since my proof might actually be, if not simple, short, and therefore
may not represent an ongoing burden in the way Robert seemed to think
it would.
That argument doesn't make sense to me. You're inserting a unique
value. It completely makes sense that you can only insert one of
them.
Even if it had an advantage, not blocking *for the single unique key alone*
opens you to issues of livelocks where several backends retry because of
each other indefinitely.
See my remarks to Robert.
Whether postgres locks down in a way that can only resolved by kill -9
or whether it aborts a transaction are, like, a couple of magnitude of a
difference.
Not really. I can see the advantage of having the deadlock be
detectable from a defensive-coding standpoint. But index locking
ordering inconsistencies, and the deadlocks they may cause are not
acceptable generally.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 14, 2013 at 1:57 AM, Greg Stark <stark@mit.edu> wrote:
It seems to me that the nature of the problem is that there will unavoidably
be a nexus between the two parts of the code here. We can try to isolate it
as much as possible but we're going to need a bit of a compromise.
Exactly. That's why all the proposals with the exception of this one
have to date involved unacceptable bloating - that's how they try and
span the nexus.
I'll find it very difficult to accept any implementation that is going
to bloat things even worse than our upsert looping example. The only
advantage of such an implementation over the upsert example is that
it'll avoid burning through subxacts. The main reason I don't want to
take that approach is that I know it won't be accepted, because it's a
disaster. That's why the people that proposed this in various forms
down through the years haven't gone and implemented it themselves. I
do not accept that all of this is like the general situation with row
locks. I do not think that the big costs of having many dead
duplicates in a unique index can be overlooked (or perhaps the cost of
cleaning them up eagerly, which is something I'd also expect to work
very badly). That's something that's going to reverberate all over the
place. Imagine a simple, innocent looking pattern that resulted in
there being unique indexes that became hugely bloated. It's not hard.
What I will concede (what I have conceded, actually) is that it would
be better if the locks were more granular. Now, I'm not so much
concerned about concurrent inserters inserting values that just so
happen to be values that were locked. It's more the case that I'm
worried about inserters blocking on other values that are incidentally
locked despite not already existing, that would go on the locked page
or maybe a later page. In particular, I'm concerned about the impact
on SERIAL primary key columns. Not exactly an uncommon case (though
one I'd already thought to optimize by locking last).
What I think might actually work acceptably is if we were to create an
SLRU that kept track of value-locks per buffer. The challenge there
would be to have regular unique index inserters care about them, while
having little to no impact on their regular performance. This might be
possible by having them check the buffer for external value locks in
the SLRU immediately after exclusive locking the buffer - usually that
only has to happen once per index tuple insertion (assuming no
duplicates necessitate retry). If they find their value in the SLRU,
they do something like unlock and block on the other xact and restart.
Now, obviously a lot of the details would have to be worked out, but
it seems possible.
In order for any of this to really be possible, there'd have to be
some concession made to my position, as Greg mentions here. In other
words, I'd need buy-in for the general idea of holding locks in shared
memory from indexes across heap tuple insertion (subject to a sound
deadlock analysis, of course). Some modest compromises may need to be
made around interruptibility. I'd also probably need agreement that
it's okay that value locks can not last more than an instant (they
cannot be held indefinitely pending the end of a transaction). This
isn't something that I imagine to be too controversial, because it's
true today for a single unique index. As I've already outlined, anyone
waiting on another transaction with a would-be duplicate to commit has
very few guarantees about the order that it'll get its second shot
relative to the order it initial queued up behind the successful but
not-yet-committed inserter.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15 Sep 2013 10:19, "Peter Geoghegan" <pg@heroku.com> wrote:
On Sat, Sep 14, 2013 at 1:57 AM, Greg Stark <stark@mit.edu> wrote:
It seems to me that the nature of the problem is that there will
unavoidably
be a nexus between the two parts of the code here. We can try to
isolate it
as much as possible but we're going to need a bit of a compromise.
In order for any of this to really be possible, there'd have to be
some concession made to my position, as Greg mentions here. In other
words, I'd need buy-in for the general idea of holding locks in shared
memory from indexes across heap tuple insertion (subject to a sound
deadlock analysis, of course).
Actually that wasn't what I meant by that.
What I meant is that there going to be some code coupling between the
executor and btree code. That's purely a question of course structure, and
will be true regardless of the algorithm you settle on.
What I was suggesting was an api for a function that would encapsulate that
coupling. The executor would call this function which would promise to
obtain all the locks needed for both operations or give up. Effectively it
would be a special btree operation which would have special knowledge of
the executor only in that it knows that being able to get a lock on two
heap buffers is something the executor needs sometimes.
I'm not sure this fits well with your syntax since it assumes the update
will happen at the same time as the index lookup but as I said I haven't
read your patch, maybe it's not incompatible. I'm writing all this on my
phone so it's mostly just pie in the sky brainstorming. I'm sorry if it's
entirely irrelevant.
Peter Geoghegan <pg@heroku.com> wrote:
There is no reason to call CheckForSerializableConflictIn() with
the additional locks held either. After all, "For a heap insert,
we only need to check for table-level SSI locks".
You're only talking about not covering that call with a *new*
LWLock, right? We put some effort into making sure that such calls
were only inside of LWLocks which were needed for correctness.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-09-15 02:19:41 -0700, Peter Geoghegan wrote:
On Sat, Sep 14, 2013 at 1:57 AM, Greg Stark <stark@mit.edu> wrote:
It seems to me that the nature of the problem is that there will unavoidably
be a nexus between the two parts of the code here. We can try to isolate it
as much as possible but we're going to need a bit of a compromise.Exactly. That's why all the proposals with the exception of this one
have to date involved unacceptable bloating - that's how they try and
span the nexus.
I'll find it very difficult to accept any implementation that is going
to bloat things even worse than our upsert looping example.
How would any even halfway sensible example cause *more* bloat than the
upsert looping thing?
I'll concede that bloat is something to be aware of, but just because
it's *an* issue, it's not *the* only issue.
In all the solutions I can think of/have heard of that have the chance
of producing additional bloat also have good chance of cleaning up the
additional bloat.
In the "promises" approach you simply can mark the promise index tuples
as LP_DEAD in the IGNORE case if you've found a conflicting tuple. In
the OR UPDATE case you can immediately reuse them. There's no heap
bloat. The logic for dead items already exists in nbtree, so that's not
too much complication. The case where that doesn't work is when postgres
dies inbetween or we're signalled to abort. But that produces bloat for
normal DML anyway. Any vacuum or insert can check whether the promise
xid has committed and remove the promise otherwise.
In the proposals that involve just inserting the heaptuple and then
handle the uniqueness violation when inserting the index tuples you can
immediately mark the index tuples as dead and mark it as prunable.
The only advantage of such an implementation over the upsert example is that
it'll avoid burning through subxacts. The main reason I don't want to
take that approach is that I know it won't be accepted, because it's a
disaster. That's why the people that proposed this in various forms
down through the years haven't gone and implemented it themselves. I
do not accept that all of this is like the general situation with row
locks.
The primary advantage will be that it's actually usable by users without
massive overhead in writing dozens of functions.
I don't think the bloat issue had much to do with the feature not
getting implemented so far. It's that nobody was willing to do the work
and endure the discussions around it. And I definitely applaud you for
finally tackling the issue despite that.
I do not think that the big costs of having many dead
duplicates in a unique index can be overlooked
Why would there be so many duplicate index tuples? The primary user of
this is going to be UPSERT. In case there's a conflicting tuple, there
is going to be a new tuple version. Which will need a new index entry
quite often. If there's no conflict, we will insert anyway.
So, there's the case of UPSERTs that could be done as HOT updates
because there's enough space on the page and none of the indexes
actually have changed. As explained above, we can simply mark the index
tuple as dead in that case (don't even need an exclusive lock for that,
if done right).
(or perhaps the cost of
cleaning them up eagerly, which is something I'd also expect to work
very badly).
Why? Remember the page you did the insert to, do a _bt_moveright() to
catch eventual splits. Mark the item as dead. Done. The next insert will
repack the page if necessary (cf. _bt_findinsertloc).
What I will concede (what I have conceded, actually) is that it would
be better if the locks were more granular. Now, I'm not so much
concerned about concurrent inserters inserting values that just so
happen to be values that were locked. It's more the case that I'm
worried about inserters blocking on other values that are incidentally
locked despite not already existing, that would go on the locked page
or maybe a later page. In particular, I'm concerned about the impact
on SERIAL primary key columns. Not exactly an uncommon case (though
one I'd already thought to optimize by locking last).
Yes, I think that's the primary issue from a scalability and performance
POV. Locking entire ranges of values, potentially even on inner pages
(because you otherwise would have to split) isn't going to work.
What I think might actually work acceptably is if we were to create an
SLRU that kept track of value-locks per buffer. The challenge there
would be to have regular unique index inserters care about them, while
having little to no impact on their regular performance. This might be
possible by having them check the buffer for external value locks in
the SLRU immediately after exclusive locking the buffer - usually that
only has to happen once per index tuple insertion (assuming no
duplicates necessitate retry). If they find their value in the SLRU,
they do something like unlock and block on the other xact and restart.
Now, obviously a lot of the details would have to be worked out, but
it seems possible.
If you can make that work, without locking heap and btree pages at the
same time, yes, I think that's a possible way forward. One way to offset
the cost of SLRU in the common case where there is no contention would
be to have a page level flag that triggers that lookup. There should be
space in btpo_flags.
In order for any of this to really be possible, there'd have to be
some concession made to my position, as Greg mentions here. In other
words, I'd need buy-in for the general idea of holding locks in shared
memory from indexes across heap tuple insertion (subject to a sound
deadlock analysis, of course).
I don't have a fundamental problem with holding locks during the
insert. I have a problem with holding page level lightweight locks on
the btree and the heap at the same time.
Some modest compromises may need to be made around interruptibility.
Why? As far as I understand that proposal, I don't see why that would be needed?
I'd also probably need agreement that
it's okay that value locks can not last more than an instant (they
cannot be held indefinitely pending the end of a transaction). This
isn't something that I imagine to be too controversial, because it's
true today for a single unique index. As I've already outlined, anyone
waiting on another transaction with a would-be duplicate to commit has
very few guarantees about the order that it'll get its second shot
relative to the order it initial queued up behind the successful but
not-yet-committed inserter.
I forsee problems here.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 14, 2013 at 6:27 PM, Peter Geoghegan <pg@heroku.com> wrote:
Note that today there is no guarantee that the original waiter for a
duplicate-inserting xact to complete will be the first one to get a
second chance, so I think it's hard to question this on correctness
grounds. Even if they are released in FIFO order, there is no reason
to assume that the first waiter will win the race with a second. Most
obviously, the second waiter may not even ever get the chance to block
on the same xid at all (so it's not really a waiter at all) and still
be able to insert, if the blocking-xact aborts after the second
"waiter" starts its descent but before it checks uniqueness. All this,
even though the second "waiter" arrived maybe minutes after the first.
ProcLockWakeup() only wakes as many waiters from the head of the queue
as can all be granted the lock without any conflicts. So I don't
think there is a race condition in that path.
So far it's
been a slippery slope type argument that can be equally well used to
argue against some facet of almost any substantial patch ever
proposed.
I don't completely agree with that characterization, but you do have a
point. Obviously, if the differences in the area of interruptibility,
starvation, deadlock risk, etc. can be made small enough relative to
the status quo can be made small enough, then those aren't reasons to
reject the approach.
But I'm skeptical that you're going to be able to accomplish that,
especially without adversely affecting maintainability. I think the
way that you're proposing to use lwlocks here is sufficiently
different from what the rest of the system does that it's going to be
hard to avoid system-wide affects that can't easily be caught during
code review; and like Andres, I don't share your skepticism about
alternative approaches.
For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes,
you'll error out. In fact, if you get the number of indexes exactly
right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and
panic the whole system.Oh, come on. We can obviously engineer a solution to that problem. I
don't think I've ever seen a table with close to 100 *unique* indexes.
4 or 5 is a very high number. If we just raised on error if someone
tried to do this with more than 10 unique indexes, I would guess
that'd we'd get exactly zero complaints about it.
That's not a solution; that's a hack.
Undetected deadlock is really not much worse than detected deadlock
here. Either way, it's a bug. And it's something that any kind of
implementation will need to account for. It's not okay to
*unpredictably* deadlock, in a way that the user has no control over.
Today, someone can do an analysis of their application and eliminate
deadlocks if they need to. That might not be terribly practical much
of the time, but it can be done. It certainly is practical to do it in
a localized way. I wouldn't like to compromise that.
I agree that unpredictable deadlocks are bad. I think the fundamental
problem with UPSERT, MERGE, and this proposal is what happens when the
conflicting tuple is present but not visible to your scan, either
because it hasn't committed yet or because it has committed but is not
visible to your snapshot. I'm not clear on how you handle that in
your approach.
If you look at the code, you'll see that I've made very modest
modifications to LWLockRelease only. I would be extremely surprised if
the overhead was not only in the noise, but was completely impossible
to detect through any conventional benchmark. These are the same kind
of very modest changes made for LWLockAcquireOrWait(), and you said
nothing about that at the time. Despite the fact that you now appear
to think that that whole effort was largely a waste of time.
Well, I did have some concerns about the performance impact of that patch:
/messages/by-id/CA+TgmoaPyQKEaoFz8HkDGvRDbOmRpkGo69zjODB5=7Jh3hbPQA@mail.gmail.com
I also discovered, after it was committed, that it didn't help in the
way I expected:
/messages/by-id/CA+TgmoY8P3sD=oUViG+xZjmZk5-phuNV39rtfyzUQxU8hJtZxw@mail.gmail.com
It's true that I didn't raise those concerns contemporaneously with
the commit, but I didn't understand the situation well enough at that
time to realize how narrow the benefit was.
I've wished, on a number of occasions, to be able to add more lwlock
primitives. The problem with that is that if everybody does it, we'll
pretty soon end up with a mess. I attempted to address that with this
proposal:
/messages/by-id/CA+Tgmob4YE_k5dpO0T07PNf1SOKPybo+wj4m4FryOS7Z4_yOzg@mail.gmail.com
...but nobody (including me) was very sure that was the right way
forward, and it never went anywhere. However, I think the basic issue
remains. I was sad to discover last week that Heikki handled this
problem for the WAL scalability patch by basically copy-and-pasting
much of the lwlock code and then hacking it up. I think we're well on
our way to an unmaintainable mess already, and I don't want it to get
worse. :-(
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-09-17 12:29:51 -0400, Robert Haas wrote:
But I'm skeptical that you're going to be able to accomplish that,
especially without adversely affecting maintainability. I think the
way that you're proposing to use lwlocks here is sufficiently
different from what the rest of the system does that it's going to be
hard to avoid system-wide affects that can't easily be caught during
code review;
I actually think extending lwlocks to allow downgrading an exclusive
lock is a good idea, independent of this path, and I think there are
some areas of the code where we could use that capability to increase
scalability. Now, that might be because I pretty much suggested using
them in such a way to solve some of the problems :P
I don't think they solve the issue of this patch (holding several nbtree
pages locked across heap operations) though.
I agree that unpredictable deadlocks are bad. I think the fundamental
problem with UPSERT, MERGE, and this proposal is what happens when the
conflicting tuple is present but not visible to your scan, either
because it hasn't committed yet or because it has committed but is not
visible to your snapshot. I'm not clear on how you handle that in
your approach.
Hm. I think it should be handled exactly the way we handle it for unique
indexes today. Wait till it's clear whether you can proceed.
At some point we might to extend that logic to more cases, but that
should be separate discussion imo.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 17, 2013 at 6:20 PM, Andres Freund <andres@2ndquadrant.com> wrote:
I agree that unpredictable deadlocks are bad. I think the fundamental
problem with UPSERT, MERGE, and this proposal is what happens when the
conflicting tuple is present but not visible to your scan, either
because it hasn't committed yet or because it has committed but is not
visible to your snapshot. I'm not clear on how you handle that in
your approach.Hm. I think it should be handled exactly the way we handle it for unique
indexes today. Wait till it's clear whether you can proceed.
That's what I do, although getting those details right has been of
secondary concern for obvious reasons.
At some point we might to extend that logic to more cases, but that
should be separate discussion imo.
This is essentially why I went and added a row locking component over
your objections. Value locks (regardless of implementation)
effectively stop an insertion from finishing, but not from starting.
ISTM that locking the row with value locks held can cause deadlock.
So, unfortunately, we cannot really discuss value locking and row
locking separately, even though I see the appeal of trying to. Gaining
an actual representative notion of the expense of releasing and
re-acquiring the locks is too tightly coupled with how this is handled
and how frequently we need to restart. Plus there may well be other
issues in the same vein that we've yet to consider.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-09-18 00:54:38 -0500, Peter Geoghegan wrote:
At some point we might to extend that logic to more cases, but that
should be separate discussion imo.This is essentially why I went and added a row locking component over
your objections.
I didn't object to implementing row level locking. I said that if your
basic algorithm without row level locks is viewn as being broken, it
won't be fixed by implementing row level locking.
What I meant here is just that we shouldn't implement a mode with less
waiting for now even if there might be usecases because that will open
another can of worms.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 17, 2013 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Sep 14, 2013 at 6:27 PM, Peter Geoghegan <pg@heroku.com> wrote:
Note that today there is no guarantee that the original waiter for a
duplicate-inserting xact to complete will be the first one to get a
second chance
ProcLockWakeup() only wakes as many waiters from the head of the queue
as can all be granted the lock without any conflicts. So I don't
think there is a race condition in that path.
Right, but what about XactLockTableWait() itself? It only acquires a
ShareLock on the xid of the got-there-first inserter that potentially
hasn't yet committed/aborted. There will be no conflicts between
multiple second-chance-seeking blockers trying to acquire this lock
concurrently, and so in fact there is (what I guess you'd consider to
be) a race condition in the current btree insertion code. So my
earlier point about according an upsert implementation license to
optimize ordering of retries across multiple unique indexes -- that it
isn't really inconsistent with the current code when dealing with only
one unique index insertion -- has not been invalidated.
EvalPlanQualFetch() and Do_MultiXactIdWait() also call
XactLockTableWait(), for similar reasons. In my patch, the later row
locking code used by INSERT...ON DUPLICATE KEY LOCK FOR UPDATE calls
XactLockTableWait() too.
So far it's
been a slippery slope type argument that can be equally well used to
argue against some facet of almost any substantial patch ever
proposed.I don't completely agree with that characterization, but you do have a
point. Obviously, if the differences in the area of interruptibility,
starvation, deadlock risk, etc. can be made small enough relative to
the status quo can be made small enough, then those aren't reasons to
reject the approach.
That all seems fair to me. That's the standard that I'd apply as a
reviewer myself.
But I'm skeptical that you're going to be able to accomplish that,
especially without adversely affecting maintainability. I think the
way that you're proposing to use lwlocks here is sufficiently
different from what the rest of the system does that it's going to be
hard to avoid system-wide affects that can't easily be caught during
code review;
Fair enough. In case it isn't already totally clear to someone, I
concede that it isn't going to be workable to hold even shared buffer
locks across all these operations. Let's get past that, though.
and like Andres, I don't share your skepticism about
alternative approaches.
Well, I expressed skepticism about one alternative approach in
particular, which is the promise tuples approach. Andres seems to
think that I'm overly concerned about bloat, but I'm not sure he
appreciates why I'm so sensitive to it in this instance. I'll be
particularly sensitive to it if value locks need to be held
indefinitely rather than there being a speculative
grab-the-value-locks attempt (because that increases the window in
which another session can necessitate that we retry at row locking
time quite considerably - see below).
I think the fundamental
problem with UPSERT, MERGE, and this proposal is what happens when the
conflicting tuple is present but not visible to your scan, either
because it hasn't committed yet or because it has committed but is not
visible to your snapshot.
Yeah, you're right. As I mentioned to Andres already, when row locking
happens and there is this kind of conflict, my approach is to retry
from scratch (go right back to before value lock acquisition) in the
sort of scenario that generally necessitates EvalPlanQual() looping,
or to throw a serialization failure where that's appropriate. After an
unsuccessful attempt at row locking there could well be an interim
wait for another xact to finish, before retrying (at read committed
isolation level). This is why I think that value locking/retrying
should be cheap, and should avoid bloat if at all possible.
Forgive me if I'm making a leap here, but it seems like what you're
saying is that the semantics of upsert that one might naturally expect
are *arguably* fundamentally impossible, because they entail
potentially locking a row that isn't current to your snapshot, and you
cannot throw a serialization failure at read committed. I respectfully
suggest that that exact definition of upsert isn't a useful one,
because other snapshot isolation/MVCC systems operating within the
same constraints must have the same issues, and yet they manage to
implement something that could be called upsert that people seem happy
with.
I also discovered, after it was committed, that it didn't help in the
way I expected:/messages/by-id/CA+TgmoY8P3sD=oUViG+xZjmZk5-phuNV39rtfyzUQxU8hJtZxw@mail.gmail.com
Well, at the time you didn't also provide raw commit latency benchmark
results for your hardware using a tool like pg_test_fsync, which I'd
consider absolutely essential to such a discussion. That's mostly or
entirely what the group commit stuff does - amortize that cost among
concurrently flushing transactions. Around this time, the patch was
said by Heikki to just relieve lock contention around WALWriteLock -
the 9.2 release notes say much the same. I never understood it that
way, though Heikki disagreed with that [1]/messages/by-id/4FB0A673.7040002@enterprisedb.com.
Certainly, if relieving contention was all the patch did, then you
wouldn't expect the 9.3 commit_delay implementation to help anyone,
but it does: with a slow fsync holding the lock 50% *longer* can
actually help tremendously. So I *always* agreed with you that there
was hardware where group commit would barely help with a moderately
sympathetic benchmark like the pgbench default. Not that it matters
much now.
It's true that I didn't raise those concerns contemporaneously with
the commit, but I didn't understand the situation well enough at that
time to realize how narrow the benefit was.I've wished, on a number of occasions, to be able to add more lwlock
primitives. The problem with that is that if everybody does it, we'll
pretty soon end up with a mess.
I wouldn't go that far. The number of possible additional primitives
that are useful isn't that high, unless we decide that LWLocks are
going to be a fundamentally different thing, which I consider
unlikely.
/messages/by-id/CA+Tgmob4YE_k5dpO0T07PNf1SOKPybo+wj4m4FryOS7Z4_yOzg@mail.gmail.com
...but nobody (including me) was very sure that was the right way
forward, and it never went anywhere. However, I think the basic issue
remains. I was sad to discover last week that Heikki handled this
problem for the WAL scalability patch by basically copy-and-pasting
much of the lwlock code and then hacking it up. I think we're well on
our way to an unmaintainable mess already, and I don't want it to get
worse. :-(
I hear what you're saying about LWLocks. I did follow the FlexLocks
stuff at the time myself. Obviously we aren't going to add new lwlock
operations if they have exactly no clients. However, I think that the
semantics implemented (weakening and strengthening of locks) may well
be handy somewhere else. So while I wouldn't go and commit that stuff
on the off chance that it will be useful, it's worth bearing in mind
going forward that it's quite possible to weaken/strengthen locks.
[1]: /messages/by-id/4FB0A673.7040002@enterprisedb.com
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter,
* Peter Geoghegan (pg@heroku.com) wrote:
Forgive me if I'm making a leap here, but it seems like what you're
saying is that the semantics of upsert that one might naturally expect
are *arguably* fundamentally impossible, because they entail
potentially locking a row that isn't current to your snapshot, and you
cannot throw a serialization failure at read committed. I respectfully
suggest that that exact definition of upsert isn't a useful one,
I'm not sure I follow this completely- you're saying that a definition
of 'upsert' which includes having to lock rows which aren't in your
current snapshot (for reasons stated) isn't a useful one. Is the
implication that a useful definition of 'upsert' is that it *doesn't*
have to lock rows which aren't in your current snapshot, and if so, then
what would the semantics of that upsert look like?
because other snapshot isolation/MVCC systems operating within the
same constraints must have the same issues, and yet they manage to
implement something that could be called upsert that people seem happy
with.
This I am generally in agreement with, to the extent that 'upsert' is
something we really want and we should figure out a way to get there
from here, but it wouldn't be the first time that we worked out a
better solution than existing implementations. So, another '+1' from me
wrt your working this issue and please don't get too discouraged that
there's a lot of pressure to find a magic bullet- I think part of it is
exactly because everyone wants this and wants it to be better than
what's out there today.
Thanks,
Stephen
Hi Stephen,
On Fri, Sep 20, 2013 at 6:55 PM, Stephen Frost <sfrost@snowman.net> wrote:
I'm not sure I follow this completely- you're saying that a definition
of 'upsert' which includes having to lock rows which aren't in your
current snapshot (for reasons stated) isn't a useful one. Is the
implication that a useful definition of 'upsert' is that it *doesn't*
have to lock rows which aren't in your current snapshot, and if so, then
what would the semantics of that upsert look like?
No, I'm suggesting that the useful semantics are that it does
potentially lock rows not yet visible to our snapshot that have
committed - the latest row version. I see no alternative (we can't
throw a serialization failure at read committed isolation level), and
Andres seemed to agree that this was the way forward. Robert described
problems he saw with this a few years ago [1]/messages/by-id/AANLkTineR-rDFWENeddLg=GrkT+epMHk2j9X0YqpiTY8@mail.gmail.com -- Peter Geoghegan. It *is* a problem (we
need to think very carefully about it), but, as I've said, it is a
problem that anyone implementing this feature for a Snapshot
Isolation/MVCC database would have to deal with, and several have.
So, what the patch does right now is (if you squint) analogous to how
SELECT FOR UPDATE uses EvalPlanQual already. However, instead of
re-verifying a qual, we're re-verifying that the value locking has
identified the right tid (there will probably be a different one in
the subsequent iteration, or maybe we *can* go insert this time). We
need consensus across unique indexes to go ahead with insertion, but
once we know that we can't (and have a tid to lock), value locks can
go away - we'll know if anything has changed about the tid's logical
row that we need to care about when row locking. Besides, holding
value locks while row locking has deadlock hazards, and, because value
locks only stop insertions *finishing*, holding on to them is at best
pointless.
The tid we get from locking, that points to a would-be duplicate heap
tuple has always committed - otherwise we'd never return from locking,
because that blocks pending the outcome of a duplicate-inserting-xact
(and only returns the tid when that xact commits). Even though this
tuple is known to be visible, it may be deleted in the interim before
row locking, in which case restarting from before value locking is
appropriate. It might also be updated, which would necessitate locking
a later row version in order to prevent race conditions. But it seems
dangerous, invasive, and maybe even generally impossible to try and
wait for the transaction that updated to commit or abort so that we
can lock that later version the usual way (the usual EvalPlanQual
looping thing) - better to restart value locking.
The fundamental observation about value locking (at least for any
half-way reasonable implementation), that I'd like to emphasize, is
that short of a radical overhaul that would have many downsides, it
can only ever prevent insertion from *finishing*. The big picture of
my design is that it tries to quickly grab value locks, release them
and grab a row lock (or insert heap tuples, index tuples, and then
release value locks). If row locking fails, it waits for the
conflicter xact to finish, and then restarts before the value locking
of the current slot. If you think that's kind of questionable, maybe
you have a point, but consider:
1. How else are you going to handle it if row locking needs to handle
conflicts? You might say "I can re-verify that no unique index columns
were affected instead", and maybe you can, but what if that doesn't
help because they *were* changed? Besides, doesn't this break the
amcanunique contract? Surely judging what's really a duplicate is the
AM's job.
You're back to "I need to throw an error to get out of this but I have
no good excuse to do so at read committed" -- you've lost the usual
duplicate key error "excuse". I don't think you can expect holding the
value locks throughout row locking to help, because, as I've said,
that causes morally indefensible deadlocks, and besides, it doesn't
stop what row locking would consider to be a conflict, it just stops
insertion from *finishing*.
2. In the existing btree index insertion code, the order that retries
occur in the event of unique index tuple insertion finding an
unfinished conflicting xact *is* undefined. Yes, that's right - the
first waiter is not guaranteed to be the first to get a second chance.
It's not even particularly probable! See remarks from my last mail to
Robert for more information.
3. People with a real purist's view on the (re)ordering of value
locking must already think that EvalPlanQual() is completely ghetto
for very similar reasons, and as such should just go use a higher
isolation level. For the rest of us, what concurrency control anomaly
can allowing this cause over and above what's already possible there?
Are lock starvation risks actually appreciably raised at all?
What Andres and Robert seem to expect generally - that value locks
only be released when we the locker has a definitive answer - actually
*can* be ensured at the higher isolation levels, where the system has
license to bail out by throwing a serialization failure. The trick
there is just to throw an error if the first *retry* at cross-index
value locking is unsuccessful or blocks on a whole other xact -- a
serialization error (and not a unique constraint violation error, as
would often but not necessarily otherwise occur for non-upserters).
Naturally, it could also happen that at > read committed, row locking
throws a serialization failure (as is always mandated over using
EvalPlanQual() or other monkeying around at higher isolation levels).
This I am generally in agreement with, to the extent that 'upsert' is
something we really want and we should figure out a way to get there
from here, but it wouldn't be the first time that we worked out a
better solution than existing implementations. So, another '+1' from me
wrt your working this issue and please don't get too discouraged that
there's a lot of pressure to find a magic bullet
Thanks for the encouragement!
[1]: /messages/by-id/AANLkTineR-rDFWENeddLg=GrkT+epMHk2j9X0YqpiTY8@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Sep 15, 2013 at 8:23 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I'll find it very difficult to accept any implementation that is going
to bloat things even worse than our upsert looping example.How would any even halfway sensible example cause *more* bloat than the
upsert looping thing?
I was away in Chicago over the week, and didn't get to answer this.
Sorry about that.
In the average/uncontended case, the subxact example bloats less than
all alternatives to my design proposed to date (including the "unborn
heap tuple" idea Robert mentioned in passing to me in person the other
day, which I think is somewhat similar to a suggestion of Heikki's
[1]: /messages/by-id/45E845C4.6030000@enterprisedb.com
contention usually doesn't happen. But you need to also appreciate
that because of the way row locking works and the guarantees value
locking makes, any ON DUPLICATE KEY LOCK FOR UPDATE implementation is
going to have to potentially restart in more places (as compared to
the doc's example), maybe including value locking of each unique index
and certainly including row locking. So the contended case might even
be worse as well.
On average, it is quite likely that either the UPDATE or INSERT will
succeed - there has to be some concurrent activity around the same
values for either to fail, and in general that's quite unlikely. If
the UPDATE doesn't succeed, it won't bloat, and it's then very likely
that the INSERT at the end of the loop will go ahead and succeed
without itself creating bloat.
Going forward with this discussion, I would like us all to take as
read that the buffer locking stuff is a prototype approach to value
locking, to be refined later (subject to my basic design being judged
fundamentally sound). I don't think anyone believes that it's
fundamentally incorrect in that it doesn't do something that it claims
to do (concerns are more around what it might do or prevent that it
shouldn't), and it can still drive discussion in a very useful
direction. So far criticism of this patch has almost entirely been on
aspects of buffer locking, but it would be much more useful for the
time being to simply assume that the buffer locks *are* interruptible.
It's probably okay with me to still be a bit suspicious of
deadlocking, though, because if we refine the buffer locking using a
more granular SLRU value locking approach, that doesn't necessarily
guarantee that it's impossible, even if it does (I guess) prevent
undesirable interactions with other buffer locking.
[1]: /messages/by-id/45E845C4.6030000@enterprisedb.com
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 20, 2013 at 5:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
ProcLockWakeup() only wakes as many waiters from the head of the queue
as can all be granted the lock without any conflicts. So I don't
think there is a race condition in that path.Right, but what about XactLockTableWait() itself? It only acquires a
ShareLock on the xid of the got-there-first inserter that potentially
hasn't yet committed/aborted. There will be no conflicts between
multiple second-chance-seeking blockers trying to acquire this lock
concurrently, and so in fact there is (what I guess you'd consider to
be) a race condition in the current btree insertion code.
I should add: README.tuplock says the following:
"""
The protocol for waiting for a tuple-level lock is really
LockTuple()
XactLockTableWait()
mark tuple as locked by me
UnlockTuple()
When there are multiple waiters, arbitration of who is to get the lock next
is provided by LockTuple().
"""
So because this isn't a tuple-level lock - it's really a value-level
lock - LockTuple() is not called by the btree code at all, and so
arbitration of who gets the lock is, as I've said, essentially
undefined.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 21, 2013 at 7:22 PM, Peter Geoghegan <pg@heroku.com> wrote:
So because this isn't a tuple-level lock - it's really a value-level
lock - LockTuple() is not called by the btree code at all, and so
arbitration of who gets the lock is, as I've said, essentially
undefined.
Addendum: It isn't even a value-level lock, because the buffer locks
are of course released before the XactLockTableWait() call. It's a
simple attempt to acquire a shared lock on an xid.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
I don't have time to answer the other emails today (elections,
climbing), but maybe you could clarify the below?
On 2013-09-21 17:07:11 -0700, Peter Geoghegan wrote:
On Sun, Sep 15, 2013 at 8:23 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I'll find it very difficult to accept any implementation that is going
to bloat things even worse than our upsert looping example.How would any even halfway sensible example cause *more* bloat than the
upsert looping thing?I was away in Chicago over the week, and didn't get to answer this.
Sorry about that.In the average/uncontended case, the subxact example bloats less than
all alternatives to my design proposed to date (including the "unborn
heap tuple" idea Robert mentioned in passing to me in person the other
day, which I think is somewhat similar to a suggestion of Heikki's
[1]). The average case is very important, because in general
contention usually doesn't happen.
I can't follow here. Why does e.g. the promise tuple approach bloat more
than the subxact example?
The protocol is roughly:
1) Insert index pointer containing an xid to be waiting upon instead of
the target tid into all indexes
2) Insert heap tuple, we can be sure there's no conflict now
3) Go through the indexes and repoint the item to point to the tid of the
heaptuple instead of the xid.
There's zero heap or index bloat in the uncontended case. In the
contended case it's just the promise tuples from 1) that are inserted
before the conflict is detected. Those can be marked as dead when the
conflict happened.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Sep 22, 2013 at 2:10 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I can't follow here. Why does e.g. the promise tuple approach bloat more
than the subxact example?
The protocol is roughly:
1) Insert index pointer containing an xid to be waiting upon instead of
the target tid into all indexes
2) Insert heap tuple, we can be sure there's no conflict now
3) Go through the indexes and repoint the item to point to the tid of the
heaptuple instead of the xid.There's zero heap or index bloat in the uncontended case. In the
contended case it's just the promise tuples from 1) that are inserted
before the conflict is detected. Those can be marked as dead when the
conflict happened.
It depends on your definition of the contended case. You're assuming
that insertion is the most probable outcome, when in fact much of the
time updating is just as likely or even more likely. Many promise
tuples may be inserted before actually seeing a conflict and deciding
to update/lock for update. In order for the example in the docs to
bloat at all, both the UPDATE and the INSERT need to fail within a
tiny temporal window - that's what I mean by uncontended (it is
usually tiny because if the UPDATE blocks, that often means it will
succeed anyway, but if not the INSERT will very probably succeed).
This is because the UPDATE won't bloat when no existing row is seen,
because its subplan will return no rows. The INSERT will only bloat if
it fails, which is generally very unlikely because of the fact that
the UPDATE just did nothing. Contrast that with bloating almost every
time an UPDATE is necessary (I think that bloat that is generally
cleaned up synchronously is still bloat). That's before we even talk
about the additional overhead. Making the locks expensive to
release/clean-up could really hurt, since it appears they'll *have* to
be unlocked before row locking, and during that time concurrent
activity affecting the row to be locked can necessitate a full restart
- that's a window we want to keep as small as possible.
I think reviewer time would for now be much better spent discussing
the patch at a higher level (along the lines of my recent mail to
Stephen and Robert). I've been at least as guilty as anyone else in
getting mired in these details. We'll be much better equipped to have
this discussion afterwards, because it isn't clear to us if we really
need or would find it at all useful to have long-lasting value locks,
how frequently we'll need to retry and for what reasons, and so on.
My immediate concern as the patch author is to come up with a better
answer to the problem that Robert described [1]/messages/by-id/AANLkTineR-rDFWENeddLg=GrkT+epMHk2j9X0YqpiTY8@mail.gmail.com, because "hey, I
locked the row -- you take it from here user that might not have any
version of it visible to you" is not good enough. I hope that there
isn't a tension between solving that problem and offering the
flexibility and composability of the proposed syntax.
[1]: /messages/by-id/AANLkTineR-rDFWENeddLg=GrkT+epMHk2j9X0YqpiTY8@mail.gmail.com
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-09-22 12:54:57 -0700, Peter Geoghegan wrote:
On Sun, Sep 22, 2013 at 2:10 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I can't follow here. Why does e.g. the promise tuple approach bloat more
than the subxact example?
The protocol is roughly:
1) Insert index pointer containing an xid to be waiting upon instead of
the target tid into all indexes
2) Insert heap tuple, we can be sure there's no conflict now
3) Go through the indexes and repoint the item to point to the tid of the
heaptuple instead of the xid.There's zero heap or index bloat in the uncontended case. In the
contended case it's just the promise tuples from 1) that are inserted
before the conflict is detected. Those can be marked as dead when the
conflict happened.It depends on your definition of the contended case. You're assuming
that insertion is the most probable outcome, when in fact much of the
time updating is just as likely or even more likely. Many promise
tuples may be inserted before actually seeing a conflict and deciding
to update/lock for update.
I still fail to see how that's relevant. For every index there's two
things that can happen:
a) there's a conflicting tuple. In that case we can fail at that
point/convert to an update. No Bloat.
b) there's no conflicting tuple. In that case we will insert a promise
tuple. If there's no conflict in further indexes (i.e. we INSERT), the
promise will converted to a plain tuple. If there *is* a further
conflict, you *still* need the new index tuple because by definition
(the index changed) it cannot be an HOT update. So you convert it as
well. No Bloat.
I think that bloat that is generally cleaned up synchronously is still
bloat
I don't think it's particularly relevant because the above will just
cause bloat in case of rollbacks and such which is nothin new, but:
I fail to fee the point of such a position.
I think reviewer time would for now be much better spent discussing
the patch at a higher level (along the lines of my recent mail to
Stephen and Robert).
Yes, I plan to reply to those, I just didn't have time to do so this
weekend. There's other stuff than PG every now and then ;)
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Sep 22, 2013 at 1:39 PM, Andres Freund <andres@2ndquadrant.com> wrote:
I still fail to see how that's relevant. For every index there's two
things that can happen:
a) there's a conflicting tuple. In that case we can fail at that
point/convert to an update. No Bloat.
Well, yes - if the conflict is in the first unique index you look at.
b) there's no conflicting tuple. In that case we will insert a promise
tuple.
Yeah, if there is no conflict relating to any of the tuples, the cost
is limited to updating the promise tuples in-place. Not exactly a
trivial additional cost even then though, because you have to
exclusive lock and WAL-log twice per index tuple.
If there's no conflict in further indexes (i.e. we INSERT), the
promise will converted to a plain tuple.
Sure.
If there *is* a further
conflict, you *still* need the new index tuple because by definition
(the index changed) it cannot be an HOT update.
By definition? What do you mean? This isn't MySQL's REPLACE. This
feature is almost certainly going to tacitly require the user to write
the upsert SQL with a particular unique index in mind (to figure that
out for ourselves, we'd need to somehow ask/infer, which is ugly/very
hard to impossible). The UPDATE, as typically written, probably
*won't* actually update any of the other, incidentally
unique-constrained/value locked columns that we have to check in case
that's what the user really meant, and very probably not the
"interesting" column appearing in the UPDATE qual itself, so it
probably *will* be a HOT update.
So you convert it as
well. No Bloat.
Even if this is a practical possibility, which I doubt, the book
keeping sounds very messy and invasive indeed.
Yes, I plan to reply to those, I just didn't have time to do so this
weekend.
Great, thanks. I cannot strongly emphasize enough how I think that's
the way to frame all of this. So much so that I almost managed to
resist answering the above points. :-)
There's other stuff than PG every now and then ;)
Hope you enjoyed the hike.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 20, 2013 at 8:48 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Sep 17, 2013 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Sep 14, 2013 at 6:27 PM, Peter Geoghegan <pg@heroku.com> wrote:
Note that today there is no guarantee that the original waiter for a
duplicate-inserting xact to complete will be the first one to get a
second chanceProcLockWakeup() only wakes as many waiters from the head of the queue
as can all be granted the lock without any conflicts. So I don't
think there is a race condition in that path.Right, but what about XactLockTableWait() itself? It only acquires a
ShareLock on the xid of the got-there-first inserter that potentially
hasn't yet committed/aborted.
That's an interesting point. As you pointed out in later emails, that
cases is handled for heap tuple locks, but btree uniqueness conflicts
are a different kettle of fish.
Yeah, you're right. As I mentioned to Andres already, when row locking
happens and there is this kind of conflict, my approach is to retry
from scratch (go right back to before value lock acquisition) in the
sort of scenario that generally necessitates EvalPlanQual() looping,
or to throw a serialization failure where that's appropriate. After an
unsuccessful attempt at row locking there could well be an interim
wait for another xact to finish, before retrying (at read committed
isolation level). This is why I think that value locking/retrying
should be cheap, and should avoid bloat if at all possible.Forgive me if I'm making a leap here, but it seems like what you're
saying is that the semantics of upsert that one might naturally expect
are *arguably* fundamentally impossible, because they entail
potentially locking a row that isn't current to your snapshot,
Precisely.
and you cannot throw a serialization failure at read committed.
Not sure that's true, but at least it might not be the most desirable behavior.
I respectfully
suggest that that exact definition of upsert isn't a useful one,
because other snapshot isolation/MVCC systems operating within the
same constraints must have the same issues, and yet they manage to
implement something that could be called upsert that people seem happy
with.
Yeah. I wonder how they do that.
I wouldn't go that far. The number of possible additional primitives
that are useful isn't that high, unless we decide that LWLocks are
going to be a fundamentally different thing, which I consider
unlikely.
I'm not convinced, but we can save that argument for another day.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 23, 2013 at 12:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Right, but what about XactLockTableWait() itself? It only acquires a
ShareLock on the xid of the got-there-first inserter that potentially
hasn't yet committed/aborted.That's an interesting point. As you pointed out in later emails, that
cases is handled for heap tuple locks, but btree uniqueness conflicts
are a different kettle of fish.
Right.
It suits my purposes to have the value locks be held for only an
instant, because:
1) It will perform much better and deadlock much less in certain
scenarios if sessions are given leeway to not block each other across
multiple values in multiple unique indexes (i.e. we make them
"considerate", like a person with a huge shopping cart that lets
another person with one thing ahead of them in a queue, and perhaps in
doing so greatly reduces their own wait because the guy with one thing
makes the cashier immediately say 'no' to the person with all those
groceries. Ahem). I don't think that this implies any additional
anomalies at read committed, and I'm reasonably confident that this
doesn't regress things to any degree lock starvation wise - lock
starvation can only come from a bunch of inserters of the same value
that consistently abort, just like the present situation with one
unique index (I think it's better with multiple unique indexes than
with only one - more opportunities for the would-be-starved session to
hear a definitive no answer and give up).
2) It will probably be considerably easier if and when we improve on
the buffer locking stuff (by adding a value locking SLRU) to assume
that they'll be shortly held. For example, maybe it's okay that the
implementation doesn't allow page splits on value-locked pages, and
maybe that makes things much easier to reason about. If you're
determined to have a strict serial ordering of value locking *without
serialization failures*, I think what I've already said about the
interactions between row locking and value locking demonstrates that
that's close to or actually impossible. Plus, it would really suck for
performance if that SLRU had to actually swap value locks to and from
disk, which becomes a real possibility if they're really long held
(mere index scans aren't going to keep the cache warm, so the
worst-case latency for an innocent inserter into some narrow range of
values might be really bad).
Speaking of ease of implementation, how do you guarantee that the
value locking waiters get the right to insert in serial order (if
that's something that you value, which I don't at RC)? You have to fix
the same "race" that already exists when acquiring a ShareLock on an
xid, and blocking on value lock acquisition. The only possible remedy
I can see for that is to integrate heap and btree locking in a much
more intimate and therefore sketchy way. You need something like
LockTuple() to arbitrate ordering, but what, and how, and where, and
with how many buffer locks held?
Most importantly:
3) As I've already mentioned, heavy value locks (like promise tuples
or similar schemes, as opposed to actual heavyweight locks)
concomitantly increase the window in which a conflict can be created
for row locking. Most transactions last but an instant, and so the
fact that other session may already be blocked locking on the
would-be-duplicate row perhaps isn't that relevant. Doing all that
clean-up is going to give other sessions increased opportunity to lock
the row themselves, and ruin our day.
But these points are about long held value locks, not the cost of
making their acquisition relatively expensive or inexpensive (but
still more or less instantaneous), so why mention that at all? Well,
since we're blocking everyone else with our value locks, they get to
have a bad day too. All the while, they're perhaps virtually
pre-destined to find some row to lock, but the window for something to
happen to that row for that to conflict with eventual row locking (to
*unnecessarily* conflict, as for example when an innocuous HOT update
occurs) gets larger and larger as they wait longer and longer on value
locks. Loosely speaking, things get multiplicatively worse - total
gridlock is probably possible, with the deadlock detector only
breaking the gridlock up a bit if we get lucky (unless, maybe, if
value locks last until transaction end...which I think is nigh on
impossible anyway).
The bottom line is that long lasting value locks - value locks that
last the duration of a transaction and are acquired serially, while
guaranteeing that the inserter that gets all the value locks needed
itself gets to insert - have the potential to cascade horribly, in
ways that I can only really begin to reason about. That is, they do
*if* we presume that they have the interactions with row locking that
I believe they do, a belief that no one has taken issue with yet.
Even *considering* this is largely academic, though, because without
some kind of miracle guaranteeing serial ordering, a miracle that
doesn't allow for serialization failures and also doesn't seriously
slow down, for example, updates (by making them care about value locks
*before* they do anything, or in the case of HOT updates *at all*),
all of this is _impossible_. So, I say let's just do the
actually-serial-ordering for value lock acquisition with serialization
failures where we're > read committed. I've seriously considered what
it would take to do it any other way so things would work how you and
Andres expect for read committed, and it makes my head hurt, because
apart from seeming unnecessary to me, it also seems completely
hopeless.
Am I being too negative here? Well, I guess that's possible. The fact
is that it's really hard to judge, because all of this is really hard
to reason about. That's what I really don't like about it.
I respectfully
suggest that that exact definition of upsert isn't a useful one,
because other snapshot isolation/MVCC systems operating within the
same constraints must have the same issues, and yet they manage to
implement something that could be called upsert that people seem happy
with.Yeah. I wonder how they do that.
My guess is that they have some fancy snapshot type that is used by
the equivalent of ModifyTable subplans, that is appropriately paranoid
about the Halloween problem and so on. How that actually might work is
far from clear, but it's a question that I have begun to consider. As
I said, a concern is that it would be in tension with the generalized,
composable syntax, where we don't explicitly have a "special update".
I'd really like to hear other opinions, though.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 23, 2013 at 12:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
and you cannot throw a serialization failure at read committed.
Not sure that's true, but at least it might not be the most desirable behavior.
I'm pretty sure that that's totally true. "You don't have to worry
about serialization failures at read committed, except when you do"
seems kind of weak to me. Especially since none of the usual suspects
say the same thing. That said, it sure would be convenient if it
wasn't true!
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
Various messages are discussing semantics around visibility. I by now
have a hard time keeping track. So let's keep the discussion of the
desired semantics to this thread.
There have been some remarks about serialization failures in read
committed transactions. I agree, those shouldn't occur. But I don't
actually think they are so much of a problem if we follow the path set
by existing uses of the EPQ logic. The scenario described seems to be an
UPSERT conflicting with a row it cannot see in the original snapshot of
the query.
In that case I think we just have to follow the example laid by
ExecUpdate, ExecDelete and heap_lock_tuple. Use the EPQ machinery (or an
alternative approach with similar enough semantics) to get a new
snapshot and follow the ctid chain. When we've found the end of the
chain we try to update that tuple.
That surely isn't free of surprising semantics, but it would follows existing
semantics. Which everybody writing concurrent applications in read
committed should (but doesn't) know. Adding a different set of semantics
seems like a bad idea.
Robert seems to have been the primary sceptic around this, what scenario
are you actually concerned about?
There are some scenarios that doesn't trivially answer. But I'd like to
understand the primary concerns first.
Regards,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 23, 2013 at 7:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
It suits my purposes to have the value locks be held for only an
instant, because:[ detailed explanation ]
I don't really disagree with any of that. TBH, I think the question
of how long value locks (as you call them) are held is going to boil
down to a question of how they end up being implemented. As I
mentioned to you at PG Open (going through the details here for those
following along at home), we could optimistically insert the new heap
tuple, then go add index entries for it, and if we find a conflict,
then instead of erroring out, we mark the tuple we were inserting dead
and go try update the conflicting tuple instead. In that
implementation, if we find that we have to wait for some other
transaction along the way, it's probably not worth reversing out the
index entries already inserted, because getting them into the index in
the first place was a WAL-logged operation, and therefore relatively
expensive, and IMHO it's most likely better to just hope things work
out than to risk having to redo all of that.
On the other hand, if the locks are strictly in-memory, then the cost
of releasing them all before we go to wait, and of reacquiring them
after we finish waiting, is pretty low. There might be some
modularity issues to work through there, but they might not turn out
to be very painful, and the advantages you mention are certainly worth
accruing if it turns out to be fairly straightforward.
Personally, I think that trying to keep it all in-memory is going to
be hard. The problem is that we can't de-optimize regular inserts or
updates to any significant degree to cater to this feature - because
as valuable as this feature is, the number of times it gets used is
still going to be a whole lot smaller than the number of times it
doesn't get used. Also, I tend to think that we might want to define
the operation as a REPLACE-type operation with respect to a certain
set of key columns; and so we'll do the insert-or-update behavior with
respect only to the index on those columns and let the chips fall
where they may with respect to any others. In that case this all
becomes much less urgent.
Even *considering* this is largely academic, though, because without
some kind of miracle guaranteeing serial ordering, a miracle that
doesn't allow for serialization failures and also doesn't seriously
slow down, for example, updates (by making them care about value locks
*before* they do anything, or in the case of HOT updates *at all*),
all of this is _impossible_. So, I say let's just do the
actually-serial-ordering for value lock acquisition with serialization
failures where we're > read committed. I've seriously considered what
it would take to do it any other way so things would work how you and
Andres expect for read committed, and it makes my head hurt, because
apart from seeming unnecessary to me, it also seems completely
hopeless.Am I being too negative here? Well, I guess that's possible. The fact
is that it's really hard to judge, because all of this is really hard
to reason about. That's what I really don't like about it.
Suppose we define the operation as REPLACE rather than INSERT...ON
DUPLICATE KEY LOCK FOR UPDATE. Then we could do something like this:
1. Try to insert a tuple. If no unique index conflicts occur, stop.
2. Note the identity of the conflicting tuple and mark the inserted
heap tuple dead.
3. If the conflicting tuple's inserting transaction is still in
progress, wait for the inserting transaction to end.
4. If the conflicting tuple is dead (e.g. because the inserter
aborted), start over.
5. If the conflicting tuple's key columns no longer match the key
columns of the REPLACE operation, start over.
6. If the conflicting tuple has a valid xmax, wait for the deleting or
locking transaction to end. If xmax is still valid, follow the CTID
chain to the updated tuple, let that be the new conflicting tuple, and
resume from step 5.
7. Update the tuple, even though it may be invisible to our snapshot
(a deliberate MVCC violation!).
While this behavior is admittedly wonky from an MVCC perspective, I
suspect that it would make a lot of people happy.
I respectfully
suggest that that exact definition of upsert isn't a useful one,
because other snapshot isolation/MVCC systems operating within the
same constraints must have the same issues, and yet they manage to
implement something that could be called upsert that people seem happy
with.Yeah. I wonder how they do that.
My guess is that they have some fancy snapshot type that is used by
the equivalent of ModifyTable subplans, that is appropriately paranoid
about the Halloween problem and so on. How that actually might work is
far from clear, but it's a question that I have begun to consider. As
I said, a concern is that it would be in tension with the generalized,
composable syntax, where we don't explicitly have a "special update".
I'd really like to hear other opinions, though.
The tension here feels fairly fundamental to me; I don't think our
implementation is to blame. I think the problem isn't so much to
figure out a clever trick that will make this all work in a truly
elegant fashion as it is to decide exactly how we're going to
compromise MVCC semantics in the least blatant way.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 24, 2013 at 5:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Various messages are discussing semantics around visibility. I by now
have a hard time keeping track. So let's keep the discussion of the
desired semantics to this thread.There have been some remarks about serialization failures in read
committed transactions. I agree, those shouldn't occur. But I don't
actually think they are so much of a problem if we follow the path set
by existing uses of the EPQ logic. The scenario described seems to be an
UPSERT conflicting with a row it cannot see in the original snapshot of
the query.
In that case I think we just have to follow the example laid by
ExecUpdate, ExecDelete and heap_lock_tuple. Use the EPQ machinery (or an
alternative approach with similar enough semantics) to get a new
snapshot and follow the ctid chain. When we've found the end of the
chain we try to update that tuple.
That surely isn't free of surprising semantics, but it would follows existing
semantics. Which everybody writing concurrent applications in read
committed should (but doesn't) know. Adding a different set of semantics
seems like a bad idea.
Robert seems to have been the primary sceptic around this, what scenario
are you actually concerned about?
I'm not skeptical about offering it as an option; in fact, I just
suggested basically the same thing on the other thread, before reading
this. Nonetheless it IS an MVCC violation; the chances that someone
will be able to demonstrate serialization anomalies that can't occur
today with this new facility seem very high to me. I feel it's
perfectly fine to respond to that by saying: yep, we know that's
possible, if it's a concern in your environment then don't use this
feature. But it should be clearly documented.
I do think that it will be easier to get this to work if we have a
define the operation as REPLACE, bundling all of the magic inside a
single SQL command. If the user issues an INSERT first and then must
try an UPDATE afterwards if the INSERT doesn't actually insert, then
you're going to have problems if the UPDATE can't see the tuple with
which the INSERT conflicted, and you're going to need some kind of a
loop in case the UPDATE itself fails. Even if we can work out all the
details, a single command that does insert-or-update seems like it
will be easier to use and more efficient. You might also want to
insert multiple tuples using INSERT ... VALUES (...), (...), (...);
figuring out which ones were inserted and which ones must now be
updated seems like a chore better avoided.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 24, 2013 at 7:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I don't really disagree with any of that. TBH, I think the question
of how long value locks (as you call them) are held is going to boil
down to a question of how they end up being implemented.
Well, I think we can rule out value locks that are held for the
duration of a transaction right away. That's just not going to fly.
As I mentioned to you at PG Open (going through the details here for those
following along at home), we could optimistically insert the new heap
tuple, then go add index entries for it, and if we find a conflict,
then instead of erroring out, we mark the tuple we were inserting dead
and go try update the conflicting tuple instead. In that
implementation, if we find that we have to wait for some other
transaction along the way, it's probably not worth reversing out the
index entries already inserted, because getting them into the index in
the first place was a WAL-logged operation, and therefore relatively
expensive, and IMHO it's most likely better to just hope things work
out than to risk having to redo all of that.
I'm afraid that there are things that concern me about this design. It
does have one big advantage over promise-tuples, which is that the
possibility of index-only bloat, and even the possible need to freeze
indexes separately from their heap relation is averted (or are you
going to have recovery do promise clean-up instead? Does recovery look
for an eventual successful insertion relating to the promise? How far
does it look?). However, while I'm just as concerned as you that
backing out is too expensive, I'm equally concerned that there is no
reasonable alternative to backing out, which is why cheap, quick
in-memory value locks are so compelling to me. See my remarks below.
On the other hand, if the locks are strictly in-memory, then the cost
of releasing them all before we go to wait, and of reacquiring them
after we finish waiting, is pretty low. There might be some
modularity issues to work through there, but they might not turn out
to be very painful, and the advantages you mention are certainly worth
accruing if it turns out to be fairly straightforward.
It's certainly a difficult situation to judge.
Personally, I think that trying to keep it all in-memory is going to
be hard. The problem is that we can't de-optimize regular inserts or
updates to any significant degree to cater to this feature - because
as valuable as this feature is, the number of times it gets used is
still going to be a whole lot smaller than the number of times it
doesn't get used.
Right - I don't think that anyone would argue that any other standard
should be applied. Fortunately, I'm reasonably confident that it can
work. The last part of index tuple insertion, where we acquire an
exclusive lock on a buffer, needs to look out for a page header bit
(on pages considered for insertion of its value). The cost of that to
anyone not using this feature is likely to be infinitesimally small.
We can leave clean-up of that bit to the next inserter, who needs the
exclusive lock anyway and doesn't find a corresponding SLRU entry. But
really, that's a discussion for another day. I think we'd want to
track value locks per pinned-by-upserter buffer, to localize any
downsides on concurrency. If we forbid page-splits in respect of a
value-locked page, we can still have a stable value (buffer number) to
use within a shared memory hash table, or something along those lines.
We're still going to want to minimize the duration of locking under
this scheme, by doing TOASTing before locking values and so on, which
is quite possible.
If we're really lucky, maybe the value locking stuff can be
generalized or re-used as part of a btree index insertion buffer
feature.
Also, I tend to think that we might want to define
the operation as a REPLACE-type operation with respect to a certain
set of key columns; and so we'll do the insert-or-update behavior with
respect only to the index on those columns and let the chips fall
where they may with respect to any others. In that case this all
becomes much less urgent.
Well, MySQL's REPLACE does zero or more DELETEs followed by an INSERT,
not try an INSERT, then maybe mark the heap tuple if there's a unique
index dup and then go UPDATE the conflicting tuple. I mention this
only because the term REPLACE has a certain baggage, and I feel it's
important to be careful about such things.
The only way that's going to work is if you say "use this unique
index", which will look pretty gross in DML. That might actually be
okay with me if we had somewhere to go from there in a future release,
but I doubt that's the case. Another issue is that I'm not sure that
this helps Andres much (or rather, clients of the logical changeset
generation infrastructure that need to do conflict resolution), and
that matters a lot to me here.
Suppose we define the operation as REPLACE rather than INSERT...ON
DUPLICATE KEY LOCK FOR UPDATE. Then we could do something like this:1. Try to insert a tuple. If no unique index conflicts occur, stop.
2. Note the identity of the conflicting tuple and mark the inserted
heap tuple dead.
3. If the conflicting tuple's inserting transaction is still in
progress, wait for the inserting transaction to end.
Sure, this is basically what the code does today (apart from marking a
just-inserted tuple dead).
4. If the conflicting tuple is dead (e.g. because the inserter
aborted), start over.
Start over from where? I presume you mean the index tuple insertion,
as things are today. Or do you mean the very start?
5. If the conflicting tuple's key columns no longer match the key
columns of the REPLACE operation, start over.
What definition of equality or inequality? I think you're going to
have to consider stashing information about the btree operator class,
which seems not ideal - a modularity violation beyond what we already
do in, say, execQual.c, I think. I think in general we have to worry
about the distinction between a particular btree operator class's idea
of equality (doesn't have to be = operator), that exists for a
particular index, and some qual's idea of equality. It would probably
be quite invasive to fix this, which I for one would find hard to
justify.
I think my scheme is okay here while yours isn't, because mine
involves row locking only, and hoping that nothing gets updated in
that tiny window after transaction commit - if it doesn't, that's good
enough for us, because we know that the btree code's opinion still
holds - if I'm not mistaken, *nothing* can have changed to the logical
row without us hearing about it (i.e. without heap_lock_tuple()
returning HeapTupleUpdated). On the other hand, you're talking about
concluding that something is not a duplicate in a way that needs to
satisfy btree unique index equality (so whatever operator is
associated with btree strategy number 3, equality, for some particular
unique index with some particular operator class) and not necessarily
just a qual written with a potentially distinct notion of equality in
respect of the relevant unique-constrained datums.
Maybe you can solve this one problem, but the fact remains that to do
so would be a pretty bad modularity violation, even by the standards
of the existing btree code. That's the basic reason why I'm averse to
using EvalPlanQual() in this fashion, or in a similar fashion. Even if
you solve all the problems for btree, I can't imagine what type of
burden it puts on amcanunique AM authors generally - I know at least
one person who won't be happy with that. :-)
6. If the conflicting tuple has a valid xmax, wait for the deleting or
locking transaction to end. If xmax is still valid, follow the CTID
chain to the updated tuple, let that be the new conflicting tuple, and
resume from step 5.
So you've arbitrarily restricted us to one value lock and one row lock
per REPLACE slot processed, which sort of allows us to avoid solving
the basic problem of value locking, because it isn't too bad now - no
need to backtrack across indexes. Clean-up (marking the heap tuple
dead) is much more expensive than releasing locks in memory (although
much less expensive than promise tuple killing), but needing to
clean-up is maybe less likely because conflicts can only come from one
unique index. Has this really bought us anything, though? Consider
that conflicts are generally only expected on one unique index anyway.
Plus you still have the disconnect between value and row locking, as
far as I can tell - "start from scratch" remains a possible step until
very late, except you pay a lot more for clean-up - avoiding that
expensive clean-up is the major benefit of introducing an SLRU-based
shadow value locking scheme to the btree code. I don't see that there
is a way to deal with the value locking/row locking disconnect other
than to live with it in a smart way.
Anyway, your design probably avoids the worst kind of gridlock. Let's
assume that it works out -- my next question has to be, where can we
go from there?
7. Update the tuple, even though it may be invisible to our snapshot
(a deliberate MVCC violation!).
I realize that you just wanted to sketch a design, but offhand I think
that the basic problem with what you describe is that it isn't
accepting of the inevitability of there being a disconnect between
value and row locking. Also, this doesn't fit with any roadmap for
getting a real upsert, and compromises the conceptual integrity of the
AM in a way that isn't likely to be accepted, and, at the risk of
saying too much before you've defended your design, perhaps even
necessitates invasive changes to the already extremely complicated row
locking code.
While this behavior is admittedly wonky from an MVCC perspective, I
suspect that it would make a lot of people happy.
"Wonky from an MVCC perspective" is the order of the day here. :-)
My guess is that they have some fancy snapshot type that is used by
the equivalent of ModifyTable subplans, that is appropriately paranoid
about the Halloween problem and so on. How that actually might work is
far from clear, but it's a question that I have begun to consider. As
I said, a concern is that it would be in tension with the generalized,
composable syntax, where we don't explicitly have a "special update".
I'd really like to hear other opinions, though.The tension here feels fairly fundamental to me; I don't think our
implementation is to blame. I think the problem isn't so much to
figure out a clever trick that will make this all work in a truly
elegant fashion as it is to decide exactly how we're going to
compromise MVCC semantics in the least blatant way.
Yeah, I totally understand the problem that way. I think it would be a
bit of a pity to give up the composability, which I liked, but it's
something that we'll have to consider. On the other hand, perhaps we
can get away with it - we simply don't know enough yet.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 21, 2013 at 05:07:11PM -0700, Peter Geoghegan wrote:
In the average/uncontended case, the subxact example bloats less than
all alternatives to my design proposed to date (including the "unborn
heap tuple" idea Robert mentioned in passing to me in person the other
day, which I think is somewhat similar to a suggestion of Heikki's
[1]). The average case is very important, because in general
contention usually doesn't happen.
This thread had a lot of discussion about bloating. I wonder, does the
code check to see if there is a matching row _before_ adding any data?
Our test-and-set code first checks to see if the lock is free, then if
it it is, it locks the bus and does a test-and-set. Couldn't we easily
check the indexes for matches before doing any locking? It seems that
would avoid bloat in most cases, and allow for a simpler implementation.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote:
This thread had a lot of discussion about bloating. I wonder, does the
code check to see if there is a matching row _before_ adding any data?
That's pretty much what the patch does.
Our test-and-set code first checks to see if the lock is free, then if
it it is, it locks the bus and does a test-and-set. Couldn't we easily
check the indexes for matches before doing any locking? It seems that
would avoid bloat in most cases, and allow for a simpler implementation.
The value locks are only really necessary for getting consensus across
unique indexes on whether or not to go forward, and to ensure that
insertion can *finish* unhindered once we're sure that's appropriate.
Once we've committed to insertion, we hold them across heap tuple
insertion and release each value lock as part of something close to
conventional btree index tuple insertion (with an index tuple with an
ordinary heap pointer inserted). I believe that all schemes proposed
to date have some variant of what could be described as value locking,
such as ordinary index tuples inserted speculatively.
Value locks are *not* held during row locking, and an attempt at row
locking is essentially opportunistic for various reasons (it boils
down to the fact that re-verifying uniqueness outside of the btree
code is very unappealing, and in any case would naturally sometimes be
insufficient - what happens if key values change across row
versions?). This might sound a bit odd, but is in a sense no different
to the current state of affairs, where the first waiter on a blocking
xact that inserted a would-be duplicate is not guaranteed to be the
first to get a second chance at inserting. I don't believe that there
are any significant additional lock starvation hazards.
In the simple case where there is a conflicting tuple that's already
committed, value locks above and beyond what the btree code does today
are unnecessary (provided the attempt to acquire a row lock is
eventually successful, which mostly means that no one else has
updated/deleted - otherwise we try again).
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 25, 2013 at 08:48:11PM -0700, Peter Geoghegan wrote:
On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote:
This thread had a lot of discussion about bloating. I wonder, does the
code check to see if there is a matching row _before_ adding any data?That's pretty much what the patch does.
So, I guess my question is if we are only bloating on a contended
operation, do we expect that to happen so much that bloat is a problem?
I think the big objection to the patch is the additional code complexity
and the potential to slow down other sessions. If it is only bloating
on a contended operation, are these two downsides worth avoiding the
bloat?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 26, 2013 at 07:43:15AM -0400, Bruce Momjian wrote:
On Wed, Sep 25, 2013 at 08:48:11PM -0700, Peter Geoghegan wrote:
On Wed, Sep 25, 2013 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote:
This thread had a lot of discussion about bloating. I wonder, does the
code check to see if there is a matching row _before_ adding any data?That's pretty much what the patch does.
So, I guess my question is if we are only bloating on a contended
operation, do we expect that to happen so much that bloat is a problem?I think the big objection to the patch is the additional code complexity
and the potential to slow down other sessions. If it is only bloating
on a contended operation, are these two downsides worth avoiding the
bloat?
Also, this isn't like the case where we are incrementing sequences --- I
am unclear what workload is going to cause a lot of contention. If two
sessions try to insert the same key, there will be bloat, but later
upsert operations will already see the insert and not cause any bloat.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 26, 2013 at 4:43 AM, Bruce Momjian <bruce@momjian.us> wrote:
So, I guess my question is if we are only bloating on a contended
operation, do we expect that to happen so much that bloat is a problem?
Maybe I could have done a better job of explaining the nature of my
concerns around bloat.
I am specifically concerned about bloat and the clean-up of bloat that
occurs between (or during) value locking and eventual row locking,
because of the necessarily opportunistic nature of the way we go from
one to the other. Bloat, and the obligation to clean it up
synchronously, make row lock conflicts more likely. Conflicts make
bloat more likely, because a conflict implies that another iteration,
complete with more bloat, is necessary.
When you consider that the feature will frequently be used with the
assumption that updating is a much more likely outcome, it becomes
clear that we need to be careful about this sort of interplay.
Having said all that, I would have no objection to some reasonable,
bound amount of bloat occurring elsewhere if that made sense. For
example, I'd certainly be happy to consider the question of whether or
not it's worth doing a kind of speculative heap insertion before
acquiring value locks, because that doesn't need to happen again and
again in the same, critical place, in the interim between value
locking and row locking. The advantage of doing that particular thing
would be to reduce the duration that value locks are held - the
disadvantages would be the *usual* disadvantages of bloat. However,
this is obviously a premature discussion to have now, because the
eventual exact nature of value locks are not known.
I think the big objection to the patch is the additional code complexity
and the potential to slow down other sessions. If it is only bloating
on a contended operation, are these two downsides worth avoiding the
bloat?
I believe that all other schemes proposed have some degree of bloat
even in the uncontended case, because they optimistically assume than
an insert will occur, when in general an update is perhaps just as
likely, and will bloat just the same. So, as I've said before,
definition of uncontended is important here.
There is no reason to assume that alternative proposals will affect
concurrency any less than my proposal - the buffer locking thing
certainly isn't essential to my design. You need to weigh things like
WAL-logging multiples times, that other proposals have. You're right
to say that all of this is complex, but I really think that quite
apart from anything else, my design is simpler than others. For
example, the design that Robert sketched would introduce a fairly
considerable modularity violation, per my recent remarks to him, and
actually plastering over that would be a considerable undertaking.
Now, you might counter, "but those other designs haven't been worked
out enough". That's true, but then my efforts to work them out further
by pointing out problems with them haven't gone very far. I have
sincerely tried to see a way to make them work.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 24, 2013 at 10:15 PM, Peter Geoghegan <pg@heroku.com> wrote:
Well, I think we can rule out value locks that are held for the
duration of a transaction right away. That's just not going to fly.
I think I agree with that. I don't think I remember hearing that proposed.
If we're really lucky, maybe the value locking stuff can be
generalized or re-used as part of a btree index insertion buffer
feature.
Well, that would be nifty.
Also, I tend to think that we might want to define
the operation as a REPLACE-type operation with respect to a certain
set of key columns; and so we'll do the insert-or-update behavior with
respect only to the index on those columns and let the chips fall
where they may with respect to any others. In that case this all
becomes much less urgent.Well, MySQL's REPLACE does zero or more DELETEs followed by an INSERT,
not try an INSERT, then maybe mark the heap tuple if there's a unique
index dup and then go UPDATE the conflicting tuple. I mention this
only because the term REPLACE has a certain baggage, and I feel it's
important to be careful about such things.
I see. Well, we could try to mimic their semantics, I suppose. Those
semantics seem like a POLA violation to me; who would have thought
that a REPLACE could delete multiple tuples? But what do I know?
The only way that's going to work is if you say "use this unique
index", which will look pretty gross in DML. That might actually be
okay with me if we had somewhere to go from there in a future release,
but I doubt that's the case. Another issue is that I'm not sure that
this helps Andres much (or rather, clients of the logical changeset
generation infrastructure that need to do conflict resolution), and
that matters a lot to me here.
Yeah, it's kind of awful.
Suppose we define the operation as REPLACE rather than INSERT...ON
DUPLICATE KEY LOCK FOR UPDATE. Then we could do something like this:1. Try to insert a tuple. If no unique index conflicts occur, stop.
2. Note the identity of the conflicting tuple and mark the inserted
heap tuple dead.
3. If the conflicting tuple's inserting transaction is still in
progress, wait for the inserting transaction to end.Sure, this is basically what the code does today (apart from marking a
just-inserted tuple dead).4. If the conflicting tuple is dead (e.g. because the inserter
aborted), start over.Start over from where? I presume you mean the index tuple insertion,
as things are today. Or do you mean the very start?
Yes, that's what I meant.
5. If the conflicting tuple's key columns no longer match the key
columns of the REPLACE operation, start over.What definition of equality or inequality?
Binary equality, same as we'd use to decide whether an update can be done HOT.
7. Update the tuple, even though it may be invisible to our snapshot
(a deliberate MVCC violation!).I realize that you just wanted to sketch a design, but offhand I think
that the basic problem with what you describe is that it isn't
accepting of the inevitability of there being a disconnect between
value and row locking. Also, this doesn't fit with any roadmap for
getting a real upsert,
Well, there are two separate issues here: what to do about MVCC, and
how to do the locking. From an MVCC perspective, I can think of only
two behaviors when the conflicting tuple is committed but invisible:
roll back, or update it despite it being invisible. If you're saying
you don't like either of those choices, I couldn't agree more, but I
don't have a third idea. If you do, I'm all ears.
In terms of how to do the locking, what I'm mostly saying is that we
could try to implement this in a way that invents as few new concepts
as possible. No promise tuples, no new SLRU, no new page-level bits,
just index tuples and heap tuples and so on. Ideally, we don't even
change the WAL format, although step 2 might require a new record
type. To the extent that what I actually described was at variance
with that goal, consider it a defect in my explanation rather than an
intent to vary. I think there's value in considering such an
implementation because each new thing that we have to introduce in
order to get this feature is a possible reason for it to be rejected -
for modularity reasons, or because it hurts performance elsewhere, or
because it's more code we have to maintain, or whatever.
Now, what I hear you saying is, gee, the performance of that might be
terrible. I'm not sure that I believe that, but it's possible that
you're right. Much seems to depend on what you think the frequency of
conflicts will be, and perhaps I'm assuming it will be low while
you're assuming a higher value. Regardless, if the performance of the
sort of implementation I'm talking about would be terrible (under some
agreed-upon definition of what terrible means in this context), then
that's a good argument for not doing it that way. I'm just not
convinced that's the case.
Basically, if there's a way we can do this without changing the
on-disk format (even in a backward-compatible way), I'd be strongly
inclined to go that route unless we have a really compelling reason to
believe it's going to suck (or be outright impossible).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 26, 2013 at 3:07 PM, Peter Geoghegan <pg@heroku.com> wrote:
When you consider that the feature will frequently be used with the
assumption that updating is a much more likely outcome, it becomes
clear that we need to be careful about this sort of interplay.
I think one thing that's pretty clear at this point is that almost any
version of this feature could be optimized for either the insert case
or the update case. For example, my proposal could be modified to
search for a conflicting tuple first, potentially wasting an index
probes (or multiple index probes, if you want to search for potential
conflicts in multiple indexes) if we're inserting, but winning heavily
in the update case. As written, it's optimized for the insert case.
In fact, I don't know how to know which of these things we should
optimize for. I wrote part of the code for an EDB proprietary feature
that can do insert-or-update loads about 6 months ago[1]In case you're wondering, attempting to use that feature to upsert an invisible tuple will result in the load failing with a unique index violation., and we
optimized it for updates. That was not, however, a matter of
principal; it just turned out to be easier to implement that way. In
fact, I would have assumed that the insert-mostly case was more
likely, but I think the real answer is that some environments will be
insert-mostly and some will be update-mostly and some will be a mix.
If we really want to squeeze out every last drop of possible
performance, we might need two modes: one that assumes we'll mostly
insert, and another that assumes we'll mostly update. That seems a
frustrating amount of detail to have to expose to the user; an
implementation that was efficient in both cases would be very
desirable, but I do not have a good idea how to get there.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
[1]: In case you're wondering, attempting to use that feature to upsert an invisible tuple will result in the load failing with a unique index violation.
an invisible tuple will result in the load failing with a unique index
violation.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 26, 2013 at 03:33:34PM -0400, Robert Haas wrote:
On Thu, Sep 26, 2013 at 3:07 PM, Peter Geoghegan <pg@heroku.com> wrote:
When you consider that the feature will frequently be used with the
assumption that updating is a much more likely outcome, it becomes
clear that we need to be careful about this sort of interplay.I think one thing that's pretty clear at this point is that almost any
version of this feature could be optimized for either the insert case
or the update case. For example, my proposal could be modified to
search for a conflicting tuple first, potentially wasting an index
probes (or multiple index probes, if you want to search for potential
conflicts in multiple indexes) if we're inserting, but winning heavily
in the update case. As written, it's optimized for the insert case.
I assumed the code was going to do the index lookups first without a
lock, and take the appropriate action, insert or update, with fallbacks
for guessing wrong.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Well, I think we can rule out value locks that are held for the
duration of a transaction right away. That's just not going to fly.I think I agree with that. I don't think I remember hearing that proposed.
I think I might have been unclear - I mean locks that are held for the
duration of *another* transaction, not our own, as we wait for that
other transaction to commit/abort. I think that earlier remarks from
yourself and Andres implied that this would be necessary. Perhaps I'm
mistaken. Your most recent design proposal doesn't do this, but I
think that that's only because it restricts the user to a single
unique index - it would otherwise be necessary to sit on the earlier
value locks (index tuples belonging to an unfinished transaction)
pending the completion of some other conflicting transaction, which
has numerous disadvantages (as described in my "it suits my purposes
to have the value locks be held for only an instant" mail to you [1]/messages/by-id/CAM3SWZRV0F-DjgpXu-WxGoG9eEcLawNrEiO5+3UKRp2e5s=TSg@mail.gmail.com).
If we're really lucky, maybe the value locking stuff can be
generalized or re-used as part of a btree index insertion buffer
feature.Well, that would be nifty.
Yes, it would. I think, based on a conversation with Rob Wultsch, that
it's another area where MySQL still do quite a bit better.
I see. Well, we could try to mimic their semantics, I suppose. Those
semantics seem like a POLA violation to me; who would have thought
that a REPLACE could delete multiple tuples? But what do I know?
I think that it's fairly widely acknowledged to not be very good.
Every MySQL user uses INSERT...ON DUPLICATE KEY UPDATE instead.
The only way that's going to work is if you say "use this unique
index", which will look pretty gross in DML.
Yeah, it's kind of awful.
It is.
What definition of equality or inequality?
Binary equality, same as we'd use to decide whether an update can be done HOT.
I guess that's acceptable in theory, because binary equality is
necessarily a *stricter* condition than equality according to some
operator that is an equivalence relation. But the fact remains that
you're just ameliorating the problem by making it happen less often
(both through this kind of trick, but also by restricting us to one
unique index), not actually fixing it.
Well, there are two separate issues here: what to do about MVCC, and
how to do the locking.
Totally agreed. Fortunately, unlike the different aspects of value and
row locking, I think that these two questions can be reasonable
considered independently.
From an MVCC perspective, I can think of only
two behaviors when the conflicting tuple is committed but invisible:
roll back, or update it despite it being invisible. If you're saying
you don't like either of those choices, I couldn't agree more, but I
don't have a third idea. If you do, I'm all ears.
I don't have another idea either. In fact, I'd go so far as to say
that doing any third thing that's better than those two to any
reasonable person is obviously impossible. But I'd add that we simple
cannot rollback at read committed, so we're just going to have to hold
our collective noses and do strange things with visibility.
FWIW, I'm tentatively looking at doing something like this:
*************** HeapTupleSatisfiesMVCC(HeapTuple htup, S
*** 958,963 ****
--- 959,975 ----
* By here, the inserting transaction has committed - have to check
* when...
*/
+
+ /*
+ * Not necessarily visible to snapshot under conventional MVCC rules, but
+ * still locked by our xact and not updated -- importantly, normal MVCC
+ * semantics apply when we update the row, so only one version will be
+ * visible at once
+ */
+ if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
+ TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
+ return true;
+
if (XidInMVCCSnapshot(HeapTupleHeaderGetXmin(tuple), snapshot))
return false; /* treat as still in progress */
This is something that I haven't given remotely enough thought yet, so
please take it with a big grain of salt.
In terms of how to do the locking, what I'm mostly saying is that we
could try to implement this in a way that invents as few new concepts
as possible. No promise tuples, no new SLRU, no new page-level bits,
just index tuples and heap tuples and so on. Ideally, we don't even
change the WAL format, although step 2 might require a new record
type. To the extent that what I actually described was at variance
with that goal, consider it a defect in my explanation rather than an
intent to vary. I think there's value in considering such an
implementation because each new thing that we have to introduce in
order to get this feature is a possible reason for it to be rejected -
for modularity reasons, or because it hurts performance elsewhere, or
because it's more code we have to maintain, or whatever.
There is certainly value in considering that, and you're right to take
that tact - it is generally valuable to have a patch be minimally
invasive. However, ultimately that's just one aspect of any given
design, an aspect that needs to be weighed against others where there
is a tension. Obviously in this instance I believe, rightly or
wrongly, that doing more - adding more infrastructure than might be
considered strictly necessary - is the least worst thing. Also,
sometimes the apparent similarity of a design to what we have today is
illusory - certainly, I think you'd at least agree that the problems
that bloating during the interim between value locking and row locking
present are qualitatively different to other problems that bloat
presents in all existing scenarios.
FWIW I'm not doing things this way because I'm ambitious, and am
willing to risk not having my work accepted if that means I might get
something that performs better, or has more features (like not
requiring the user to specify a unique index in DML). Rather, I'm
doing things this way because I sincerely believe that on balance mine
is the best, most forward-thinking design proposed to date, and
therefore the design most likely to ultimately be accepted (even
though I do of course accept that there are numerous aspects that need
to be worked out still). If the whole design is ultimately not
accepted, that's something that I'll have to deal with, but understand
that I don't see any way to play it safe here (except, I suppose, to
give up now).
Now, what I hear you saying is, gee, the performance of that might be
terrible. I'm not sure that I believe that, but it's possible that
you're right.
I think that the average case will be okay, but not great. I think
that the worst case performance may well be unforgivably bad, and it's
a fairly plausible worst case. Even if someone disputes its
likelihood, and demonstrates that it isn't actually that likely, that
isn't necessarily very re-assuring - getting all the details right is
pretty subtle, especially compared to just not bloating, and just
deferring to the btree code whose responsibilities include enforcing
uniqueness.
Much seems to depend on what you think the frequency of
conflicts will be, and perhaps I'm assuming it will be low while
you're assuming a higher value. Regardless, if the performance of the
sort of implementation I'm talking about would be terrible (under some
agreed-upon definition of what terrible means in this context), then
that's a good argument for not doing it that way. I'm just not
convinced that's the case.
All fair points. Forgive me for repeating myself, but the word
"conflict" needs to be used carefully here, because there are two
basic ways of interpreting it - something that happens due to
concurrent xact activity around the same values, and something that
happens due to there already being some row there with a conflicting
value from some time ago (or that our xact inserted, even). Indeed,
the former *is* generally much less likely than the latter, so the
distinction is important. You could also further differentiate between
value level and row level conflicts, or at least I think that you
should, and that we should allow for value level conflicts.
Let me try and explain myself better, with reference to a concrete
example. Suppose we have a table with a primary key column, A, and a
unique constraint column, B, and we lock the pk value first and the
unique constraint value second. I'm assuming your design, but allowing
for multiple unique indexes because I don't think doing anything less
will be accepted - promise tuples have some of the same problems, as
well as some other idiosyncratic ones (see my earlier remarks on
recovery/freezing [2]/messages/by-id/CAM3SWZQUUuYYcGksVytmcGqACVMkf1ui1uvfJekM15YkWZpzhw@mail.gmail.com -- Peter Geoghegan for examples of those).
So there is a fairly high probability that the pk value on A will be
unique, and a fairly low probability that the unique constraint value
on B will be unique, at least in this usage pattern of interest, where
the user is mostly going to end up updating. Mostly, we insert a
speculative regular index tuple (that points to a speculative heap
tuple that we might decide to kill) into the pk column, A, right away,
and then maybe block pending the resolution of a conflicting
transaction on the unique constraint column B. I don't think we have
any reasonable way of not blocking on A - if we go clean it up for the
wait, that's going to bloat quite dramatically, *and* we have to WAL
log. In any case you seemed to accept that cleaning up bloat
synchronously like that was just going to be too expensive. So I
suppose that rules that out. That just leaves sitting on the "value
lock" (that the pk index tuple already inserted effectively is)
indefinitely, pending the outcome of the first transaction.
What are the consequences of sitting on that value lock indefinitely?
Well, xacts are going to block on the pk value much more frequently,
by simple virtue of the fact that the value locks there are held for a
long time - they just needed to hear a "no" answer, which the unique
constraint was in most cases happy to immediately give, so this is
totally unnecessary. Contention is now in a certain sense almost as
bad for every unique index as it is for the weakest link. That's only
where the problems begin, though, and it isn't necessary for there to
be bad contention on more than one unique index (the pk could just be
on a serial column, say) to see bad effects.
So your long-running xact that's blocking all the other sessions on
its proposed value for a (or maybe even b) - that finally gets to
proceed. Regardless of whether it commits or aborts, there will be a
big bloat race. This is because when the other sessions get the
go-ahead to proceed, they'll all run to get the row lock (one guy
might insert instead). Only one will be successful, but they'll all
kill their heap tuple on the assumption that they'll probably lock the
row, which is only true in the average case. Now, maybe you can teach
them to not bother killing the heap tuple when there are no index
tuples actually inserted to ruin things, but then maybe not, and maybe
it wouldn't help in this instance if you did teach them (because
there's a third, otherwise irrelevant constraint or whatever).
Realize you can generally only kill the heap tuple *before* you have
the row lock, because otherwise a totally innocent non-HOT update (may
not update any unique indexed columns at all) will deadlock with your
session, which I don't think is defensible, and will probably happen
often if allowed to (after all, this is upsert - users are going to
want to update their locked rows!).
So in this scenario, each of the original blockers will simultaneously
try again and again to get the row lock as one transaction proceeds
with locking and then probably commits. For every blocker's iteration
(there will be n_blockers - 1 iterations, with each iteration
resolving things for one blocker only), each blocker bloats. We're
talking about creating duplicates in unique indexes for each and every
iteration, for each and every blocker, and we all know duplicates in
btree indexes are, in a word, bad. I can imagine one or two
ridiculously bloated indexes in this scenario. It's even degenerative
in another direction - the more aggregate bloat we have, the slower
the jump from value to row locking takes, the more likely conflicts
are, the more likely bloat is.
Contrast this with my design, where re-ordering of would-be
conflicters across unique indexes (or serialization failures) can
totally nip this in the bud *if* the contention can be re-ordered
around, but if not, at least there is no need to worry about
aggregating bloat at all, because it creates no bloat.
Now, you're probably thinking "but I said I'll reverify the row for
conflicts across versions, and it'll be fine - there's generally no
need to iterate and bloat again provided no unique-indexed column
changed, even if that is more likely to occur due to the clean-up pre
row locking". Maybe I'm being unfair, but apart from requiring a
considerable amount of additional infrastructure of its own (a new
EvalPlanQual()-like thing that cares about binary equality in respect
of some columns only across row versions), I think that this is likely
to turn out to be subtly flawed in some way, simply because of the
modularity violation, so I haven't given you the benefit of the doubt
about your ability to frequently avoid repeatedly asking the index +
btree code what to do. For example, partial unique indexes - maybe
something that looked okay before because you simply didn't have cause
to insert into that unique index has to be considered in light of the
fact that it changed across row versions - are you going to stash that
knowledge too, and is it likely to affect someone who might otherwise
not have these issues really badly because we have to assume the worst
there? Do you want to do a value verification thing for that too, as
we do when deciding to insert into partial indexes in the first place?
Even if this new nothing-changed-across-versions infrastructure works,
will it work often enough in practice to be worth it -- have you ever
tracked the proportion of updates that were HOT updates in a
production DB? It isn't uncommon for it to not be great, and I think
that we can take that as a proxy for how well this will work. It could
be totally legitimate for the UPDATE portion to alter a unique indexed
column all the time.
Basically, if there's a way we can do this without changing the
on-disk format (even in a backward-compatible way), I'd be strongly
inclined to go that route unless we have a really compelling reason to
believe it's going to suck (or be outright impossible).
I don't believe that anything that I have proposed needs to break our
on-disk format - I hadn't considered what the implications might be in
this area for other proposals, but it's possible that that's an
additional advantage of doing value locking all in-memory.
[1]: /messages/by-id/CAM3SWZRV0F-DjgpXu-WxGoG9eEcLawNrEiO5+3UKRp2e5s=TSg@mail.gmail.com
[2]: /messages/by-id/CAM3SWZQUUuYYcGksVytmcGqACVMkf1ui1uvfJekM15YkWZpzhw@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 26, 2013 at 12:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think one thing that's pretty clear at this point is that almost any
version of this feature could be optimized for either the insert case
or the update case. For example, my proposal could be modified to
search for a conflicting tuple first, potentially wasting an index
probes (or multiple index probes, if you want to search for potential
conflicts in multiple indexes) if we're inserting, but winning heavily
in the update case.
I don't think that's really the case.
In what sense could my design really be said to prioritize either the
INSERT or the UPDATE case? I'm pretty sure that it's still necessary
to get all the value locks per unique index needed up until the first
one with a conflict even if you know that you're going to UPDATE for
*some* reason, in order for things to be well defined (which is
important, because there might be more than one conflict, and which
one is locked matters - maybe we could add DDL to let unique indexes
have a checking priority or something like that).
The only appreciable downside of my design for updates that I can
think of is that there has to be another index scan, to find the
locked-for-update row to update. However, that's probably worth it,
since it is at least relatively rare, and allows the user the
flexibility of using a more complex UPDATE predicate than "apply to
conflicter", which is something that the MySQL syntax effectively
limits users to.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 26, 2013 at 11:58 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Well, I think we can rule out value locks that are held for the
duration of a transaction right away. That's just not going to fly.I think I agree with that. I don't think I remember hearing that proposed.
I think I might have been unclear - I mean locks that are held for the
duration of *another* transaction, not our own, as we wait for that
other transaction to commit/abort. I think that earlier remarks from
yourself and Andres implied that this would be necessary. Perhaps I'm
mistaken. Your most recent design proposal doesn't do this, but I
think that that's only because it restricts the user to a single
unique index - it would otherwise be necessary to sit on the earlier
value locks (index tuples belonging to an unfinished transaction)
pending the completion of some other conflicting transaction, which
has numerous disadvantages (as described in my "it suits my purposes
to have the value locks be held for only an instant" mail to you [1]).
OK, now I understand what you are saying. I don't think I agree with it.
I don't have another idea either. In fact, I'd go so far as to say
that doing any third thing that's better than those two to any
reasonable person is obviously impossible. But I'd add that we simple
cannot rollback at read committed, so we're just going to have to hold
our collective noses and do strange things with visibility.
I don't accept that as a general principal. We're writing the code;
we can make it behave any way we think best.
This is something that I haven't given remotely enough thought yet, so
please take it with a big grain of salt.
I doubt that any change to HeapTupleSatisfiesMVCC() will be
acceptable. This feature needs to restrain itself to behavior changes
that only affect users of this feature, I think.
There is certainly value in considering that, and you're right to take
that tact - it is generally valuable to have a patch be minimally
invasive. However, ultimately that's just one aspect of any given
design, an aspect that needs to be weighed against others where there
is a tension. Obviously in this instance I believe, rightly or
wrongly, that doing more - adding more infrastructure than might be
considered strictly necessary - is the least worst thing. Also,
sometimes the apparent similarity of a design to what we have today is
illusory - certainly, I think you'd at least agree that the problems
that bloating during the interim between value locking and row locking
present are qualitatively different to other problems that bloat
presents in all existing scenarios.
TBH, no, I don't think I agree with that. See further below.
Let me try and explain myself better, with reference to a concrete
example. Suppose we have a table with a primary key column, A, and a
unique constraint column, B, and we lock the pk value first and the
unique constraint value second. I'm assuming your design, but allowing
for multiple unique indexes because I don't think doing anything less
will be accepted - promise tuples have some of the same problems, as
well as some other idiosyncratic ones (see my earlier remarks on
recovery/freezing [2] for examples of those).
OK, so far I'm right with you.
So there is a fairly high probability that the pk value on A will be
unique, and a fairly low probability that the unique constraint value
on B will be unique, at least in this usage pattern of interest, where
the user is mostly going to end up updating. Mostly, we insert a
speculative regular index tuple (that points to a speculative heap
tuple that we might decide to kill) into the pk column, A, right away,
and then maybe block pending the resolution of a conflicting
transaction on the unique constraint column B. I don't think we have
any reasonable way of not blocking on A - if we go clean it up for the
wait, that's going to bloat quite dramatically, *and* we have to WAL
log. In any case you seemed to accept that cleaning up bloat
synchronously like that was just going to be too expensive. So I
suppose that rules that out. That just leaves sitting on the "value
lock" (that the pk index tuple already inserted effectively is)
indefinitely, pending the outcome of the first transaction.
Agreed.
What are the consequences of sitting on that value lock indefinitely?
Well, xacts are going to block on the pk value much more frequently,
by simple virtue of the fact that the value locks there are held for a
long time - they just needed to hear a "no" answer, which the unique
constraint was in most cases happy to immediately give, so this is
totally unnecessary. Contention is now in a certain sense almost as
bad for every unique index as it is for the weakest link. That's only
where the problems begin, though, and it isn't necessary for there to
be bad contention on more than one unique index (the pk could just be
on a serial column, say) to see bad effects.
Here's where I start to lose faith. It's unclear to me what those
other transactions are doing. If they're trying to insert a record
that conflicts with the primary key of the tuple we're inserting,
they're probably doomed, but not necessarily; we might roll back. If
they're also upserting, it's absolutely essential that they wait until
we get done before deciding what to do.
So your long-running xact that's blocking all the other sessions on
its proposed value for a (or maybe even b) - that finally gets to
proceed. Regardless of whether it commits or aborts, there will be a
big bloat race. This is because when the other sessions get the
go-ahead to proceed, they'll all run to get the row lock (one guy
might insert instead). Only one will be successful, but they'll all
kill their heap tuple on the assumption that they'll probably lock the
row, which is only true in the average case. Now, maybe you can teach
them to not bother killing the heap tuple when there are no index
tuples actually inserted to ruin things, but then maybe not, and maybe
it wouldn't help in this instance if you did teach them (because
there's a third, otherwise irrelevant constraint or whatever).
Supposing they are all upserters, it seems to me that what will
probably happen is that one of them will lock the row and update it,
and then commit. Then the next one will lock the row and update it,
and then commit. And so on. It's probably important to avoid having
them keep recreating speculative tuples and then killing them as long
as a candidate tuple is available, so that they don't create a dead
tuple per iteration. But that seems doable.
Realize you can generally only kill the heap tuple *before* you have
the row lock, because otherwise a totally innocent non-HOT update (may
not update any unique indexed columns at all) will deadlock with your
session, which I don't think is defensible, and will probably happen
often if allowed to (after all, this is upsert - users are going to
want to update their locked rows!).
I must be obtuse; I don't see why that would deadlock.
A bigger problem that I've just realized, though, is that once
somebody else has blocked on a unique index insertion, they'll be
stuck there until end of transaction even if we kill the tuple,
because they're waiting on the xid, not the index itself. That might
be fatal to my proposed design, or at least require the use of some
more clever locking regimen.
Contrast this with my design, where re-ordering of would-be
conflicters across unique indexes (or serialization failures) can
totally nip this in the bud *if* the contention can be re-ordered
around, but if not, at least there is no need to worry about
aggregating bloat at all, because it creates no bloat.
Yeah, possibly.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 27, 2013 at 5:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I don't have another idea either. In fact, I'd go so far as to say
that doing any third thing that's better than those two to any
reasonable person is obviously impossible. But I'd add that we simple
cannot rollback at read committed, so we're just going to have to hold
our collective noses and do strange things with visibility.I don't accept that as a general principal. We're writing the code;
we can make it behave any way we think best.
I presume you're referring to the principle that we cannot throw
serialization failures at read committed. I'd suggest that letting
that happen would upset a lot of people, because it's so totally
unprecedented. A large segment of our user base would just consider
that to be Postgres randomly throwing errors, and would be totally
dismissive of the need to do so, and not without some justification -
no one else does the same. The reality is that the majority of our
users don't even know what an isolation level is. I'm not just talking
about people that use Postgres more casually, such as Heroku
customers. I've personally talked to people who didn't even know what
a transaction isolation level was, that were in a position where they
really, really should have known.
I doubt that any change to HeapTupleSatisfiesMVCC() will be
acceptable. This feature needs to restrain itself to behavior changes
that only affect users of this feature, I think.
I agree with the principle of what you're saying, but I'm not aware
that those changes to HeapTupleSatisfiesMVCC() imply any behavioral
changes for those not using the feature. Certainly, the standard
regression tests and isolation tests still pass, for what it's worth.
Having said that, I have not thought about it enough to be willing to
actually defend that bit of code. Though I must admit that I am a
little encouraged by the fact that it passes casual inspection.
I am starting to wonder if it's really necessary to have a "blessed"
update that can see the locked, not-otherwise-visible tuple. Doing
that certainly has its disadvantages, both in terms of code complexity
and in terms of being arbitrarily restrictive. We're going to have to
allow the user to see the locked row after it's updated (the new row
version that we create will naturally be visible to its creating xact)
- is it really any worse that the user can see it before an update (or
a delete)? The user could decide to effectively make the update change
nothing, and see the same thing anyway.
I get why you're averse to doing odd things to visibility - I was too.
I just don't see that we have a choice if we want this feature to work
acceptably with read committed. In addition, as it happens I just
don't see that the general situation is made any worse by the fact
that the user might be able to see the row before an update/delete.
Isn't is also weird to update or delete something you cannot see?
Couldn't EvalPlanQual() be said to be an MVCC violation on similar
grounds? It also "reaches into the future". Locking a row isn't really
that distinct from updating it in terms of the code footprint, but
also from a logical perspective.
It's probably important to avoid having
them keep recreating speculative tuples and then killing them as long
as a candidate tuple is available, so that they don't create a dead
tuple per iteration. But that seems doable.
I'm not so sure.
Realize you can generally only kill the heap tuple *before* you have
the row lock, because otherwise a totally innocent non-HOT update (may
not update any unique indexed columns at all) will deadlock with your
session, which I don't think is defensible, and will probably happen
often if allowed to (after all, this is upsert - users are going to
want to update their locked rows!).I must be obtuse; I don't see why that would deadlock.
If you don't see it, then you aren't being obtuse in asking for
clarification. It's really easy to be wrong about this kind of thing.
If the non-HOT update updates some random row, changing the key
columns, it will lock that random row version. It will then proceed
with "value locking" (i.e. inserting index tuples in the usual way, in
this case with entirely new values). It might then block on one of the
index tuples we, the upserter, have already inserted (these are our
"value locks" under your scheme). Meanwhile, we (the upserter) might
have *already* concluded that the *old* heap row that the regular
updater is in the process of rendering invisible is to blame in
respect of some other value in some later unique index, and that *it*
must be locked. Deadlock. This seems very possible if the key values
are somewhat correlated, which is probably generally quite common.
The important observation here is that an updater, in effect, locks
both the old and new sets of values (for non-HOT updates). And as I've
already noted, any practical "value locking" implementation isn't
going to be able to prevent the update from immediately locking the
old, because that doesn't touch an index. Hence, there is an
irresolvable disconnect between value and row locking.
Are we comfortable with this? Before you answer, consider that there
was lots of bugs (their words) in the MySQL implementation of this
same basic idea surrounding excessive deadlocking - I heard through
the grapevine that they fixed a number of bugs along these lines, and
that their implementation has historically had lots of deadlocking
problems.
I think that the way to deal with weird, unprincipled deadlocking is
to simply not hold value locks at the same time as row locks - it is
my contention that the lock starvation hazards that avoiding being
smarter about this may present aren't actually an issue, unless you
have some kind of highly implausible perfect storm of read-committed
aborters inserting around the same values - only one of those needs to
commit to remedy the situation - the first "no" answer is all we need
to give up.
To repeat myself, that's really the essential nature of my design: it
is accepting of the inevitability of there being a disconnect between
value and row locking. Value locks that are implemented in a sane way
can do very little; they can only prevent a conflicting insertion from
*finishing*, and not from causing a conflict for row locking.
A bigger problem that I've just realized, though, is that once
somebody else has blocked on a unique index insertion, they'll be
stuck there until end of transaction even if we kill the tuple,
because they're waiting on the xid, not the index itself. That might
be fatal to my proposed design, or at least require the use of some
more clever locking regimen.
Well, it's really fatal to your proposed design *because* it implies
that others will be blocked on earlier value locks, which is what I
was trying to say (in saying this, I'm continuing to hold your design
to the same standard as my own, which is that it must work across
multiple unique indexes - I believe that you yourself accept this
standard based on your remarks here).
For the benefit of others who may not get what we're talking about: in
my patch, that isn't a problem, because when we block on acquiring an
xid ShareLock pending value conflict resolution, that means that the
other guy actually did insert (and did not merely think about it), and
so with that design it's entirely appropriate that we wait for his
xact to end.
Contrast this with my design, where re-ordering of would-be
conflicters across unique indexes (or serialization failures) can
totally nip this in the bud *if* the contention can be re-ordered
around, but if not, at least there is no need to worry about
aggregating bloat at all, because it creates no bloat.Yeah, possibly.
I think that re-ordering is an important property of any design where
we cannot bail out with serialization failures. I know it seems weird,
because it seems like an MVCC violation to have our behavior altered
as a result of a transaction that committed that isn't even visible to
us. As I think you appreciate, on a certain level that's just the
nature of the beast. This might sound stupid, but: you can say the
same thing about unique constraint violations! I do not believe that
this introduces any anomalies that read committed doesn't already
permit according to the standard.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 24, 2013 at 2:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Various messages are discussing semantics around visibility. I by now
have a hard time keeping track. So let's keep the discussion of the
desired semantics to this thread.
Yes, it's pretty complicated.
I meant to comment on this here, but ended up saying some stuff to
Robert about this in the main thread, so I should probably direct you
to that. You were probably right to start a new thread, because I
think we can usefully discuss this topic in parallel, but that's just
what ended up happening.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 27, 2013 at 8:36 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Fri, Sep 27, 2013 at 5:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I don't have another idea either. In fact, I'd go so far as to say
that doing any third thing that's better than those two to any
reasonable person is obviously impossible. But I'd add that we simple
cannot rollback at read committed, so we're just going to have to hold
our collective noses and do strange things with visibility.I don't accept that as a general principal. We're writing the code;
we can make it behave any way we think best.I presume you're referring to the principle that we cannot throw
serialization failures at read committed. I'd suggest that letting
that happen would upset a lot of people, because it's so totally
unprecedented. A large segment of our user base would just consider
that to be Postgres randomly throwing errors, and would be totally
dismissive of the need to do so, and not without some justification -
no one else does the same. The reality is that the majority of our
users don't even know what an isolation level is. I'm not just talking
about people that use Postgres more casually, such as Heroku
customers. I've personally talked to people who didn't even know what
a transaction isolation level was, that were in a position where they
really, really should have known.
Yes, it might not be a good idea. But I'm just saying, we get to decide.
I doubt that any change to HeapTupleSatisfiesMVCC() will be
acceptable. This feature needs to restrain itself to behavior changes
that only affect users of this feature, I think.I agree with the principle of what you're saying, but I'm not aware
that those changes to HeapTupleSatisfiesMVCC() imply any behavioral
changes for those not using the feature. Certainly, the standard
regression tests and isolation tests still pass, for what it's worth.
Having said that, I have not thought about it enough to be willing to
actually defend that bit of code. Though I must admit that I am a
little encouraged by the fact that it passes casual inspection.
Well, at a minimum, it's a performance worry. Those functions are
*hot*. Single branches do matter there.
I am starting to wonder if it's really necessary to have a "blessed"
update that can see the locked, not-otherwise-visible tuple. Doing
that certainly has its disadvantages, both in terms of code complexity
and in terms of being arbitrarily restrictive. We're going to have to
allow the user to see the locked row after it's updated (the new row
version that we create will naturally be visible to its creating xact)
- is it really any worse that the user can see it before an update (or
a delete)? The user could decide to effectively make the update change
nothing, and see the same thing anyway.
If we're not going to just error out over the invisible tuple the user
needs some way to interact with it. The details are negotiable.
I get why you're averse to doing odd things to visibility - I was too.
I just don't see that we have a choice if we want this feature to work
acceptably with read committed. In addition, as it happens I just
don't see that the general situation is made any worse by the fact
that the user might be able to see the row before an update/delete.
Isn't is also weird to update or delete something you cannot see?Couldn't EvalPlanQual() be said to be an MVCC violation on similar
grounds? It also "reaches into the future". Locking a row isn't really
that distinct from updating it in terms of the code footprint, but
also from a logical perspective.
Yes, EvalPlanQual() is definitely an MVCC violation.
Realize you can generally only kill the heap tuple *before* you have
the row lock, because otherwise a totally innocent non-HOT update (may
not update any unique indexed columns at all) will deadlock with your
session, which I don't think is defensible, and will probably happen
often if allowed to (after all, this is upsert - users are going to
want to update their locked rows!).I must be obtuse; I don't see why that would deadlock.
If you don't see it, then you aren't being obtuse in asking for
clarification. It's really easy to be wrong about this kind of thing.If the non-HOT update updates some random row, changing the key
columns, it will lock that random row version. It will then proceed
with "value locking" (i.e. inserting index tuples in the usual way, in
this case with entirely new values). It might then block on one of the
index tuples we, the upserter, have already inserted (these are our
"value locks" under your scheme). Meanwhile, we (the upserter) might
have *already* concluded that the *old* heap row that the regular
updater is in the process of rendering invisible is to blame in
respect of some other value in some later unique index, and that *it*
must be locked. Deadlock. This seems very possible if the key values
are somewhat correlated, which is probably generally quite common.
OK, I see.
The important observation here is that an updater, in effect, locks
both the old and new sets of values (for non-HOT updates). And as I've
already noted, any practical "value locking" implementation isn't
going to be able to prevent the update from immediately locking the
old, because that doesn't touch an index. Hence, there is an
irresolvable disconnect between value and row locking.
This part I don't follow. "locking the old"? What irresolvable
disconnect? I mean, they're different things; I get *that*.
Are we comfortable with this? Before you answer, consider that there
was lots of bugs (their words) in the MySQL implementation of this
same basic idea surrounding excessive deadlocking - I heard through
the grapevine that they fixed a number of bugs along these lines, and
that their implementation has historically had lots of deadlocking
problems.I think that the way to deal with weird, unprincipled deadlocking is
to simply not hold value locks at the same time as row locks - it is
my contention that the lock starvation hazards that avoiding being
smarter about this may present aren't actually an issue, unless you
have some kind of highly implausible perfect storm of read-committed
aborters inserting around the same values - only one of those needs to
commit to remedy the situation - the first "no" answer is all we need
to give up.
OK, I take your point, I think. The existing system already acquires
value locks when a tuple lock is held, during an UPDATE, and we can't
change that.
Contrast this with my design, where re-ordering of would-be
conflicters across unique indexes (or serialization failures) can
totally nip this in the bud *if* the contention can be re-ordered
around, but if not, at least there is no need to worry about
aggregating bloat at all, because it creates no bloat.Yeah, possibly.
I think that re-ordering is an important property of any design where
we cannot bail out with serialization failures. I know it seems weird,
because it seems like an MVCC violation to have our behavior altered
as a result of a transaction that committed that isn't even visible to
us. As I think you appreciate, on a certain level that's just the
nature of the beast. This might sound stupid, but: you can say the
same thing about unique constraint violations! I do not believe that
this introduces any anomalies that read committed doesn't already
permit according to the standard.
I worry about the behavior being confusing and hard to understand in
the presence of multiple unique indexes and reordering. Perhaps I
simply don't understand the problem domain well-enough yet.
From a user perspective, I would really think people would want to
specify a set of key columns and then update if a match is found on
those key columns. Suppose there's a unique index on (a, b) and
another on (c), and the user passes in (a,b,c)=(1,1,1). It's hard for
me to imagine that the user will be happy to update either (1,1,2) or
(2,2,1), whichever exists. In what situation would that be the
desired behavior?
Also, under such a programming model, if somebody drops a unique index
or adds a new one, the behavior of someone's application can
completely change. I have a hard time swallowing that. It's an
established precedent that dropping a unique index can make some other
operation fail (e.g. ADD FOREIGN KEY, and more recently CREATE VIEW ..
GROUP BY), and of course it can cause performance or plan changes.
But overturning the semantics is, I think, something new, and it
doesn't feel like a good direction.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 30, 2013 at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I doubt that any change to HeapTupleSatisfiesMVCC() will be
acceptable. This feature needs to restrain itself to behavior changes
that only affect users of this feature, I think.I agree with the principle of what you're saying, but I'm not aware
that those changes to HeapTupleSatisfiesMVCC() imply any behavioral
changes for those not using the feature.
Well, at a minimum, it's a performance worry. Those functions are
*hot*. Single branches do matter there.
Well, that certainly is a reasonable concern. Offhand, I suspect that
branch prediction helps immensely. But even if it doesn't, couldn't it
be the case that returning earlier there actually helps? Where we have
a real xid (so TransactionIdIsCurrentTransactionId() must do more than
a single test of a scalar variable), and the row is locked *only*
(which is already very cheap to check - it's another scalar variable
that we already test in a few other places in that function), isn't
there on average a high chance that the tuple ought to be visible to
our snapshot anyway?
I am starting to wonder if it's really necessary to have a "blessed"
update that can see the locked, not-otherwise-visible tuple.
If we're not going to just error out over the invisible tuple the user
needs some way to interact with it. The details are negotiable.
I think that we will error out over an invisible tuple with higher
isolation levels. Certainly, what we do there today instead of
EvalPlanQual() looping is consistent with that behavior.
Couldn't EvalPlanQual() be said to be an MVCC violation on similar
grounds? It also "reaches into the future". Locking a row isn't really
that distinct from updating it in terms of the code footprint, but
also from a logical perspective.Yes, EvalPlanQual() is definitely an MVCC violation.
So I think that you can at least see why I'd consider that the two (my
tweaks to HeapTupleSatisfiesMVCC() and EvalPlanQual()) are isomorphic.
It just becomes the job of this new locking infrastructure to worry
about the would-be invisibility of the locked tuple, and raise a
serialization error accordingly at higher isolation levels.
The important observation here is that an updater, in effect, locks
both the old and new sets of values (for non-HOT updates). And as I've
already noted, any practical "value locking" implementation isn't
going to be able to prevent the update from immediately locking the
old, because that doesn't touch an index. Hence, there is an
irresolvable disconnect between value and row locking.This part I don't follow. "locking the old"? What irresolvable
disconnect? I mean, they're different things; I get *that*.
Well, if you update a row, the old row version's values are locked, in
the sense that any upserter interested in inserting the same values as
the old version is going to have to block pending the outcome of the
updating xact.
The disconnect is that any attempt at a clever dance, to interplay
value and row locking such that this definitely just works first time
seems totally futile - I'm emphasizing this because it's the obvious
way to approach this basic problem. It turns out that it could only be
done at great expense, in a way that would immediately be dismissed as
totally outlandish.
OK, I take your point, I think. The existing system already acquires
value locks when a tuple lock is held, during an UPDATE, and we can't
change that.
Right.
I think that re-ordering is an important property of any design where
we cannot bail out with serialization failures.
I worry about the behavior being confusing and hard to understand in
the presence of multiple unique indexes and reordering. Perhaps I
simply don't understand the problem domain well-enough yet.
It's only confusing if you are worried about what concurrent sessions
do with respect to each other at this low level. In which case, just
use a higher isolation level and pay the price. I'm not introducing
any additional anomalies described and prohibited by the standard by
doing this, and indeed the order of retrying in the event of a
conflict today is totally undefined, so this line of thinking is not
inconsistent with how things work today. Today, strictly speaking some
unique constraint violations might be more appropriate as
serialization failures. So with this new functionality, when used,
they're going to be actual serialization failures where that's
appropriate, where we'd otherwise go do something else other than
error. Why burden read committed like that? (Actually, fwiw I suspect
that currently the SSI guarantees *can* be violated with unique retry
re-ordering, but that's a whole other story, and is pretty subtle).
Let me come right out and say it: Yes, part of the reason that I'm
taking this line is because it's convenient to my implementation from
a number of different perspectives. But one of those perspectives is
that it will help performance in the face of contention immensely,
without violating any actual precept held today (by us or by the
standard or by anyone else AFAIK), and besides, my basic design is
informed by sincerely-held beliefs about what will actually work
within the constraints presented.
From a user perspective, I would really think people would want to
specify a set of key columns and then update if a match is found on
those key columns. Suppose there's a unique index on (a, b) and
another on (c), and the user passes in (a,b,c)=(1,1,1). It's hard for
me to imagine that the user will be happy to update either (1,1,2) or
(2,2,1), whichever exists. In what situation would that be the
desired behavior?
You're right - that isn't desirable. The reason that we go to all this
trouble with locking multiple values concurrently boils down to
preventing the user from having to specify a constraint name - it's
usually really obvious *to users* that understand their schema, so why
bother them with that esoteric detail? The user *is* more or less
required to have a particular constraint in mind when writing their
DML (for upsert). It could be that that constraint has a 1:1
correlation with another constraint in practice, which would also work
out fine - they'd specify one or the other constrained column (maybe
both) in the subsequent update's predicate. But generally, yes,
they're out of luck here, until we get around to implementing MERGE in
its full generality, which I think what I've proposed is a logical
stepping stone towards (again, because it involves locking values
across unique indexes simultaneously).
Now, at least what I've proposed has the advantage of allowing the
user to add some controls in their update's predicate. So if they only
had updating (1,1,2) in mind, they could put WHERE a = 1 AND b = 1 in
there too (I'm imagining the wCTE pattern is used). They'd then be
able to inspect the value of the FOUND pseudo-variable or whatever.
Now, I'm assuming that we'd somehow be able to tell that the insert
hasn't succeeded (i.e. it locked), and maybe that doesn't accord very
well with these kinds of facilities as they exist today, but it
doesn't seem like too much extra work (MySQL would consider that both
the locked and updated rows were affected, which might help us here).
MySQL's INSERT...ON DUPLICATE KEY UPDATE has nothing like this - there
is no guidance as to why you went to update, and you cannot have a
separate update qual. Users better just get it right!
Maybe what's really needed here is INSERT...ON DUPLICATE KEY LOCK FOR
UPDATE RETURNING LOCKED... . You can see what was actually locked, and
act on *that* as appropriate. Though you don't get to see the actual
value of default expressions and so on, which is a notable
disadvantage over RETURNING REJECTS... .
The advantage of RETURNING LOCKED would be you could check if it
LOCKED for the reason you thought it should have. If it didn't, then
surely what you'd prefer would be a unique constraint violation, so
you can just go throw an error in application code (or maybe consider
another value for the columns that surprised you).
What do others think?
Also, under such a programming model, if somebody drops a unique index
or adds a new one, the behavior of someone's application can
completely change. I have a hard time swallowing that. It's an
established precedent that dropping a unique index can make some other
operation fail (e.g. ADD FOREIGN KEY, and more recently CREATE VIEW ..
GROUP BY), and of course it can cause performance or plan changes.
But overturning the semantics is, I think, something new, and it
doesn't feel like a good direction.
In what sense is causing, or preventing an error (the current state of
affairs) not a behavioral change? I'd have thought it a very
significant one. If what you're saying here is true, wouldn't that
mandate that we specify the name of a unique index inline, within DML?
I thought we were in agreement that that wasn't desirable.
If you think it's a bit odd that we lock every value while the user
essentially has one constraint in mind when writing their DML,
consider:
1) We need this for MERGE anyway.
2) Don't underestimate the intellectual overhead for developers and
operations personnel of adding an application-defined significance to
unique indexes that they don't otherwise have. It sure would suck if a
refactoring effort to normalize unique index naming style had the
effect of breaking a slew of application code. Certainly, everyone
else seems to have reached the same conclusion in their own
implementation of upsert, because they don't require that a unique
index be specified, even when that could have unexpected results.
3) The problems that getting the details wrong present can be
ameliorated by developers who feel it might be a problem for them, as
already described. I think in the vast majority of cases it just
obviously won't be a problem to begin with.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 30, 2013 at 3:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
If you think it's a bit odd that we lock every value while the user
essentially has one constraint in mind when writing their DML,
consider:
I should add to that list:
4) Locking all the values at once is necessary for the behavior of the
locking to be well-defined -- I feel we need to know that some exact
tuple is to blame (according to our well defined ordering for checking
unique indexes for conflicts) for at least one instant in time.
Given that we need to be the first to change the row without anything
being altered to it, this ought to be sufficient. If you think it's
bad that some other session can come in and insert a tuple that would
have caused us to decide differently (before *our* transaction commits
but *after* we've inserted), now you're into blaming the *wrong* tuple
in the future, and I can't get excited about that - we always prefer a
tuple normally visible to our snapshot, but if forced to (if there is
none) we just throw a serialization failure (where appropriate). So
for read committed you can have no *principled* beef with this, but
for serializable you're going to naturally prefer the
currently-visible tuple generally (that's the only correct behavior
there that won't error - there *better* be something visible).
Besides, the way the user tacitly has to use the feature with one
particular constraint in mind kind of implies that this cannot
happen...
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 30, 2013 at 9:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Sep 30, 2013 at 3:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
If you think it's a bit odd that we lock every value while the user
essentially has one constraint in mind when writing their DML,
consider:I should add to that list:
4) Locking all the values at once is necessary for the behavior of the
locking to be well-defined -- I feel we need to know that some exact
tuple is to blame (according to our well defined ordering for checking
unique indexes for conflicts) for at least one instant in time.Given that we need to be the first to change the row without anything
being altered to it, this ought to be sufficient. If you think it's
bad that some other session can come in and insert a tuple that would
have caused us to decide differently (before *our* transaction commits
but *after* we've inserted), now you're into blaming the *wrong* tuple
in the future, and I can't get excited about that - we always prefer a
tuple normally visible to our snapshot, but if forced to (if there is
none) we just throw a serialization failure (where appropriate). So
for read committed you can have no *principled* beef with this, but
for serializable you're going to naturally prefer the
currently-visible tuple generally (that's the only correct behavior
there that won't error - there *better* be something visible).Besides, the way the user tacitly has to use the feature with one
particular constraint in mind kind of implies that this cannot
happen...
This patch is still marked as "Needs Review" in the CommitFest
application. There's no reviewer, but in fact Andres and I both spent
quite a lot of time providing design feedback (probably more than I
spent on any other CommitFest patch). I think it's clear that the
patch as submitted is not committable, so as far as the CommitFest
goes I'm going to mark it Returned with Feedback.
I think there are still some design considerations to work out here,
but honestly I'm not totally sure what the remaining points of
disagreement are. It would be nice to here the opinions of a few more
people on the concurrency issues, but beyond that I think that a lot
of this is going to boil down to whether the details of the value
locking can be made to seem palatable enough and sufficiently
low-overhead in the common case. I don't believe we can comment on
that in the abstract.
There's still some question in my mind as to what the semantics ought
to be. I do understand Peter's point that having to specify a
particular index would be grotty, but I'm not sure it invalidates my
point that having to work across multiple indexes could lead to
surprising results in some scenarios. I'm not going to stand here and
hold my breath, though: if that's the only thing that makes me nervous
about the final patch, I'll not object to it on that basis.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 9, 2013 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
This patch is still marked as "Needs Review" in the CommitFest
application. There's no reviewer, but in fact Andres and I both spent
quite a lot of time providing design feedback (probably more than I
spent on any other CommitFest patch).
Right, thank you both.
I think there are still some design considerations to work out here,
but honestly I'm not totally sure what the remaining points of
disagreement are. It would be nice to here the opinions of a few more
people on the concurrency issues, but beyond that I think that a lot
of this is going to boil down to whether the details of the value
locking can be made to seem palatable enough and sufficiently
low-overhead in the common case. I don't believe we can comment on
that in the abstract.
I agree that we cannot comment on it in the abstract. I am optimistic
that we can make the value locking work better without regressing the
common cases (especially if we're only concerned about not regressing
users that never use the feature, as opposed to having some
expectation for regular inserters inserting values into the same
ranges as an upserter). That's not my immediate concern, though - my
immediate concern is getting the concurrency and visibility issues
scrutinized.
What would it take to get the patch into a committable state if the
value locking had essentially the same properties (they were held
instantaneously), but were perfect? There is no point in giving the
value locking implementation too much further consideration unless
that question can be answered. In the past I've said that row locking
and value locking cannot be considered separately, but that was when
it was generally assumed that value locks had to persist for a long
time in a way that I don't think is feasible (and I think Robert would
now agree that it's at the very least very hard). Persisting value
locks basically make not regressing the general case hard, when you
think about the implementation. As Robert remarked, regular btree
index insertion blockings on an xid, not a value, and cannot be easily
made to appreciate that the "value lock" that a would-be duplicate
index tuple represents may just be held for a short time, and not the
entire duration of their inserter's transaction.
There's still some question in my mind as to what the semantics ought
to be. I do understand Peter's point that having to specify a
particular index would be grotty, but I'm not sure it invalidates my
point that having to work across multiple indexes could lead to
surprising results in some scenarios. I'm not going to stand here and
hold my breath, though: if that's the only thing that makes me nervous
about the final patch, I'll not object to it on that basis.
I should be so lucky! :-)
Unfortunately, I have a very busy schedule in the month ahead,
including travelling to Ireland and Japan, so I don't think I'm going
to get the opportunity to work on this too much. I'll try and produce
a V4 that formally proposes some variant of my ideas around visibility
of locked tuples.
Here are some things you might not like about this patch, if we're
still assuming that the value locks are prototype and it's useful to
defer discussion around their implementation:
* The lock starvation hazards around going from value locking to row
locking, and retrying if it doesn't work out (i.e. if the row and its
descendant rows cannot be locked without what would ordinarily
necessitate using EvalPlanQual()). I don't see what we could do about
those, other than checking for changes in the rows unique index
values, which would be complex. I understand the temptation to do
that, but the fact is that that isn't going to work all the time -
some unique index value may well change every time. By doing that
you've already accepted whatever hazard may exist, and it becomes a
question of degree. Which is fine, but I don't see that the current
degree is actually much of problem in the real world.
* Reordering of value locks generally. I still need to ensure this
will behave reasonably at higher isolation levels (i.e. they'll get a
serialization failure). I think that Robert accepts that this isn't
inconsistent with read committed's documented behavior, and that it is
useful, and maybe even essential.
* The basic question of whether or not it's possible to lock values
and rows at the same time, and if that matters (because it turns out
what looks like that isn't, because deleters will effectively lock
values without even touching an index). I think Robert saw the
difficulty of doing this, but it would be nice to get a definitive
answer. I think that any MERGE implementation worth its salt will not
deadlock without the potential for multiple rows to be locked in an
inconsistent order, so this shouldn't either, and as I believe I
demonstrated, value locks and row locks should not be held at the same
time for at least that reason. Right?
* The syntax. I like the composability, and the way it's likely to
become idiomatic to combine it with wCTEs. Others may not.
* The visibility hacks that V4 is likely to have. The fact that
preserving the composable syntax may imply changes to
HeapTupleSatisfiesMVCC() so that rows locked but with no currently
visible version (under conventional rules) are visible to our snapshot
by virtue of having been locked all the same (this only matters at
read committed).
So I think that what this patch really could benefit from is lots of
scrutiny around the concurrency issues. It would be unfair to ask for
that before at least producing a V4, so I'll clean up what I already
have and post it, probably on Sunday.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 9, 2013 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
* The lock starvation hazards around going from value locking to row
locking, and retrying if it doesn't work out (i.e. if the row and its
descendant rows cannot be locked without what would ordinarily
necessitate using EvalPlanQual()). I don't see what we could do about
those, other than checking for changes in the rows unique index
values, which would be complex. I understand the temptation to do
that, but the fact is that that isn't going to work all the time -
some unique index value may well change every time. By doing that
you've already accepted whatever hazard may exist, and it becomes a
question of degree. Which is fine, but I don't see that the current
degree is actually much of problem in the real world.
Some of the decisions we make here may end up being based on measured
performance rather than theoretical analysis.
* Reordering of value locks generally. I still need to ensure this
will behave reasonably at higher isolation levels (i.e. they'll get a
serialization failure). I think that Robert accepts that this isn't
inconsistent with read committed's documented behavior, and that it is
useful, and maybe even essential.
I think there's a sentence missing here, or something. Obviously, the
behavior at higher isolation levels is neither consistent nor
inconsistent with read committed's documented behavior; it's another
issue entirely.
* The basic question of whether or not it's possible to lock values
and rows at the same time, and if that matters (because it turns out
what looks like that isn't, because deleters will effectively lock
values without even touching an index). I think Robert saw the
difficulty of doing this, but it would be nice to get a definitive
answer. I think that any MERGE implementation worth its salt will not
deadlock without the potential for multiple rows to be locked in an
inconsistent order, so this shouldn't either, and as I believe I
demonstrated, value locks and row locks should not be held at the same
time for at least that reason. Right?
Right.
* The syntax. I like the composability, and the way it's likely to
become idiomatic to combine it with wCTEs. Others may not.
I've actually lost track of what syntax you're proposing.
* The visibility hacks that V4 is likely to have. The fact that
preserving the composable syntax may imply changes to
HeapTupleSatisfiesMVCC() so that rows locked but with no currently
visible version (under conventional rules) are visible to our snapshot
by virtue of having been locked all the same (this only matters at
read committed).
I continue to think this is a bad idea.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 9, 2013 at 5:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
* Reordering of value locks generally. I still need to ensure this
will behave reasonably at higher isolation levels (i.e. they'll get a
serialization failure). I think that Robert accepts that this isn't
inconsistent with read committed's documented behavior, and that it is
useful, and maybe even essential.I think there's a sentence missing here, or something. Obviously, the
behavior at higher isolation levels is neither consistent nor
inconsistent with read committed's documented behavior; it's another
issue entirely.
Here, "this" referred to the reordering concept generally. So I was
just saying that I'm not actually introducing any anomaly that is
described by the standard at read committed, and that at repeatable
read+, we can have actual serial ordering of value locks without
requiring them to last a long time, because we can throw serialization
failures, and can even do so when not strictly logically necessary.
* The basic question of whether or not it's possible to lock values
and rows at the same time, and if that matters (because it turns out
what looks like that isn't, because deleters will effectively lock
values without even touching an index). I think Robert saw the
difficulty of doing this, but it would be nice to get a definitive
answer. I think that any MERGE implementation worth its salt will not
deadlock without the potential for multiple rows to be locked in an
inconsistent order, so this shouldn't either, and as I believe I
demonstrated, value locks and row locks should not be held at the same
time for at least that reason. Right?Right.
I'm glad we're on the same page with that - it's a very important
consideration to my mind.
* The syntax. I like the composability, and the way it's likely to
become idiomatic to combine it with wCTEs. Others may not.I've actually lost track of what syntax you're proposing.
I'm continuing to propose:
INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
with a much less interesting variant that could be jettisoned:
INSERT...ON DUPLICATE KEY IGNORE
I'm also proposing extended RETURNING to make it work with this. So
the basic idea is that within Postgres, the idiomatic way to correctly
do upsert becomes something like:
postgres=# with r as (
insert into foo(a,b)
values (5, '!'), (6, '@')
on duplicate key lock for update
returning rejects *
)
update foo set b = r.b from r where foo.a = r.a;
* The visibility hacks that V4 is likely to have. The fact that
preserving the composable syntax may imply changes to
HeapTupleSatisfiesMVCC() so that rows locked but with no currently
visible version (under conventional rules) are visible to our snapshot
by virtue of having been locked all the same (this only matters at
read committed).I continue to think this is a bad idea.
Fair enough.
Is it just because of performance concerns? If so, that's probably not
that hard to address. It either has a measurable impact on performance
for a very unsympathetic benchmark or it doesn't. I guess that's the
standard that I'll be held to, which is probably fair.
Do you see the appeal of the composable syntax?
I appreciate that it's odd that serializable transactions now have to
worry about seeing something they shouldn't have seen (when they
conclusively have to go lock a row version not current to their
snapshot). But that's simpler than any of the alternatives that I see.
Does there really need to be a new snapshot type with one tiny
difference that apparently doesn't actually affect conventional
clients of MVCC snapshots?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 9, 2013 at 9:30 PM, Peter Geoghegan <pg@heroku.com> wrote:
* The syntax. I like the composability, and the way it's likely to
become idiomatic to combine it with wCTEs. Others may not.I've actually lost track of what syntax you're proposing.
I'm continuing to propose:
INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
with a much less interesting variant that could be jettisoned:
INSERT...ON DUPLICATE KEY IGNORE
I'm also proposing extended RETURNING to make it work with this. So
the basic idea is that within Postgres, the idiomatic way to correctly
do upsert becomes something like:postgres=# with r as (
insert into foo(a,b)
values (5, '!'), (6, '@')
on duplicate key lock for update
returning rejects *
)
update foo set b = r.b from r where foo.a = r.a;
I can't claim to be enamored of this syntax.
* The visibility hacks that V4 is likely to have. The fact that
preserving the composable syntax may imply changes to
HeapTupleSatisfiesMVCC() so that rows locked but with no currently
visible version (under conventional rules) are visible to our snapshot
by virtue of having been locked all the same (this only matters at
read committed).I continue to think this is a bad idea.
Fair enough.
Is it just because of performance concerns? If so, that's probably not
that hard to address. It either has a measurable impact on performance
for a very unsympathetic benchmark or it doesn't. I guess that's the
standard that I'll be held to, which is probably fair.
That's part of it; but I also think that HeapTupleSatisfiesMVCC() is a
pretty fundamental bit of the system that I am loathe to tamper with.
We can try to talk ourselves into believing that the definition change
will only affect this case, but I'm wary that there will be
unanticipated consequences, or simply that we'll find, after it's far
too late to do anything about it, that we don't particularly care for
the new semantics. It's probably an overstatement to say that I'll
oppose any whatsoever that touches the semantics of that function, but
not by much.
Do you see the appeal of the composable syntax?
To some extent. It seems to me that what we're designing is a giant
grotty hack, albeit a convenient one. But if we're not really going
to get MERGE, I'm not sure how much good it is to try to pretend we've
got something general.
I appreciate that it's odd that serializable transactions now have to
worry about seeing something they shouldn't have seen (when they
conclusively have to go lock a row version not current to their
snapshot).
Surely that's never going to be acceptable. At read committed,
locking a version not current to the snapshot might be acceptable if
we hold our nose, but at any higher level I think we have to fail with
a serialization complaint.
But that's simpler than any of the alternatives that I see.
Does there really need to be a new snapshot type with one tiny
difference that apparently doesn't actually affect conventional
clients of MVCC snapshots?
I think that's the wrong way of thinking about it. If you're
introducing a new type of snapshot, or tinkering with the semantics of
an existing one, I think that's a reason to reject the patch straight
off. We should be looking for a design that doesn't require that. If
we can't find one, I'm not sure we should do this at all.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-11 08:43:43 -0400, Robert Haas wrote:
I appreciate that it's odd that serializable transactions now have to
worry about seeing something they shouldn't have seen (when they
conclusively have to go lock a row version not current to their
snapshot).Surely that's never going to be acceptable. At read committed,
locking a version not current to the snapshot might be acceptable if
we hold our nose, but at any higher level I think we have to fail with
a serialization complaint.
I think an UPSERTish action in RR/SERIALIZABLE that notices a concurrent
update should and has to *ALWAYS* raise a serialization
failure. Anything else will cause violations of the given guarantees.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 11, 2013 at 10:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-10-11 08:43:43 -0400, Robert Haas wrote:
I appreciate that it's odd that serializable transactions now have to
worry about seeing something they shouldn't have seen (when they
conclusively have to go lock a row version not current to their
snapshot).Surely that's never going to be acceptable. At read committed,
locking a version not current to the snapshot might be acceptable if
we hold our nose, but at any higher level I think we have to fail with
a serialization complaint.I think an UPSERTish action in RR/SERIALIZABLE that notices a concurrent
update should and has to *ALWAYS* raise a serialization
failure. Anything else will cause violations of the given guarantees.
Sorry, this was just a poor choice of words on my part. I totally
agree with you here. Although I wasn't even talking about noticing a
concurrent update - I was talking about noticing that a tuple that
it's necessary to lock isn't visible to a serializable snapshot in the
first place (which should also fail).
What I actually meant was that it's odd that that one case (reason for
returning) added to HeapTupleSatisfiesMVCC() will always obligate
Serializable transactions to throw a serialization failure. Though
that isn't strictly true; the modifications to
HeapTupleSatisfiesMVCC() that I'm likely to propose also redundantly
work for other cases where, if I'm not mistaken, that's okay (today,
if you've exclusively locked a tuple and it hasn't been
updated/deleted, why shouldn't it be visible to your snapshot?). The
onus is on the executor-level code to notice this
should-be-invisibility for non-read-committed, probably immediately
after returning from value locking.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Oct 11, 2013 at 5:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
* The visibility hacks that V4 is likely to have. The fact that
preserving the composable syntax may imply changes to
HeapTupleSatisfiesMVCC() so that rows locked but with no currently
visible version (under conventional rules) are visible to our snapshot
by virtue of having been locked all the same (this only matters at
read committed).I continue to think this is a bad idea.
Is it just because of performance concerns? /
That's part of it; but I also think that HeapTupleSatisfiesMVCC() is a
pretty fundamental bit of the system that I am loathe to tamper with.
We can try to talk ourselves into believing that the definition change
will only affect this case, but I'm wary that there will be
unanticipated consequences, or simply that we'll find, after it's far
too late to do anything about it, that we don't particularly care for
the new semantics. It's probably an overstatement to say that I'll
oppose any whatsoever that touches the semantics of that function, but
not by much.
A tuple that is exclusively locked by our transaction and not updated
or deleted being visible on that basis alone isn't *that* hard to
reason about. Granted, we need to be very careful here, but we're
talking about 3 lines of code.
Do you see the appeal of the composable syntax?
To some extent. It seems to me that what we're designing is a giant
grotty hack, albeit a convenient one. But if we're not really going
to get MERGE, I'm not sure how much good it is to try to pretend we've
got something general.
Well, to be fair perhaps all of the things that you consider grotty
hacks seem like inherent requirements to me, for any half-way
reasonable upsert implementation on any system, that has the essential
property of upsert: an atomic insert-or-update (or maybe serialization
failure).
But that's simpler than any of the alternatives that I see.
Does there really need to be a new snapshot type with one tiny
difference that apparently doesn't actually affect conventional
clients of MVCC snapshots?I think that's the wrong way of thinking about it. If you're
introducing a new type of snapshot, or tinkering with the semantics of
an existing one, I think that's a reason to reject the patch straight
off. We should be looking for a design that doesn't require that. If
we can't find one, I'm not sure we should do this at all.
I'm confused by this. We need to lock a row not visible to our
snapshot under conventional rules. I think we can rule out
serialization failures at read committed. That just leaves changing
something about the visibility rules of an existing snapshot type, or
creating a new snapshot type, no?
It would also be unacceptable to update a tuple, and not have the new
row version (which of course will still have "information from the
future") visible to our snapshot - what would regular RETURNING
return? So what do you have in mind? I don't think that locking a row
and updating it are really that distinct anyway. The benefit of
locking is that we don't have to update. We can delete, for example.
Perhaps I've totally missed your point here, but to me it sounds like
you're saying that certain properties must always be preserved that
are fundamentally in tension with upsert working in the way people
expect, and the way it is bound to actually work in numerous other
systems.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 9, 2013 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
Unfortunately, I have a very busy schedule in the month ahead,
including travelling to Ireland and Japan, so I don't think I'm going
to get the opportunity to work on this too much. I'll try and produce
a V4 that formally proposes some variant of my ideas around visibility
of locked tuples.
V4 is attached.
Most notably, this adds the modifications to HeapTupleSatisfiesMVCC(),
though they're neater than in the snippet I sent earlier.
There is also some clean-up around row-level locking. That code has
been simplified. I also try and handle serialization failures in a
better way, though that really needs the attention of a subject matter
expert.
There are a few additional XXX comments highlighting areas of concern,
particularly around serializable behavior. I've deferred making higher
isolation levels care about wrongfully relying on the special
HeapTupleSatisfiesMVCC() exception (e.g. they won't throw a
serialization failure, mostly because I couldn't decide on where to do
the test on time prior to travelling tomorrow).
I've added code to do heap_prepare_insert before value locks are held.
Whatever our eventual value locking implementation, that's going to be
a useful optimization. Though unfortunately I ran out of time to give
this the scrutiny it really deserves, I suppose that it's something
that we can return to later.
I ask that reviewers continue to focus on concurrency issues and broad
design issues, and continue to defer discussion about an eventual
value locking implementation. I continue to think that that's the most
useful way of proceeding for the time being. My earlier points about
probable areas of concern [1]/messages/by-id/CAM3SWZSvSrTzPhjNPjahtJ0rFfS-gJFhU86Vpewf+eO8GwZXNQ@mail.gmail.com remain a good place for reviewers to
start.
[1]: /messages/by-id/CAM3SWZSvSrTzPhjNPjahtJ0rFfS-gJFhU86Vpewf+eO8GwZXNQ@mail.gmail.com
--
Peter Geoghegan
Attachments:
On Fri, Oct 11, 2013 at 2:30 PM, Peter Geoghegan <pg@heroku.com> wrote:
But that's simpler than any of the alternatives that I see.
Does there really need to be a new snapshot type with one tiny
difference that apparently doesn't actually affect conventional
clients of MVCC snapshots?I think that's the wrong way of thinking about it. If you're
introducing a new type of snapshot, or tinkering with the semantics of
an existing one, I think that's a reason to reject the patch straight
off. We should be looking for a design that doesn't require that. If
we can't find one, I'm not sure we should do this at all.I'm confused by this. We need to lock a row not visible to our
snapshot under conventional rules. I think we can rule out
serialization failures at read committed. That just leaves changing
something about the visibility rules of an existing snapshot type, or
creating a new snapshot type, no?It would also be unacceptable to update a tuple, and not have the new
row version (which of course will still have "information from the
future") visible to our snapshot - what would regular RETURNING
return? So what do you have in mind? I don't think that locking a row
and updating it are really that distinct anyway. The benefit of
locking is that we don't have to update. We can delete, for example.
Well, the SQL standard way of doing this type of operation is MERGE.
The alternative we know exists in other databases is REPLACE; there's
also INSERT .. ON DUPLICATE KEY update. In all of those cases,
whatever weirdness exists around MVCC is confined to that one command.
I tend to think we should do similarly, with the goal that
HeapTupleSatisfiesMVCC need not change at all.
I don't have the only vote here, of course, but my feeling is that
that's more likely to be a good route.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 15, 2013 at 5:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Well, the SQL standard way of doing this type of operation is MERGE.
The alternative we know exists in other databases is REPLACE; there's
also INSERT .. ON DUPLICATE KEY update. In all of those cases,
whatever weirdness exists around MVCC is confined to that one command.
I tend to think we should do similarly, with the goal that
HeapTupleSatisfiesMVCC need not change at all.
I don't think that it's very pragmatic to define success in terms of
not modifying a single visibility function. I feel it would be more
useful to define it as providing acceptable, non-surprising semantics,
while not regressing performance in other areas.
The fact remains that you're going to have a create a new snapshot
type even for this special case, so I don't see any win as regards
managing invasiveness here. Quite the contrary, in fact.
I don't have the only vote here, of course, but my feeling is that
that's more likely to be a good route.
Naturally we all want MERGE. It seems self-defeating to insist on
something significantly harder that there is significant less demand
for, though. I thought that there was at least informal agreement that
this sort of approach was preferable to MERGE in its full generality,
based on feedback at the 2012 developer meeting. I really don't think
that what I've done here is any worse than INSERT...ON DUPLICATE KEY
UPDATE in any of the areas you express concern about here. REPLACE has
some serious problems, and I just don't see it as a viable alternative
at all - just ask any MySQL user.
MERGE is of course more flexible to what I have here in some ways, but
actually less flexible in other ways. I think that the real point of
MERGE is that it's defined in a way that serves data warehousing use
cases very well: the semantics constrain things such that the executor
only has to execute a single ModifyTable node that does inserts,
updates and deletes in a single scan. That's great, but what if it's
useful to do that CRUD (yes, this can include selects) to entirely
different tables? Or what if the relevant DML will only come in a
later statement in the same transaction?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 15, 2013 at 8:07 AM, Peter Geoghegan <pg@heroku.com> wrote:
Naturally we all want MERGE. It seems self-defeating to insist on
something significantly harder that there is significant less demand
for, though.
I hasten to add: which is not to imply that you're insisting rather
than expressing a sentiment.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 15, 2013 at 11:07 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Oct 15, 2013 at 5:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Well, the SQL standard way of doing this type of operation is MERGE.
The alternative we know exists in other databases is REPLACE; there's
also INSERT .. ON DUPLICATE KEY update. In all of those cases,
whatever weirdness exists around MVCC is confined to that one command.
I tend to think we should do similarly, with the goal that
HeapTupleSatisfiesMVCC need not change at all.I don't think that it's very pragmatic to define success in terms of
not modifying a single visibility function. I feel it would be more
useful to define it as providing acceptable, non-surprising semantics,
while not regressing performance in other areas.The fact remains that you're going to have a create a new snapshot
type even for this special case, so I don't see any win as regards
managing invasiveness here. Quite the contrary, in fact.
Well, we might have to agree to disagree.
I don't have the only vote here, of course, but my feeling is that
that's more likely to be a good route.Naturally we all want MERGE. It seems self-defeating to insist on
something significantly harder that there is significant less demand
for, though. I thought that there was at least informal agreement that
this sort of approach was preferable to MERGE in its full generality,
based on feedback at the 2012 developer meeting. I really don't think
that what I've done here is any worse than INSERT...ON DUPLICATE KEY
UPDATE in any of the areas you express concern about here. REPLACE has
some serious problems, and I just don't see it as a viable alternative
at all - just ask any MySQL user.MERGE is of course more flexible to what I have here in some ways, but
actually less flexible in other ways. I think that the real point of
MERGE is that it's defined in a way that serves data warehousing use
cases very well: the semantics constrain things such that the executor
only has to execute a single ModifyTable node that does inserts,
updates and deletes in a single scan. That's great, but what if it's
useful to do that CRUD (yes, this can include selects) to entirely
different tables? Or what if the relevant DML will only come in a
later statement in the same transaction?
I'm not saying "go implement MERGE". I'm saying, make the
insert-or-update operation a single statement, using some syntax TBD,
instead of requiring the use of a new insert statement that makes
invisible rows visible as a side effect, so that you can wrap that in
a CTE and feed it to an update statement. That's complex and, AFAICS,
unlike how any other database product handles this.
Again, other people can have different opinions on this, and that's
fine. I'm just giving you mine.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-15 11:11:24 -0400, Robert Haas wrote:
I'm not saying "go implement MERGE". I'm saying, make the
insert-or-update operation a single statement, using some syntax TBD,
instead of requiring the use of a new insert statement that makes
invisible rows visible as a side effect, so that you can wrap that in
a CTE and feed it to an update statement. That's complex and, AFAICS,
unlike how any other database product handles this.
I think we most definitely should provide a single statement
variant. That's the one users yearn for.
I also would like a variant where I can lock a row on conflict, for
multimaster scenarios, but that doesn't necessarily have to be exposed
to SQL.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 15, 2013 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I'm not saying "go implement MERGE". I'm saying, make the
insert-or-update operation a single statement, using some syntax TBD,
instead of requiring the use of a new insert statement that makes
invisible rows visible as a side effect, so that you can wrap that in
a CTE and feed it to an update statement. That's complex and, AFAICS,
unlike how any other database product handles this.
Well, lots of other databases have their own unique way of doing this
- apart from MySQL's INSERT...ON DUPLICATE KEY UPDATE, there is a
variant within Teradata, Sybase and SQLite. They're all different. And
in the case of Teradata, it was an interim feature towards MERGE which
came in a much later release, which is how I see this.
No other database system even has writeable CTEs, of course. It's a
fairly recent idea.
Again, other people can have different opinions on this, and that's
fine. I'm just giving you mine.
I will defer to the majority opinion here. But you also expressed
concern about surprising results due to the wrong unique constraint
violation being the source of a conflict. Couldn't this syntax (with
the wCTE upsert pattern) help with that, by naming the constant
inserted in the update too? It would be pretty simple to expose that,
and far less grotty than naming a unique index in DML.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 15, 2013 at 11:34 AM, Peter Geoghegan <pg@heroku.com> wrote:
Again, other people can have different opinions on this, and that's
fine. I'm just giving you mine.I will defer to the majority opinion here. But you also expressed
concern about surprising results due to the wrong unique constraint
violation being the source of a conflict. Couldn't this syntax (with
the wCTE upsert pattern) help with that, by naming the constant
inserted in the update too? It would be pretty simple to expose that,
and far less grotty than naming a unique index in DML.
Well, I don't know that any of us can claim to have a lock on what the
syntax should look like. I think we need to hear some proposals.
You've heard my gripe about the current syntax (which Andres appears
to share), but I shan't attempt to prejudice you in favor of my
preferred alternative, because I don't have one yet. There could be
other ways of avoiding that problem, though. Here's an example:
UPSERT table (keycol1, ..., keycoln) = (keyval1, ..., keyvaln) SET
(nonkeycol1, ..., nonkeycoln) = (nonkeyval1, ..., nonkeyvaln)
That's pretty ugly on multiple levels, and I'm definitely not
proposing that exact thing, but the idea is: look for a record that
matches on the key columns/values; if found, update the non-key
columns with the corresponding values; if not found, construct a new
row with both the key and nonkey column sets and insert it. If no
matching unique index exists we'll have to fail, but we stop short of
having to mention the name of that index.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 15, 2013 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Well, I don't know that any of us can claim to have a lock on what the
syntax should look like.
Sure. But it's not just syntax. We're talking about functional
differences too, since you're talking about mandating an update, which
is a not the same as an "update locked row only conditionally", or a
delete.
I get that it's a little verbose, but then this is ORM plumbing for
many of those that would prefer a more succinct syntax. Those people
would also benefit from having their ORM do something much more
powerful for them when needed.
I think we need to hear some proposals.
Agreed.
You've heard my gripe about the current syntax (which Andres appears
to share), but I shan't attempt to prejudice you in favor of my
preferred alternative, because I don't have one yet.
FWIW, I sincerely see very real advantages to what I've proposed here.
To me, the fact that it's convenient to implement is beside the point.
There could be
other ways of avoiding that problem, though. Here's an example:UPSERT table (keycol1, ..., keycoln) = (keyval1, ..., keyvaln) SET
(nonkeycol1, ..., nonkeycoln) = (nonkeyval1, ..., nonkeyvaln)That's pretty ugly on multiple levels, and I'm definitely not
proposing that exact thing, but the idea is: look for a record that
matches on the key columns/values; if found, update the non-key
columns with the corresponding values; if not found, construct a new
row with both the key and nonkey column sets and insert it. If no
matching unique index exists we'll have to fail, but we stop short of
having to mention the name of that index.
What if you want to update the key columns - either the potential
conflict-causing one, or another? What about composite unique
constraints? MySQL certainly supports all that, for example.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-15 10:19:17 -0700, Peter Geoghegan wrote:
On Tue, Oct 15, 2013 at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Well, I don't know that any of us can claim to have a lock on what the
syntax should look like.Sure. But it's not just syntax. We're talking about functional
differences too, since you're talking about mandating an update, which
is a not the same as an "update locked row only conditionally", or a
delete.
I think anything that only works by breaking visibility rules that way
is a nonstarter. Doing that from the C level is one thing, exposing it
this way seems a bad idea.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 15, 2013 at 10:29 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I think anything that only works by breaking visibility rules that way
is a nonstarter. Doing that from the C level is one thing, exposing it
this way seems a bad idea.
What visibility rule is that? Upsert *has* to do effectively the same
thing as what I've proposed - there is no getting away from it. So
maybe the visibility rulebook (which as far as I can tell is "the way
things work today") needs to be updated. If we did, say, INSERT...ON
DUPLICATE KEY UPDATE, we'd have to update a row with potentially no
visible-to-snapshot version *at all*, and make a new version of that
visible. That's just what it takes. What's the difference between that
and just locking? If the only difference is that it isn't necessary to
modify tqual.c because you're passing a tid directly, that isn't a
user-visible difference - the "rule" has been broken just the same.
Arguably, it's even more of a hack, since it's a special, out-of-band
visibility exception. I'm happy to have total scrutiny of changes to
tqual.c, but I'm surprised that the mere fact of it having been
modified is being weighed so heavily.
Another thing that I'm not clear on is how an update can be backed out
of if the row is modified by another xact. As I think I've already
illustrated, the row locking that takes place has to be kind of
opportunistic. I'm sure you could do it, but it would probably be
quite invasive.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/15/2013 08:11 AM, Robert Haas wrote:
I'm not saying "go implement MERGE". I'm saying, make the
insert-or-update operation a single statement, using some syntax TBD,
instead of requiring the use of a new insert statement that makes
invisible rows visible as a side effect, so that you can wrap that in
a CTE and feed it to an update statement. That's complex and, AFAICS,
unlike how any other database product handles this.
Hmmm. Is the plan NOT to eventually get to a single-statement upsert?
If not, then I'm not that keen on this feature. I can't say that
anybody I know who's migrating from MySQL would use a 2-statement
version of upsert; if they were prepared for that, then they'd be
prepared to just rewrite their stuff as proper insert/updates anyway.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM743d306b9d38ad358ac7be5264fd9141355962ae3be2cd5aea830c74841bff6af293b6d809b4727d2ff7714f021a2da6@asav-3.01.com
On Tue, Oct 15, 2013 at 10:58 AM, Josh Berkus <josh@agliodbs.com> wrote:
Hmmm. Is the plan NOT to eventually get to a single-statement upsert?
If not, then I'm not that keen on this feature.
See the original e-mail in the thread for what I imagine idiomatic
usage will look like.
/messages/by-id/CAM3SWZThwrKtvurf1aWAiH8qThGNMZAfyDcNw8QJu7pqHk5AGQ@mail.gmail.com
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 15, 2013 at 11:05 AM, Peter Geoghegan <pg@heroku.com> wrote:
See the original e-mail in the thread for what I imagine idiomatic
usage will look like./messages/by-id/CAM3SWZThwrKtvurf1aWAiH8qThGNMZAfyDcNw8QJu7pqHk5AGQ@mail.gmail.com
Note also that this doesn't preclude a variant with a more direct
update part (not that I think that's all that compelling). Doing
things this way was motivated by:
1) Serving the needs of logical changeset generation plugins, even if
Andres doesn't think that needs to be exposed through SQL. He and I
both want something that does this with low overhead (in particular,
no subtransactions).
2) Getting something effective into the next release. MERGE-like
flexibility seems like a very desirable thing. And the
implementation's infrastructure can be used by an eventual MERGE
implementation.
3) Being simple enough that huge bike shedding over syntax might not
be necessary. Making insert statements grow an update tumor is likely
to get messy fast. I know because I tried it myself.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter,
Note also that this doesn't preclude a variant with a more direct
update part (not that I think that's all that compelling). Doing
things this way was motivated by:
I can see the value in the CTE format for this for existing PostgreSQL
users.
(although, AFAICT it doesn't allow for the implementation of one of my
personal desires, which is UPDATE ... ON NOT FOUND INSERT, for cases
where updates are expected to occur 95% of the time, but that's another
topic. Unless "rejects" for an Update could be the leftover rows, but
then we're getting into full MERGE.).
I'm just pointing out that this doesn't do much for the MySQL migration
case; the rewrite is too complex to automate. I'd been assuming that we
had some plans to implement a MySQL-friendly syntax for 9.5, and this
version was a stepping stone to that.
Does this version make a distinction between PRIMARY KEY constraints and
UNIQUE indexes? If not, how does it pick among keys? If so, what about
tables with no PRIMARY KEY for various reasons (like unique GiST indexes?)
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM84c549bc06654020a0a32cbc384ddd76e98730177b212b773eefb445ca826a8ebd500f79b08aaeca9dbff9fa82f319ab@asav-3.01.com
On Tue, Oct 15, 2013 at 11:23 AM, Josh Berkus <josh@agliodbs.com> wrote:
(although, AFAICT it doesn't allow for the implementation of one of my
personal desires, which is UPDATE ... ON NOT FOUND INSERT, for cases
where updates are expected to occur 95% of the time, but that's another
topic. Unless "rejects" for an Update could be the leftover rows, but
then we're getting into full MERGE.).
This isn't really all that inefficient for that case. Certainly, the
balance in cost between mostly-insert cases and mostly-update cases is
a strength of my basic approach over others.
Does this version make a distinction between PRIMARY KEY constraints and
UNIQUE indexes? If not, how does it pick among keys? If so, what about
tables with no PRIMARY KEY for various reasons (like unique GiST indexes?)
We thought about prioritizing where to look (mostly as a performance
optimization), but right now no. It works with amcanunique methods,
which in practice means btrees. There is no such thing as a GiST
unique index, so I guess you're referring to an exclusion constraint
on an equality operator. That doesn't work with this, but why would
you want it to? As for generalizing this to work with exclusion
constraints, which I guess you might have also meant, that's a much
more difficult and much less compelling proposition, in my opinion.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/15/2013 11:38 AM, Peter Geoghegan wrote:
We thought about prioritizing where to look (mostly as a performance
optimization), but right now no. It works with amcanunique methods,
which in practice means btrees. There is no such thing as a GiST
unique index, so I guess you're referring to an exclusion constraint
on an equality operator. That doesn't work with this, but why would
you want it to? As for generalizing this to work with exclusion
constraints, which I guess you might have also meant, that's a much
more difficult and much less compelling proposition, in my opinion.
Yeah, that was one thing I was thinking of.
Also, because you can't INDEX CONCURRENTLY a PK, I've been building a
lot of databases which have no PKs, only UNIQUE indexes. Historically,
this hasn't been an issue because aside from wonky annoyances (like the
CONCURRENTLY case), Postgres doesn't distinguish between UNIQUE indexes
and PRIMARY KEYs -- as, indeed, it shouldn't, since they're both keys,
adn the whole concept of a "primary key" is a legacy of index-organized
databases, which PostgreSQL is not.
However, it does seem like the new syntax could be extended with and
optional "USING unqiue_index_name" in the future (9.5), no?
I'm just checking that we're not painting ourselves into a corner with
this particular implementation. It's OK if it doesn't implement most
things now; it's bad if it is impossible to build on and we have to
support it forever.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMdad05bce97aa4c27f22120f883914da308fa5859ad954c3b54cfdd8c1097202f10de2408475a9905b4ca55e585e0d423@asav-1.01.com
On Tue, Oct 15, 2013 at 11:55 AM, Josh Berkus <josh@agliodbs.com> wrote:
However, it does seem like the new syntax could be extended with and
optional "USING unqiue_index_name" in the future (9.5), no?
There is no reason why we couldn't do that and just consider that one
unique index. Whether we should is another question - I certainly
think that mandating it would be very bad.
I'm just checking that we're not painting ourselves into a corner with
this particular implementation. It's OK if it doesn't implement most
things now; it's bad if it is impossible to build on and we have to
support it forever.
I don't believe it does. In essence this just simply inserts a row,
and rather than throwing a unique constraint violation, locks the row
that prevented insertion from proceeding in respect of any tuple
proposed for insertion where it does not. That's all. You can build
lots of things with it that you can't today. Or you can not use it at
all. So that covers semantics, I'd say.
As for implementation: I believe that the implementation is by far the
most forward thinking (in terms of building infrastructure for a
proper MERGE) of any proposal to date.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/15/2013 12:03 PM, Peter Geoghegan wrote:
On Tue, Oct 15, 2013 at 11:55 AM, Josh Berkus <josh@agliodbs.com> wrote:
However, it does seem like the new syntax could be extended with and
optional "USING unqiue_index_name" in the future (9.5), no?There is no reason why we couldn't do that and just consider that one
unique index. Whether we should is another question -
What's the "shouldn't" argument, if any?
I certainly
think that mandating it would be very bad.
Agreed. If there is a PK, we should allow the user to use it implicitly.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMdf164a75d3c3669d466626f043bd4a8948561c0b99448334689846e9e06d5dc1e62cc9196bf8dcc304ce50269482b2c9@asav-1.01.com
On 2013-10-15 10:53:35 -0700, Peter Geoghegan wrote:
On Tue, Oct 15, 2013 at 10:29 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I think anything that only works by breaking visibility rules that way
is a nonstarter. Doing that from the C level is one thing, exposing it
this way seems a bad idea.What visibility rule is that?
The early return you added to HTSMVCC.
At the very least it opens you to lots of halloween problem like
scenarios.
Upsert *has* to do effectively the same thing as what I've proposed -
there is no getting away from it. So maybe the visibility rulebook
(which as far as I can tell is "the way things work today") needs to
be updated. If we did, say, INSERT...ON DUPLICATE KEY UPDATE, we'd
have to update a row with potentially no visible-to-snapshot version
*at all*, and make a new version of that visible. That's just what it
takes. What's the difference between that and just locking? If the
only difference is that it isn't necessary to modify tqual.c because
you're passing a tid directly, that isn't a user-visible difference -
the "rule" has been broken just the same. Arguably, it's even more of
a hack, since it's a special, out-of-band visibility exception.
No, doing it in special case code is fundamentally different since those
locations deal only with one row at a time. There's no scans that can
pass over that row.
That's why I think exposing the "on conflict lock" logic to anything but
C isn't going to fly btw.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-15 11:23:44 -0700, Josh Berkus wrote:
(although, AFAICT it doesn't allow for the implementation of one of my
personal desires, which is UPDATE ... ON NOT FOUND INSERT, for cases
where updates are expected to occur 95% of the time, but that's another
topic. Unless "rejects" for an Update could be the leftover rows, but
then we're getting into full MERGE.).
FWIW I can't see the above syntax as something working very well - you
fundamentally have to SET every column and it only makes sense in
UPDATEs that provably affect only one row.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-15 11:55:06 -0700, Josh Berkus wrote:
Also, because you can't INDEX CONCURRENTLY a PK, I've been building a
lot of databases which have no PKs, only UNIQUE indexes.
You know that you can add prebuilt primary keys using ALTER TABLE
... ADD CONSTRAINT ... PRIMARY KEY (...) USING indexname?
Postgres doesn't distinguish between UNIQUE indexes
and PRIMARY KEYs -- as, indeed, it shouldn't, since they're both keys,
adn the whole concept of a "primary key" is a legacy of index-organized
databases, which PostgreSQL is not.
There's some other differences, fro one primary keys are automatically
picked up by foreign keys if the referenced columns aren't specified,
for another we do not yet automatically recognize NOT NULL UNIQUE
columns in GROUP BY.
However, it does seem like the new syntax could be extended with and
optional "USING unqiue_index_name" in the future (9.5), no?
Yes.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/15/2013 02:31 PM, Andres Freund wrote:
On 2013-10-15 11:55:06 -0700, Josh Berkus wrote:
Also, because you can't INDEX CONCURRENTLY a PK, I've been building a
lot of databases which have no PKs, only UNIQUE indexes.You know that you can add prebuilt primary keys using ALTER TABLE
... ADD CONSTRAINT ... PRIMARY KEY (...) USING indexname?
That still requires an ACCESS EXCLUSIVE lock, and then can't be dropped
using DROP INDEX CONCURRENTLY.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMd39d5489f03dbe4a45ef28d8d42d04389a629da73c2b1a3eae6bca8a9d042b031ea9e6222fd8856908002b044d16bc3e@asav-1.01.com
On Tue, Oct 15, 2013 at 2:25 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-10-15 10:53:35 -0700, Peter Geoghegan wrote:
On Tue, Oct 15, 2013 at 10:29 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I think anything that only works by breaking visibility rules that way
is a nonstarter. Doing that from the C level is one thing, exposing it
this way seems a bad idea.What visibility rule is that?
The early return you added to HTSMVCC.
At the very least it opens you to lots of halloween problem like
scenarios.
The term "visibility rule" as you've used it here is suggestive of
some authoritative rule that should obviously never even be bent. I'd
suggest that what Postgres does isn't very useful as an authority on
this matter, because Postgres doesn't have upsert. Besides, today
Postgres doesn't just bend the rules (that is, some kind of classic
notion of MVCC as described in "Concurrency Control in Distributed
Database Systems" or something), it totally breaks them, at least in
READ COMMITTED mode (and what I've proposed here just occurs in RC
mode).
It is not actually in evidence that this approach introduces Halloween
problems. In order for HTSMVCC to controversially indicate visibility
under my scheme, it is not sufficient for the row version to just be
exclusive locked by our xact without otherwise being visible - it must
also *not be updated*. Now, I'll freely admit that this could still be
problematic - there might have been a subtlety I missed. But since an
actual example of where this is problematic hasn't been forthcoming, I
take it that it isn't obvious to either yourself or Robert that it
actually is. Any scheme that involves playing cute tricks with
visibility (which is to say, any credible upsert implementation) needs
very careful thought.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 15, 2013 at 1:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
There could be
other ways of avoiding that problem, though. Here's an example:UPSERT table (keycol1, ..., keycoln) = (keyval1, ..., keyvaln) SET
(nonkeycol1, ..., nonkeycoln) = (nonkeyval1, ..., nonkeyvaln)That's pretty ugly on multiple levels, and I'm definitely not
proposing that exact thing, but the idea is: look for a record that
matches on the key columns/values; if found, update the non-key
columns with the corresponding values; if not found, construct a new
row with both the key and nonkey column sets and insert it. If no
matching unique index exists we'll have to fail, but we stop short of
having to mention the name of that index.What if you want to update the key columns - either the potential
conflict-causing one, or another?
I'm not sure what that means in the context of an UPSERT operation.
If the update case is, when a = 1 then make a = 2, then which value
goes in column a when we insert, 1 or 2? But I suppose if you can
work that out it's just a matter of mentioning the column as both a
key column and a non-key column.
What about composite unique
constraints? MySQL certainly supports all that, for example.
That's why it allows you to specify N key columns rather than
restricting you to just one.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14.10.2013 07:12, Peter Geoghegan wrote:
On Wed, Oct 9, 2013 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
Unfortunately, I have a very busy schedule in the month ahead,
including travelling to Ireland and Japan, so I don't think I'm going
to get the opportunity to work on this too much. I'll try and produce
a V4 that formally proposes some variant of my ideas around visibility
of locked tuples.V4 is attached.
Most notably, this adds the modifications to HeapTupleSatisfiesMVCC(),
though they're neater than in the snippet I sent earlier.There is also some clean-up around row-level locking. That code has
been simplified. I also try and handle serialization failures in a
better way, though that really needs the attention of a subject matter
expert.There are a few additional XXX comments highlighting areas of concern,
particularly around serializable behavior. I've deferred making higher
isolation levels care about wrongfully relying on the special
HeapTupleSatisfiesMVCC() exception (e.g. they won't throw a
serialization failure, mostly because I couldn't decide on where to do
the test on time prior to travelling tomorrow).I've added code to do heap_prepare_insert before value locks are held.
Whatever our eventual value locking implementation, that's going to be
a useful optimization. Though unfortunately I ran out of time to give
this the scrutiny it really deserves, I suppose that it's something
that we can return to later.I ask that reviewers continue to focus on concurrency issues and broad
design issues, and continue to defer discussion about an eventual
value locking implementation. I continue to think that that's the most
useful way of proceeding for the time being. My earlier points about
probable areas of concern [1] remain a good place for reviewers to
start.
I think it's important to recap the design goals of this. I don't think
these have been listed before, so let me try:
* It should be usable and perform well for both large batch updates and
small transactions.
* It should perform well both when there are no duplicates, and when
there are lots of duplicates
And from that follows some finer requirements:
* Performance when there are no duplicates should be close to raw INSERT
performance.
* Performance when all rows are duplicates should be close to raw UPDATE
performance.
* We should not leave behind large numbers of dead tuples in either case.
Anything else I'm missing?
What about exclusion constraints? I'd like to see this work for them as
well. Currently, exclusion constraints are checked after the tuple is
inserted, and you abort if the constraint was violated. We could still
insert the heap and index tuples first, but instead of aborting on
violation, we would kill the heap tuple we already inserted and retry.
There are some complications there, like how to wake up any other
backends that are waiting to grab a lock on the tuple we just killed,
but it seems doable.
That would, however, perform badly and leave garbage behind if there are
duplicates. A refinement of that would be to first check for constraint
violations, then insert the tuple, and then check again. That would
avoid the garbage in most cases, but would perform much more poorly when
there are no duplicates, because it needs two index scans for every
insertion. A further refinement would be to keep track of how many
duplicates there have been recently, and switch between the two
strategies based on that.
That cost of doing two scans could be alleviated by using
markpos/restrpos to do the second scan. That is presumably cheaper than
starting a whole new scan with the same key. (markpos/restrpos don't
currently work for non-MVCC snapshots, so that'd need to be fixed, though)
And that detour with exclusion constraints takes me back to the current
patch :-). What if you implemented the unique check in a similar fashion
too (when doing INSERT ON DUPLICATE KEY ...)? First, scan for a
conflicting key, and mark the position. Then do the insertion to that
position. If the insertion fails because of a duplicate key (which was
inserted after we did the first scan), mark the heap tuple as dead, and
start over. The indexam changes would be quite similar to the changes
you made in your patch, but instead of keeping the page locked, you'd
only hold a pin on the target page (if even that). The first indexam
call would check that the key doesn't exist, and remember the insert
position. The second call would re-find the previous position, and
insert the tuple, checking again that there really wasn't a duplicate
key violation. The locking aspects would be less scary than your current
patch.
I'm not sure if that would perform as well as your current patch. I must
admit your current approach is pretty optimal performance-wise. But I'd
like to see it, and that would be a solution for exclusion constraints
in any case.
One fairly limitation with your current approach is that the number of
lwlocks you can hold simultaneously is limited (MAX_SIMUL_LWLOCKS ==
100). Another limitation is that the minimum for shared_buffers is only
16. Neither of those is a serious problem in real applications - no-one
runs with shared_buffers=16 and no sane schema has a hundred unique
indexes, but it's still something to consider.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 18, 2013 at 6:44 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
I think it's important to recap the design goals of this.
Seems reasonable to list them out.
* It should be usable and perform well for both large batch updates and
small transactions.
I think that that's a secondary goal, a question to be considered but
perhaps deferred during this initial effort. I agree that it certainly
is important.
* It should perform well both when there are no duplicates, and when there
are lots of duplicates
I think this is very important.
And from that follows some finer requirements:
* Performance when there are no duplicates should be close to raw INSERT
performance.* Performance when all rows are duplicates should be close to raw UPDATE
performance.* We should not leave behind large numbers of dead tuples in either case.
I agree with all that.
Anything else I'm missing?
I think so, yes. I'll add:
* Should not deadlock unreasonably.
If the UPDATE case is to work and perform almost as well as a regular
UPDATE, that must mean that it has essentially the same
characteristics as plain UPDATE. In particular, I feel fairly strongly
that it is not okay for upserts to deadlock with each other unless the
possibility of each transaction locking multiple rows (in an
inconsistent order) exists. I don't want to repeat the mistakes of
MySQL here. This is a point that I stressed to Robert on a previous
occasion [1]/messages/by-id/CAM3SWZRfrw+zXe7CKt6-QTCuvKQ-Oi7gnbBOPqQsvddU=9M7_g@mail.gmail.com. It's why value locks and row locks cannot be held at the
same time. Incidentally, that implies that all alternative schemes
involving bloat will bloat once per attempt, I believe.
I'll also add:
* Should anticipate a day when Postgres needs plumbing for SQL MERGE,
which is still something we want, particularly for batch operations. I
realize that the standard doesn't strictly require MERGE to handle the
concurrency issues, but even still I don't think that an
implementation that doesn't is practicable - does such an
implementation currently exist in any other system?
What about exclusion constraints? I'd like to see this work for them as
well. Currently, exclusion constraints are checked after the tuple is
inserted, and you abort if the constraint was violated. We could still
insert the heap and index tuples first, but instead of aborting on
violation, we would kill the heap tuple we already inserted and retry. There
are some complications there, like how to wake up any other backends that
are waiting to grab a lock on the tuple we just killed, but it seems doable.
I agree that it's at least doable.
That would, however, perform badly and leave garbage behind if there are
duplicates. A refinement of that would be to first check for constraint
violations, then insert the tuple, and then check again. That would avoid
the garbage in most cases, but would perform much more poorly when there are
no duplicates, because it needs two index scans for every insertion. A
further refinement would be to keep track of how many duplicates there have
been recently, and switch between the two strategies based on that.
Seems like an awful lot of additional mechanism.
That cost of doing two scans could be alleviated by using markpos/restrpos
to do the second scan. That is presumably cheaper than starting a whole new
scan with the same key. (markpos/restrpos don't currently work for non-MVCC
snapshots, so that'd need to be fixed, though)
Well, it seems like we could already use a "pick up where you left
off" mechanism in the case of regular btree index tuple insertions
into unique indexes -- after all, we don't do that in the event of
blocking pending the outcome of the other transaction (that inserted a
duplicate that we need to conclusively know has or has not committed)
today. The fact that this doesn't already exist leaves me less than
optimistic about the prospect of making it work to facilitate a scheme
such as the one you describe here. (Today we still need to catch a
committed version of the tuple that would make our tuple a duplicate
from a fresh index scan, only *after* waiting for a transaction to
commit/abort at the end of our original index scan). So we're already
pretty naive about this, even though it would pay to not be.
Making something like markpos work for the purposes of an upsert
implementation seems not only hard, but also like a possible
modularity violation. Are we not unreasonably constraining the
implementation going forward? My patch respects the integrity of the
am abstraction, and doesn't really add any knowledge to the core
system about how amcanunique index methods might go about implementing
the new "amlock" method. The core system worries a bit about the "low
level locks" (as it naively refers to value locks), and doesn't
consider that it has the right to hold on to them for more than an
instant, but that's about it. Plus we don't have to worry about
whether something does or does not work for a certain snapshot type
with my approach, because as with the current unique index btree
coding, it operates at a lower level than that, and does not need to
consider visibility as such.
The markpos and restpos am methods only called for regular index
(only) scans, that don't need to worry about things that are not
visible. Of course, upsert needs to worry about
invisible-but-conclusively-live things. This seems much harder, and
basically implies value locking of some kind, if I'm not mistaken. So
have you really gained anything?
So what I've done, aside from being, as you say below, close to
optimal, is in a sense defined in terms of existing, well-established
abstractions. I feel it's easier to reason about the implications of
holding value locks (whatever the implementation) for longer and
across multiple operations than it is to do all this instead. What
I've done with locking is scary, but not as scary as the worst case of
alternative implementations.
And that detour with exclusion constraints takes me back to the current
patch :-). What if you implemented the unique check in a similar fashion too
(when doing INSERT ON DUPLICATE KEY ...)? First, scan for a conflicting key,
and mark the position. Then do the insertion to that position. If the
insertion fails because of a duplicate key (which was inserted after we did
the first scan), mark the heap tuple as dead, and start over. The indexam
changes would be quite similar to the changes you made in your patch, but
instead of keeping the page locked, you'd only hold a pin on the target page
(if even that). The first indexam call would check that the key doesn't
exist, and remember the insert position. The second call would re-find the
previous position, and insert the tuple, checking again that there really
wasn't a duplicate key violation. The locking aspects would be less scary
than your current patch.I'm not sure if that would perform as well as your current patch. I must
admit your current approach is pretty optimal performance-wise. But I'd like
to see it, and that would be a solution for exclusion constraints in any
case.
I'm certainly not opposed to making something like this work for
exclusion constraints. Certainly, I want this to be as general as
possible. But I don't think that it needs to be a blocker, and I don't
think we gain anything in code footprint by addressing that by being
as general as possible in our approach to the basic concurrency issue.
After all, we're going to have to repeat the basic pattern in multiple
modules.
With exclusion constraints, we'd have to worry about a single slot
proposed for insertion violating (and therefore presumably obliging us
to lock) every row in the table. Are we going to have a mechanism for
spilling a tid array potentially sized in gigabytes to disk (relating
to just one slot proposed for insertion)? Is it principled to have
that one slot project out rejects consisting of (say) the entire
table? Is it even useful to lock multiple rows if we can't really
update them, because they'll overlap each other when all updated with
the one value? These are complicated questions, and frankly I don't
have the bandwidth to answer them too soon. I just want to implement a
feature that there is obviously huge pent up demand for, that has in
the past put Postgres at a strategic disadvantage. I don't think it is
unsound to define ON DUPLICATE KEY in terms of unique indexes. That's
how we represent uniques...it isn't spelt ON OVERLAPPING or whatever.
That seems like an addition, a nice-to-have, and maybe not even that,
because exclusion-constrained columns *aren't* keys, and people aren't
likely to want to upsert details of a booking (the typical exclusion
constraint use-case) with the booking range in the UPDATE part's
predicate. They'd just do it by key, because they'd already have a
booking number PK value or whatever.
Making this perform as well as possible is an important consideration.
All alternative approaches that involve bloat concern me, and for
reasons that I'm not sure were fully appreciated during earlier
discussion on this thread: I'm worried about the worst case, not the
average case. I am worried about a so-called "thundering herd"
scenario. You need something like LockTuple() to arbitrate ordering,
which seems complex, and like a further modularity violation. If this
is to perform well when there are lots of existing tuples to be
updated (with contention that wouldn't be considered unreasonable for
plain updates), the amount of bloat generated by a thundering herd
could be really really bad (once per attempt per "head of
cattle"/upserter) . It's hard to say for sure how much of a problem
this is, but I think it needs to be considered. It's a problem that
I'm not sure we have the tools to analyze ahead of time. It's easier
to pin down and reason about the conventional value locking stuff,
because we know how deadlocks work. We know how to do analysis of
deadlock hazards, and the surface area actually turns out to be not
too large there.
One fairly limitation with your current approach is that the number of
lwlocks you can hold simultaneously is limited (MAX_SIMUL_LWLOCKS == 100).
Another limitation is that the minimum for shared_buffers is only 16.
Neither of those is a serious problem in real applications - no-one runs
with shared_buffers=16 and no sane schema has a hundred unique indexes, but
it's still something to consider.
I was under the impression, based on previously feedback, that what
I've done with LWLocks was unlikely to be accepted. I proceeded under
the assumption that we'll be able to ameliorate these problems, as for
example by implementing an alternative value locking mechanism (an
SLRU?) that is similar to what I've done to date (in particular, very
cheap and fast), but without all the down-sides that concerned Robert
and Andres, and now you. As I said, I still think that's easier and
safer than all alternative approaches described to date. It just so
happens that I also believe it will perform a lot better in the
average case too, but that isn't a key advantage to my mind.
You're right that the value locking is scary. I think we need to very
carefully consider it, once I have buy-in on the basic approach. I
really do think it's the least-worst approach described to date. It
isn't like we can't discuss making it inherently less scary, but I
hesitate to do that now, given that I don't know if that discussing
will go anywhere.
Thanks for your efforts on reviewing my work here! Do you think it
would be useful at this juncture to write a patch to make the order of
locking across unique indexes well-defined? I think it may well have
independent value to get the insertion into unique indexes (that can
throw errors) out of the way when doing a regular slot insertion.
Better to abort the transaction as soon as possible.
[1]: /messages/by-id/CAM3SWZRfrw+zXe7CKt6-QTCuvKQ-Oi7gnbBOPqQsvddU=9M7_g@mail.gmail.com
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Nov 18, 2013 at 4:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
You're right that the value locking is scary. I think we need to very
carefully consider it, once I have buy-in on the basic approach. I
really do think it's the least-worst approach described to date. It
isn't like we can't discuss making it inherently less scary, but I
hesitate to do that now, given that I don't know if that discussing
will go anywhere.
One possible compromise would be "promise tuples" where we know we'll
be able to keep our promise. In other words:
1. We lock values in the first phase, in more or less the manner of
the extant patch.
2. When a consensus exists that heap tuple insertion proceeds, we
proceed with insertion of these promise index tuples (and probably
keep just a pin on the relevant pages).
3. Proceed with insertion of the heap tuple (with no "value locks" of
any kind held).
3. Go back to the unique indexes, update the heap tid and unset the
index tuple flag (that indicates that the tuples are in this promise
state). Probably we can even be bright about re-finding the existing
promise tuples with their proper heap tid (e.g. maybe we can avoid
doing a regular index scan at least some of the time - chances are
pretty good that the index tuple is on the same page as before, so
it's generally well worth a shot looking there first). As with the
earlier promise tuple proposals, we store our xid in the ItemPointer.
4. Finally, insertion of non-unique index tuples occurs in the regular manner.
Obviously the big advantage here is that we don't have to worry about
value locking across heap tuple insertion at all, and yet we don't
have to worry about bloating, because we really do know that insertion
proper will proceed when inserting *this* type of promise index tuple.
Maybe that even makes it okay to just use buffer locks, if we think
some more about the other edge cases. Regular index scans take the
aforementioned flag as a kind of visibility hint, perhaps, so we don't
have to worry about them. And VACUUM would kill any dead promise
tuples - this would be much less of a concern than with the earlier
promise tuple proposals, because it is extremely non routine. Maybe
it's fine to not make autovacuum concerned about a whole new class of
(index-only) bloat, which seemed like a big problem with those earlier
proposals, simply because crashes within this tiny window are
hopefully so rare that it couldn't possibly amount to much bloat in
the grand scheme of things (at least before a routine VACUUM - UPDATEs
tend to necessitate those). If you have 50 upserting backends in this
tiny window during a crash, that would be only 50 dead index tuples.
Given the window is so tiny, I doubt it would be much of a problem at
all - even 50 seems like a very high number. The track_counts counts
that drive autovacuum here are already not crash safe, so I see no
regression.
Now, you still have to value lock across multiple btree unique
indexes, and I understand there are reservations about this. But the
surface area is made significantly smaller at reasonably low cost.
Furthermore, doing TOASTing out-of-line and so on ceases to be
necessary.
The LOCK FOR UPDATE case is the same as before. Nothing else changes.
FWIW, without presuming anything about value locking implementation,
I'm not too worried about making the implementation scale to very
large numbers of unique indexes, with very low shared_buffer settings.
We already have a fairly similar situation with
max_locks_per_transaction and so on, no?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 19.11.2013 02:37, Peter Geoghegan wrote:
On Mon, Nov 18, 2013 at 6:44 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:* It should be usable and perform well for both large batch updates and
small transactions.I think that that's a secondary goal, a question to be considered but
perhaps deferred during this initial effort. I agree that it certainly
is important.
Ok. Which use case are you targeting during this initial effort, batch
updates or small OLTP transactions?
Anything else I'm missing?
I think so, yes. I'll add:
* Should not deadlock unreasonably.
If the UPDATE case is to work and perform almost as well as a regular
UPDATE, that must mean that it has essentially the same
characteristics as plain UPDATE. In particular, I feel fairly strongly
that it is not okay for upserts to deadlock with each other unless the
possibility of each transaction locking multiple rows (in an
inconsistent order) exists.
Agreed.
What about exclusion constraints? I'd like to see this work for them as
well. Currently, exclusion constraints are checked after the tuple is
inserted, and you abort if the constraint was violated. We could still
insert the heap and index tuples first, but instead of aborting on
violation, we would kill the heap tuple we already inserted and retry. There
are some complications there, like how to wake up any other backends that
are waiting to grab a lock on the tuple we just killed, but it seems doable.I agree that it's at least doable.
That would, however, perform badly and leave garbage behind if there are
duplicates. A refinement of that would be to first check for constraint
violations, then insert the tuple, and then check again. That would avoid
the garbage in most cases, but would perform much more poorly when there are
no duplicates, because it needs two index scans for every insertion. A
further refinement would be to keep track of how many duplicates there have
been recently, and switch between the two strategies based on that.Seems like an awful lot of additional mechanism.
Not really. Once you have the code in place to do the
kill-inserted-tuple dance on a conflict, all you need is to do an extra
index search before it. And once you have that, it's not hard to add
some kind of a heuristic to either do the pre-check or skip it.
That cost of doing two scans could be alleviated by using markpos/restrpos
to do the second scan. That is presumably cheaper than starting a whole new
scan with the same key. (markpos/restrpos don't currently work for non-MVCC
snapshots, so that'd need to be fixed, though)Well, it seems like we could already use a "pick up where you left
off" mechanism in the case of regular btree index tuple insertions
into unique indexes -- after all, we don't do that in the event of
blocking pending the outcome of the other transaction (that inserted a
duplicate that we need to conclusively know has or has not committed)
today. The fact that this doesn't already exist leaves me less than
optimistic about the prospect of making it work to facilitate a scheme
such as the one you describe here. (Today we still need to catch a
committed version of the tuple that would make our tuple a duplicate
from a fresh index scan, only *after* waiting for a transaction to
commit/abort at the end of our original index scan). So we're already
pretty naive about this, even though it would pay to not be.
We just haven't bothered to optimize for the case that you have to wait.
That's going to be slow anyway. Also, after sleeping, the insertion
position might've moved right a lot, if a lot of insertions happened
during the sleep, so it might be best to do a new scan anyway.
Making something like markpos work for the purposes of an upsert
implementation seems not only hard, but also like a possible
modularity violation. Are we not unreasonably constraining the
implementation going forward? My patch respects the integrity of the
am abstraction, and doesn't really add any knowledge to the core
system about how amcanunique index methods might go about implementing
the new "amlock" method. The core system worries a bit about the "low
level locks" (as it naively refers to value locks), and doesn't
consider that it has the right to hold on to them for more than an
instant, but that's about it. Plus we don't have to worry about
whether something does or does not work for a certain snapshot type
with my approach, because as with the current unique index btree
coding, it operates at a lower level than that, and does not need to
consider visibility as such.The markpos and restpos am methods only called for regular index
(only) scans, that don't need to worry about things that are not
visible. Of course, upsert needs to worry about
invisible-but-conclusively-live things. This seems much harder, and
basically implies value locking of some kind, if I'm not mistaken. So
have you really gained anything?
I probably shouldn't have mentioned markpos/restrpos, you're right that
it's not a good idea to conflate that with index insertion.
Nevertheless, some kind of an API for doing a duplicate-key check prior
to insertion, and remembering the location for the actual insert later,
seems sensible. It's certainly no more of a modularity violation than
the value-locking scheme you're proposing.
What I'm thinking is a new indexam function, let's call it "pre-insert".
The pre-insert function checks for any possible unique key violations,
just like insertion, but doesn't modify the index. Also, as an
optimization, it can remember the position where the insertion will go
to later, and return an opaque token to represent that. That token can
be passed to the insert-function later, which can use it to quickly
re-find the insert position. In other words, very similar to the
index_lock function you're proposing, but it doesn't keep the page locked.
And that detour with exclusion constraints takes me back to the current
patch :-). What if you implemented the unique check in a similar fashion too
(when doing INSERT ON DUPLICATE KEY ...)? First, scan for a conflicting key,
and mark the position. Then do the insertion to that position. If the
insertion fails because of a duplicate key (which was inserted after we did
the first scan), mark the heap tuple as dead, and start over. The indexam
changes would be quite similar to the changes you made in your patch, but
instead of keeping the page locked, you'd only hold a pin on the target page
(if even that). The first indexam call would check that the key doesn't
exist, and remember the insert position. The second call would re-find the
previous position, and insert the tuple, checking again that there really
wasn't a duplicate key violation. The locking aspects would be less scary
than your current patch.I'm not sure if that would perform as well as your current patch. I must
admit your current approach is pretty optimal performance-wise. But I'd like
to see it, and that would be a solution for exclusion constraints in any
case.I'm certainly not opposed to making something like this work for
exclusion constraints. Certainly, I want this to be as general as
possible. But I don't think that it needs to be a blocker, and I don't
think we gain anything in code footprint by addressing that by being
as general as possible in our approach to the basic concurrency issue.
After all, we're going to have to repeat the basic pattern in multiple
modules.
Well, I don't know what to say. I *do* have a hunch that we'd gain much
in code footprint by making this general. I don't understand what
pattern you'd need to repeat in multiple modules.
Here's a patch, implementing a rough version of the scheme I'm trying to
explain. It's not as polished as yours, but it ought to be enough to
evaluate the code footprint and performance. It doesn't make any changes
to the indexam API, and it works the same with exclusion constraints and
unique constraints. As it stands, it doesn't leave bloat behind, except
when a concurrent insertion with a conflicting key happens between the
first "pre-check" and the actual insertion. That should be rare in practice.
What have you been using to performance test this?
With exclusion constraints, we'd have to worry about a single slot
proposed for insertion violating (and therefore presumably obliging us
to lock) every row in the table. Are we going to have a mechanism for
spilling a tid array potentially sized in gigabytes to disk (relating
to just one slot proposed for insertion)? Is it principled to have
that one slot project out rejects consisting of (say) the entire
table? Is it even useful to lock multiple rows if we can't really
update them, because they'll overlap each other when all updated with
the one value?
Hmm. I think what you're referring to is the case where you try to
insert a row so that it violates an exclusion constraint, and in a way
that it conflicts with a large number of existing tuples. For example,
if you have a calendar application with a constraint that two
reservations must not overlap, and you try to insert a new reservation
that covers, say, a whole decade.
That's not a problem for ON DUPLICATE KEY IGNORE, as you just ignore the
conflict and move on. For ON DUPLICATE KEY LOCK FOR UPDATE, I guess we
would need to handle a large TID array. Or maybe we can arrange it so
that the tuples are locked as we scan them, without having to collect
them all in a large array.
(the attached patch only locks the first existing tuple that conflicts;
that needs to be fixed)
RETURNING REJECTS is not an issue here, as that just returns the
rejected rows we were about to insert, not the existing rows in the table.
- Heikki
Attachments:
insert_on_dup-kill-on-conflict-1.patchtext/x-diff; name=insert_on_dup-kill-on-conflict-1.patchDownload
*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
***************
*** 1418,1423 **** JumbleQuery(pgssJumbleState *jstate, Query *query)
--- 1418,1424 ----
JumbleRangeTable(jstate, query->rtable);
JumbleExpr(jstate, (Node *) query->jointree);
JumbleExpr(jstate, (Node *) query->targetList);
+ APP_JUMB(query->specClause);
JumbleExpr(jstate, (Node *) query->returningList);
JumbleExpr(jstate, (Node *) query->groupClause);
JumbleExpr(jstate, query->havingQual);
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 2541,2551 **** compute_infobits(uint16 infomask, uint16 infomask2)
* (the last only for HeapTupleSelfUpdated, since we
* cannot obtain cmax from a combocid generated by another transaction).
* See comments for struct HeapUpdateFailureData for additional info.
*/
HTSU_Result
heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
! HeapUpdateFailureData *hufd)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
--- 2541,2555 ----
* (the last only for HeapTupleSelfUpdated, since we
* cannot obtain cmax from a combocid generated by another transaction).
* See comments for struct HeapUpdateFailureData for additional info.
+ *
+ * If 'kill' is true, we're killing a tuple we just inserted in the same
+ * command. Instead of the normal visibility checks, we check that the tuple
+ * was inserted by the current transaction and given command id.
*/
HTSU_Result
heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
! HeapUpdateFailureData *hufd, bool kill)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
***************
*** 2601,2607 **** heap_delete(Relation relation, ItemPointer tid,
tp.t_self = *tid;
l1:
! result = HeapTupleSatisfiesUpdate(&tp, cid, buffer);
if (result == HeapTupleInvisible)
{
--- 2605,2620 ----
tp.t_self = *tid;
l1:
! if (!kill)
! result = HeapTupleSatisfiesUpdate(&tp, cid, buffer);
! else
! {
! if (tp.t_data->t_choice.t_heap.t_xmin != xid ||
! tp.t_data->t_choice.t_heap.t_field3.t_cid != cid)
! elog(ERROR, "attempted to kill a tuple inserted by another transaction or command");
! result = HeapTupleMayBeUpdated;
! }
!
if (result == HeapTupleInvisible)
{
***************
*** 2870,2876 **** simple_heap_delete(Relation relation, ItemPointer tid)
result = heap_delete(relation, tid,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
! &hufd);
switch (result)
{
case HeapTupleSelfUpdated:
--- 2883,2889 ----
result = heap_delete(relation, tid,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
! &hufd, false);
switch (result)
{
case HeapTupleSelfUpdated:
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 1644,1651 **** BuildIndexInfo(Relation index)
ii->ii_ExclusionStrats = NULL;
}
/* other info */
- ii->ii_Unique = indexStruct->indisunique;
ii->ii_ReadyForInserts = IndexIsReady(indexStruct);
/* initialize index-build state to default */
--- 1644,1692 ----
ii->ii_ExclusionStrats = NULL;
}
+ /*
+ * fetch info for checking unique constraints. (this is currently only
+ * used by ExecCheckIndexConstraints(), for INSERT ... ON DUPLICATE KEY.
+ * In regular insertions, the index AM handles the unique check itself.
+ * Might make sense to do this lazily, only when needed)
+ */
+ if (indexStruct->indisunique)
+ {
+ int ncols = index->rd_rel->relnatts;
+
+ if (index->rd_rel->relam != BTREE_AM_OID)
+ elog(ERROR, "only b-tree indexes are supported for foreign keys");
+
+ ii->ii_UniqueOps = (Oid *) palloc(sizeof(Oid) * ncols);
+ ii->ii_UniqueProcs = (Oid *) palloc(sizeof(Oid) * ncols);
+ ii->ii_UniqueStrats = (uint16 *) palloc(sizeof(uint16) * ncols);
+
+ /*
+ * We have to look up the operator's strategy number. This
+ * provides a cross-check that the operator does match the index.
+ */
+ /* We need the func OIDs and strategy numbers too */
+ for (i = 0; i < ncols; i++)
+ {
+ ii->ii_UniqueStrats[i] = BTEqualStrategyNumber;
+ ii->ii_UniqueOps[i] =
+ get_opfamily_member(index->rd_opfamily[i],
+ index->rd_opcintype[i],
+ index->rd_opcintype[i],
+ ii->ii_UniqueStrats[i]);
+ ii->ii_UniqueProcs[i] = get_opcode(ii->ii_UniqueOps[i]);
+ }
+ ii->ii_Unique = true;
+ }
+ else
+ {
+ ii->ii_UniqueOps = NULL;
+ ii->ii_UniqueProcs = NULL;
+ ii->ii_UniqueStrats = NULL;
+ ii->ii_Unique = false;
+ }
+
/* other info */
ii->ii_ReadyForInserts = IndexIsReady(indexStruct);
/* initialize index-build state to default */
***************
*** 2566,2575 **** IndexCheckExclusion(Relation heapRelation,
/*
* Check that this tuple has no conflicts.
*/
! check_exclusion_constraint(heapRelation,
indexRelation, indexInfo,
&(heapTuple->t_self), values, isnull,
! estate, true, false);
}
heap_endscan(scan);
--- 2607,2616 ----
/*
* Check that this tuple has no conflicts.
*/
! check_exclusion_or_unique_constraint(heapRelation,
indexRelation, indexInfo,
&(heapTuple->t_self), values, isnull,
! estate, true, false, true, NULL);
}
heap_endscan(scan);
*** a/src/backend/commands/constraint.c
--- b/src/backend/commands/constraint.c
***************
*** 170,178 **** unique_key_recheck(PG_FUNCTION_ARGS)
* For exclusion constraints we just do the normal check, but now it's
* okay to throw error.
*/
! check_exclusion_constraint(trigdata->tg_relation, indexRel, indexInfo,
&(new_row->t_self), values, isnull,
! estate, false, false);
}
/*
--- 170,178 ----
* For exclusion constraints we just do the normal check, but now it's
* okay to throw error.
*/
! check_exclusion_or_unique_constraint(trigdata->tg_relation, indexRel, indexInfo,
&(new_row->t_self), values, isnull,
! estate, false, false, true, NULL);
}
/*
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***************
*** 2284,2290 **** CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate);
/* AFTER ROW INSERT Triggers */
ExecARInsertTriggers(estate, resultRelInfo, tuple,
--- 2284,2290 ----
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate, false);
/* AFTER ROW INSERT Triggers */
ExecARInsertTriggers(estate, resultRelInfo, tuple,
***************
*** 2391,2397 **** CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
! estate);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
recheckIndexes);
--- 2391,2397 ----
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
! estate, false);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
recheckIndexes);
*** a/src/backend/executor/execUtils.c
--- b/src/backend/executor/execUtils.c
***************
*** 990,996 **** ExecCloseIndices(ResultRelInfo *resultRelInfo)
*
* This returns a list of index OIDs for any unique or exclusion
* constraints that are deferred and that had
! * potential (unconfirmed) conflicts.
*
* CAUTION: this must not be called for a HOT update.
* We can't defend against that here for lack of info.
--- 990,997 ----
*
* This returns a list of index OIDs for any unique or exclusion
* constraints that are deferred and that had
! * potential (unconfirmed) conflicts. (if noErrorOnDuplicate == true,
! * the same is done for non-deferred constraints)
*
* CAUTION: this must not be called for a HOT update.
* We can't defend against that here for lack of info.
***************
*** 1000,1006 **** ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
! EState *estate)
{
List *result = NIL;
ResultRelInfo *resultRelInfo;
--- 1001,1008 ----
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
! EState *estate,
! bool noErrorOnDuplicate)
{
List *result = NIL;
ResultRelInfo *resultRelInfo;
***************
*** 1092,1100 **** ExecInsertIndexTuples(TupleTableSlot *slot,
--- 1094,1107 ----
* For a deferrable unique index, we tell the index AM to just detect
* possible non-uniqueness, and we add the index OID to the result
* list if further checking is needed.
+ *
+ * For a IGNORE/REJECT DUPLICATES insertion, just detect possible
+ * non-uniqueness, and tell the caller if it failed.
*/
if (!indexRelation->rd_index->indisunique)
checkUnique = UNIQUE_CHECK_NO;
+ else if (noErrorOnDuplicate)
+ checkUnique = UNIQUE_CHECK_PARTIAL;
else if (indexRelation->rd_index->indimmediate)
checkUnique = UNIQUE_CHECK_YES;
else
***************
*** 1121,1133 **** ExecInsertIndexTuples(TupleTableSlot *slot,
*/
if (indexInfo->ii_ExclusionOps != NULL)
{
! bool errorOK = !indexRelation->rd_index->indimmediate;
satisfiesConstraint =
! check_exclusion_constraint(heapRelation,
indexRelation, indexInfo,
tupleid, values, isnull,
! estate, false, errorOK);
}
if ((checkUnique == UNIQUE_CHECK_PARTIAL ||
--- 1128,1142 ----
*/
if (indexInfo->ii_ExclusionOps != NULL)
{
! bool errorOK = (!indexRelation->rd_index->indimmediate &&
! !noErrorOnDuplicate);
satisfiesConstraint =
! check_exclusion_or_unique_constraint(heapRelation,
indexRelation, indexInfo,
tupleid, values, isnull,
! estate, false, errorOK, false,
! NULL);
}
if ((checkUnique == UNIQUE_CHECK_PARTIAL ||
***************
*** 1146,1163 **** ExecInsertIndexTuples(TupleTableSlot *slot,
return result;
}
/*
! * Check for violation of an exclusion constraint
*
* heap: the table containing the new tuple
* index: the index supporting the exclusion constraint
* indexInfo: info about the index, including the exclusion properties
! * tupleid: heap TID of the new tuple we have just inserted
* values, isnull: the *index* column values computed for the new tuple
* estate: an EState we can do evaluation in
* newIndex: if true, we are trying to build a new index (this affects
* only the wording of error messages)
* errorOK: if true, don't throw error for violation
*
* Returns true if OK, false if actual or potential violation
*
--- 1155,1294 ----
return result;
}
+ /* ----------------------------------------------------------------
+ * ExecCheckIndexConstraints
+ *
+ * This routine checks if a tuple violates any unique or
+ * exclusion constraints. If no conflict, returns true.
+ * Otherwise returns false, and the TID of the conflicting
+ * tuple is returned in *conflictTid
+ *
+ *
+ * Note that this doesn't lock the values in any way, so it's
+ * possible that a conflicting tuple is inserted immediately
+ * after this returns, and a later insert with the same values
+ * still conflicts. But this can be used for a pre-check before
+ * insertion.
+ * ----------------------------------------------------------------
+ */
+ bool
+ ExecCheckIndexConstraints(TupleTableSlot *slot,
+ EState *estate, ItemPointer conflictTid)
+ {
+ ResultRelInfo *resultRelInfo;
+ int i;
+ int numIndices;
+ RelationPtr relationDescs;
+ Relation heapRelation;
+ IndexInfo **indexInfoArray;
+ ExprContext *econtext;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ ItemPointerData invalidItemPtr;
+
+ ItemPointerSetInvalid(&invalidItemPtr);
+
+
+ /*
+ * Get information from the result relation info structure.
+ */
+ resultRelInfo = estate->es_result_relation_info;
+ numIndices = resultRelInfo->ri_NumIndices;
+ relationDescs = resultRelInfo->ri_IndexRelationDescs;
+ indexInfoArray = resultRelInfo->ri_IndexRelationInfo;
+ heapRelation = resultRelInfo->ri_RelationDesc;
+
+ /*
+ * We will use the EState's per-tuple context for evaluating predicates
+ * and index expressions (creating it if it's not already there).
+ */
+ econtext = GetPerTupleExprContext(estate);
+
+ /* Arrange for econtext's scan tuple to be the tuple under test */
+ econtext->ecxt_scantuple = slot;
+
+ /*
+ * for each index, form and insert the index tuple
+ */
+ for (i = 0; i < numIndices; i++)
+ {
+ Relation indexRelation = relationDescs[i];
+ IndexInfo *indexInfo;
+ bool satisfiesConstraint;
+
+ if (indexRelation == NULL)
+ continue;
+
+ indexInfo = indexInfoArray[i];
+
+ if (!indexInfo->ii_Unique && !indexInfo->ii_ExclusionOps)
+ continue;
+
+ /* If the index is marked as read-only, ignore it */
+ if (!indexInfo->ii_ReadyForInserts)
+ continue;
+
+ /* Check for partial index */
+ if (indexInfo->ii_Predicate != NIL)
+ {
+ List *predicate;
+
+ /*
+ * If predicate state not set up yet, create it (in the estate's
+ * per-query context)
+ */
+ predicate = indexInfo->ii_PredicateState;
+ if (predicate == NIL)
+ {
+ predicate = (List *)
+ ExecPrepareExpr((Expr *) indexInfo->ii_Predicate,
+ estate);
+ indexInfo->ii_PredicateState = predicate;
+ }
+
+ /* Skip this index-update if the predicate isn't satisfied */
+ if (!ExecQual(predicate, econtext, false))
+ continue;
+ }
+
+ /*
+ * FormIndexDatum fills in its values and isnull parameters with the
+ * appropriate values for the column(s) of the index.
+ */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ satisfiesConstraint =
+ check_exclusion_or_unique_constraint(heapRelation,
+ indexRelation, indexInfo,
+ &invalidItemPtr, values, isnull,
+ estate, false, true, true,
+ conflictTid);
+ if (!satisfiesConstraint)
+ return false;
+ }
+
+ return true;
+ }
+
/*
! * Check for violation of an exclusion or unique constraint
*
* heap: the table containing the new tuple
* index: the index supporting the exclusion constraint
* indexInfo: info about the index, including the exclusion properties
! * tupleid: heap TID of the new tuple we have just inserted (invalid if we
! * haven't inserted a new tuple yet)
* values, isnull: the *index* column values computed for the new tuple
* estate: an EState we can do evaluation in
* newIndex: if true, we are trying to build a new index (this affects
* only the wording of error messages)
* errorOK: if true, don't throw error for violation
+ * wait: if true, wait for conflicting transaction to finish, even if !errorOK
+ * conflictTid: if not-NULL, the TID of conflicting tuple is returned here.
*
* Returns true if OK, false if actual or potential violation
*
***************
*** 1169,1182 **** ExecInsertIndexTuples(TupleTableSlot *slot,
*
* When errorOK is false, we'll throw error on violation, so a false result
* is impossible.
*/
bool
! check_exclusion_constraint(Relation heap, Relation index, IndexInfo *indexInfo,
ItemPointer tupleid, Datum *values, bool *isnull,
! EState *estate, bool newIndex, bool errorOK)
{
! Oid *constr_procs = indexInfo->ii_ExclusionProcs;
! uint16 *constr_strats = indexInfo->ii_ExclusionStrats;
Oid *index_collations = index->rd_indcollation;
int index_natts = index->rd_index->indnatts;
IndexScanDesc index_scan;
--- 1300,1319 ----
*
* When errorOK is false, we'll throw error on violation, so a false result
* is impossible.
+ *
+ * Note: The indexam is normally responsible for checking unique constraints,
+ * so this normally only needs to be used for exclusion constraints. But this
+ * is done when doing a "pre-check" for conflicts with "INSERT ... ON DUPLICATE
+ * KEY", before inserting the actual tuple.
*/
bool
! check_exclusion_or_unique_constraint(Relation heap, Relation index, IndexInfo *indexInfo,
ItemPointer tupleid, Datum *values, bool *isnull,
! EState *estate, bool newIndex,
! bool errorOK, bool wait, ItemPointer conflictTid)
{
! Oid *constr_procs;
! uint16 *constr_strats;
Oid *index_collations = index->rd_indcollation;
int index_natts = index->rd_index->indnatts;
IndexScanDesc index_scan;
***************
*** 1190,1195 **** check_exclusion_constraint(Relation heap, Relation index, IndexInfo *indexInfo,
--- 1327,1343 ----
TupleTableSlot *existing_slot;
TupleTableSlot *save_scantuple;
+ if (indexInfo->ii_ExclusionOps)
+ {
+ constr_procs = indexInfo->ii_ExclusionProcs;
+ constr_strats = indexInfo->ii_ExclusionStrats;
+ }
+ else
+ {
+ constr_procs = indexInfo->ii_UniqueProcs;
+ constr_strats = indexInfo->ii_UniqueStrats;
+ }
+
/*
* If any of the input values are NULL, the constraint check is assumed to
* pass (i.e., we assume the operators are strict).
***************
*** 1253,1259 **** retry:
/*
* Ignore the entry for the tuple we're trying to check.
*/
! if (ItemPointerEquals(tupleid, &tup->t_self))
{
if (found_self) /* should not happen */
elog(ERROR, "found self tuple multiple times in index \"%s\"",
--- 1401,1408 ----
/*
* Ignore the entry for the tuple we're trying to check.
*/
! if (ItemPointerIsValid(tupleid) &&
! ItemPointerEquals(tupleid, &tup->t_self))
{
if (found_self) /* should not happen */
elog(ERROR, "found self tuple multiple times in index \"%s\"",
***************
*** 1287,1295 **** retry:
* we're not supposed to raise error, just return the fact of the
* potential conflict without waiting to see if it's real.
*/
! if (errorOK)
{
conflict = true;
break;
}
--- 1436,1446 ----
* we're not supposed to raise error, just return the fact of the
* potential conflict without waiting to see if it's real.
*/
! if (errorOK && !wait)
{
conflict = true;
+ if (conflictTid)
+ *conflictTid = tup->t_self;
break;
}
***************
*** 1314,1319 **** retry:
--- 1465,1478 ----
/*
* We have a definite conflict. Report it.
*/
+ if (errorOK)
+ {
+ conflict = true;
+ if (conflictTid)
+ *conflictTid = tup->t_self;
+ break;
+ }
+
error_new = BuildIndexValueDescription(index, values, isnull);
error_existing = BuildIndexValueDescription(index, existing_values,
existing_isnull);
***************
*** 1345,1350 **** retry:
--- 1504,1512 ----
* However, it is possible to define exclusion constraints for which that
* wouldn't be true --- for instance, if the operator is <>. So we no
* longer complain if found_self is still false.
+ *
+ * It would also not be true in the pre-check mode, when we haven't
+ * inserted a tuple yet.
*/
econtext->ecxt_scantuple = save_scantuple;
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
***************
*** 39,44 ****
--- 39,45 ----
#include "access/htup_details.h"
#include "access/xact.h"
+ #include "catalog/catalog.h"
#include "commands/trigger.h"
#include "executor/executor.h"
#include "executor/nodeModifyTable.h"
***************
*** 152,157 **** ExecProcessReturning(ProjectionInfo *projectReturning,
--- 153,258 ----
}
/* ----------------------------------------------------------------
+ * ExecLockHeapTupleForUpdateSpec: Try to lock tuple for update as part of
+ * speculative insertion.
+ *
+ * Returns value indicating if we're done with heap tuple locking, or if
+ * another attempt at value locking is required.
+ * ----------------------------------------------------------------
+ */
+ static bool
+ ExecLockHeapTupleForUpdateSpec(EState *estate,
+ ResultRelInfo *relinfo,
+ ItemPointer tid)
+ {
+ Relation relation = relinfo->ri_RelationDesc;
+ HeapTupleData tuple;
+ Buffer buffer;
+
+ HTSU_Result test;
+ HeapUpdateFailureData hufd;
+
+ Assert(ItemPointerIsValid(tid));
+
+ /*
+ * Lock tuple for update.
+ *
+ * Wait for other transaction to complete.
+ */
+ tuple.t_self = *tid;
+ test = heap_lock_tuple(relation, &tuple,
+ estate->es_output_cid,
+ LockTupleExclusive, false,
+ true, &buffer, &hufd);
+ ReleaseBuffer(buffer);
+
+ switch (test)
+ {
+ case HeapTupleSelfUpdated:
+ /*
+ * The target tuple was already updated or deleted by the current
+ * command, or by a later command in the current transaction. We
+ * conclude that we're done in the former case, and throw an error
+ * in the latter case, for the same reasons enumerated in
+ * ExecUpdate and ExecDelete.
+ */
+ if (hufd.cmax != estate->es_output_cid)
+ ereport(ERROR,
+ (errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
+ errmsg("tuple to be updated was already modified by an operation triggered by the current command"),
+ errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));
+
+ /*
+ * The fact that this command has already updated or deleted the
+ * tuple is grounds for concluding that we're done. Appropriate
+ * locks will already be held. It isn't the responsibility of the
+ * speculative insertion LOCK FOR UPDATE infrastructure to ensure
+ * an atomic INSERT-or-UPDATE in the event of a tuple being updated
+ * or deleted by the same xact in the interim.
+ */
+ return true;
+ case HeapTupleMayBeUpdated:
+ /*
+ * Success -- we're done, as tuple is locked and known to be
+ * visible to our snapshot under conventional MVCC rules if the
+ * current isolation level mandates that (in READ COMMITTED mode, a
+ * special exception to the conventional rules applies).
+ */
+ return true;
+ case HeapTupleUpdated:
+ if (IsolationUsesXactSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("could not serialize access due to concurrent update")));
+ /*
+ * Tell caller to try again from the very start. We don't use the
+ * usual EvalPlanQual looping pattern here, fundamentally because
+ * we don't have a useful qual to verify the next tuple with.
+ *
+ * We might devise a means of verifying, by way of binary equality
+ * in a similar manner to HOT codepaths, if any unique indexed
+ * columns changed, but this would only serve to ameliorate the
+ * fundamental problem. It might well not be good enough, because
+ * those columns could change too. It's not clear that doing any
+ * better here would be worth it.
+ *
+ * At this point, all bets are off -- it might actually turn out to
+ * be okay to proceed with insertion instead of locking now (the
+ * tuple we attempted to lock could have been deleted, for
+ * example). On the other hand, it might not be okay, but for an
+ * entirely different reason, with an entirely separate TID to
+ * blame and lock. This TID may not even be part of the same
+ * update chain.
+ */
+ return false;
+ default:
+ elog(ERROR, "unrecognized heap_lock_tuple status: %u", test);
+ }
+
+ return false;
+ }
+
+ /* ----------------------------------------------------------------
* ExecInsert
*
* For INSERT, we have to insert the tuple into the target relation
***************
*** 164,176 **** static TupleTableSlot *
ExecInsert(TupleTableSlot *slot,
TupleTableSlot *planSlot,
EState *estate,
! bool canSetTag)
{
HeapTuple tuple;
ResultRelInfo *resultRelInfo;
Relation resultRelationDesc;
Oid newId;
List *recheckIndexes = NIL;
/*
* get the heap tuple out of the tuple table slot, making sure we have a
--- 265,281 ----
ExecInsert(TupleTableSlot *slot,
TupleTableSlot *planSlot,
EState *estate,
! bool canSetTag,
! SpecType spec)
{
HeapTuple tuple;
ResultRelInfo *resultRelInfo;
Relation resultRelationDesc;
Oid newId;
List *recheckIndexes = NIL;
+ ProjectionInfo *returning;
+ bool rejects = (spec == SPEC_IGNORE_REJECTS ||
+ spec == SPEC_UPDATE_REJECTS);
/*
* get the heap tuple out of the tuple table slot, making sure we have a
***************
*** 183,188 **** ExecInsert(TupleTableSlot *slot,
--- 288,320 ----
*/
resultRelInfo = estate->es_result_relation_info;
resultRelationDesc = resultRelInfo->ri_RelationDesc;
+ returning = resultRelInfo->ri_projectReturning;
+
+ /*
+ * If speculative insertion is requested, take necessary precautions.
+ *
+ * Value locks are typically actually implemented by AMs as shared locks on
+ * buffers. This could be quite hazardous, because in the worst case those
+ * locks could be on catalog indexes, with the system then liable to
+ * deadlock due to innocent catalog access when inserting a heap tuple.
+ * However, we take a precaution against that here.
+ *
+ * Rather than forever committing to carefully managing these hazards
+ * during the extended (but still short) window after locking in which heap
+ * tuple insertion will potentially later take place (a window that ends,
+ * in the "insertion proceeds" case, when locks are released by the second
+ * phase of speculative insertion having completed for unique indexes), it
+ * is expedient to simply forbid speculative insertion into catalogs
+ * altogether. There is no consequence to allowing speculative insertion
+ * into TOAST tables, which we also forbid, but that doesn't seem terribly
+ * useful.
+ */
+ if (spec != SPEC_NONE &&
+ IsSystemRelation(resultRelationDesc))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("speculative insertion into catalogs and TOAST tables not supported"),
+ errtable(resultRelationDesc)));
/*
* If the result relation has OIDs, force the tuple's OID to zero so that
***************
*** 246,251 **** ExecInsert(TupleTableSlot *slot,
--- 378,386 ----
}
else
{
+ bool conflicted = false;
+ ItemPointerData conflictTid;
+
/*
* Constraints might reference the tableoid column, so initialize
* t_tableOid before evaluating them.
***************
*** 258,278 **** ExecInsert(TupleTableSlot *slot,
if (resultRelationDesc->rd_att->constr)
ExecConstraints(resultRelInfo, slot, estate);
/*
! * insert the tuple
*
! * Note: heap_insert returns the tid (location) of the new tuple in
! * the t_self field.
*/
! newId = heap_insert(resultRelationDesc, tuple,
! estate->es_output_cid, 0, NULL);
! /*
! * insert index entries for tuple
! */
! if (resultRelInfo->ri_NumIndices > 0)
! recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate);
}
if (canSetTag)
--- 393,491 ----
if (resultRelationDesc->rd_att->constr)
ExecConstraints(resultRelInfo, slot, estate);
+ retry:
+ ItemPointerSetInvalid(&conflictTid);
+
/*
! * If we are expecting duplicates, do a non-conclusive first check.
! * We might still fail later, after inserting the heap tuple, if a
! * conflicting row was inserted concurrently. We'll handle that by
! * deleting the already-inserted tuple and retrying, but that's fairly
! * expensive, so we try to avoid it.
*
! * XXX: If we know or assume that there are few duplicates, it would
! * be better to skip this, and just optimistically proceed with the
! * insertion below. You would then leave behind some garbage when a
! * conflict happens, but if it's rare, it doesn't matter much. Some
! * kind of heuristic might be in order here, like stop doing these
! * pre-checks if the last 100 insertions have not been duplicates.
*/
! if (spec != SPEC_NONE && resultRelInfo->ri_NumIndices > 0)
! {
! if (!ExecCheckIndexConstraints(slot, estate, &conflictTid))
! conflicted = true;
! }
! if (!conflicted)
! {
! /*
! * insert the tuple
! *
! * Note: heap_insert returns the tid (location) of the new tuple in
! * the t_self field.
! */
! newId = heap_insert(resultRelationDesc, tuple,
! estate->es_output_cid, 0, NULL);
!
! /*
! * Insert index entries for tuple.
! *
! * Locks will be acquired if needed, or the locks acquired by
! * ExecLockIndexTuples() may be used instead.
! */
! if (resultRelInfo->ri_NumIndices > 0)
! recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate,
! spec != SPEC_NONE);
!
! if (spec != SPEC_NONE && recheckIndexes)
! {
! HeapUpdateFailureData hufd;
! heap_delete(resultRelationDesc, &(tuple->t_self),
! estate->es_output_cid, NULL, false, &hufd, true);
! conflicted = true;
! }
! }
!
! if (conflicted)
! {
! if (spec == SPEC_UPDATE || spec == SPEC_UPDATE_REJECTS)
! {
! /*
! * Try to lock row for update.
! *
! * XXX: We don't have the TID of the conflicting tuple if
! * the index insertion failed and we had to kill the already
! * inserted tuple. We'd need to modify the index AM to pass
! * through the TID back here. So for now, we just retry, and
! * hopefully the new pre-check will fail on the same tuple
! * (or it's finished by now), and we'll get its TID that way
! */
! if (!ItemPointerIsValid(&conflictTid))
! {
! elog(DEBUG1, "insertion conflicted after pre-check");
! goto retry;
! }
!
! if (!ExecLockHeapTupleForUpdateSpec(estate,
! resultRelInfo,
! &conflictTid))
! {
! /*
! * Couldn't lock row - restart from just before value
! * locking. It's subtly wrong to assume anything about
! * the row version that is under consideration for
! * locking if another transaction locked it first.
! */
! goto retry;
! }
! }
!
! if (rejects)
! return ExecProcessReturning(returning, slot, planSlot);
! else
! return NULL;
! }
}
if (canSetTag)
***************
*** 291,300 **** ExecInsert(TupleTableSlot *slot,
if (resultRelInfo->ri_WithCheckOptions != NIL)
ExecWithCheckOptions(resultRelInfo, slot, estate);
! /* Process RETURNING if present */
! if (resultRelInfo->ri_projectReturning)
! return ExecProcessReturning(resultRelInfo->ri_projectReturning,
! slot, planSlot);
return NULL;
}
--- 504,515 ----
if (resultRelInfo->ri_WithCheckOptions != NIL)
ExecWithCheckOptions(resultRelInfo, slot, estate);
! /*
! * Process RETURNING if present and not only returning speculative
! * insertion rejects
! */
! if (returning && !rejects)
! return ExecProcessReturning(returning, slot, planSlot);
return NULL;
}
***************
*** 403,409 **** ldelete:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
! &hufd);
switch (result)
{
case HeapTupleSelfUpdated:
--- 618,625 ----
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
! &hufd,
! false);
switch (result)
{
case HeapTupleSelfUpdated:
***************
*** 781,787 **** lreplace:;
*/
if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate);
}
if (canSetTag)
--- 997,1003 ----
*/
if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate, false);
}
if (canSetTag)
***************
*** 1011,1017 **** ExecModifyTable(ModifyTableState *node)
switch (operation)
{
case CMD_INSERT:
! slot = ExecInsert(slot, planSlot, estate, node->canSetTag);
break;
case CMD_UPDATE:
slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
--- 1227,1234 ----
switch (operation)
{
case CMD_INSERT:
! slot = ExecInsert(slot, planSlot, estate, node->canSetTag,
! node->spec);
break;
case CMD_UPDATE:
slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
***************
*** 1086,1091 **** ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
--- 1303,1309 ----
mtstate->resultRelInfo = estate->es_result_relations + node->resultRelIndex;
mtstate->mt_arowmarks = (List **) palloc0(sizeof(List *) * nplans);
mtstate->mt_nplans = nplans;
+ mtstate->spec = node->spec;
/* set up epqstate with dummy subplan data for the moment */
EvalPlanQualInit(&mtstate->mt_epqstate, estate, NULL, NIL, node->epqParam);
***************
*** 1296,1301 **** ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
--- 1514,1520 ----
break;
case CMD_UPDATE:
case CMD_DELETE:
+ Assert(node->spec == SPEC_NONE);
junk_filter_needed = true;
break;
default:
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 182,187 **** _copyModifyTable(const ModifyTable *from)
--- 182,188 ----
COPY_NODE_FIELD(returningLists);
COPY_NODE_FIELD(fdwPrivLists);
COPY_NODE_FIELD(rowMarks);
+ COPY_SCALAR_FIELD(spec);
COPY_SCALAR_FIELD(epqParam);
return newnode;
***************
*** 2076,2081 **** _copyWithClause(const WithClause *from)
--- 2077,2094 ----
return newnode;
}
+ static ReturningClause *
+ _copyReturningClause(const ReturningClause *from)
+ {
+ ReturningClause *newnode = makeNode(ReturningClause);
+
+ COPY_NODE_FIELD(returningList);
+ COPY_SCALAR_FIELD(rejects);
+ COPY_LOCATION_FIELD(location);
+
+ return newnode;
+ }
+
static CommonTableExpr *
_copyCommonTableExpr(const CommonTableExpr *from)
{
***************
*** 2464,2469 **** _copyQuery(const Query *from)
--- 2477,2483 ----
COPY_NODE_FIELD(jointree);
COPY_NODE_FIELD(targetList);
COPY_NODE_FIELD(withCheckOptions);
+ COPY_SCALAR_FIELD(specClause);
COPY_NODE_FIELD(returningList);
COPY_NODE_FIELD(groupClause);
COPY_NODE_FIELD(havingQual);
***************
*** 2487,2493 **** _copyInsertStmt(const InsertStmt *from)
COPY_NODE_FIELD(relation);
COPY_NODE_FIELD(cols);
COPY_NODE_FIELD(selectStmt);
! COPY_NODE_FIELD(returningList);
COPY_NODE_FIELD(withClause);
return newnode;
--- 2501,2508 ----
COPY_NODE_FIELD(relation);
COPY_NODE_FIELD(cols);
COPY_NODE_FIELD(selectStmt);
! COPY_SCALAR_FIELD(specClause);
! COPY_NODE_FIELD(rlist);
COPY_NODE_FIELD(withClause);
return newnode;
***************
*** 2501,2507 **** _copyDeleteStmt(const DeleteStmt *from)
COPY_NODE_FIELD(relation);
COPY_NODE_FIELD(usingClause);
COPY_NODE_FIELD(whereClause);
! COPY_NODE_FIELD(returningList);
COPY_NODE_FIELD(withClause);
return newnode;
--- 2516,2522 ----
COPY_NODE_FIELD(relation);
COPY_NODE_FIELD(usingClause);
COPY_NODE_FIELD(whereClause);
! COPY_NODE_FIELD(rlist);
COPY_NODE_FIELD(withClause);
return newnode;
***************
*** 2516,2522 **** _copyUpdateStmt(const UpdateStmt *from)
COPY_NODE_FIELD(targetList);
COPY_NODE_FIELD(whereClause);
COPY_NODE_FIELD(fromClause);
! COPY_NODE_FIELD(returningList);
COPY_NODE_FIELD(withClause);
return newnode;
--- 2531,2537 ----
COPY_NODE_FIELD(targetList);
COPY_NODE_FIELD(whereClause);
COPY_NODE_FIELD(fromClause);
! COPY_NODE_FIELD(rlist);
COPY_NODE_FIELD(withClause);
return newnode;
***************
*** 4565,4570 **** copyObject(const void *from)
--- 4580,4588 ----
case T_WithClause:
retval = _copyWithClause(from);
break;
+ case T_ReturningClause:
+ retval = _copyReturningClause(from);
+ break;
case T_CommonTableExpr:
retval = _copyCommonTableExpr(from);
break;
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
***************
*** 859,864 **** _equalQuery(const Query *a, const Query *b)
--- 859,865 ----
COMPARE_NODE_FIELD(jointree);
COMPARE_NODE_FIELD(targetList);
COMPARE_NODE_FIELD(withCheckOptions);
+ COMPARE_SCALAR_FIELD(specClause);
COMPARE_NODE_FIELD(returningList);
COMPARE_NODE_FIELD(groupClause);
COMPARE_NODE_FIELD(havingQual);
***************
*** 880,886 **** _equalInsertStmt(const InsertStmt *a, const InsertStmt *b)
COMPARE_NODE_FIELD(relation);
COMPARE_NODE_FIELD(cols);
COMPARE_NODE_FIELD(selectStmt);
! COMPARE_NODE_FIELD(returningList);
COMPARE_NODE_FIELD(withClause);
return true;
--- 881,888 ----
COMPARE_NODE_FIELD(relation);
COMPARE_NODE_FIELD(cols);
COMPARE_NODE_FIELD(selectStmt);
! COMPARE_SCALAR_FIELD(specClause);
! COMPARE_NODE_FIELD(rlist);
COMPARE_NODE_FIELD(withClause);
return true;
***************
*** 892,898 **** _equalDeleteStmt(const DeleteStmt *a, const DeleteStmt *b)
COMPARE_NODE_FIELD(relation);
COMPARE_NODE_FIELD(usingClause);
COMPARE_NODE_FIELD(whereClause);
! COMPARE_NODE_FIELD(returningList);
COMPARE_NODE_FIELD(withClause);
return true;
--- 894,900 ----
COMPARE_NODE_FIELD(relation);
COMPARE_NODE_FIELD(usingClause);
COMPARE_NODE_FIELD(whereClause);
! COMPARE_NODE_FIELD(rlist);
COMPARE_NODE_FIELD(withClause);
return true;
***************
*** 905,911 **** _equalUpdateStmt(const UpdateStmt *a, const UpdateStmt *b)
COMPARE_NODE_FIELD(targetList);
COMPARE_NODE_FIELD(whereClause);
COMPARE_NODE_FIELD(fromClause);
! COMPARE_NODE_FIELD(returningList);
COMPARE_NODE_FIELD(withClause);
return true;
--- 907,913 ----
COMPARE_NODE_FIELD(targetList);
COMPARE_NODE_FIELD(whereClause);
COMPARE_NODE_FIELD(fromClause);
! COMPARE_NODE_FIELD(rlist);
COMPARE_NODE_FIELD(withClause);
return true;
***************
*** 2331,2336 **** _equalWithClause(const WithClause *a, const WithClause *b)
--- 2333,2348 ----
}
static bool
+ _equalReturningClause(const ReturningClause *a, const ReturningClause *b)
+ {
+ COMPARE_NODE_FIELD(returningList);
+ COMPARE_SCALAR_FIELD(rejects);
+ COMPARE_LOCATION_FIELD(location);
+
+ return true;
+ }
+
+ static bool
_equalCommonTableExpr(const CommonTableExpr *a, const CommonTableExpr *b)
{
COMPARE_STRING_FIELD(ctename);
***************
*** 3033,3038 **** equal(const void *a, const void *b)
--- 3045,3053 ----
case T_WithClause:
retval = _equalWithClause(a, b);
break;
+ case T_ReturningClause:
+ retval = _equalReturningClause(a, b);
+ break;
case T_CommonTableExpr:
retval = _equalCommonTableExpr(a, b);
break;
*** a/src/backend/nodes/nodeFuncs.c
--- b/src/backend/nodes/nodeFuncs.c
***************
*** 1457,1462 **** exprLocation(const Node *expr)
--- 1457,1465 ----
case T_WithClause:
loc = ((const WithClause *) expr)->location;
break;
+ case T_ReturningClause:
+ loc = ((const ReturningClause *) expr)->location;
+ break;
case T_CommonTableExpr:
loc = ((const CommonTableExpr *) expr)->location;
break;
***************
*** 2930,2936 **** raw_expression_tree_walker(Node *node,
return true;
if (walker(stmt->selectStmt, context))
return true;
! if (walker(stmt->returningList, context))
return true;
if (walker(stmt->withClause, context))
return true;
--- 2933,2939 ----
return true;
if (walker(stmt->selectStmt, context))
return true;
! if (walker(stmt->rlist, context))
return true;
if (walker(stmt->withClause, context))
return true;
***************
*** 2946,2952 **** raw_expression_tree_walker(Node *node,
return true;
if (walker(stmt->whereClause, context))
return true;
! if (walker(stmt->returningList, context))
return true;
if (walker(stmt->withClause, context))
return true;
--- 2949,2955 ----
return true;
if (walker(stmt->whereClause, context))
return true;
! if (walker(stmt->rlist, context))
return true;
if (walker(stmt->withClause, context))
return true;
***************
*** 2964,2970 **** raw_expression_tree_walker(Node *node,
return true;
if (walker(stmt->fromClause, context))
return true;
! if (walker(stmt->returningList, context))
return true;
if (walker(stmt->withClause, context))
return true;
--- 2967,2973 ----
return true;
if (walker(stmt->fromClause, context))
return true;
! if (walker(stmt->rlist, context))
return true;
if (walker(stmt->withClause, context))
return true;
***************
*** 3157,3162 **** raw_expression_tree_walker(Node *node,
--- 3160,3167 ----
break;
case T_WithClause:
return walker(((WithClause *) node)->ctes, context);
+ case T_ReturningClause:
+ return walker(((ReturningClause *) node)->returningList, context);
case T_CommonTableExpr:
return walker(((CommonTableExpr *) node)->ctequery, context);
default:
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 336,341 **** _outModifyTable(StringInfo str, const ModifyTable *node)
--- 336,342 ----
WRITE_NODE_FIELD(returningLists);
WRITE_NODE_FIELD(fdwPrivLists);
WRITE_NODE_FIELD(rowMarks);
+ WRITE_ENUM_FIELD(spec, SpecType);
WRITE_INT_FIELD(epqParam);
}
***************
*** 2253,2258 **** _outQuery(StringInfo str, const Query *node)
--- 2254,2260 ----
WRITE_NODE_FIELD(jointree);
WRITE_NODE_FIELD(targetList);
WRITE_NODE_FIELD(withCheckOptions);
+ WRITE_ENUM_FIELD(specClause, SpecType);
WRITE_NODE_FIELD(returningList);
WRITE_NODE_FIELD(groupClause);
WRITE_NODE_FIELD(havingQual);
***************
*** 2326,2331 **** _outWithClause(StringInfo str, const WithClause *node)
--- 2328,2343 ----
}
static void
+ _outReturningClause(StringInfo str, const ReturningClause *node)
+ {
+ WRITE_NODE_TYPE("RETURNINGCLAUSE");
+
+ WRITE_NODE_FIELD(returningList);
+ WRITE_BOOL_FIELD(rejects);
+ WRITE_LOCATION_FIELD(location);
+ }
+
+ static void
_outCommonTableExpr(StringInfo str, const CommonTableExpr *node)
{
WRITE_NODE_TYPE("COMMONTABLEEXPR");
***************
*** 3147,3152 **** _outNode(StringInfo str, const void *obj)
--- 3159,3167 ----
case T_WithClause:
_outWithClause(str, obj);
break;
+ case T_ReturningClause:
+ _outReturningClause(str, obj);
+ break;
case T_CommonTableExpr:
_outCommonTableExpr(str, obj);
break;
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
***************
*** 211,216 **** _readQuery(void)
--- 211,217 ----
READ_NODE_FIELD(jointree);
READ_NODE_FIELD(targetList);
READ_NODE_FIELD(withCheckOptions);
+ READ_ENUM_FIELD(specClause, SpecType);
READ_NODE_FIELD(returningList);
READ_NODE_FIELD(groupClause);
READ_NODE_FIELD(havingQual);
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 4737,4743 **** make_modifytable(PlannerInfo *root,
CmdType operation, bool canSetTag,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
! List *rowMarks, int epqParam)
{
ModifyTable *node = makeNode(ModifyTable);
Plan *plan = &node->plan;
--- 4737,4743 ----
CmdType operation, bool canSetTag,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
! List *rowMarks, SpecType spec, int epqParam)
{
ModifyTable *node = makeNode(ModifyTable);
Plan *plan = &node->plan;
***************
*** 4789,4794 **** make_modifytable(PlannerInfo *root,
--- 4789,4795 ----
node->withCheckOptionLists = withCheckOptionLists;
node->returningLists = returningLists;
node->rowMarks = rowMarks;
+ node->spec = spec;
node->epqParam = epqParam;
/*
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
***************
*** 609,614 **** subquery_planner(PlannerGlobal *glob, Query *parse,
--- 609,615 ----
withCheckOptionLists,
returningLists,
rowMarks,
+ parse->specClause,
SS_assign_special_param(root));
}
}
***************
*** 1008,1013 **** inheritance_planner(PlannerInfo *root)
--- 1009,1015 ----
withCheckOptionLists,
returningLists,
rowMarks,
+ parse->specClause,
SS_assign_special_param(root));
}
*** a/src/backend/parser/analyze.c
--- b/src/backend/parser/analyze.c
***************
*** 61,67 **** static Node *transformSetOperationTree(ParseState *pstate, SelectStmt *stmt,
static void determineRecursiveColTypes(ParseState *pstate,
Node *larg, List *nrtargetlist);
static Query *transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt);
! static List *transformReturningList(ParseState *pstate, List *returningList);
static Query *transformDeclareCursorStmt(ParseState *pstate,
DeclareCursorStmt *stmt);
static Query *transformExplainStmt(ParseState *pstate,
--- 61,68 ----
static void determineRecursiveColTypes(ParseState *pstate,
Node *larg, List *nrtargetlist);
static Query *transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt);
! static List *transformReturningClause(ParseState *pstate, ReturningClause *returningList,
! bool *rejects);
static Query *transformDeclareCursorStmt(ParseState *pstate,
DeclareCursorStmt *stmt);
static Query *transformExplainStmt(ParseState *pstate,
***************
*** 343,348 **** transformDeleteStmt(ParseState *pstate, DeleteStmt *stmt)
--- 344,350 ----
{
Query *qry = makeNode(Query);
Node *qual;
+ bool rejects;
qry->commandType = CMD_DELETE;
***************
*** 373,384 **** transformDeleteStmt(ParseState *pstate, DeleteStmt *stmt)
qual = transformWhereClause(pstate, stmt->whereClause,
EXPR_KIND_WHERE, "WHERE");
! qry->returningList = transformReturningList(pstate, stmt->returningList);
/* done building the range table and jointree */
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, qual);
qry->hasSubLinks = pstate->p_hasSubLinks;
qry->hasWindowFuncs = pstate->p_hasWindowFuncs;
qry->hasAggs = pstate->p_hasAggs;
--- 375,394 ----
qual = transformWhereClause(pstate, stmt->whereClause,
EXPR_KIND_WHERE, "WHERE");
! qry->returningList = transformReturningClause(pstate, stmt->rlist, &rejects);
!
! if (rejects)
! ereport(ERROR,
! (errcode(ERRCODE_SYNTAX_ERROR),
! errmsg("RETURNING clause does not accept REJECTS for DELETE statements"),
! parser_errposition(pstate,
! exprLocation((Node *) stmt->rlist))));
/* done building the range table and jointree */
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, qual);
+ qry->specClause = SPEC_NONE;
qry->hasSubLinks = pstate->p_hasSubLinks;
qry->hasWindowFuncs = pstate->p_hasWindowFuncs;
qry->hasAggs = pstate->p_hasAggs;
***************
*** 399,404 **** transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
--- 409,415 ----
{
Query *qry = makeNode(Query);
SelectStmt *selectStmt = (SelectStmt *) stmt->selectStmt;
+ SpecType spec = stmt->specClause;
List *exprList = NIL;
bool isGeneralSelect;
List *sub_rtable;
***************
*** 410,415 **** transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
--- 421,427 ----
ListCell *icols;
ListCell *attnos;
ListCell *lc;
+ bool rejects = false;
/* There can't be any outer WITH to worry about */
Assert(pstate->p_ctenamespace == NIL);
***************
*** 737,755 **** transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
* RETURNING will work. Also, remove any namespace entries added in a
* sub-SELECT or VALUES list.
*/
! if (stmt->returningList)
{
pstate->p_namespace = NIL;
addRTEtoQuery(pstate, pstate->p_target_rangetblentry,
false, true, true);
! qry->returningList = transformReturningList(pstate,
! stmt->returningList);
}
/* done building the range table and jointree */
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, NULL);
qry->hasSubLinks = pstate->p_hasSubLinks;
assign_query_collations(pstate, qry);
--- 749,782 ----
* RETURNING will work. Also, remove any namespace entries added in a
* sub-SELECT or VALUES list.
*/
! if (stmt->rlist)
{
pstate->p_namespace = NIL;
addRTEtoQuery(pstate, pstate->p_target_rangetblentry,
false, true, true);
! qry->returningList = transformReturningClause(pstate,
! stmt->rlist, &rejects);
}
/* done building the range table and jointree */
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, NULL);
+ /* Normalize speculative insertion specification */
+ if (rejects)
+ {
+ if (spec == SPEC_IGNORE)
+ spec = SPEC_IGNORE_REJECTS;
+ else if (spec == SPEC_UPDATE)
+ spec = SPEC_UPDATE_REJECTS;
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("RETURNING clause with REJECTS can only appear when ON DUPLICATE KEY was also specified"),
+ parser_errposition(pstate,
+ exprLocation((Node *) stmt->rlist))));
+ }
+ qry->specClause = spec;
qry->hasSubLinks = pstate->p_hasSubLinks;
assign_query_collations(pstate, qry);
***************
*** 997,1002 **** transformSelectStmt(ParseState *pstate, SelectStmt *stmt)
--- 1024,1030 ----
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, qual);
+ qry->specClause = SPEC_NONE;
qry->hasSubLinks = pstate->p_hasSubLinks;
qry->hasWindowFuncs = pstate->p_hasWindowFuncs;
***************
*** 1893,1898 **** transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt)
--- 1921,1927 ----
Node *qual;
ListCell *origTargetList;
ListCell *tl;
+ bool rejects;
qry->commandType = CMD_UPDATE;
pstate->p_is_update = true;
***************
*** 1922,1931 **** transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt)
qual = transformWhereClause(pstate, stmt->whereClause,
EXPR_KIND_WHERE, "WHERE");
! qry->returningList = transformReturningList(pstate, stmt->returningList);
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, qual);
qry->hasSubLinks = pstate->p_hasSubLinks;
--- 1951,1969 ----
qual = transformWhereClause(pstate, stmt->whereClause,
EXPR_KIND_WHERE, "WHERE");
! qry->returningList = transformReturningClause(pstate, stmt->rlist,
! &rejects);
!
! if (rejects)
! ereport(ERROR,
! (errcode(ERRCODE_SYNTAX_ERROR),
! errmsg("RETURNING clause does not accept REJECTS for UPDATE statements"),
! parser_errposition(pstate,
! exprLocation((Node *) stmt->rlist))));
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, qual);
+ qry->specClause = SPEC_NONE;
qry->hasSubLinks = pstate->p_hasSubLinks;
***************
*** 1995,2011 **** transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt)
}
/*
! * transformReturningList -
* handle a RETURNING clause in INSERT/UPDATE/DELETE
*/
static List *
! transformReturningList(ParseState *pstate, List *returningList)
{
! List *rlist;
int save_next_resno;
! if (returningList == NIL)
! return NIL; /* nothing to do */
/*
* We need to assign resnos starting at one in the RETURNING list. Save
--- 2033,2055 ----
}
/*
! * transformReturningClause -
* handle a RETURNING clause in INSERT/UPDATE/DELETE
*/
static List *
! transformReturningClause(ParseState *pstate, ReturningClause *clause,
! bool *rejects)
{
! List *tlist, *rlist;
int save_next_resno;
! if (clause == NULL)
! {
! *rejects = false;
! return NIL;
! }
!
! rlist = clause->returningList;
/*
* We need to assign resnos starting at one in the RETURNING list. Save
***************
*** 2016,2030 **** transformReturningList(ParseState *pstate, List *returningList)
pstate->p_next_resno = 1;
/* transform RETURNING identically to a SELECT targetlist */
! rlist = transformTargetList(pstate, returningList, EXPR_KIND_RETURNING);
/* mark column origins */
! markTargetListOrigins(pstate, rlist);
/* restore state */
pstate->p_next_resno = save_next_resno;
! return rlist;
}
--- 2060,2098 ----
pstate->p_next_resno = 1;
/* transform RETURNING identically to a SELECT targetlist */
! tlist = transformTargetList(pstate, rlist, EXPR_KIND_RETURNING);
!
! /* Cannot accept system column Vars when returning rejects */
! if (clause->rejects)
! {
! ListCell *l;
!
! foreach(l, tlist)
! {
! TargetEntry *tle = (TargetEntry *) lfirst(l);
! Var *var = (Var *) tle->expr;
!
! if (var->varattno <= 0)
! {
! ereport(ERROR,
! (errcode(ERRCODE_UNDEFINED_COLUMN),
! errmsg("RETURNING clause cannot return system columns when REJECTS is specified"),
! parser_errposition(pstate,
! exprLocation((Node *) var))));
! }
! }
! }
!
! /* pass on if we return rejects */
! *rejects = clause->rejects;
/* mark column origins */
! markTargetListOrigins(pstate, tlist);
/* restore state */
pstate->p_next_resno = save_next_resno;
! return tlist;
}
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
***************
*** 204,209 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
--- 204,210 ----
RangeVar *range;
IntoClause *into;
WithClause *with;
+ ReturningClause *returnc;
A_Indices *aind;
ResTarget *target;
struct PrivTarget *privtarget;
***************
*** 342,351 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
opclass_purpose opt_opfamily transaction_mode_list_or_empty
OptTableFuncElementList TableFuncElementList opt_type_modifiers
prep_type_clause
! execute_param_clause using_clause returning_clause
! opt_enum_val_list enum_val_list table_func_column_list
! create_generic_options alter_generic_options
! relation_expr_list dostmt_opt_list
%type <list> opt_fdw_options fdw_options
%type <defelt> fdw_option
--- 343,351 ----
opclass_purpose opt_opfamily transaction_mode_list_or_empty
OptTableFuncElementList TableFuncElementList opt_type_modifiers
prep_type_clause
! execute_param_clause using_clause opt_enum_val_list
! enum_val_list table_func_column_list create_generic_options
! alter_generic_options relation_expr_list dostmt_opt_list
%type <list> opt_fdw_options fdw_options
%type <defelt> fdw_option
***************
*** 396,401 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
--- 396,402 ----
%type <defelt> SeqOptElem
%type <istmt> insert_rest
+ %type <ival> opt_on_duplicate_key
%type <vsetstmt> set_rest set_rest_more SetResetClause FunctionSetResetClause
***************
*** 487,492 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
--- 488,495 ----
%type <node> func_expr func_expr_windowless
%type <node> common_table_expr
%type <with> with_clause opt_with_clause
+ %type <boolean> opt_rejects
+ %type <returnc> returning_clause
%type <list> cte_list
%type <list> window_clause window_definition_list opt_partition_clause
***************
*** 536,541 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
--- 539,545 ----
DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS
DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DESC
DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP
+ DUPLICATE
EACH ELSE ENABLE_P ENCODING ENCRYPTED END_P ENUM_P ESCAPE EVENT EXCEPT
EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
***************
*** 548,554 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
HANDLER HAVING HEADER_P HOLD HOUR_P
! IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IN_P
INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P
INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER
INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION
--- 552,558 ----
HANDLER HAVING HEADER_P HOLD HOUR_P
! IDENTITY_P IF_P IGNORE ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IN_P
INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P
INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER
INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION
***************
*** 577,583 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE
RANGE READ REAL REASSIGN RECHECK RECURSIVE REF REFERENCES REFRESH REINDEX
! RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK
ROW ROWS RULE
--- 581,587 ----
QUOTE
RANGE READ REAL REASSIGN RECHECK RECURSIVE REF REFERENCES REFRESH REINDEX
! REJECTS RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK
ROW ROWS RULE
***************
*** 8863,8873 **** DeallocateStmt: DEALLOCATE name
*****************************************************************************/
InsertStmt:
! opt_with_clause INSERT INTO qualified_name insert_rest returning_clause
{
$5->relation = $4;
! $5->returningList = $6;
$5->withClause = $1;
$$ = (Node *) $5;
}
;
--- 8867,8879 ----
*****************************************************************************/
InsertStmt:
! opt_with_clause INSERT INTO qualified_name insert_rest
! opt_on_duplicate_key returning_clause
{
$5->relation = $4;
! $5->rlist = $7;
$5->withClause = $1;
+ $5->specClause = $6;
$$ = (Node *) $5;
}
;
***************
*** 8911,8919 **** insert_column_item:
}
;
returning_clause:
! RETURNING target_list { $$ = $2; }
! | /* EMPTY */ { $$ = NIL; }
;
--- 8917,8961 ----
}
;
+ opt_on_duplicate_key:
+ ON DUPLICATE KEY LOCK_P FOR UPDATE
+ {
+ $$ = SPEC_UPDATE;
+ }
+ |
+ ON DUPLICATE KEY IGNORE
+ {
+ $$ = SPEC_IGNORE;
+ }
+ | /*EMPTY*/
+ {
+ $$ = SPEC_NONE;
+ }
+ ;
+
+ opt_rejects:
+ REJECTS
+ {
+ $$ = TRUE;
+ }
+ | /*EMPTY*/
+ {
+ $$ = FALSE;
+ }
+ ;
+
returning_clause:
! RETURNING opt_rejects target_list
! {
! $$ = makeNode(ReturningClause);
! $$->returningList = $3;
! $$->rejects = $2;
! $$->location = @1;
! }
! | /* EMPTY */
! {
! $$ = NULL;
! }
;
***************
*** 8931,8937 **** DeleteStmt: opt_with_clause DELETE_P FROM relation_expr_opt_alias
n->relation = $4;
n->usingClause = $5;
n->whereClause = $6;
! n->returningList = $7;
n->withClause = $1;
$$ = (Node *)n;
}
--- 8973,8979 ----
n->relation = $4;
n->usingClause = $5;
n->whereClause = $6;
! n->rlist = $7;
n->withClause = $1;
$$ = (Node *)n;
}
***************
*** 8998,9004 **** UpdateStmt: opt_with_clause UPDATE relation_expr_opt_alias
n->targetList = $5;
n->fromClause = $6;
n->whereClause = $7;
! n->returningList = $8;
n->withClause = $1;
$$ = (Node *)n;
}
--- 9040,9046 ----
n->targetList = $5;
n->fromClause = $6;
n->whereClause = $7;
! n->rlist = $8;
n->withClause = $1;
$$ = (Node *)n;
}
***************
*** 12559,12564 **** unreserved_keyword:
--- 12601,12607 ----
| DOMAIN_P
| DOUBLE_P
| DROP
+ | DUPLICATE
| EACH
| ENABLE_P
| ENCODING
***************
*** 12589,12594 **** unreserved_keyword:
--- 12632,12638 ----
| HOUR_P
| IDENTITY_P
| IF_P
+ | IGNORE
| IMMEDIATE
| IMMUTABLE
| IMPLICIT_P
***************
*** 12914,12919 **** reserved_keyword:
--- 12958,12964 ----
| PLACING
| PRIMARY
| REFERENCES
+ | REJECTS
| RETURNING
| SELECT
| SESSION_USER
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
***************
*** 4047,4053 **** RelationGetExclusionInfo(Relation indexRelation,
MemoryContextSwitchTo(oldcxt);
}
-
/*
* Routines to support ereport() reports of relation-related errors
*
--- 4047,4052 ----
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 837,842 **** HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
--- 837,843 ----
* Here, we consider the effects of:
* all transactions committed as of the time of the given snapshot
* previous commands of this transaction
+ * all rows only locked (not updated) by this transaction, committed by another
*
* Does _not_ include:
* transactions shown as in-progress by the snapshot
***************
*** 959,965 **** HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
--- 960,985 ----
* when...
*/
if (XidInMVCCSnapshot(HeapTupleHeaderGetXmin(tuple), snapshot))
+ {
+ /*
+ * Not visible to snapshot under conventional MVCC rules, but may still
+ * be exclusive locked by our xact and not updated, which will satisfy
+ * MVCC under a special rule. Importantly, this special rule will not
+ * be invoked if the row is updated, so only one version can be visible
+ * at once.
+ *
+ * Currently this is useful to exactly one case -- INSERT...ON
+ * DUPLICATE KEY LOCK FOR UPDATE, where it's possible and sometimes
+ * desirable to lock a row that would not otherwise be visible to the
+ * given MVCC snapshot. The locked row should on that basis alone
+ * become visible, for the benefit of READ COMMITTED mode.
+ */
+ if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
+ TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
+ return true;
+
return false; /* treat as still in progress */
+ }
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 138,144 **** extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
CommandId cid, int options, BulkInsertState bistate);
extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
! HeapUpdateFailureData *hufd);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
--- 138,144 ----
CommandId cid, int options, BulkInsertState bistate);
extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
! HeapUpdateFailureData *hufd, bool kill);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
***************
*** 348,361 **** extern void ExecCloseScanRelation(Relation scanrel);
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
! EState *estate);
! extern bool check_exclusion_constraint(Relation heap, Relation index,
IndexInfo *indexInfo,
ItemPointer tupleid,
Datum *values, bool *isnull,
EState *estate,
! bool newIndex, bool errorOK);
extern void RegisterExprContextCallback(ExprContext *econtext,
ExprContextCallbackFunction function,
--- 348,366 ----
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
+ extern List *ExecLockIndexValues(TupleTableSlot *slot, EState *estate,
+ SpecType specReason);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
! EState *estate, bool noErrorOnDuplicate);
! extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
! ItemPointer conflictTid);
! extern bool check_exclusion_or_unique_constraint(Relation heap, Relation index,
IndexInfo *indexInfo,
ItemPointer tupleid,
Datum *values, bool *isnull,
EState *estate,
! bool newIndex, bool errorOK, bool wait,
! ItemPointer conflictTid);
extern void RegisterExprContextCallback(ExprContext *econtext,
ExprContextCallbackFunction function,
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 41,46 ****
--- 41,49 ----
* ExclusionOps Per-column exclusion operators, or NULL if none
* ExclusionProcs Underlying function OIDs for ExclusionOps
* ExclusionStrats Opclass strategy numbers for ExclusionOps
+ * UniqueOps Theses are like Exclusion*, but for unique indexes
+ * UniqueProcs
+ * UniqueStrats
* Unique is it a unique index?
* ReadyForInserts is it valid for inserts?
* Concurrent are we doing a concurrent index build?
***************
*** 62,67 **** typedef struct IndexInfo
--- 65,73 ----
Oid *ii_ExclusionOps; /* array with one entry per column */
Oid *ii_ExclusionProcs; /* array with one entry per column */
uint16 *ii_ExclusionStrats; /* array with one entry per column */
+ Oid *ii_UniqueOps; /* array with one entry per column */
+ Oid *ii_UniqueProcs; /* array with one entry per column */
+ uint16 *ii_UniqueStrats; /* array with one entry per column */
bool ii_Unique;
bool ii_ReadyForInserts;
bool ii_Concurrent;
***************
*** 1085,1090 **** typedef struct ModifyTableState
--- 1091,1097 ----
int mt_whichplan; /* which one is being executed (0..n-1) */
ResultRelInfo *resultRelInfo; /* per-subplan target relations */
List **mt_arowmarks; /* per-subplan ExecAuxRowMark lists */
+ SpecType spec; /* reason for speculative insertion */
EPQState mt_epqstate; /* for evaluating EvalPlanQual rechecks */
bool fireBSTriggers; /* do we need to fire stmt triggers? */
} ModifyTableState;
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
***************
*** 402,407 **** typedef enum NodeTag
--- 402,408 ----
T_RowMarkClause,
T_XmlSerialize,
T_WithClause,
+ T_ReturningClause,
T_CommonTableExpr,
/*
***************
*** 545,550 **** typedef enum CmdType
--- 546,564 ----
* with qual */
} CmdType;
+ /* SpecType -
+ * "Speculative insertion" clause
+ *
+ * This also appears across various subsystems
+ */
+ typedef enum
+ {
+ SPEC_NONE, /* No reason to insert speculatively */
+ SPEC_IGNORE, /* "ON DUPLICATE KEY IGNORE" */
+ SPEC_IGNORE_REJECTS, /* same as SPEC_IGNORE, plus RETURNING rejected */
+ SPEC_UPDATE, /* "ON DUPLICATE KEY LOCK FOR UPDATE" */
+ SPEC_UPDATE_REJECTS /* same as SPEC_UPDATE, plus RETURNING rejected */
+ } SpecType;
/*
* JoinType -
*** a/src/include/nodes/parsenodes.h
--- b/src/include/nodes/parsenodes.h
***************
*** 130,135 **** typedef struct Query
--- 130,137 ----
List *withCheckOptions; /* a list of WithCheckOption's */
+ SpecType specClause; /* speculative insertion clause */
+
List *returningList; /* return-values list (of TargetEntry) */
List *groupClause; /* a list of SortGroupClause's */
***************
*** 949,954 **** typedef struct WithClause
--- 951,971 ----
} WithClause;
/*
+ * ReturningClause -
+ * representation of returninglist for parsing
+ *
+ * Note: ReturningClause does not propogate into the Query representation;
+ * returningList does, while rejects influences speculative insertion.
+ */
+ typedef struct ReturningClause
+ {
+ NodeTag type;
+ List *returningList; /* List proper */
+ bool rejects; /* A list of rejects? */
+ int location; /* token location, or -1 if unknown */
+ } ReturningClause;
+
+ /*
* CommonTableExpr -
* representation of WITH list element
*
***************
*** 998,1004 **** typedef struct InsertStmt
RangeVar *relation; /* relation to insert into */
List *cols; /* optional: names of the target columns */
Node *selectStmt; /* the source SELECT/VALUES, or NULL */
! List *returningList; /* list of expressions to return */
WithClause *withClause; /* WITH clause */
} InsertStmt;
--- 1015,1022 ----
RangeVar *relation; /* relation to insert into */
List *cols; /* optional: names of the target columns */
Node *selectStmt; /* the source SELECT/VALUES, or NULL */
! SpecType specClause; /* ON DUPLICATE KEY specification */
! ReturningClause *rlist; /* expressions to return */
WithClause *withClause; /* WITH clause */
} InsertStmt;
***************
*** 1012,1018 **** typedef struct DeleteStmt
RangeVar *relation; /* relation to delete from */
List *usingClause; /* optional using clause for more tables */
Node *whereClause; /* qualifications */
! List *returningList; /* list of expressions to return */
WithClause *withClause; /* WITH clause */
} DeleteStmt;
--- 1030,1036 ----
RangeVar *relation; /* relation to delete from */
List *usingClause; /* optional using clause for more tables */
Node *whereClause; /* qualifications */
! ReturningClause *rlist; /* expressions to return */
WithClause *withClause; /* WITH clause */
} DeleteStmt;
***************
*** 1027,1033 **** typedef struct UpdateStmt
List *targetList; /* the target list (of ResTarget) */
Node *whereClause; /* qualifications */
List *fromClause; /* optional from clause for more tables */
! List *returningList; /* list of expressions to return */
WithClause *withClause; /* WITH clause */
} UpdateStmt;
--- 1045,1051 ----
List *targetList; /* the target list (of ResTarget) */
Node *whereClause; /* qualifications */
List *fromClause; /* optional from clause for more tables */
! ReturningClause *rlist; /* expressions to return */
WithClause *withClause; /* WITH clause */
} UpdateStmt;
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 176,181 **** typedef struct ModifyTable
--- 176,182 ----
List *returningLists; /* per-target-table RETURNING tlists */
List *fdwPrivLists; /* per-target-table FDW private data lists */
List *rowMarks; /* PlanRowMarks (non-locking only) */
+ SpecType spec; /* speculative insertion specification */
int epqParam; /* ID of Param for EvalPlanQual re-eval */
} ModifyTable;
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
***************
*** 84,90 **** extern ModifyTable *make_modifytable(PlannerInfo *root,
CmdType operation, bool canSetTag,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
! List *rowMarks, int epqParam);
extern bool is_projection_capable_plan(Plan *plan);
/*
--- 84,90 ----
CmdType operation, bool canSetTag,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
! List *rowMarks, SpecType spec, int epqParam);
extern bool is_projection_capable_plan(Plan *plan);
/*
*** a/src/include/parser/kwlist.h
--- b/src/include/parser/kwlist.h
***************
*** 133,138 **** PG_KEYWORD("document", DOCUMENT_P, UNRESERVED_KEYWORD)
--- 133,139 ----
PG_KEYWORD("domain", DOMAIN_P, UNRESERVED_KEYWORD)
PG_KEYWORD("double", DOUBLE_P, UNRESERVED_KEYWORD)
PG_KEYWORD("drop", DROP, UNRESERVED_KEYWORD)
+ PG_KEYWORD("duplicate", DUPLICATE, UNRESERVED_KEYWORD)
PG_KEYWORD("each", EACH, UNRESERVED_KEYWORD)
PG_KEYWORD("else", ELSE, RESERVED_KEYWORD)
PG_KEYWORD("enable", ENABLE_P, UNRESERVED_KEYWORD)
***************
*** 180,185 **** PG_KEYWORD("hold", HOLD, UNRESERVED_KEYWORD)
--- 181,187 ----
PG_KEYWORD("hour", HOUR_P, UNRESERVED_KEYWORD)
PG_KEYWORD("identity", IDENTITY_P, UNRESERVED_KEYWORD)
PG_KEYWORD("if", IF_P, UNRESERVED_KEYWORD)
+ PG_KEYWORD("ignore", IGNORE, UNRESERVED_KEYWORD)
PG_KEYWORD("ilike", ILIKE, TYPE_FUNC_NAME_KEYWORD)
PG_KEYWORD("immediate", IMMEDIATE, UNRESERVED_KEYWORD)
PG_KEYWORD("immutable", IMMUTABLE, UNRESERVED_KEYWORD)
***************
*** 307,312 **** PG_KEYWORD("ref", REF, UNRESERVED_KEYWORD)
--- 309,315 ----
PG_KEYWORD("references", REFERENCES, RESERVED_KEYWORD)
PG_KEYWORD("refresh", REFRESH, UNRESERVED_KEYWORD)
PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD)
+ PG_KEYWORD("rejects", REJECTS, RESERVED_KEYWORD)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD)
*** a/src/test/isolation/isolation_schedule
--- b/src/test/isolation/isolation_schedule
***************
*** 22,24 **** test: aborted-keyrevoke
--- 22,26 ----
test: multixact-no-deadlock
test: drop-index-concurrently-1
test: timeouts
+ test: insert-duplicate-key-ignore
+ test: insert-duplicate-key-lock-for-update
*** /dev/null
--- b/src/test/isolation/specs/insert-duplicate-key-ignore.spec
***************
*** 0 ****
--- 1,42 ----
+ # INSERT...ON DUPLICATE KEY IGNORE test
+ #
+ # This test tries to expose problems with the interaction between concurrent
+ # sessions during INSERT...ON DUPLICATE KEY IGNORE.
+ #
+ # The convention here is that session 1 always ends up inserting, and session 2
+ # always ends up ignoring.
+
+ setup
+ {
+ CREATE TABLE ints (key int primary key, val text);
+ }
+
+ teardown
+ {
+ DROP TABLE ints;
+ }
+
+ session "s1"
+ setup
+ {
+ BEGIN ISOLATION LEVEL READ COMMITTED;
+ }
+ step "ignore1" { INSERT INTO ints(key, val) VALUES(1, 'ignore1') ON DUPLICATE KEY IGNORE; }
+ step "select1" { SELECT * FROM ints; }
+ step "c1" { COMMIT; }
+ step "a1" { ABORT; }
+
+ session "s2"
+ setup
+ {
+ BEGIN ISOLATION LEVEL READ COMMITTED;
+ }
+ step "ignore2" { INSERT INTO ints(key, val) VALUES(1, 'ignore2') ON DUPLICATE KEY IGNORE; }
+ step "select2" { SELECT * FROM ints; }
+ step "c2" { COMMIT; }
+ step "a2" { ABORT; }
+
+ # Regular case where one session block-waits on another to determine if it
+ # should proceed with an insert or ignore.
+ permutation "ignore1" "ignore2" "c1" "select2" "c2"
+ permutation "ignore1" "ignore2" "a1" "select2" "c2"
*** /dev/null
--- b/src/test/isolation/specs/insert-duplicate-key-lock-for-update.spec
***************
*** 0 ****
--- 1,39 ----
+ # INSERT...ON DUPLICATE KEY LOCK FOR UPDATE test
+ #
+ # This test tries to expose problems with the interaction between concurrent
+ # sessions during INSERT...ON DUPLICATE LOCK FOR UPDATE.
+
+ setup
+ {
+ CREATE TABLE ints (key int primary key, val text);
+ }
+
+ teardown
+ {
+ DROP TABLE ints;
+ }
+
+ session "s1"
+ setup
+ {
+ BEGIN ISOLATION LEVEL READ COMMITTED;
+ }
+ step "lock1" { INSERT INTO ints(key, val) VALUES(1, 'lock1') ON DUPLICATE KEY LOCK FOR UPDATE; }
+ step "select1" { SELECT * FROM ints; }
+ step "c1" { COMMIT; }
+ step "a1" { ABORT; }
+
+ session "s2"
+ setup
+ {
+ BEGIN ISOLATION LEVEL READ COMMITTED;
+ }
+ step "lock2" { INSERT INTO ints(key, val) VALUES(1, 'lock2') ON DUPLICATE KEY LOCK FOR UPDATE; }
+ step "select2" { SELECT * FROM ints; }
+ step "c2" { COMMIT; }
+ step "a2" { ABORT; }
+
+ # Regular case where one session block-waits on another to determine if it
+ # should proceed with an insert or lock.
+ permutation "lock1" "lock2" "c1" "select2" "c2"
+ permutation "lock1" "lock2" "a1" "select2" "c2"
On Tue, Nov 19, 2013 at 5:13 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Ok. Which use case are you targeting during this initial effort, batch
updates or small OLTP transactions?
OLTP transactions are probably my primary concern. I just realized
that I wasn't actually very clear on that point in my most recent
e-mail -- my apologies. What we really need for batching, and what we
should work towards in the medium term is MERGE, where a single table
scan does everything.
However, I also care about facilitating conflict resolution in
multi-master replication systems, so I think we definitely ought to
consider that carefully if at all possible. Incidentally, Andres said
a few weeks back that he thinks that what I've proposed ought to be
only exposed to C code, owing to the fact that it necessitates the
visibility trick (actually, I think UPSERT does generally, but what
I've done has, I suppose, necessitated making it more explicit/general
- i.e. modifications are added to HeapTupleSatisfiesMVCC()). I don't
understand what difference it makes to only exposed it at the C level
- what I've proposed in this area is either correct or incorrect
(Andres mentioned the Halloween problem). Furthermore, I presume that
it's broadly useful to have Bucardo-style custom conflict resolution
policies, without people having to get their hands dirty with C, and I
think having this at the SQL level helps there. Plus, as I've said
many times, the flexibility this syntax offers is likely to be broadly
useful for ordinary SQL clients - this is almost as good as SQL MERGE
for many cases.
Seems like an awful lot of additional mechanism.
Not really. Once you have the code in place to do the kill-inserted-tuple
dance on a conflict, all you need is to do an extra index search before it.
And once you have that, it's not hard to add some kind of a heuristic to
either do the pre-check or skip it.
Perhaps.
I probably shouldn't have mentioned markpos/restrpos, you're right that it's
not a good idea to conflate that with index insertion. Nevertheless, some
kind of an API for doing a duplicate-key check prior to insertion, and
remembering the location for the actual insert later, seems sensible. It's
certainly no more of a modularity violation than the value-locking scheme
you're proposing.
I'm not so sure - in principle, any locking implementation can be used
by any conceivable amcanunique indexing method. The core system knows
that it isn't okay to sit on them all day long, but that doesn't seem
very onerous.
I'm certainly not opposed to making something like this work for
exclusion constraints. Certainly, I want this to be as general as
possible. But I don't think that it needs to be a blocker, and I don't
think we gain anything in code footprint by addressing that by being
as general as possible in our approach to the basic concurrency issue.
After all, we're going to have to repeat the basic pattern in multiple
modules.Well, I don't know what to say. I *do* have a hunch that we'd gain much in
code footprint by making this general. I don't understand what pattern you'd
need to repeat in multiple modules.
Now that I see this rough patch, I better appreciate what you mean. I
withdraw this objection.
Here's a patch, implementing a rough version of the scheme I'm trying to
explain. It's not as polished as yours, but it ought to be enough to
evaluate the code footprint and performance. It doesn't make any changes to
the indexam API, and it works the same with exclusion constraints and unique
constraints. As it stands, it doesn't leave bloat behind, except when a
concurrent insertion with a conflicting key happens between the first
"pre-check" and the actual insertion. That should be rare in practice.What have you been using to performance test this?
I was just testing my patch against a custom pgbench workload,
involving running upserts against a table from a fixed range of PK
values. It's proven kind of difficult to benchmark this in the way
that pgbench has proved useful for in the past. Pretty soon the
table's PK range is "saturated", so they're all updates, but on the
other hand how do you balance the INSERT or UPDATE case?
Multiple unique indexes are the interesting case for comparing both
approaches. I didn't really worry about performance so much as
correctness, and for multiple unique constraints your approach clearly
falls down, as explained below.
Is it even useful to lock multiple rows if we can't really
update them, because they'll overlap each other when all updated with
the one value?Hmm. I think what you're referring to is the case where you try to insert a
row so that it violates an exclusion constraint, and in a way that it
conflicts with a large number of existing tuples. For example, if you have a
calendar application with a constraint that two reservations must not
overlap, and you try to insert a new reservation that covers, say, a whole
decade.
Right.
That's not a problem for ON DUPLICATE KEY IGNORE, as you just ignore the
conflict and move on. For ON DUPLICATE KEY LOCK FOR UPDATE, I guess we would
need to handle a large TID array. Or maybe we can arrange it so that the
tuples are locked as we scan them, without having to collect them all in a
large array.(the attached patch only locks the first existing tuple that conflicts; that
needs to be fixed)
I'm having a hard time seeing how ON DUPLICATE KEY LOCK FOR UPDATE is
of very much use to exclusion constraints at all. Perhaps I lack
imagination here. However, ON DUPLICATE KEY IGNORE certainly *is*
useful with exclusion constraints, and I'm not dismissive of that.
I think we ought to at least be realistic about the concerns that
inform your approach here. I don't think that making this work for
exclusion constraints is all that compelling; I'll take it, I guess
(not that there is obviously a dichotomy between doing btree locking
and doing ECs too), but I doubt people put "overlaps" operators in the
predicates of DML very often *at all*, and doubt even more that there
is actual demand for upserting there. I think that the reason that you
prefer this design is almost entirely down to possible hazards with
btree locking around what I've done (or, indeed anything that
approximates what I've done); maybe that's so obvious that you didn't
even occur to you to mention it, but I think it should be
acknowledged. I don't think that using index locking of *some* form is
unreasonable. Certainly, I think that from reading the literature
(e.g. [1]http://zfs.informatik.rwth-aachen.de/btw2007/paper/p18.pdf) one can find evidence that btree page index locking as part
of value locking seems like a common technique in many popular RDBMSs,
and presumably forms an important part of their SQL MERGE
implementations. As it says in that paper:
"""
Thus, non-leaf pages do not require locks and are protected by latches
only. The remainder of this paper focuses on locks.
"""
They talk here about a very traditional System-R architecture -
"Assumptions about the database environment are designed to be very
traditional". Latches here are basically equivalent to our buffer
locks, and what they call locks we call heavyweight locks. So I'm
pretty sure many other *traditional* systems handle value locking by
escalating a "latch" to a leaf-page-level heavyweight lock (it's often
more granular too). I think that the advantages are fairly
fundamental.
I think that "4.1 Locks on keys and ranges" of this paper is interesting.
I've also found a gentler introduction to traditional btree key
locking [2]http://www.hpl.hp.com/techreports/2010/HPL-2010-9.pdf -- Peter Geoghegan. In that paper, section "5 Protecting a B-tree’s logical
contents" it is said:
"""
Latches must be managed carefully in key range locking if lockable
resources are defined by keys that may be deleted if not protected.
Until the lock request is inserted into the lock manager’s data
structures, the latch on the data structure in the buffer pool is
required to ensure the existence of the key value. On the other hand,
if a lock cannot be granted immediately, the thread should not hold a
latch while the transaction waits. Thus, after waiting for a key value
lock, a transaction must repeat its root-to-leaf search for the key.
"""
So I strongly suspect that some other systems have found it useful to
escalate from a latch (buffer/page lock) to a lock (heavyweight lock).
I have some concerns about what you've done that may limit my
immediate ability to judge performance, and the relative merits of
both approaches generally. Now, I know you just wanted to sketch
something out, and that's fine, but I'm only sharing my thoughts. I am
particularly worried about the worst case (for either approach),
particularly with more than 1 unique index. I am also worried about
livelock hazards (again, in particular with more than 1 index) - I am
not asserting that they exist in your patch, but they are definitely
more difficult to reason about. Value locking works because once a
page lock is acquired, all unique indexes are inserted into. Could you
have two upserters livelock each other with two unique indexes with
1:1 correlated values in practice (i.e. 2 unique indexes that might
almost work as 1 composite index)? That is a reasonable usage of
upsert, I think.
We never wait on another transaction if there is a conflict when
inserting - we just do the usual UNIQUE_CHECK_PARTIAL thing (we don't
wait for other xact during btree insertion). This breaks the IGNORE
case (how does it determine the final outcome of the transaction that
inserted what may be a conflict, iff the conflict was only found
during insertion?), which would probably be fine for our purposes if
that were the only issue, but I have concerns about its effects on the
ON DUPLICATE KEY LOCK FOR UPDATE case too. I don't like that an
upserter's ExecInsertIndexTuples() won't wait on other xids generally,
I think. Why should the code btree-insert even though it knows it's
going to kill the heap tuple? It makes things very hard to reason
about.
If you are just mostly thinking about exclusion constraints here, then
I'm not sure that even at this point that it's okay that the IGNORE
case doesn't work there, because IGNORE is the only thing that makes
much sense for exclusion constraints.
The unacceptable-deadlocking-pattern generally occurs when we try to
lock two different row versions. Your patch is fairly easy to make
deadlock.
Regarding this:
/*
* At this point we have either a conflict or a potential conflict. If
* we're not supposed to raise error, just return the fact of the
* potential conflict without waiting to see if it's real.
*/
if (errorOK && !wait)
{
conflict = true;
if (conflictTid)
*conflictTid = tup->t_self;
break;
}
Don't we really just have only a potential conflict? Even if
conflictTid is committed?
I think it's odd that you insert btree index tuples without ever
worrying about waiting (which is what breaks the IGNORE case, you
might say). UNIQUE_CHECK_PARTIAL never gives an xid to wait on from
within _bt_check_unique(). Won't that itself make other sessions block
pending the outcome of our transaction (in non-upserting
ExecInsertIndexTuples(), or in ExecCheckIndexConstraints())? Could
that be why your patch deadlocks unreasonable (that is, in the way
you've already agreed, in your most recent mail, isn't okay)?
Isn't it only already okay that UNIQUE_CHECK_PARTIAL might do that for
deferred unique indexes because of re-checking, which may then abort
the xact?
How will this work?:
* XXX: If we know or assume that there are few duplicates, it would
* be better to skip this, and just optimistically proceed with the
* insertion below. You would then leave behind some garbage when a
* conflict happens, but if it's rare, it doesn't matter much. Some
* kind of heuristic might be in order here, like stop doing these
* pre-checks if the last 100 insertions have not been duplicates.
...when you consider that the only place a tid can come from is this pre-check?
Anyway, consider the following simple test-case of your patch.
postgres=# create unlogged table foo
(
a int4 primary key,
b int4 unique
);
CREATE TABLE
If I run the attached pgbench script like this:
pg@hamster:~/pgbench-tools/tests$ pgbench -f upsert.sql -n -c 50 -T 20
I can get it to deadlock (and especially to throw unique constraint
violations) like crazy. Single unique indexes seemed okay, though I
have my doubts that only allowing one unique index gets us far, or
that it will be acceptable to have the user specify a unique index in
DML or something. I discussed this with Robert in relation to his
design upthread. Multiple unique constraints were *always* the hard
case. I mean, my patch only really does something unconventional
*because* of that case, really. One unique index is easy.
Leaving discussion of value locking aside, just how rough is this
revision of yours? What do you think of certain controversial aspects
of my design that remain unchanged, such as the visibility trick (as
actually implemented, and/or just in principle)? What about the syntax
itself? It is certainly valuable to have additional MERGE-like
functionality above and beyond the basic "upsert", not least for
multi-master conflict resolution with complex resolution policies, and
this syntax gets us much of that.
How would you feel about making it possible for the UPDATE to use a
tidscan, by projecting out the tid that caused a conflict, as a
semi-documented optimization? It might be unfortunate if someone tried
to UPDATE based on that ctid twice, but that is a less common
requirement. It is kind of an abuse of notation, because of course
you're not supposed to be projecting out the conflict-causer but the
rejects, but perhaps we can live with that, if we can live with the
basic idea.
I'm sorry if my thoughts here are not fully developed, but it's hard
to pin this stuff down. Especially since I'm guessing what is and
isn't essential to your design in this rough sketch.
Thanks
[1]: http://zfs.informatik.rwth-aachen.de/btw2007/paper/p18.pdf
[2]: http://www.hpl.hp.com/techreports/2010/HPL-2010-9.pdf -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
On Sat, Nov 23, 2013 at 11:52 PM, Peter Geoghegan <pg@heroku.com> wrote:
pg@hamster:~/pgbench-tools/tests$ pgbench -f upsert.sql -n -c 50 -T 20
I can get it to deadlock (and especially to throw unique constraint
violations) like crazy.
I'm sorry, this test-case is an earlier one that is actually entirely
invalid for the purpose stated (though my concerns stated above remain
- I just didn't think the multi-unique-index case had been exercised
enough, and so did this at the last minute). Please omit it from your
consideration. I think I have been working too late...
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Nov 23, 2013 at 11:52 PM, Peter Geoghegan <pg@heroku.com> wrote:
I have some concerns about what you've done that may limit my
immediate ability to judge performance, and the relative merits of
both approaches generally. Now, I know you just wanted to sketch
something out, and that's fine, but I'm only sharing my thoughts. I am
particularly worried about the worst case (for either approach),
particularly with more than 1 unique index. I am also worried about
livelock hazards (again, in particular with more than 1 index) - I am
not asserting that they exist in your patch, but they are definitely
more difficult to reason about. Value locking works because once a
page lock is acquired, all unique indexes are inserted into. Could you
have two upserters livelock each other with two unique indexes with
1:1 correlated values in practice (i.e. 2 unique indexes that might
almost work as 1 composite index)? That is a reasonable usage of
upsert, I think.
So I had it backwards: In fact, it isn't possible to get your patch to
deadlock when it should - it livelocks instead (where with my patch,
as far as I can tell, we predictably and correctly have detected
deadlocks). I see an infinite succession of "insertion conflicted
after pre-check" DEBUG1 elog messages, and no progress, which is an
obvious indication of livelock. My test does involve 2 unique indexes
- that's generally the hard case to get right. Dozens of backends are
tied-up in livelock.
Test case for this is attached. My patch is considerably slowed down
by the way this test-case tangles everything up, but does get through
each pgbench run/loop in the bash script predictably enough. And when
I kill the test-case, a bunch of backends are not left around, stuck
in perpetual livelock (with my patch it takes only a few seconds for
the deadlock detector to get around to killing every backend).
I'm also seeing this:
Client 45 aborted in state 2: ERROR: attempted to lock invisible tuple
Client 55 aborted in state 2: ERROR: attempted to lock invisible tuple
Client 41 aborted in state 2: ERROR: attempted to lock invisible tuple
To me this seems like a problem with the (potential) total lack of
locking that your approach takes (inserting btree unique index tuples
as in your patch is a form of value locking...sort of...it's a little
hard to reason about as presented). Do you think this might be an
inherent problem, or can you suggest a way to make your approach still
work?
So I probably should have previously listed as a requirement for our design:
* Doesn't just work with one unique index. Naming a unique index
directly in DML, or assuming that the PK is intended seems quite weak
to me.
This is something I discussed plenty with Robert, and I guess I just
forgot to repeat myself when asked.
Thanks
--
Peter Geoghegan
On 11/18/2013 06:44 AM, Heikki Linnakangas wrote:
I think it's important to recap the design goals of this. I don't think
these have been listed before, so let me try:* It should be usable and perform well for both large batch updates and
small transactions.* It should perform well both when there are no duplicates, and when
there are lots of duplicatesAnd from that follows some finer requirements:
* Performance when there are no duplicates should be close to raw INSERT
performance.* Performance when all rows are duplicates should be close to raw UPDATE
performance.* We should not leave behind large numbers of dead tuples in either case.
I think this is setting the bar way too high for an initial feature.
Would we like to eventually have all of those things? Yes. Do we need
to have all of them for 9.4? No.
It's more useful to measure this feature against the current
alternatives used by our users, which are upsert functions and similar
patterns. If we can make things easier and more efficient than those
(which shouldn't be hard), then it's a worthwhile step forwards.
That being said, the other requirement I am concerned about is being
able to support the syntax of this feature in commonly used ORMs. That
is, can I write a fairly small Django or Rails extension which does
upsert using this patch? Fortunately, I think I can ...
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM97279d50af1cc84c4353f6ee4a699ff9f26b004705110ca4cba9191ca7d3ab1a4103883ad911afb38d02e99642bb6db6@asav-2.01.com
On Tue, Nov 26, 2013 at 9:11 AM, Josh Berkus <josh@agliodbs.com> wrote:
* It should be usable and perform well for both large batch updates and
small transactions.* It should perform well both when there are no duplicates, and when
there are lots of duplicatesAnd from that follows some finer requirements:
* Performance when there are no duplicates should be close to raw INSERT
performance.* Performance when all rows are duplicates should be close to raw UPDATE
performance.* We should not leave behind large numbers of dead tuples in either case.
I think this is setting the bar way too high for an initial feature.
Would we like to eventually have all of those things? Yes. Do we need
to have all of them for 9.4? No.
The requirements around performance/bloat have a lot to do with making
the feature work reasonably well for multi-master conflict resolution.
They also have much more to do with the worst case than the average
case. If the worst case really is terribly bad, that ends up being a
major gotcha. I'm not concerned about bloat as such, but in any case
whether or not Heikki's design can mostly avoid bloat is, for now, of
secondary importance.
I feel the need to re-iterate something I've already said: I don't see
that I have a concession to make here with a view to pragmatically
getting something useful into 9.4. I am playing it as safe as I think
I can.
It's more useful to measure this feature against the current
alternatives used by our users, which are upsert functions and similar
patterns. If we can make things easier and more efficient than those
(which shouldn't be hard), then it's a worthwhile step forwards.
Actually, it's very hard. I don't have license to burn through xids.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/26/13 01:59, Peter Geoghegan wrote:
On Sat, Nov 23, 2013 at 11:52 PM, Peter Geoghegan <pg@heroku.com> wrote:
I have some concerns about what you've done that may limit my
immediate ability to judge performance, and the relative merits of
both approaches generally. Now, I know you just wanted to sketch
something out, and that's fine, but I'm only sharing my thoughts. I am
particularly worried about the worst case (for either approach),
particularly with more than 1 unique index. I am also worried about
livelock hazards (again, in particular with more than 1 index) - I am
not asserting that they exist in your patch, but they are definitely
more difficult to reason about. Value locking works because once a
page lock is acquired, all unique indexes are inserted into. Could you
have two upserters livelock each other with two unique indexes with
1:1 correlated values in practice (i.e. 2 unique indexes that might
almost work as 1 composite index)? That is a reasonable usage of
upsert, I think.So I had it backwards: In fact, it isn't possible to get your patch to
deadlock when it should - it livelocks instead (where with my patch,
as far as I can tell, we predictably and correctly have detected
deadlocks). I see an infinite succession of "insertion conflicted
after pre-check" DEBUG1 elog messages, and no progress, which is an
obvious indication of livelock. My test does involve 2 unique indexes
- that's generally the hard case to get right. Dozens of backends are
tied-up in livelock.Test case for this is attached.
Great, thanks! I forgot to reset the "conflicted" variable when looping
to retry, so that once it got into the "insertion conflicted after
pre-check" situation, it never got out of it.
After fixing that bug, I'm getting a correctly-detected deadlock every
now and then with that test case.
I'm also seeing this:
Client 45 aborted in state 2: ERROR: attempted to lock invisible tuple
Client 55 aborted in state 2: ERROR: attempted to lock invisible tuple
Client 41 aborted in state 2: ERROR: attempted to lock invisible tuple
Hmm. That's because the trick I used to kill the just-inserted tuple
confuses a concurrent heap_lock_tuple call. It doesn't expect the tuple
it's locking to become invisible. Actually, doesn't your patch have the
same bug? If you're about to lock a tuple in ON DUPLICATE KEY LOCK FOR
UPDATE, and the transaction that inserted the duplicate row aborts just
before the heap_lock_tuple() call, I think you'd also see that error.
To me this seems like a problem with the (potential) total lack of
locking that your approach takes (inserting btree unique index tuples
as in your patch is a form of value locking...sort of...it's a little
hard to reason about as presented). Do you think this might be an
inherent problem, or can you suggest a way to make your approach still
work?
Just garden-variety bugs :-). Attached patch fixes both issues.
So I probably should have previously listed as a requirement for our design:
* Doesn't just work with one unique index. Naming a unique index
directly in DML, or assuming that the PK is intended seems quite weak
to me.This is something I discussed plenty with Robert, and I guess I just
forgot to repeat myself when asked.
Totally agreed on that.
- Heikki
Attachments:
insert_on_dup-kill-on-conflict-2.patchtext/x-diff; name=insert_on_dup-kill-on-conflict-2.patchDownload
*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
***************
*** 1418,1423 **** JumbleQuery(pgssJumbleState *jstate, Query *query)
--- 1418,1424 ----
JumbleRangeTable(jstate, query->rtable);
JumbleExpr(jstate, (Node *) query->jointree);
JumbleExpr(jstate, (Node *) query->targetList);
+ APP_JUMB(query->specClause);
JumbleExpr(jstate, (Node *) query->returningList);
JumbleExpr(jstate, (Node *) query->groupClause);
JumbleExpr(jstate, query->havingQual);
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 2541,2551 **** compute_infobits(uint16 infomask, uint16 infomask2)
* (the last only for HeapTupleSelfUpdated, since we
* cannot obtain cmax from a combocid generated by another transaction).
* See comments for struct HeapUpdateFailureData for additional info.
*/
HTSU_Result
heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
! HeapUpdateFailureData *hufd)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
--- 2541,2555 ----
* (the last only for HeapTupleSelfUpdated, since we
* cannot obtain cmax from a combocid generated by another transaction).
* See comments for struct HeapUpdateFailureData for additional info.
+ *
+ * If 'kill' is true, we're killing a tuple we just inserted in the same
+ * command. Instead of the normal visibility checks, we check that the tuple
+ * was inserted by the current transaction and given command id.
*/
HTSU_Result
heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
! HeapUpdateFailureData *hufd, bool kill)
{
HTSU_Result result;
TransactionId xid = GetCurrentTransactionId();
***************
*** 2601,2607 **** heap_delete(Relation relation, ItemPointer tid,
tp.t_self = *tid;
l1:
! result = HeapTupleSatisfiesUpdate(&tp, cid, buffer);
if (result == HeapTupleInvisible)
{
--- 2605,2620 ----
tp.t_self = *tid;
l1:
! if (!kill)
! result = HeapTupleSatisfiesUpdate(&tp, cid, buffer);
! else
! {
! if (tp.t_data->t_choice.t_heap.t_xmin != xid ||
! tp.t_data->t_choice.t_heap.t_field3.t_cid != cid)
! elog(ERROR, "attempted to kill a tuple inserted by another transaction or command");
! result = HeapTupleMayBeUpdated;
! }
!
if (result == HeapTupleInvisible)
{
***************
*** 2870,2876 **** simple_heap_delete(Relation relation, ItemPointer tid)
result = heap_delete(relation, tid,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
! &hufd);
switch (result)
{
case HeapTupleSelfUpdated:
--- 2883,2889 ----
result = heap_delete(relation, tid,
GetCurrentCommandId(true), InvalidSnapshot,
true /* wait for commit */ ,
! &hufd, false);
switch (result)
{
case HeapTupleSelfUpdated:
***************
*** 3950,3957 **** l3:
if (result == HeapTupleInvisible)
{
! UnlockReleaseBuffer(*buffer);
! elog(ERROR, "attempted to lock invisible tuple");
}
else if (result == HeapTupleBeingUpdated)
{
--- 3963,3975 ----
if (result == HeapTupleInvisible)
{
! LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
! /*
! * this is expected if we're locking a tuple in ON DUPLICATE KEY LOCK
! * FOR UPDATE mode, and the inserting transaction killed the tuple in
! * the same* transaction.
! */
! return HeapTupleInvisible;
}
else if (result == HeapTupleBeingUpdated)
{
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 1644,1651 **** BuildIndexInfo(Relation index)
ii->ii_ExclusionStrats = NULL;
}
/* other info */
- ii->ii_Unique = indexStruct->indisunique;
ii->ii_ReadyForInserts = IndexIsReady(indexStruct);
/* initialize index-build state to default */
--- 1644,1692 ----
ii->ii_ExclusionStrats = NULL;
}
+ /*
+ * fetch info for checking unique constraints. (this is currently only
+ * used by ExecCheckIndexConstraints(), for INSERT ... ON DUPLICATE KEY.
+ * In regular insertions, the index AM handles the unique check itself.
+ * Might make sense to do this lazily, only when needed)
+ */
+ if (indexStruct->indisunique)
+ {
+ int ncols = index->rd_rel->relnatts;
+
+ if (index->rd_rel->relam != BTREE_AM_OID)
+ elog(ERROR, "only b-tree indexes are supported for foreign keys");
+
+ ii->ii_UniqueOps = (Oid *) palloc(sizeof(Oid) * ncols);
+ ii->ii_UniqueProcs = (Oid *) palloc(sizeof(Oid) * ncols);
+ ii->ii_UniqueStrats = (uint16 *) palloc(sizeof(uint16) * ncols);
+
+ /*
+ * We have to look up the operator's strategy number. This
+ * provides a cross-check that the operator does match the index.
+ */
+ /* We need the func OIDs and strategy numbers too */
+ for (i = 0; i < ncols; i++)
+ {
+ ii->ii_UniqueStrats[i] = BTEqualStrategyNumber;
+ ii->ii_UniqueOps[i] =
+ get_opfamily_member(index->rd_opfamily[i],
+ index->rd_opcintype[i],
+ index->rd_opcintype[i],
+ ii->ii_UniqueStrats[i]);
+ ii->ii_UniqueProcs[i] = get_opcode(ii->ii_UniqueOps[i]);
+ }
+ ii->ii_Unique = true;
+ }
+ else
+ {
+ ii->ii_UniqueOps = NULL;
+ ii->ii_UniqueProcs = NULL;
+ ii->ii_UniqueStrats = NULL;
+ ii->ii_Unique = false;
+ }
+
/* other info */
ii->ii_ReadyForInserts = IndexIsReady(indexStruct);
/* initialize index-build state to default */
***************
*** 2566,2575 **** IndexCheckExclusion(Relation heapRelation,
/*
* Check that this tuple has no conflicts.
*/
! check_exclusion_constraint(heapRelation,
indexRelation, indexInfo,
&(heapTuple->t_self), values, isnull,
! estate, true, false);
}
heap_endscan(scan);
--- 2607,2616 ----
/*
* Check that this tuple has no conflicts.
*/
! check_exclusion_or_unique_constraint(heapRelation,
indexRelation, indexInfo,
&(heapTuple->t_self), values, isnull,
! estate, true, false, true, NULL);
}
heap_endscan(scan);
*** a/src/backend/commands/constraint.c
--- b/src/backend/commands/constraint.c
***************
*** 170,178 **** unique_key_recheck(PG_FUNCTION_ARGS)
* For exclusion constraints we just do the normal check, but now it's
* okay to throw error.
*/
! check_exclusion_constraint(trigdata->tg_relation, indexRel, indexInfo,
&(new_row->t_self), values, isnull,
! estate, false, false);
}
/*
--- 170,178 ----
* For exclusion constraints we just do the normal check, but now it's
* okay to throw error.
*/
! check_exclusion_or_unique_constraint(trigdata->tg_relation, indexRel, indexInfo,
&(new_row->t_self), values, isnull,
! estate, false, false, true, NULL);
}
/*
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***************
*** 2284,2290 **** CopyFrom(CopyState cstate)
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate);
/* AFTER ROW INSERT Triggers */
ExecARInsertTriggers(estate, resultRelInfo, tuple,
--- 2284,2290 ----
if (resultRelInfo->ri_NumIndices > 0)
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate, false);
/* AFTER ROW INSERT Triggers */
ExecARInsertTriggers(estate, resultRelInfo, tuple,
***************
*** 2391,2397 **** CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
! estate);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
recheckIndexes);
--- 2391,2397 ----
ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
recheckIndexes =
ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
! estate, false);
ExecARInsertTriggers(estate, resultRelInfo,
bufferedTuples[i],
recheckIndexes);
*** a/src/backend/executor/execUtils.c
--- b/src/backend/executor/execUtils.c
***************
*** 990,996 **** ExecCloseIndices(ResultRelInfo *resultRelInfo)
*
* This returns a list of index OIDs for any unique or exclusion
* constraints that are deferred and that had
! * potential (unconfirmed) conflicts.
*
* CAUTION: this must not be called for a HOT update.
* We can't defend against that here for lack of info.
--- 990,997 ----
*
* This returns a list of index OIDs for any unique or exclusion
* constraints that are deferred and that had
! * potential (unconfirmed) conflicts. (if noErrorOnDuplicate == true,
! * the same is done for non-deferred constraints)
*
* CAUTION: this must not be called for a HOT update.
* We can't defend against that here for lack of info.
***************
*** 1000,1006 **** ExecCloseIndices(ResultRelInfo *resultRelInfo)
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
! EState *estate)
{
List *result = NIL;
ResultRelInfo *resultRelInfo;
--- 1001,1008 ----
List *
ExecInsertIndexTuples(TupleTableSlot *slot,
ItemPointer tupleid,
! EState *estate,
! bool noErrorOnDuplicate)
{
List *result = NIL;
ResultRelInfo *resultRelInfo;
***************
*** 1092,1100 **** ExecInsertIndexTuples(TupleTableSlot *slot,
--- 1094,1107 ----
* For a deferrable unique index, we tell the index AM to just detect
* possible non-uniqueness, and we add the index OID to the result
* list if further checking is needed.
+ *
+ * For a IGNORE/REJECT DUPLICATES insertion, just detect possible
+ * non-uniqueness, and tell the caller if it failed.
*/
if (!indexRelation->rd_index->indisunique)
checkUnique = UNIQUE_CHECK_NO;
+ else if (noErrorOnDuplicate)
+ checkUnique = UNIQUE_CHECK_PARTIAL;
else if (indexRelation->rd_index->indimmediate)
checkUnique = UNIQUE_CHECK_YES;
else
***************
*** 1121,1133 **** ExecInsertIndexTuples(TupleTableSlot *slot,
*/
if (indexInfo->ii_ExclusionOps != NULL)
{
! bool errorOK = !indexRelation->rd_index->indimmediate;
satisfiesConstraint =
! check_exclusion_constraint(heapRelation,
indexRelation, indexInfo,
tupleid, values, isnull,
! estate, false, errorOK);
}
if ((checkUnique == UNIQUE_CHECK_PARTIAL ||
--- 1128,1142 ----
*/
if (indexInfo->ii_ExclusionOps != NULL)
{
! bool errorOK = (!indexRelation->rd_index->indimmediate &&
! !noErrorOnDuplicate);
satisfiesConstraint =
! check_exclusion_or_unique_constraint(heapRelation,
indexRelation, indexInfo,
tupleid, values, isnull,
! estate, false, errorOK, false,
! NULL);
}
if ((checkUnique == UNIQUE_CHECK_PARTIAL ||
***************
*** 1146,1163 **** ExecInsertIndexTuples(TupleTableSlot *slot,
return result;
}
/*
! * Check for violation of an exclusion constraint
*
* heap: the table containing the new tuple
* index: the index supporting the exclusion constraint
* indexInfo: info about the index, including the exclusion properties
! * tupleid: heap TID of the new tuple we have just inserted
* values, isnull: the *index* column values computed for the new tuple
* estate: an EState we can do evaluation in
* newIndex: if true, we are trying to build a new index (this affects
* only the wording of error messages)
* errorOK: if true, don't throw error for violation
*
* Returns true if OK, false if actual or potential violation
*
--- 1155,1294 ----
return result;
}
+ /* ----------------------------------------------------------------
+ * ExecCheckIndexConstraints
+ *
+ * This routine checks if a tuple violates any unique or
+ * exclusion constraints. If no conflict, returns true.
+ * Otherwise returns false, and the TID of the conflicting
+ * tuple is returned in *conflictTid
+ *
+ *
+ * Note that this doesn't lock the values in any way, so it's
+ * possible that a conflicting tuple is inserted immediately
+ * after this returns, and a later insert with the same values
+ * still conflicts. But this can be used for a pre-check before
+ * insertion.
+ * ----------------------------------------------------------------
+ */
+ bool
+ ExecCheckIndexConstraints(TupleTableSlot *slot,
+ EState *estate, ItemPointer conflictTid)
+ {
+ ResultRelInfo *resultRelInfo;
+ int i;
+ int numIndices;
+ RelationPtr relationDescs;
+ Relation heapRelation;
+ IndexInfo **indexInfoArray;
+ ExprContext *econtext;
+ Datum values[INDEX_MAX_KEYS];
+ bool isnull[INDEX_MAX_KEYS];
+ ItemPointerData invalidItemPtr;
+
+ ItemPointerSetInvalid(&invalidItemPtr);
+
+
+ /*
+ * Get information from the result relation info structure.
+ */
+ resultRelInfo = estate->es_result_relation_info;
+ numIndices = resultRelInfo->ri_NumIndices;
+ relationDescs = resultRelInfo->ri_IndexRelationDescs;
+ indexInfoArray = resultRelInfo->ri_IndexRelationInfo;
+ heapRelation = resultRelInfo->ri_RelationDesc;
+
+ /*
+ * We will use the EState's per-tuple context for evaluating predicates
+ * and index expressions (creating it if it's not already there).
+ */
+ econtext = GetPerTupleExprContext(estate);
+
+ /* Arrange for econtext's scan tuple to be the tuple under test */
+ econtext->ecxt_scantuple = slot;
+
+ /*
+ * for each index, form and insert the index tuple
+ */
+ for (i = 0; i < numIndices; i++)
+ {
+ Relation indexRelation = relationDescs[i];
+ IndexInfo *indexInfo;
+ bool satisfiesConstraint;
+
+ if (indexRelation == NULL)
+ continue;
+
+ indexInfo = indexInfoArray[i];
+
+ if (!indexInfo->ii_Unique && !indexInfo->ii_ExclusionOps)
+ continue;
+
+ /* If the index is marked as read-only, ignore it */
+ if (!indexInfo->ii_ReadyForInserts)
+ continue;
+
+ /* Check for partial index */
+ if (indexInfo->ii_Predicate != NIL)
+ {
+ List *predicate;
+
+ /*
+ * If predicate state not set up yet, create it (in the estate's
+ * per-query context)
+ */
+ predicate = indexInfo->ii_PredicateState;
+ if (predicate == NIL)
+ {
+ predicate = (List *)
+ ExecPrepareExpr((Expr *) indexInfo->ii_Predicate,
+ estate);
+ indexInfo->ii_PredicateState = predicate;
+ }
+
+ /* Skip this index-update if the predicate isn't satisfied */
+ if (!ExecQual(predicate, econtext, false))
+ continue;
+ }
+
+ /*
+ * FormIndexDatum fills in its values and isnull parameters with the
+ * appropriate values for the column(s) of the index.
+ */
+ FormIndexDatum(indexInfo,
+ slot,
+ estate,
+ values,
+ isnull);
+
+ satisfiesConstraint =
+ check_exclusion_or_unique_constraint(heapRelation,
+ indexRelation, indexInfo,
+ &invalidItemPtr, values, isnull,
+ estate, false, true, true,
+ conflictTid);
+ if (!satisfiesConstraint)
+ return false;
+ }
+
+ return true;
+ }
+
/*
! * Check for violation of an exclusion or unique constraint
*
* heap: the table containing the new tuple
* index: the index supporting the exclusion constraint
* indexInfo: info about the index, including the exclusion properties
! * tupleid: heap TID of the new tuple we have just inserted (invalid if we
! * haven't inserted a new tuple yet)
* values, isnull: the *index* column values computed for the new tuple
* estate: an EState we can do evaluation in
* newIndex: if true, we are trying to build a new index (this affects
* only the wording of error messages)
* errorOK: if true, don't throw error for violation
+ * wait: if true, wait for conflicting transaction to finish, even if !errorOK
+ * conflictTid: if not-NULL, the TID of conflicting tuple is returned here.
*
* Returns true if OK, false if actual or potential violation
*
***************
*** 1169,1182 **** ExecInsertIndexTuples(TupleTableSlot *slot,
*
* When errorOK is false, we'll throw error on violation, so a false result
* is impossible.
*/
bool
! check_exclusion_constraint(Relation heap, Relation index, IndexInfo *indexInfo,
ItemPointer tupleid, Datum *values, bool *isnull,
! EState *estate, bool newIndex, bool errorOK)
{
! Oid *constr_procs = indexInfo->ii_ExclusionProcs;
! uint16 *constr_strats = indexInfo->ii_ExclusionStrats;
Oid *index_collations = index->rd_indcollation;
int index_natts = index->rd_index->indnatts;
IndexScanDesc index_scan;
--- 1300,1319 ----
*
* When errorOK is false, we'll throw error on violation, so a false result
* is impossible.
+ *
+ * Note: The indexam is normally responsible for checking unique constraints,
+ * so this normally only needs to be used for exclusion constraints. But this
+ * is done when doing a "pre-check" for conflicts with "INSERT ... ON DUPLICATE
+ * KEY", before inserting the actual tuple.
*/
bool
! check_exclusion_or_unique_constraint(Relation heap, Relation index, IndexInfo *indexInfo,
ItemPointer tupleid, Datum *values, bool *isnull,
! EState *estate, bool newIndex,
! bool errorOK, bool wait, ItemPointer conflictTid)
{
! Oid *constr_procs;
! uint16 *constr_strats;
Oid *index_collations = index->rd_indcollation;
int index_natts = index->rd_index->indnatts;
IndexScanDesc index_scan;
***************
*** 1190,1195 **** check_exclusion_constraint(Relation heap, Relation index, IndexInfo *indexInfo,
--- 1327,1343 ----
TupleTableSlot *existing_slot;
TupleTableSlot *save_scantuple;
+ if (indexInfo->ii_ExclusionOps)
+ {
+ constr_procs = indexInfo->ii_ExclusionProcs;
+ constr_strats = indexInfo->ii_ExclusionStrats;
+ }
+ else
+ {
+ constr_procs = indexInfo->ii_UniqueProcs;
+ constr_strats = indexInfo->ii_UniqueStrats;
+ }
+
/*
* If any of the input values are NULL, the constraint check is assumed to
* pass (i.e., we assume the operators are strict).
***************
*** 1253,1259 **** retry:
/*
* Ignore the entry for the tuple we're trying to check.
*/
! if (ItemPointerEquals(tupleid, &tup->t_self))
{
if (found_self) /* should not happen */
elog(ERROR, "found self tuple multiple times in index \"%s\"",
--- 1401,1408 ----
/*
* Ignore the entry for the tuple we're trying to check.
*/
! if (ItemPointerIsValid(tupleid) &&
! ItemPointerEquals(tupleid, &tup->t_self))
{
if (found_self) /* should not happen */
elog(ERROR, "found self tuple multiple times in index \"%s\"",
***************
*** 1287,1295 **** retry:
* we're not supposed to raise error, just return the fact of the
* potential conflict without waiting to see if it's real.
*/
! if (errorOK)
{
conflict = true;
break;
}
--- 1436,1446 ----
* we're not supposed to raise error, just return the fact of the
* potential conflict without waiting to see if it's real.
*/
! if (errorOK && !wait)
{
conflict = true;
+ if (conflictTid)
+ *conflictTid = tup->t_self;
break;
}
***************
*** 1314,1319 **** retry:
--- 1465,1478 ----
/*
* We have a definite conflict. Report it.
*/
+ if (errorOK)
+ {
+ conflict = true;
+ if (conflictTid)
+ *conflictTid = tup->t_self;
+ break;
+ }
+
error_new = BuildIndexValueDescription(index, values, isnull);
error_existing = BuildIndexValueDescription(index, existing_values,
existing_isnull);
***************
*** 1345,1350 **** retry:
--- 1504,1512 ----
* However, it is possible to define exclusion constraints for which that
* wouldn't be true --- for instance, if the operator is <>. So we no
* longer complain if found_self is still false.
+ *
+ * It would also not be true in the pre-check mode, when we haven't
+ * inserted a tuple yet.
*/
econtext->ecxt_scantuple = save_scantuple;
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
***************
*** 39,44 ****
--- 39,45 ----
#include "access/htup_details.h"
#include "access/xact.h"
+ #include "catalog/catalog.h"
#include "commands/trigger.h"
#include "executor/executor.h"
#include "executor/nodeModifyTable.h"
***************
*** 152,157 **** ExecProcessReturning(ProjectionInfo *projectReturning,
--- 153,264 ----
}
/* ----------------------------------------------------------------
+ * ExecLockHeapTupleForUpdateSpec: Try to lock tuple for update as part of
+ * speculative insertion.
+ *
+ * Returns value indicating if we're done with heap tuple locking, or if
+ * another attempt at value locking is required.
+ * ----------------------------------------------------------------
+ */
+ static bool
+ ExecLockHeapTupleForUpdateSpec(EState *estate,
+ ResultRelInfo *relinfo,
+ ItemPointer tid)
+ {
+ Relation relation = relinfo->ri_RelationDesc;
+ HeapTupleData tuple;
+ Buffer buffer;
+
+ HTSU_Result test;
+ HeapUpdateFailureData hufd;
+
+ Assert(ItemPointerIsValid(tid));
+
+ /*
+ * Lock tuple for update.
+ *
+ * Wait for other transaction to complete.
+ */
+ tuple.t_self = *tid;
+ test = heap_lock_tuple(relation, &tuple,
+ estate->es_output_cid,
+ LockTupleExclusive, false,
+ true, &buffer, &hufd);
+ ReleaseBuffer(buffer);
+
+ switch (test)
+ {
+ case HeapTupleInvisible:
+ /*
+ * This can happen if the inserting transaction aborted. Try again.
+ */
+ return false;
+
+ case HeapTupleSelfUpdated:
+ /*
+ * The target tuple was already updated or deleted by the current
+ * command, or by a later command in the current transaction. We
+ * conclude that we're done in the former case, and throw an error
+ * in the latter case, for the same reasons enumerated in
+ * ExecUpdate and ExecDelete.
+ */
+ if (hufd.cmax != estate->es_output_cid)
+ ereport(ERROR,
+ (errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
+ errmsg("tuple to be updated was already modified by an operation triggered by the current command"),
+ errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));
+
+ /*
+ * The fact that this command has already updated or deleted the
+ * tuple is grounds for concluding that we're done. Appropriate
+ * locks will already be held. It isn't the responsibility of the
+ * speculative insertion LOCK FOR UPDATE infrastructure to ensure
+ * an atomic INSERT-or-UPDATE in the event of a tuple being updated
+ * or deleted by the same xact in the interim.
+ */
+ return true;
+ case HeapTupleMayBeUpdated:
+ /*
+ * Success -- we're done, as tuple is locked and known to be
+ * visible to our snapshot under conventional MVCC rules if the
+ * current isolation level mandates that (in READ COMMITTED mode, a
+ * special exception to the conventional rules applies).
+ */
+ return true;
+ case HeapTupleUpdated:
+ if (IsolationUsesXactSnapshot())
+ ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("could not serialize access due to concurrent update")));
+ /*
+ * Tell caller to try again from the very start. We don't use the
+ * usual EvalPlanQual looping pattern here, fundamentally because
+ * we don't have a useful qual to verify the next tuple with.
+ *
+ * We might devise a means of verifying, by way of binary equality
+ * in a similar manner to HOT codepaths, if any unique indexed
+ * columns changed, but this would only serve to ameliorate the
+ * fundamental problem. It might well not be good enough, because
+ * those columns could change too. It's not clear that doing any
+ * better here would be worth it.
+ *
+ * At this point, all bets are off -- it might actually turn out to
+ * be okay to proceed with insertion instead of locking now (the
+ * tuple we attempted to lock could have been deleted, for
+ * example). On the other hand, it might not be okay, but for an
+ * entirely different reason, with an entirely separate TID to
+ * blame and lock. This TID may not even be part of the same
+ * update chain.
+ */
+ return false;
+ default:
+ elog(ERROR, "unrecognized heap_lock_tuple status: %u", test);
+ }
+
+ return false;
+ }
+
+ /* ----------------------------------------------------------------
* ExecInsert
*
* For INSERT, we have to insert the tuple into the target relation
***************
*** 164,176 **** static TupleTableSlot *
ExecInsert(TupleTableSlot *slot,
TupleTableSlot *planSlot,
EState *estate,
! bool canSetTag)
{
HeapTuple tuple;
ResultRelInfo *resultRelInfo;
Relation resultRelationDesc;
Oid newId;
List *recheckIndexes = NIL;
/*
* get the heap tuple out of the tuple table slot, making sure we have a
--- 271,287 ----
ExecInsert(TupleTableSlot *slot,
TupleTableSlot *planSlot,
EState *estate,
! bool canSetTag,
! SpecType spec)
{
HeapTuple tuple;
ResultRelInfo *resultRelInfo;
Relation resultRelationDesc;
Oid newId;
List *recheckIndexes = NIL;
+ ProjectionInfo *returning;
+ bool rejects = (spec == SPEC_IGNORE_REJECTS ||
+ spec == SPEC_UPDATE_REJECTS);
/*
* get the heap tuple out of the tuple table slot, making sure we have a
***************
*** 183,188 **** ExecInsert(TupleTableSlot *slot,
--- 294,326 ----
*/
resultRelInfo = estate->es_result_relation_info;
resultRelationDesc = resultRelInfo->ri_RelationDesc;
+ returning = resultRelInfo->ri_projectReturning;
+
+ /*
+ * If speculative insertion is requested, take necessary precautions.
+ *
+ * Value locks are typically actually implemented by AMs as shared locks on
+ * buffers. This could be quite hazardous, because in the worst case those
+ * locks could be on catalog indexes, with the system then liable to
+ * deadlock due to innocent catalog access when inserting a heap tuple.
+ * However, we take a precaution against that here.
+ *
+ * Rather than forever committing to carefully managing these hazards
+ * during the extended (but still short) window after locking in which heap
+ * tuple insertion will potentially later take place (a window that ends,
+ * in the "insertion proceeds" case, when locks are released by the second
+ * phase of speculative insertion having completed for unique indexes), it
+ * is expedient to simply forbid speculative insertion into catalogs
+ * altogether. There is no consequence to allowing speculative insertion
+ * into TOAST tables, which we also forbid, but that doesn't seem terribly
+ * useful.
+ */
+ if (spec != SPEC_NONE &&
+ IsSystemRelation(resultRelationDesc))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("speculative insertion into catalogs and TOAST tables not supported"),
+ errtable(resultRelationDesc)));
/*
* If the result relation has OIDs, force the tuple's OID to zero so that
***************
*** 246,251 **** ExecInsert(TupleTableSlot *slot,
--- 384,392 ----
}
else
{
+ bool conflicted;
+ ItemPointerData conflictTid;
+
/*
* Constraints might reference the tableoid column, so initialize
* t_tableOid before evaluating them.
***************
*** 258,278 **** ExecInsert(TupleTableSlot *slot,
if (resultRelationDesc->rd_att->constr)
ExecConstraints(resultRelInfo, slot, estate);
/*
! * insert the tuple
*
! * Note: heap_insert returns the tid (location) of the new tuple in
! * the t_self field.
*/
! newId = heap_insert(resultRelationDesc, tuple,
! estate->es_output_cid, 0, NULL);
! /*
! * insert index entries for tuple
! */
! if (resultRelInfo->ri_NumIndices > 0)
! recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate);
}
if (canSetTag)
--- 399,498 ----
if (resultRelationDesc->rd_att->constr)
ExecConstraints(resultRelInfo, slot, estate);
+ retry:
+ conflicted = false;
+ ItemPointerSetInvalid(&conflictTid);
+
/*
! * If we are expecting duplicates, do a non-conclusive first check.
! * We might still fail later, after inserting the heap tuple, if a
! * conflicting row was inserted concurrently. We'll handle that by
! * deleting the already-inserted tuple and retrying, but that's fairly
! * expensive, so we try to avoid it.
*
! * XXX: If we know or assume that there are few duplicates, it would
! * be better to skip this, and just optimistically proceed with the
! * insertion below. You would then leave behind some garbage when a
! * conflict happens, but if it's rare, it doesn't matter much. Some
! * kind of heuristic might be in order here, like stop doing these
! * pre-checks if the last 100 insertions have not been duplicates.
*/
! if (spec != SPEC_NONE && resultRelInfo->ri_NumIndices > 0)
! {
! if (!ExecCheckIndexConstraints(slot, estate, &conflictTid))
! conflicted = true;
! }
! if (!conflicted)
! {
! /*
! * insert the tuple
! *
! * Note: heap_insert returns the tid (location) of the new tuple in
! * the t_self field.
! */
! newId = heap_insert(resultRelationDesc, tuple,
! estate->es_output_cid, 0, NULL);
!
! /*
! * Insert index entries for tuple.
! *
! * Locks will be acquired if needed, or the locks acquired by
! * ExecLockIndexTuples() may be used instead.
! */
! if (resultRelInfo->ri_NumIndices > 0)
! recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate,
! spec != SPEC_NONE);
!
! if (spec != SPEC_NONE && recheckIndexes)
! {
! HeapUpdateFailureData hufd;
! heap_delete(resultRelationDesc, &(tuple->t_self),
! estate->es_output_cid, NULL, false, &hufd, true);
! conflicted = true;
! }
! }
!
! if (conflicted)
! {
! if (spec == SPEC_UPDATE || spec == SPEC_UPDATE_REJECTS)
! {
! /*
! * Try to lock row for update.
! *
! * XXX: We don't have the TID of the conflicting tuple if
! * the index insertion failed and we had to kill the already
! * inserted tuple. We'd need to modify the index AM to pass
! * through the TID back here. So for now, we just retry, and
! * hopefully the new pre-check will fail on the same tuple
! * (or it's finished by now), and we'll get its TID that way
! */
! if (!ItemPointerIsValid(&conflictTid))
! {
! elog(DEBUG1, "insertion conflicted after pre-check");
! goto retry;
! }
!
! if (!ExecLockHeapTupleForUpdateSpec(estate,
! resultRelInfo,
! &conflictTid))
! {
! /*
! * Couldn't lock row - restart from just before value
! * locking. It's subtly wrong to assume anything about
! * the row version that is under consideration for
! * locking if another transaction locked it first.
! */
! goto retry;
! }
! }
!
! if (rejects)
! return ExecProcessReturning(returning, slot, planSlot);
! else
! return NULL;
! }
}
if (canSetTag)
***************
*** 291,300 **** ExecInsert(TupleTableSlot *slot,
if (resultRelInfo->ri_WithCheckOptions != NIL)
ExecWithCheckOptions(resultRelInfo, slot, estate);
! /* Process RETURNING if present */
! if (resultRelInfo->ri_projectReturning)
! return ExecProcessReturning(resultRelInfo->ri_projectReturning,
! slot, planSlot);
return NULL;
}
--- 511,522 ----
if (resultRelInfo->ri_WithCheckOptions != NIL)
ExecWithCheckOptions(resultRelInfo, slot, estate);
! /*
! * Process RETURNING if present and not only returning speculative
! * insertion rejects
! */
! if (returning && !rejects)
! return ExecProcessReturning(returning, slot, planSlot);
return NULL;
}
***************
*** 403,409 **** ldelete:;
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
! &hufd);
switch (result)
{
case HeapTupleSelfUpdated:
--- 625,632 ----
estate->es_output_cid,
estate->es_crosscheck_snapshot,
true /* wait for commit */ ,
! &hufd,
! false);
switch (result)
{
case HeapTupleSelfUpdated:
***************
*** 781,787 **** lreplace:;
*/
if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate);
}
if (canSetTag)
--- 1004,1010 ----
*/
if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! estate, false);
}
if (canSetTag)
***************
*** 1011,1017 **** ExecModifyTable(ModifyTableState *node)
switch (operation)
{
case CMD_INSERT:
! slot = ExecInsert(slot, planSlot, estate, node->canSetTag);
break;
case CMD_UPDATE:
slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
--- 1234,1241 ----
switch (operation)
{
case CMD_INSERT:
! slot = ExecInsert(slot, planSlot, estate, node->canSetTag,
! node->spec);
break;
case CMD_UPDATE:
slot = ExecUpdate(tupleid, oldtuple, slot, planSlot,
***************
*** 1086,1091 **** ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
--- 1310,1316 ----
mtstate->resultRelInfo = estate->es_result_relations + node->resultRelIndex;
mtstate->mt_arowmarks = (List **) palloc0(sizeof(List *) * nplans);
mtstate->mt_nplans = nplans;
+ mtstate->spec = node->spec;
/* set up epqstate with dummy subplan data for the moment */
EvalPlanQualInit(&mtstate->mt_epqstate, estate, NULL, NIL, node->epqParam);
***************
*** 1296,1301 **** ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
--- 1521,1527 ----
break;
case CMD_UPDATE:
case CMD_DELETE:
+ Assert(node->spec == SPEC_NONE);
junk_filter_needed = true;
break;
default:
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 182,187 **** _copyModifyTable(const ModifyTable *from)
--- 182,188 ----
COPY_NODE_FIELD(returningLists);
COPY_NODE_FIELD(fdwPrivLists);
COPY_NODE_FIELD(rowMarks);
+ COPY_SCALAR_FIELD(spec);
COPY_SCALAR_FIELD(epqParam);
return newnode;
***************
*** 2085,2090 **** _copyWithClause(const WithClause *from)
--- 2086,2103 ----
return newnode;
}
+ static ReturningClause *
+ _copyReturningClause(const ReturningClause *from)
+ {
+ ReturningClause *newnode = makeNode(ReturningClause);
+
+ COPY_NODE_FIELD(returningList);
+ COPY_SCALAR_FIELD(rejects);
+ COPY_LOCATION_FIELD(location);
+
+ return newnode;
+ }
+
static CommonTableExpr *
_copyCommonTableExpr(const CommonTableExpr *from)
{
***************
*** 2475,2480 **** _copyQuery(const Query *from)
--- 2488,2494 ----
COPY_NODE_FIELD(jointree);
COPY_NODE_FIELD(targetList);
COPY_NODE_FIELD(withCheckOptions);
+ COPY_SCALAR_FIELD(specClause);
COPY_NODE_FIELD(returningList);
COPY_NODE_FIELD(groupClause);
COPY_NODE_FIELD(havingQual);
***************
*** 2498,2504 **** _copyInsertStmt(const InsertStmt *from)
COPY_NODE_FIELD(relation);
COPY_NODE_FIELD(cols);
COPY_NODE_FIELD(selectStmt);
! COPY_NODE_FIELD(returningList);
COPY_NODE_FIELD(withClause);
return newnode;
--- 2512,2519 ----
COPY_NODE_FIELD(relation);
COPY_NODE_FIELD(cols);
COPY_NODE_FIELD(selectStmt);
! COPY_SCALAR_FIELD(specClause);
! COPY_NODE_FIELD(rlist);
COPY_NODE_FIELD(withClause);
return newnode;
***************
*** 2512,2518 **** _copyDeleteStmt(const DeleteStmt *from)
COPY_NODE_FIELD(relation);
COPY_NODE_FIELD(usingClause);
COPY_NODE_FIELD(whereClause);
! COPY_NODE_FIELD(returningList);
COPY_NODE_FIELD(withClause);
return newnode;
--- 2527,2533 ----
COPY_NODE_FIELD(relation);
COPY_NODE_FIELD(usingClause);
COPY_NODE_FIELD(whereClause);
! COPY_NODE_FIELD(rlist);
COPY_NODE_FIELD(withClause);
return newnode;
***************
*** 2527,2533 **** _copyUpdateStmt(const UpdateStmt *from)
COPY_NODE_FIELD(targetList);
COPY_NODE_FIELD(whereClause);
COPY_NODE_FIELD(fromClause);
! COPY_NODE_FIELD(returningList);
COPY_NODE_FIELD(withClause);
return newnode;
--- 2542,2548 ----
COPY_NODE_FIELD(targetList);
COPY_NODE_FIELD(whereClause);
COPY_NODE_FIELD(fromClause);
! COPY_NODE_FIELD(rlist);
COPY_NODE_FIELD(withClause);
return newnode;
***************
*** 4579,4584 **** copyObject(const void *from)
--- 4594,4602 ----
case T_WithClause:
retval = _copyWithClause(from);
break;
+ case T_ReturningClause:
+ retval = _copyReturningClause(from);
+ break;
case T_CommonTableExpr:
retval = _copyCommonTableExpr(from);
break;
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
***************
*** 859,864 **** _equalQuery(const Query *a, const Query *b)
--- 859,865 ----
COMPARE_NODE_FIELD(jointree);
COMPARE_NODE_FIELD(targetList);
COMPARE_NODE_FIELD(withCheckOptions);
+ COMPARE_SCALAR_FIELD(specClause);
COMPARE_NODE_FIELD(returningList);
COMPARE_NODE_FIELD(groupClause);
COMPARE_NODE_FIELD(havingQual);
***************
*** 880,886 **** _equalInsertStmt(const InsertStmt *a, const InsertStmt *b)
COMPARE_NODE_FIELD(relation);
COMPARE_NODE_FIELD(cols);
COMPARE_NODE_FIELD(selectStmt);
! COMPARE_NODE_FIELD(returningList);
COMPARE_NODE_FIELD(withClause);
return true;
--- 881,888 ----
COMPARE_NODE_FIELD(relation);
COMPARE_NODE_FIELD(cols);
COMPARE_NODE_FIELD(selectStmt);
! COMPARE_SCALAR_FIELD(specClause);
! COMPARE_NODE_FIELD(rlist);
COMPARE_NODE_FIELD(withClause);
return true;
***************
*** 892,898 **** _equalDeleteStmt(const DeleteStmt *a, const DeleteStmt *b)
COMPARE_NODE_FIELD(relation);
COMPARE_NODE_FIELD(usingClause);
COMPARE_NODE_FIELD(whereClause);
! COMPARE_NODE_FIELD(returningList);
COMPARE_NODE_FIELD(withClause);
return true;
--- 894,900 ----
COMPARE_NODE_FIELD(relation);
COMPARE_NODE_FIELD(usingClause);
COMPARE_NODE_FIELD(whereClause);
! COMPARE_NODE_FIELD(rlist);
COMPARE_NODE_FIELD(withClause);
return true;
***************
*** 905,911 **** _equalUpdateStmt(const UpdateStmt *a, const UpdateStmt *b)
COMPARE_NODE_FIELD(targetList);
COMPARE_NODE_FIELD(whereClause);
COMPARE_NODE_FIELD(fromClause);
! COMPARE_NODE_FIELD(returningList);
COMPARE_NODE_FIELD(withClause);
return true;
--- 907,913 ----
COMPARE_NODE_FIELD(targetList);
COMPARE_NODE_FIELD(whereClause);
COMPARE_NODE_FIELD(fromClause);
! COMPARE_NODE_FIELD(rlist);
COMPARE_NODE_FIELD(withClause);
return true;
***************
*** 2344,2349 **** _equalWithClause(const WithClause *a, const WithClause *b)
--- 2346,2361 ----
}
static bool
+ _equalReturningClause(const ReturningClause *a, const ReturningClause *b)
+ {
+ COMPARE_NODE_FIELD(returningList);
+ COMPARE_SCALAR_FIELD(rejects);
+ COMPARE_LOCATION_FIELD(location);
+
+ return true;
+ }
+
+ static bool
_equalCommonTableExpr(const CommonTableExpr *a, const CommonTableExpr *b)
{
COMPARE_STRING_FIELD(ctename);
***************
*** 3049,3054 **** equal(const void *a, const void *b)
--- 3061,3069 ----
case T_WithClause:
retval = _equalWithClause(a, b);
break;
+ case T_ReturningClause:
+ retval = _equalReturningClause(a, b);
+ break;
case T_CommonTableExpr:
retval = _equalCommonTableExpr(a, b);
break;
*** a/src/backend/nodes/nodeFuncs.c
--- b/src/backend/nodes/nodeFuncs.c
***************
*** 1460,1465 **** exprLocation(const Node *expr)
--- 1460,1468 ----
case T_WithClause:
loc = ((const WithClause *) expr)->location;
break;
+ case T_ReturningClause:
+ loc = ((const ReturningClause *) expr)->location;
+ break;
case T_CommonTableExpr:
loc = ((const CommonTableExpr *) expr)->location;
break;
***************
*** 2946,2952 **** raw_expression_tree_walker(Node *node,
return true;
if (walker(stmt->selectStmt, context))
return true;
! if (walker(stmt->returningList, context))
return true;
if (walker(stmt->withClause, context))
return true;
--- 2949,2955 ----
return true;
if (walker(stmt->selectStmt, context))
return true;
! if (walker(stmt->rlist, context))
return true;
if (walker(stmt->withClause, context))
return true;
***************
*** 2962,2968 **** raw_expression_tree_walker(Node *node,
return true;
if (walker(stmt->whereClause, context))
return true;
! if (walker(stmt->returningList, context))
return true;
if (walker(stmt->withClause, context))
return true;
--- 2965,2971 ----
return true;
if (walker(stmt->whereClause, context))
return true;
! if (walker(stmt->rlist, context))
return true;
if (walker(stmt->withClause, context))
return true;
***************
*** 2980,2986 **** raw_expression_tree_walker(Node *node,
return true;
if (walker(stmt->fromClause, context))
return true;
! if (walker(stmt->returningList, context))
return true;
if (walker(stmt->withClause, context))
return true;
--- 2983,2989 ----
return true;
if (walker(stmt->fromClause, context))
return true;
! if (walker(stmt->rlist, context))
return true;
if (walker(stmt->withClause, context))
return true;
***************
*** 3175,3180 **** raw_expression_tree_walker(Node *node,
--- 3178,3185 ----
break;
case T_WithClause:
return walker(((WithClause *) node)->ctes, context);
+ case T_ReturningClause:
+ return walker(((ReturningClause *) node)->returningList, context);
case T_CommonTableExpr:
return walker(((CommonTableExpr *) node)->ctequery, context);
default:
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 336,341 **** _outModifyTable(StringInfo str, const ModifyTable *node)
--- 336,342 ----
WRITE_NODE_FIELD(returningLists);
WRITE_NODE_FIELD(fdwPrivLists);
WRITE_NODE_FIELD(rowMarks);
+ WRITE_ENUM_FIELD(spec, SpecType);
WRITE_INT_FIELD(epqParam);
}
***************
*** 2250,2255 **** _outQuery(StringInfo str, const Query *node)
--- 2251,2257 ----
WRITE_NODE_FIELD(jointree);
WRITE_NODE_FIELD(targetList);
WRITE_NODE_FIELD(withCheckOptions);
+ WRITE_ENUM_FIELD(specClause, SpecType);
WRITE_NODE_FIELD(returningList);
WRITE_NODE_FIELD(groupClause);
WRITE_NODE_FIELD(havingQual);
***************
*** 2323,2328 **** _outWithClause(StringInfo str, const WithClause *node)
--- 2325,2340 ----
}
static void
+ _outReturningClause(StringInfo str, const ReturningClause *node)
+ {
+ WRITE_NODE_TYPE("RETURNINGCLAUSE");
+
+ WRITE_NODE_FIELD(returningList);
+ WRITE_BOOL_FIELD(rejects);
+ WRITE_LOCATION_FIELD(location);
+ }
+
+ static void
_outCommonTableExpr(StringInfo str, const CommonTableExpr *node)
{
WRITE_NODE_TYPE("COMMONTABLEEXPR");
***************
*** 3156,3161 **** _outNode(StringInfo str, const void *obj)
--- 3168,3176 ----
case T_WithClause:
_outWithClause(str, obj);
break;
+ case T_ReturningClause:
+ _outReturningClause(str, obj);
+ break;
case T_CommonTableExpr:
_outCommonTableExpr(str, obj);
break;
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
***************
*** 211,216 **** _readQuery(void)
--- 211,217 ----
READ_NODE_FIELD(jointree);
READ_NODE_FIELD(targetList);
READ_NODE_FIELD(withCheckOptions);
+ READ_ENUM_FIELD(specClause, SpecType);
READ_NODE_FIELD(returningList);
READ_NODE_FIELD(groupClause);
READ_NODE_FIELD(havingQual);
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 4722,4728 **** make_modifytable(PlannerInfo *root,
CmdType operation, bool canSetTag,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
! List *rowMarks, int epqParam)
{
ModifyTable *node = makeNode(ModifyTable);
Plan *plan = &node->plan;
--- 4722,4728 ----
CmdType operation, bool canSetTag,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
! List *rowMarks, SpecType spec, int epqParam)
{
ModifyTable *node = makeNode(ModifyTable);
Plan *plan = &node->plan;
***************
*** 4774,4779 **** make_modifytable(PlannerInfo *root,
--- 4774,4780 ----
node->withCheckOptionLists = withCheckOptionLists;
node->returningLists = returningLists;
node->rowMarks = rowMarks;
+ node->spec = spec;
node->epqParam = epqParam;
/*
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
***************
*** 609,614 **** subquery_planner(PlannerGlobal *glob, Query *parse,
--- 609,615 ----
withCheckOptionLists,
returningLists,
rowMarks,
+ parse->specClause,
SS_assign_special_param(root));
}
}
***************
*** 1008,1013 **** inheritance_planner(PlannerInfo *root)
--- 1009,1015 ----
withCheckOptionLists,
returningLists,
rowMarks,
+ parse->specClause,
SS_assign_special_param(root));
}
*** a/src/backend/parser/analyze.c
--- b/src/backend/parser/analyze.c
***************
*** 61,67 **** static Node *transformSetOperationTree(ParseState *pstate, SelectStmt *stmt,
static void determineRecursiveColTypes(ParseState *pstate,
Node *larg, List *nrtargetlist);
static Query *transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt);
! static List *transformReturningList(ParseState *pstate, List *returningList);
static Query *transformDeclareCursorStmt(ParseState *pstate,
DeclareCursorStmt *stmt);
static Query *transformExplainStmt(ParseState *pstate,
--- 61,68 ----
static void determineRecursiveColTypes(ParseState *pstate,
Node *larg, List *nrtargetlist);
static Query *transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt);
! static List *transformReturningClause(ParseState *pstate, ReturningClause *returningList,
! bool *rejects);
static Query *transformDeclareCursorStmt(ParseState *pstate,
DeclareCursorStmt *stmt);
static Query *transformExplainStmt(ParseState *pstate,
***************
*** 343,348 **** transformDeleteStmt(ParseState *pstate, DeleteStmt *stmt)
--- 344,350 ----
{
Query *qry = makeNode(Query);
Node *qual;
+ bool rejects;
qry->commandType = CMD_DELETE;
***************
*** 373,384 **** transformDeleteStmt(ParseState *pstate, DeleteStmt *stmt)
qual = transformWhereClause(pstate, stmt->whereClause,
EXPR_KIND_WHERE, "WHERE");
! qry->returningList = transformReturningList(pstate, stmt->returningList);
/* done building the range table and jointree */
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, qual);
qry->hasSubLinks = pstate->p_hasSubLinks;
qry->hasWindowFuncs = pstate->p_hasWindowFuncs;
qry->hasAggs = pstate->p_hasAggs;
--- 375,394 ----
qual = transformWhereClause(pstate, stmt->whereClause,
EXPR_KIND_WHERE, "WHERE");
! qry->returningList = transformReturningClause(pstate, stmt->rlist, &rejects);
!
! if (rejects)
! ereport(ERROR,
! (errcode(ERRCODE_SYNTAX_ERROR),
! errmsg("RETURNING clause does not accept REJECTS for DELETE statements"),
! parser_errposition(pstate,
! exprLocation((Node *) stmt->rlist))));
/* done building the range table and jointree */
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, qual);
+ qry->specClause = SPEC_NONE;
qry->hasSubLinks = pstate->p_hasSubLinks;
qry->hasWindowFuncs = pstate->p_hasWindowFuncs;
qry->hasAggs = pstate->p_hasAggs;
***************
*** 399,404 **** transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
--- 409,415 ----
{
Query *qry = makeNode(Query);
SelectStmt *selectStmt = (SelectStmt *) stmt->selectStmt;
+ SpecType spec = stmt->specClause;
List *exprList = NIL;
bool isGeneralSelect;
List *sub_rtable;
***************
*** 410,415 **** transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
--- 421,427 ----
ListCell *icols;
ListCell *attnos;
ListCell *lc;
+ bool rejects = false;
/* There can't be any outer WITH to worry about */
Assert(pstate->p_ctenamespace == NIL);
***************
*** 737,755 **** transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
* RETURNING will work. Also, remove any namespace entries added in a
* sub-SELECT or VALUES list.
*/
! if (stmt->returningList)
{
pstate->p_namespace = NIL;
addRTEtoQuery(pstate, pstate->p_target_rangetblentry,
false, true, true);
! qry->returningList = transformReturningList(pstate,
! stmt->returningList);
}
/* done building the range table and jointree */
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, NULL);
qry->hasSubLinks = pstate->p_hasSubLinks;
assign_query_collations(pstate, qry);
--- 749,782 ----
* RETURNING will work. Also, remove any namespace entries added in a
* sub-SELECT or VALUES list.
*/
! if (stmt->rlist)
{
pstate->p_namespace = NIL;
addRTEtoQuery(pstate, pstate->p_target_rangetblentry,
false, true, true);
! qry->returningList = transformReturningClause(pstate,
! stmt->rlist, &rejects);
}
/* done building the range table and jointree */
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, NULL);
+ /* Normalize speculative insertion specification */
+ if (rejects)
+ {
+ if (spec == SPEC_IGNORE)
+ spec = SPEC_IGNORE_REJECTS;
+ else if (spec == SPEC_UPDATE)
+ spec = SPEC_UPDATE_REJECTS;
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("RETURNING clause with REJECTS can only appear when ON DUPLICATE KEY was also specified"),
+ parser_errposition(pstate,
+ exprLocation((Node *) stmt->rlist))));
+ }
+ qry->specClause = spec;
qry->hasSubLinks = pstate->p_hasSubLinks;
assign_query_collations(pstate, qry);
***************
*** 997,1002 **** transformSelectStmt(ParseState *pstate, SelectStmt *stmt)
--- 1024,1030 ----
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, qual);
+ qry->specClause = SPEC_NONE;
qry->hasSubLinks = pstate->p_hasSubLinks;
qry->hasWindowFuncs = pstate->p_hasWindowFuncs;
***************
*** 1893,1898 **** transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt)
--- 1921,1927 ----
Node *qual;
ListCell *origTargetList;
ListCell *tl;
+ bool rejects;
qry->commandType = CMD_UPDATE;
pstate->p_is_update = true;
***************
*** 1922,1931 **** transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt)
qual = transformWhereClause(pstate, stmt->whereClause,
EXPR_KIND_WHERE, "WHERE");
! qry->returningList = transformReturningList(pstate, stmt->returningList);
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, qual);
qry->hasSubLinks = pstate->p_hasSubLinks;
--- 1951,1969 ----
qual = transformWhereClause(pstate, stmt->whereClause,
EXPR_KIND_WHERE, "WHERE");
! qry->returningList = transformReturningClause(pstate, stmt->rlist,
! &rejects);
!
! if (rejects)
! ereport(ERROR,
! (errcode(ERRCODE_SYNTAX_ERROR),
! errmsg("RETURNING clause does not accept REJECTS for UPDATE statements"),
! parser_errposition(pstate,
! exprLocation((Node *) stmt->rlist))));
qry->rtable = pstate->p_rtable;
qry->jointree = makeFromExpr(pstate->p_joinlist, qual);
+ qry->specClause = SPEC_NONE;
qry->hasSubLinks = pstate->p_hasSubLinks;
***************
*** 1995,2011 **** transformUpdateStmt(ParseState *pstate, UpdateStmt *stmt)
}
/*
! * transformReturningList -
* handle a RETURNING clause in INSERT/UPDATE/DELETE
*/
static List *
! transformReturningList(ParseState *pstate, List *returningList)
{
! List *rlist;
int save_next_resno;
! if (returningList == NIL)
! return NIL; /* nothing to do */
/*
* We need to assign resnos starting at one in the RETURNING list. Save
--- 2033,2055 ----
}
/*
! * transformReturningClause -
* handle a RETURNING clause in INSERT/UPDATE/DELETE
*/
static List *
! transformReturningClause(ParseState *pstate, ReturningClause *clause,
! bool *rejects)
{
! List *tlist, *rlist;
int save_next_resno;
! if (clause == NULL)
! {
! *rejects = false;
! return NIL;
! }
!
! rlist = clause->returningList;
/*
* We need to assign resnos starting at one in the RETURNING list. Save
***************
*** 2016,2030 **** transformReturningList(ParseState *pstate, List *returningList)
pstate->p_next_resno = 1;
/* transform RETURNING identically to a SELECT targetlist */
! rlist = transformTargetList(pstate, returningList, EXPR_KIND_RETURNING);
/* mark column origins */
! markTargetListOrigins(pstate, rlist);
/* restore state */
pstate->p_next_resno = save_next_resno;
! return rlist;
}
--- 2060,2098 ----
pstate->p_next_resno = 1;
/* transform RETURNING identically to a SELECT targetlist */
! tlist = transformTargetList(pstate, rlist, EXPR_KIND_RETURNING);
!
! /* Cannot accept system column Vars when returning rejects */
! if (clause->rejects)
! {
! ListCell *l;
!
! foreach(l, tlist)
! {
! TargetEntry *tle = (TargetEntry *) lfirst(l);
! Var *var = (Var *) tle->expr;
!
! if (var->varattno <= 0)
! {
! ereport(ERROR,
! (errcode(ERRCODE_UNDEFINED_COLUMN),
! errmsg("RETURNING clause cannot return system columns when REJECTS is specified"),
! parser_errposition(pstate,
! exprLocation((Node *) var))));
! }
! }
! }
!
! /* pass on if we return rejects */
! *rejects = clause->rejects;
/* mark column origins */
! markTargetListOrigins(pstate, tlist);
/* restore state */
pstate->p_next_resno = save_next_resno;
! return tlist;
}
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
***************
*** 204,209 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
--- 204,210 ----
RangeVar *range;
IntoClause *into;
WithClause *with;
+ ReturningClause *returnc;
A_Indices *aind;
ResTarget *target;
struct PrivTarget *privtarget;
***************
*** 342,351 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
opclass_purpose opt_opfamily transaction_mode_list_or_empty
OptTableFuncElementList TableFuncElementList opt_type_modifiers
prep_type_clause
! execute_param_clause using_clause returning_clause
! opt_enum_val_list enum_val_list table_func_column_list
! create_generic_options alter_generic_options
! relation_expr_list dostmt_opt_list
%type <list> opt_fdw_options fdw_options
%type <defelt> fdw_option
--- 343,351 ----
opclass_purpose opt_opfamily transaction_mode_list_or_empty
OptTableFuncElementList TableFuncElementList opt_type_modifiers
prep_type_clause
! execute_param_clause using_clause opt_enum_val_list
! enum_val_list table_func_column_list create_generic_options
! alter_generic_options relation_expr_list dostmt_opt_list
%type <list> opt_fdw_options fdw_options
%type <defelt> fdw_option
***************
*** 396,401 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
--- 396,402 ----
%type <defelt> SeqOptElem
%type <istmt> insert_rest
+ %type <ival> opt_on_duplicate_key
%type <vsetstmt> set_rest set_rest_more SetResetClause FunctionSetResetClause
***************
*** 489,494 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
--- 490,497 ----
%type <node> func_expr func_expr_windowless
%type <node> common_table_expr
%type <with> with_clause opt_with_clause
+ %type <boolean> opt_rejects
+ %type <returnc> returning_clause
%type <list> cte_list
%type <list> window_clause window_definition_list opt_partition_clause
***************
*** 538,543 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
--- 541,547 ----
DATA_P DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS
DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS DESC
DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP
+ DUPLICATE
EACH ELSE ENABLE_P ENCODING ENCRYPTED END_P ENUM_P ESCAPE EVENT EXCEPT
EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
***************
*** 550,556 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
HANDLER HAVING HEADER_P HOLD HOUR_P
! IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IN_P
INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P
INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER
INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION
--- 554,560 ----
HANDLER HAVING HEADER_P HOLD HOUR_P
! IDENTITY_P IF_P IGNORE ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IN_P
INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P
INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER
INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION
***************
*** 579,585 **** static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
QUOTE
RANGE READ REAL REASSIGN RECHECK RECURSIVE REF REFERENCES REFRESH REINDEX
! RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK
ROW ROWS RULE
--- 583,589 ----
QUOTE
RANGE READ REAL REASSIGN RECHECK RECURSIVE REF REFERENCES REFRESH REINDEX
! REJECTS RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
RESET RESTART RESTRICT RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK
ROW ROWS RULE
***************
*** 8870,8880 **** DeallocateStmt: DEALLOCATE name
*****************************************************************************/
InsertStmt:
! opt_with_clause INSERT INTO qualified_name insert_rest returning_clause
{
$5->relation = $4;
! $5->returningList = $6;
$5->withClause = $1;
$$ = (Node *) $5;
}
;
--- 8874,8886 ----
*****************************************************************************/
InsertStmt:
! opt_with_clause INSERT INTO qualified_name insert_rest
! opt_on_duplicate_key returning_clause
{
$5->relation = $4;
! $5->rlist = $7;
$5->withClause = $1;
+ $5->specClause = $6;
$$ = (Node *) $5;
}
;
***************
*** 8918,8926 **** insert_column_item:
}
;
returning_clause:
! RETURNING target_list { $$ = $2; }
! | /* EMPTY */ { $$ = NIL; }
;
--- 8924,8968 ----
}
;
+ opt_on_duplicate_key:
+ ON DUPLICATE KEY LOCK_P FOR UPDATE
+ {
+ $$ = SPEC_UPDATE;
+ }
+ |
+ ON DUPLICATE KEY IGNORE
+ {
+ $$ = SPEC_IGNORE;
+ }
+ | /*EMPTY*/
+ {
+ $$ = SPEC_NONE;
+ }
+ ;
+
+ opt_rejects:
+ REJECTS
+ {
+ $$ = TRUE;
+ }
+ | /*EMPTY*/
+ {
+ $$ = FALSE;
+ }
+ ;
+
returning_clause:
! RETURNING opt_rejects target_list
! {
! $$ = makeNode(ReturningClause);
! $$->returningList = $3;
! $$->rejects = $2;
! $$->location = @1;
! }
! | /* EMPTY */
! {
! $$ = NULL;
! }
;
***************
*** 8938,8944 **** DeleteStmt: opt_with_clause DELETE_P FROM relation_expr_opt_alias
n->relation = $4;
n->usingClause = $5;
n->whereClause = $6;
! n->returningList = $7;
n->withClause = $1;
$$ = (Node *)n;
}
--- 8980,8986 ----
n->relation = $4;
n->usingClause = $5;
n->whereClause = $6;
! n->rlist = $7;
n->withClause = $1;
$$ = (Node *)n;
}
***************
*** 9005,9011 **** UpdateStmt: opt_with_clause UPDATE relation_expr_opt_alias
n->targetList = $5;
n->fromClause = $6;
n->whereClause = $7;
! n->returningList = $8;
n->withClause = $1;
$$ = (Node *)n;
}
--- 9047,9053 ----
n->targetList = $5;
n->fromClause = $6;
n->whereClause = $7;
! n->rlist = $8;
n->withClause = $1;
$$ = (Node *)n;
}
***************
*** 12589,12594 **** unreserved_keyword:
--- 12631,12637 ----
| DOMAIN_P
| DOUBLE_P
| DROP
+ | DUPLICATE
| EACH
| ENABLE_P
| ENCODING
***************
*** 12619,12624 **** unreserved_keyword:
--- 12662,12668 ----
| HOUR_P
| IDENTITY_P
| IF_P
+ | IGNORE
| IMMEDIATE
| IMMUTABLE
| IMPLICIT_P
***************
*** 12944,12949 **** reserved_keyword:
--- 12988,12994 ----
| PLACING
| PRIMARY
| REFERENCES
+ | REJECTS
| RETURNING
| SELECT
| SESSION_USER
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
***************
*** 4047,4053 **** RelationGetExclusionInfo(Relation indexRelation,
MemoryContextSwitchTo(oldcxt);
}
-
/*
* Routines to support ereport() reports of relation-related errors
*
--- 4047,4052 ----
*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 837,842 **** HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
--- 837,843 ----
* Here, we consider the effects of:
* all transactions committed as of the time of the given snapshot
* previous commands of this transaction
+ * all rows only locked (not updated) by this transaction, committed by another
*
* Does _not_ include:
* transactions shown as in-progress by the snapshot
***************
*** 959,965 **** HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
--- 960,985 ----
* when...
*/
if (XidInMVCCSnapshot(HeapTupleHeaderGetXmin(tuple), snapshot))
+ {
+ /*
+ * Not visible to snapshot under conventional MVCC rules, but may still
+ * be exclusive locked by our xact and not updated, which will satisfy
+ * MVCC under a special rule. Importantly, this special rule will not
+ * be invoked if the row is updated, so only one version can be visible
+ * at once.
+ *
+ * Currently this is useful to exactly one case -- INSERT...ON
+ * DUPLICATE KEY LOCK FOR UPDATE, where it's possible and sometimes
+ * desirable to lock a row that would not otherwise be visible to the
+ * given MVCC snapshot. The locked row should on that basis alone
+ * become visible, for the benefit of READ COMMITTED mode.
+ */
+ if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
+ TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
+ return true;
+
return false; /* treat as still in progress */
+ }
if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 138,144 **** extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
CommandId cid, int options, BulkInsertState bistate);
extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
! HeapUpdateFailureData *hufd);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
--- 138,144 ----
CommandId cid, int options, BulkInsertState bistate);
extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
CommandId cid, Snapshot crosscheck, bool wait,
! HeapUpdateFailureData *hufd, bool kill);
extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
HeapTuple newtup,
CommandId cid, Snapshot crosscheck, bool wait,
*** a/src/include/executor/executor.h
--- b/src/include/executor/executor.h
***************
*** 348,361 **** extern void ExecCloseScanRelation(Relation scanrel);
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
! EState *estate);
! extern bool check_exclusion_constraint(Relation heap, Relation index,
IndexInfo *indexInfo,
ItemPointer tupleid,
Datum *values, bool *isnull,
EState *estate,
! bool newIndex, bool errorOK);
extern void RegisterExprContextCallback(ExprContext *econtext,
ExprContextCallbackFunction function,
--- 348,366 ----
extern void ExecOpenIndices(ResultRelInfo *resultRelInfo);
extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
+ extern List *ExecLockIndexValues(TupleTableSlot *slot, EState *estate,
+ SpecType specReason);
extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
! EState *estate, bool noErrorOnDuplicate);
! extern bool ExecCheckIndexConstraints(TupleTableSlot *slot, EState *estate,
! ItemPointer conflictTid);
! extern bool check_exclusion_or_unique_constraint(Relation heap, Relation index,
IndexInfo *indexInfo,
ItemPointer tupleid,
Datum *values, bool *isnull,
EState *estate,
! bool newIndex, bool errorOK, bool wait,
! ItemPointer conflictTid);
extern void RegisterExprContextCallback(ExprContext *econtext,
ExprContextCallbackFunction function,
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 41,46 ****
--- 41,49 ----
* ExclusionOps Per-column exclusion operators, or NULL if none
* ExclusionProcs Underlying function OIDs for ExclusionOps
* ExclusionStrats Opclass strategy numbers for ExclusionOps
+ * UniqueOps Theses are like Exclusion*, but for unique indexes
+ * UniqueProcs
+ * UniqueStrats
* Unique is it a unique index?
* ReadyForInserts is it valid for inserts?
* Concurrent are we doing a concurrent index build?
***************
*** 62,67 **** typedef struct IndexInfo
--- 65,73 ----
Oid *ii_ExclusionOps; /* array with one entry per column */
Oid *ii_ExclusionProcs; /* array with one entry per column */
uint16 *ii_ExclusionStrats; /* array with one entry per column */
+ Oid *ii_UniqueOps; /* array with one entry per column */
+ Oid *ii_UniqueProcs; /* array with one entry per column */
+ uint16 *ii_UniqueStrats; /* array with one entry per column */
bool ii_Unique;
bool ii_ReadyForInserts;
bool ii_Concurrent;
***************
*** 1085,1090 **** typedef struct ModifyTableState
--- 1091,1097 ----
int mt_whichplan; /* which one is being executed (0..n-1) */
ResultRelInfo *resultRelInfo; /* per-subplan target relations */
List **mt_arowmarks; /* per-subplan ExecAuxRowMark lists */
+ SpecType spec; /* reason for speculative insertion */
EPQState mt_epqstate; /* for evaluating EvalPlanQual rechecks */
bool fireBSTriggers; /* do we need to fire stmt triggers? */
} ModifyTableState;
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
***************
*** 403,408 **** typedef enum NodeTag
--- 403,409 ----
T_RowMarkClause,
T_XmlSerialize,
T_WithClause,
+ T_ReturningClause,
T_CommonTableExpr,
/*
***************
*** 546,551 **** typedef enum CmdType
--- 547,565 ----
* with qual */
} CmdType;
+ /* SpecType -
+ * "Speculative insertion" clause
+ *
+ * This also appears across various subsystems
+ */
+ typedef enum
+ {
+ SPEC_NONE, /* No reason to insert speculatively */
+ SPEC_IGNORE, /* "ON DUPLICATE KEY IGNORE" */
+ SPEC_IGNORE_REJECTS, /* same as SPEC_IGNORE, plus RETURNING rejected */
+ SPEC_UPDATE, /* "ON DUPLICATE KEY LOCK FOR UPDATE" */
+ SPEC_UPDATE_REJECTS /* same as SPEC_UPDATE, plus RETURNING rejected */
+ } SpecType;
/*
* JoinType -
*** a/src/include/nodes/parsenodes.h
--- b/src/include/nodes/parsenodes.h
***************
*** 130,135 **** typedef struct Query
--- 130,137 ----
List *withCheckOptions; /* a list of WithCheckOption's */
+ SpecType specClause; /* speculative insertion clause */
+
List *returningList; /* return-values list (of TargetEntry) */
List *groupClause; /* a list of SortGroupClause's */
***************
*** 978,983 **** typedef struct WithClause
--- 980,1000 ----
} WithClause;
/*
+ * ReturningClause -
+ * representation of returninglist for parsing
+ *
+ * Note: ReturningClause does not propogate into the Query representation;
+ * returningList does, while rejects influences speculative insertion.
+ */
+ typedef struct ReturningClause
+ {
+ NodeTag type;
+ List *returningList; /* List proper */
+ bool rejects; /* A list of rejects? */
+ int location; /* token location, or -1 if unknown */
+ } ReturningClause;
+
+ /*
* CommonTableExpr -
* representation of WITH list element
*
***************
*** 1027,1033 **** typedef struct InsertStmt
RangeVar *relation; /* relation to insert into */
List *cols; /* optional: names of the target columns */
Node *selectStmt; /* the source SELECT/VALUES, or NULL */
! List *returningList; /* list of expressions to return */
WithClause *withClause; /* WITH clause */
} InsertStmt;
--- 1044,1051 ----
RangeVar *relation; /* relation to insert into */
List *cols; /* optional: names of the target columns */
Node *selectStmt; /* the source SELECT/VALUES, or NULL */
! SpecType specClause; /* ON DUPLICATE KEY specification */
! ReturningClause *rlist; /* expressions to return */
WithClause *withClause; /* WITH clause */
} InsertStmt;
***************
*** 1041,1047 **** typedef struct DeleteStmt
RangeVar *relation; /* relation to delete from */
List *usingClause; /* optional using clause for more tables */
Node *whereClause; /* qualifications */
! List *returningList; /* list of expressions to return */
WithClause *withClause; /* WITH clause */
} DeleteStmt;
--- 1059,1065 ----
RangeVar *relation; /* relation to delete from */
List *usingClause; /* optional using clause for more tables */
Node *whereClause; /* qualifications */
! ReturningClause *rlist; /* expressions to return */
WithClause *withClause; /* WITH clause */
} DeleteStmt;
***************
*** 1056,1062 **** typedef struct UpdateStmt
List *targetList; /* the target list (of ResTarget) */
Node *whereClause; /* qualifications */
List *fromClause; /* optional from clause for more tables */
! List *returningList; /* list of expressions to return */
WithClause *withClause; /* WITH clause */
} UpdateStmt;
--- 1074,1080 ----
List *targetList; /* the target list (of ResTarget) */
Node *whereClause; /* qualifications */
List *fromClause; /* optional from clause for more tables */
! ReturningClause *rlist; /* expressions to return */
WithClause *withClause; /* WITH clause */
} UpdateStmt;
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 176,181 **** typedef struct ModifyTable
--- 176,182 ----
List *returningLists; /* per-target-table RETURNING tlists */
List *fdwPrivLists; /* per-target-table FDW private data lists */
List *rowMarks; /* PlanRowMarks (non-locking only) */
+ SpecType spec; /* speculative insertion specification */
int epqParam; /* ID of Param for EvalPlanQual re-eval */
} ModifyTable;
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
***************
*** 84,90 **** extern ModifyTable *make_modifytable(PlannerInfo *root,
CmdType operation, bool canSetTag,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
! List *rowMarks, int epqParam);
extern bool is_projection_capable_plan(Plan *plan);
/*
--- 84,90 ----
CmdType operation, bool canSetTag,
List *resultRelations, List *subplans,
List *withCheckOptionLists, List *returningLists,
! List *rowMarks, SpecType spec, int epqParam);
extern bool is_projection_capable_plan(Plan *plan);
/*
*** a/src/include/parser/kwlist.h
--- b/src/include/parser/kwlist.h
***************
*** 133,138 **** PG_KEYWORD("document", DOCUMENT_P, UNRESERVED_KEYWORD)
--- 133,139 ----
PG_KEYWORD("domain", DOMAIN_P, UNRESERVED_KEYWORD)
PG_KEYWORD("double", DOUBLE_P, UNRESERVED_KEYWORD)
PG_KEYWORD("drop", DROP, UNRESERVED_KEYWORD)
+ PG_KEYWORD("duplicate", DUPLICATE, UNRESERVED_KEYWORD)
PG_KEYWORD("each", EACH, UNRESERVED_KEYWORD)
PG_KEYWORD("else", ELSE, RESERVED_KEYWORD)
PG_KEYWORD("enable", ENABLE_P, UNRESERVED_KEYWORD)
***************
*** 180,185 **** PG_KEYWORD("hold", HOLD, UNRESERVED_KEYWORD)
--- 181,187 ----
PG_KEYWORD("hour", HOUR_P, UNRESERVED_KEYWORD)
PG_KEYWORD("identity", IDENTITY_P, UNRESERVED_KEYWORD)
PG_KEYWORD("if", IF_P, UNRESERVED_KEYWORD)
+ PG_KEYWORD("ignore", IGNORE, UNRESERVED_KEYWORD)
PG_KEYWORD("ilike", ILIKE, TYPE_FUNC_NAME_KEYWORD)
PG_KEYWORD("immediate", IMMEDIATE, UNRESERVED_KEYWORD)
PG_KEYWORD("immutable", IMMUTABLE, UNRESERVED_KEYWORD)
***************
*** 307,312 **** PG_KEYWORD("ref", REF, UNRESERVED_KEYWORD)
--- 309,315 ----
PG_KEYWORD("references", REFERENCES, RESERVED_KEYWORD)
PG_KEYWORD("refresh", REFRESH, UNRESERVED_KEYWORD)
PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD)
+ PG_KEYWORD("rejects", REJECTS, RESERVED_KEYWORD)
PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD)
PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD)
PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD)
*** a/src/test/isolation/isolation_schedule
--- b/src/test/isolation/isolation_schedule
***************
*** 22,24 **** test: aborted-keyrevoke
--- 22,26 ----
test: multixact-no-deadlock
test: drop-index-concurrently-1
test: timeouts
+ test: insert-duplicate-key-ignore
+ test: insert-duplicate-key-lock-for-update
*** /dev/null
--- b/src/test/isolation/specs/insert-duplicate-key-ignore.spec
***************
*** 0 ****
--- 1,42 ----
+ # INSERT...ON DUPLICATE KEY IGNORE test
+ #
+ # This test tries to expose problems with the interaction between concurrent
+ # sessions during INSERT...ON DUPLICATE KEY IGNORE.
+ #
+ # The convention here is that session 1 always ends up inserting, and session 2
+ # always ends up ignoring.
+
+ setup
+ {
+ CREATE TABLE ints (key int primary key, val text);
+ }
+
+ teardown
+ {
+ DROP TABLE ints;
+ }
+
+ session "s1"
+ setup
+ {
+ BEGIN ISOLATION LEVEL READ COMMITTED;
+ }
+ step "ignore1" { INSERT INTO ints(key, val) VALUES(1, 'ignore1') ON DUPLICATE KEY IGNORE; }
+ step "select1" { SELECT * FROM ints; }
+ step "c1" { COMMIT; }
+ step "a1" { ABORT; }
+
+ session "s2"
+ setup
+ {
+ BEGIN ISOLATION LEVEL READ COMMITTED;
+ }
+ step "ignore2" { INSERT INTO ints(key, val) VALUES(1, 'ignore2') ON DUPLICATE KEY IGNORE; }
+ step "select2" { SELECT * FROM ints; }
+ step "c2" { COMMIT; }
+ step "a2" { ABORT; }
+
+ # Regular case where one session block-waits on another to determine if it
+ # should proceed with an insert or ignore.
+ permutation "ignore1" "ignore2" "c1" "select2" "c2"
+ permutation "ignore1" "ignore2" "a1" "select2" "c2"
*** /dev/null
--- b/src/test/isolation/specs/insert-duplicate-key-lock-for-update.spec
***************
*** 0 ****
--- 1,39 ----
+ # INSERT...ON DUPLICATE KEY LOCK FOR UPDATE test
+ #
+ # This test tries to expose problems with the interaction between concurrent
+ # sessions during INSERT...ON DUPLICATE LOCK FOR UPDATE.
+
+ setup
+ {
+ CREATE TABLE ints (key int primary key, val text);
+ }
+
+ teardown
+ {
+ DROP TABLE ints;
+ }
+
+ session "s1"
+ setup
+ {
+ BEGIN ISOLATION LEVEL READ COMMITTED;
+ }
+ step "lock1" { INSERT INTO ints(key, val) VALUES(1, 'lock1') ON DUPLICATE KEY LOCK FOR UPDATE; }
+ step "select1" { SELECT * FROM ints; }
+ step "c1" { COMMIT; }
+ step "a1" { ABORT; }
+
+ session "s2"
+ setup
+ {
+ BEGIN ISOLATION LEVEL READ COMMITTED;
+ }
+ step "lock2" { INSERT INTO ints(key, val) VALUES(1, 'lock2') ON DUPLICATE KEY LOCK FOR UPDATE; }
+ step "select2" { SELECT * FROM ints; }
+ step "c2" { COMMIT; }
+ step "a2" { ABORT; }
+
+ # Regular case where one session block-waits on another to determine if it
+ # should proceed with an insert or lock.
+ permutation "lock1" "lock2" "c1" "select2" "c2"
+ permutation "lock1" "lock2" "a1" "select2" "c2"
On Tue, Nov 26, 2013 at 11:32 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
After fixing that bug, I'm getting a correctly-detected deadlock every now
and then with that test case.
We'll probably want to carefully consider how
predictably/deterministically this occurs.
Hmm. That's because the trick I used to kill the just-inserted tuple
confuses a concurrent heap_lock_tuple call. It doesn't expect the tuple it's
locking to become invisible. Actually, doesn't your patch have the same bug?
If you're about to lock a tuple in ON DUPLICATE KEY LOCK FOR UPDATE, and the
transaction that inserted the duplicate row aborts just before the
heap_lock_tuple() call, I think you'd also see that error.
Yes, that's true. It will occur much more frequently with your
previous revision, but the V4 patch is also affected.
To me this seems like a problem with the (potential) total lack of
locking that your approach takes (inserting btree unique index tuples
as in your patch is a form of value locking...sort of...it's a little
hard to reason about as presented). Do you think this might be an
inherent problem, or can you suggest a way to make your approach still
work?Just garden-variety bugs :-). Attached patch fixes both issues.
Great. I'll let you know what I think.
* Doesn't just work with one unique index. Naming a unique index
directly in DML, or assuming that the PK is intended seems quite weak
to me.
Totally agreed on that.
Good.
BTW, you keep forgetting to add "expected" output of the new isolation tests.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 26, 2013 at 1:41 PM, Peter Geoghegan <pg@heroku.com> wrote:
Great. I'll let you know what I think.
So having taken a look at what you've done here, some concerns remain.
I'm coming up with a good explanation/test case, which might be easier
than trying to explain it any other way.
There are some visibility-related race conditions even still, with the
same test case as before. It takes a good while to recreate, but can
be done after several hours on an 8 core server under my control:
pg@gerbil:~/pgdata$ ls -l -h -a hack_log.log
-rw-rw-r-- 1 pg pg 1.6G Nov 27 05:10 hack_log.log
pg@gerbil:~/pgdata$ cat hack_log.log | grep visible
ERROR: attempted to update invisible tuple
ERROR: attempted to update invisible tuple
ERROR: attempted to update invisible tuple
FWIW I'm pretty sure that my original patch has the same bug, but it
hardly matters now.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 26, 2013 at 8:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
There are some visibility-related race conditions even still
I also see this, sandwiched between the very many "deadlock detected"
errors recorded over 6 or so hours (this is in chronological order,
with no ERRORs omitted within the range shown):
ERROR: deadlock detected
ERROR: deadlock detected
ERROR: deadlock detected
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: deadlock detected
ERROR: deadlock detected
ERROR: deadlock detected
ERROR: deadlock detected
This, along with the already-discussed "attempted to update invisible
tuple" forms a full account of unexpected ERRORs seen during the
extended run of the test case, so far.
Since it took me a relatively long time to recreate this, it may not
be trivial to do so. Unless you don't think it's useful to do so, I'm
going to give this test a full 24 hours, just in case it shows up
anything else like this.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-11-27 01:09:49 -0800, Peter Geoghegan wrote:
On Tue, Nov 26, 2013 at 8:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
There are some visibility-related race conditions even still
I also see this, sandwiched between the very many "deadlock detected"
errors recorded over 6 or so hours (this is in chronological order,
with no ERRORs omitted within the range shown):ERROR: deadlock detected
ERROR: deadlock detected
ERROR: deadlock detected
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: unable to fetch updated version of tuple
ERROR: deadlock detected
ERROR: deadlock detected
ERROR: deadlock detected
ERROR: deadlock detectedThis, along with the already-discussed "attempted to update invisible
tuple" forms a full account of unexpected ERRORs seen during the
extended run of the test case, so far.
I think at least the "unable to fetch updated version of tuple" ERRORs
are likely to be an unrelated 9.3+ BUG that I've recently
reported. Alvaro has a patch. C.f. 20131124000203.GA4403@alap2.anarazel.de
Even the "deadlock detected" errors might be a fkey-locking issue. Bug
#8434, but that's really hard to know without more details.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 27, 2013 at 1:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Even the "deadlock detected" errors might be a fkey-locking issue. Bug
#8434, but that's really hard to know without more details.
Thanks, I was aware of that but didn't make the connection.
I've written a test-case that is designed to exercise one case that
deadlocks like crazy - deadlocking is the expected, correct behavior.
The deadlock errors are not in themselves suspicious. Actually, if
anything I find it suspicious that there aren't more deadlocks.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 27, 2013 at 1:09 AM, Peter Geoghegan <pg@heroku.com> wrote:
Since it took me a relatively long time to recreate this, it may not
be trivial to do so. Unless you don't think it's useful to do so, I'm
going to give this test a full 24 hours, just in case it shows up
anything else like this.
I see a further, distinct error message this morning:
"ERROR: unrecognized heap_lock_tuple status: 1"
This is a would-be "attempted to lock invisible tuple" error, but with
the error raised by some heap_lock_tuple() call site, unlike the
previous situation where heap_lock_tuple() raised the error directly.
Since with the most recent revision, we handle this (newly possible)
return code in the new ExecLockHeapTupleForUpdateSpec() function, that
just leaves EvalPlanQualFetch() as a plausible place to see it, given
the codepaths exercised in the test case.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Nov 27, 2013 at 1:09 AM, Peter Geoghegan <pg@heroku.com> wrote:
This, along with the already-discussed "attempted to update invisible
tuple" forms a full account of unexpected ERRORs seen during the
extended run of the test case, so far.
Actually, it was slightly misleading of me to say it's the same
test-case; in fact, this time I ran each pgbench run with a variable,
random number of seconds between 2 and 20 inclusive (as opposed to
always 2 seconds). If you happen to need help recreating this, I am
happy to give it.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
What's the status of this patch? I posted my version using a quite
different approach than your original patch. You did some testing of
that, and ran into unrelated bugs. Have they been fixed now?
Where do we go from here? Are you planning to continue based on my
proof-of-concept patch, fixing the known issues with that? Or do you
need more convincing?
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 12, 2013 at 1:23 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
What's the status of this patch? I posted my version using a quite different
approach than your original patch. You did some testing of that, and ran
into unrelated bugs. Have they been fixed now?
Sorry, I dropped the ball on this. I'm doing a bit more testing of an
approach to fixing the new bugs. I'll let you know how I get on
tomorrow (later today for you).
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 12, 2013 at 1:47 AM, Peter Geoghegan <pg@heroku.com> wrote:
Sorry, I dropped the ball on this.
Thank you for your patience, Heikki.
I attached two revisions - one of my patch (btreelock_insert_on_dup)
and one of your alternative design (exclusion_insert_on_dup). In both
cases I've added a new visibility rule to HeapTupleSatisfiesUpdate(),
and enabled projecting on duplicate-causing-tid by means of the ctid
system column when RETURNING REJECTS. I'm not in an immediate position
to satisfy myself that the former revision is correct (I'm travelling
tomorrow morning and running a bit short on time) and I'm not
proposing the latter for inclusion as part of the feature (that's a
discussion we may have in time, but it serves a useful purpose during
testing).
Both of these revisions have identical ad-hoc test cases included as
new files - see testcase.sh and upsert.sql. My patch doesn't have any
unique constraint violations, and has pretty consistent performance,
while yours has many unique constraint violations. I'd like to hear
your thoughts on the testcase, and the design implications.
--
Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes:
I attached two revisions - one of my patch (btreelock_insert_on_dup)
and one of your alternative design (exclusion_insert_on_dup).
I spent a little bit of time looking at btreelock_insert_on_dup. AFAICT
it executes FormIndexDatum() for later indexes while holding assorted
buffer locks in earlier indexes. That really ain't gonna do, because in
the case of an expression index, FormIndexDatum will execute nearly
arbitrary user-defined code, which might well result in accesses to those
indexes or others. What we'd have to do is refactor so that all the index
tuple values get computed before we start to insert any of them. That
doesn't seem impossible, but it implies a good deal more refactoring than
has been done here.
Once we do that, I wonder if we couldn't get rid of the LWLockWeaken/
Strengthen stuff. That scares the heck out of me; I think it's deadlock
problems waiting to happen.
Another issue is that the number of buffer locks being held doesn't seem
to be bounded by anything much. The current LWLock infrastructure has a
hard limit on how many lwlocks can be held per backend.
Also, the lack of any doc updates makes it hard to review this. I can
see that you don't want to touch the user-facing docs until the syntax
is agreed on, but at the very least you ought to produce real updates
for the indexam API spec, since you're changing that significantly.
BTW, so far as the syntax goes, I'm quite distressed by having to make
REJECTS into a fully-reserved word. It's not reserved according to the
standard, and it seems pretty likely to be something that apps might be
using as a table or column name.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Dec 13, 2013 at 4:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I spent a little bit of time looking at btreelock_insert_on_dup. AFAICT
it executes FormIndexDatum() for later indexes while holding assorted
buffer locks in earlier indexes. That really ain't gonna do, because in
the case of an expression index, FormIndexDatum will execute nearly
arbitrary user-defined code, which might well result in accesses to those
indexes or others. What we'd have to do is refactor so that all the index
tuple values get computed before we start to insert any of them. That
doesn't seem impossible, but it implies a good deal more refactoring than
has been done here.
We were proceeding on the basis that what I'd done, if deemed
acceptable in principle, could eventually be replaced by an
alternative value locking implementation that more or less similarly
extends the limited way in which value locking already occurs (i.e.
unique index enforcement's buffer locking), but without the downsides.
While I certainly appreciate your input, I still think that there is a
controversy about what implementation gets us the most useful
semantics, and I think we should now focus on resolving it. I am not
sure that Heikki's approach is functionally equivalent to mine. At the
very least, I think the trade-off of doing one or the other should be
well understood.
Once we do that, I wonder if we couldn't get rid of the LWLockWeaken/
Strengthen stuff. That scares the heck out of me; I think it's deadlock
problems waiting to happen.
There are specific caveats around using those. I think that they could
be useful elsewhere, but are likely to only ever have a few clients.
As previously mentioned, the same semantics appear in other similar
locking primitives in other domains, so fwiw it really doesn't strike
me as all that controversial. I agree that their *usage* is not
acceptable as-is. I've only left the usage in the patch to give us
some basis for reasoning about the performance on mixed workloads for
comparative purposes. Perhaps I shouldn't have even done that, to
better focus reviewer attention on the semantics implied by each
implementation.
Also, the lack of any doc updates makes it hard to review this. I can
see that you don't want to touch the user-facing docs until the syntax
is agreed on, but at the very least you ought to produce real updates
for the indexam API spec, since you're changing that significantly.
I'll certainly do that in any future revision.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Dec 12, 2013 at 4:18 PM, Peter Geoghegan <pg@heroku.com> wrote:
Both of these revisions have identical ad-hoc test cases included as
new files - see testcase.sh and upsert.sql. My patch doesn't have any
unique constraint violations, and has pretty consistent performance,
while yours has many unique constraint violations. I'd like to hear
your thoughts on the testcase, and the design implications.
I withdraw the test-case. Both approaches behave similarly if you look
for long enough, and that's okay.
I also think that changes to HeapTupleSatisfiesUpdate() are made
unnecessary by recent bug fixes to that function. The test case
previously described [1]/messages/by-id/CAM3SWZS2--GOvUmYA2ks_aNyfesb0_H6T95_k8+wyx7Pi=CQvw@mail.gmail.com that broke that is no longer recreatable, at
least so far.
Do you think that we need to throw a serialization failure within
ExecLockHeapTupleForUpdateSpec() iff heap_lock_tuple() returns
HeapTupleInvisible and IsolationUsesXactSnapshot()? Also, I'm having a
hard time figuring out a good choke point to catch MVCC snapshots
availing of our special visibility rule where they should not due to
IsolationUsesXactSnapshot(). It seems sufficient to continue to assume
that Postgres won't attempt to lock any tid invisible under
conventional MVCC rules in the first place, except within
ExecLockHeapTupleForUpdateSpec(), but what do we actually do within
ExecLockHeapTupleForUpdateSpec()? I'm thinking of a new tqual.c
routine concerning the tuple being in the future that we re-check when
IsolationUsesXactSnapshot(). That's not very modular, though. Maybe
we'd go through heapam.c.
I think it doesn't matter that what now constitute MVCC snapshots
(with the new, special "reach into the future" rule) have that new
rule, for the purposes of higher isolation levels, because we'll have
a serialization failure within ExecLockHeapTupleForUpdateSpec() before
this is allowed to become a problem. In order for the new rule to be
relevant, we'd have to be the Xact to lock in the first place, and as
an xact in non-read-committed mode, we'd be sure to call the new
tqual.c "in the future" routine or whatever. Only upserters can lock a
row in the future, so it is the job of upserters to care about this
special case.
Incidentally, I tried to rebase recently, and saw some shift/reduce
conflicts due to 1b4f7f93b4693858cb983af3cd557f6097dab67b, "Allow
empty target list in SELECT". The fix for that is not immediately
obvious.
So I think we should proceed with the non-conclusive-check-first
approach (if only on pragmatic grounds), but even now I'm not really
sure. I think there might be unprincipled deadlocking should
ExecInsertIndexTuples() fail to be completely consistent about its
ordering of insertion - the use of dirty snapshots (including as part
of conventional !UNIQUE_CHECK_PARTIAL unique index enforcement) plays
a part in this risk. Roughly speaking, heap_delete() doesn't render
the tuple immediately invisible to some-other-xact's dirty snapshot
[2]: https://github.com/postgres/postgres/blob/94b899b829657332bda856ac3f06153d09077bd1/src/backend/utils/time/tqual.c#L798
is also beneficial in some ways. Our old, dead tuples from previous
attempts stick around, and function as "value locks" to everyone else,
since for example _bt_check_unique() cares about visibility having
merely been affected, which is grounds for blocking. More
counter-intuitive still, we go ahead with "value locking" (i.e. btree
UNIQUE_CHECK_PARTIAL tuple insertion originating from the main
speculative ExecInsertIndexTuples() call) even though we already know
that we will delete the corresponding heap row (which, as noted, still
satisfies HeapTupleSatisfiesDirty() and so is value-lock-like).
Empirically, retrying because ExecInsertIndexTuples() returns some
recheckIndexes occurs infrequently, so maybe that makes all of this
okay. Or maybe it happens infrequently *because* we don't give up on
insertion when it looks like the current iteration is futile. Maybe
just inserting into every unique index, and then blocking on an xid
within ExecCheckIndexConstraints(), works out fairly and performs
reasonably in all common cases. It's pretty damn subtle, though, and I
worry about the worst case performance, and basic correctness issues
for these reasons. The fact that deferred unique indexes also use
UNIQUE_CHECK_PARTIAL is cold comfort -- that only ever has to through
an error on conflict, and only once. We haven't "earned the right" to
lock *all* values in all unique indexes, but kind of do so anyway in
the event of an "insertion conflicted after pre-check".
Another concern that bears reiterating is: I think making the
lock-for-update case work for exclusion constraints is a lot of
additional complexity for a very small return.
Do you think it's worth optimizing ExecInsertIndexTuples() to avoid
futile non-unique/exclusion constrained index tuple insertion?
[1]: /messages/by-id/CAM3SWZS2--GOvUmYA2ks_aNyfesb0_H6T95_k8+wyx7Pi=CQvw@mail.gmail.com
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Dec 18, 2013 at 8:39 PM, Peter Geoghegan <pg@heroku.com> wrote:
Empirically, retrying because ExecInsertIndexTuples() returns some
recheckIndexes occurs infrequently, so maybe that makes all of this
okay. Or maybe it happens infrequently *because* we don't give up on
insertion when it looks like the current iteration is futile. Maybe
just inserting into every unique index, and then blocking on an xid
within ExecCheckIndexConstraints(), works out fairly and performs
reasonably in all common cases. It's pretty damn subtle, though, and I
worry about the worst case performance, and basic correctness issues
for these reasons.
I realized that it's possible to create the problem that I'd
previously predicted with "promise tuples" [1]/messages/by-id/CAM3SWZRfrw+zXe7CKt6-QTCuvKQ-Oi7gnbBOPqQsvddU=9M7_g@mail.gmail.com some time ago, that are
similar in some regards to what Heikki has here. At the time, Robert
seemed to agree that this was a concern [2]/messages/by-id/CA+TgmobwDZSVcKWTmVNBxeHSe4LCnW6zon2soH6L7VoO+7tAzw@mail.gmail.com.
I have a very simple testcase attached, much simpler that previous
testcases, that reproduces deadlock for the patch
exclusion_insert_on_dup.2013_12_12.patch.gz at scale=1 frequently, and
occasionally when scale=10 (for tiny, single-statement transactions).
With scale=100, I can't get it to deadlock on my laptop (60 clients in
all cases), at least in a reasonable time period. With the patch
btreelock_insert_on_dup.2013_12_12.patch.gz, it will never deadlock,
even with scale=1, simply because value locks are not held on to
across row locking. This is why I characterized the locking as
"opportunistic" on several occasions in months past.
The test-case is actually much simpler than the one I describe in [1]/messages/by-id/CAM3SWZRfrw+zXe7CKt6-QTCuvKQ-Oi7gnbBOPqQsvddU=9M7_g@mail.gmail.com,
and much simpler than all previous test-cases, as there is only one
unique index, though the problem is essentially the same. It is down
to old "value locks" held across retries - with "exclusion_...", we
can't *stop* locking things from previous locking attempts (where a
locking attempt is btree insertion with the UNIQUE_CHECK_PARTIAL
flag), because dirty snapshots still see
inserted-then-deleted-in-other-xact tuples. This deadlocking seems
unprincipled and unjustified, which is a concern that I had all along,
and a concern that Heikki seemed to share more recently [3]/messages/by-id/528B640F.50601@vmware.com. This is
why I felt strongly all along that value locks ought to be cheap to
both acquire and _release_, and it's unfortunate that so much time was
wasted on tangential issues, though I do accept some responsibility
for that.
So, I'd like to request as much scrutiny as possible from as wide as
possible a cross section of contributors of this test case
specifically. This feature's development is deadlocked on resolving
this unprincipled deadlocking controversy. This is a relatively easy
thing to have an opinion on, and I'd like to hear as many as possible.
Is this deadlocking something we can live with? What is a reasonable
path forward? Perhaps I am being pedantic in considering unnecessary
deadlocking as ipso facto unacceptable (after all, MySQL lived with
this kind of problem for long enough, even if it has gotten better for
them recently), but there is a very real danger of painting ourselves
into a corner with these concurrency issues. I aim to have the
community understand ahead of time the exact set of
trade-offs/semantics implied by our chosen implementation, whatever
the outcome. That seems very important. I myself lean towards this
being a blocker for the "exclusion_" approach at least as presented.
Now, you might say to yourself "why should I assume that this isn't
just attributable to btree page buffer locks being coarser than other
approaches to value locking?". That's a reasonable point, and indeed
it's why I avoided lower scale values in prior, more complicated
test-cases, but that doesn't actually account for the problem
highlighted: In this test-case we do not hold buffer locks across
other buffer locks within a single backends (at least in any new way),
nor do we lock rows while holding buffer locks within a single
backend. Quite simply, the conventional btree value locking approach
doesn't attempt to lock 2 things within a backend at the same time,
and you need to do that to get a deadlock, so there are no deadlocks.
Importantly, the "btree_..." implementation can release value locks.
Thanks
P.S. In the interest of reproducibility, I attach new revisions of
each patch, even though there is no reason to believe that any changes
since the last revision posted are significant to the test-case. There
was some diff minimization, plus I incorporated some (but not all)
unrelated feedback from Tom. It wasn't immediately obvious, at least
to me, that "rejects" can be made to be an unreserved keyword, due to
shift/reduce conflicts, but I did document AM changes. Hopefully this
gives some indication of the essential nature or intent of my design
that we may work towards refining (depending on the outcome of
discussion here, of course).
P.P.S. Be careful not to fall afoul of the shift/reduce conflicts when
applying either patch on top of commit
1b4f7f93b4693858cb983af3cd557f6097dab67b. I'm working on a fix that
allows a clean rebase.
[1]: /messages/by-id/CAM3SWZRfrw+zXe7CKt6-QTCuvKQ-Oi7gnbBOPqQsvddU=9M7_g@mail.gmail.com
[2]: /messages/by-id/CA+TgmobwDZSVcKWTmVNBxeHSe4LCnW6zon2soH6L7VoO+7tAzw@mail.gmail.com
[3]: /messages/by-id/528B640F.50601@vmware.com
--
Peter Geoghegan
Attachments:
btreelock_insert_on_dup.2013_12_19.patch.gzapplication/x-gzip; name=btreelock_insert_on_dup.2013_12_19.patch.gzDownload
����R btreelock_insert_on_dup.2013_12_19.patch ��y�F� ���)`�lB��S"���d����%��8y���IP��$8 hY��wu����|�����%�������kL&���M�8��Q8O�`�dq3�/���?O��G��3|h�o���3 ��3��S���[�o�����t'M�y�_��'�����7�����7{{{�0����,g������W7q�_bg����u�5M���������}i���vo����.�����o5�?�#{/���<3m{�nU9E�����X��y���n��$�����:<?�����y/����[����2��u�^7Q�\���4��>��[zSh9�1u�����f6}B������}��U�7����n�Z�k�&�}�c1����b�7.����-�{�r�O�$��Q�* ���z����8����������/Qxy�)�����a8��,��~�����@��c/Y����t���W8�����t�~��s���G���-I0N���������ApR�� 6����wF�/? ���I~���V�9���r�3 o���%/��8W����4�yz��������cK��U��(��c�&��; t��;����}��4
r��F�����}z�\�["D>(JB�o�Y�0�D>��qf��?�a�m�����3�~�/��rG�G��6��O}Np�|Z�/v`��Nn� �a�^����,��a$�9�6�'�7�:K�{[��P��2��FO��B���Bh�����] �xI��0�� �?u��� �d��� w����0�����%a�D���U��a*����@�nni�0y�����;�������������a�������F#��
�,�>>�w���$�x����{x�a>����� FcZ@�cFag6C�L.��E8�qB�AX����[��a4]��$�;x��O?�����f4D�|b8e|`x?�2��-`}\��!�;���������$@4�>N��(��t������/�A�}$�,C��t�z��Q!:������-��0���6\N��c��8S�03oz����������/^4�],D���.Hny����x����2]��#��{�6vt�D%S�u@����)�Sp'�H�
�$��{��ho�O@[�?Y�iJ1m��-��/#�{����0��&���q��W��.8��� �2O���cs���l1�C���s���x��������}�#��� �`�8��r� �yB3�0f���DD
&V:7�>
���g���
���|�S�H��YL�p#9
qb���V�sw�n�&���4�-uda&�O@Q�A<���H�4��J�0I$(�<^2�H�be��W���K��O�pF�? ��y��xK1����r�51.������|X���������g�%J4[
�[Z��F{iR�joq����75F�~@�}/��@(P<�n^ ���e����q��#��w�����c���h^-!� ��}$O�6�S��\���w�����w��{�8$��X����a���0�(-�`�Vy���n
�m�-<��Vi�0W�|"n�c��:�s�'G�� C�8C~����3�������i0"�h�<f�+`b��Oa""�8\��� ����aH����U5
n��m�Y��a]�%GC����p�}��D�6���%>��{�z�X�x��Bw��d�����?�P�nY�E�i$��`�n�H&���F��dB�xg��Q9�p�-���9� �e�|������xx�WW����A�-�:S@X;E7)���4��~A�����T�=^����X�����0@1`�:�Hyo}$�4���?L�4P��~��� ���9k�3}�h�S���_�D�����4 l�FN��G������e�����gu�@S�
�z�O�i�g��O_�����{A�)��
��c���`�rk��v���[��p��z=&���h�;���k��W���4�����g��D�%�����)�E��}}a##7�u��0E�d�XG��Yw~�����������(;@�����O8@tvx�/p������{�B��%`����~�����{p����2fy�%�����!((� yu�c�A@�x -&����E���'��S��pk���MC�#� ���&>� ��><~���-��r��2���������|4H�;kHU��v;��E��8�6�a�m=#�yx�I
���o�I���P��BBF�&�X�2�|o ���(��S&��%��'_����:g^2��!����g��j�8�0Z��~Q1 j[A
����=H @��/����������cH��^���S�����tx�s~��������:�p������OTsVg�5�x��}zvaw� ���q��I��>i����-��rP=�RQo���'��$#���p)�~qZ�m����EK�?K$�s���*���d���b��v ����yP=b��q{��a%5��B�j�=�RD�C-���������X ��O�\i��(�s���][�an:#}B�)�k %88u8pw��w�D<*$:��oV�� ��Q ��K�c�Qo; ����BeH��B���)����<�=1���2�S/�����J�P��T�/���1��6�����8:�x2`0�����������@]b_�k�U&������+�������n�s`�u�B� ���0�FE���Y��������u��P�����O��;�>��|/�����L3�
d����X�!�"�Dq'�� �*�*33�7����� VR I(hU�+�,nz��2�D��d�m��al���2�<�
�X��w����:k�����\~��D� �����u@=�"E�q��,��=��i.DD���Wf,6IZ��
�z@"�| p��q��V�_���Z��u���}^e�s���]�����f,�,�
c����h&l"[���gP�O��:����w-��!7��d�Ev�x��g�BG�(��C�������1��3�h�����2��(��#\�F��`FA�O�#6@������>��!�|}���u����~r��N��/.z�W'?��'F��"d��vk��V;"a^���Yk���k��a�o�8�i��:0�k2�E�QT����m� ����.��<��;����(�=���������aH\
|���,���=�@0!���'���$MJ"D�EVZh�
"fJ[��>
��AB�b�/�7��?M�5��Ey���!y��S�)szR���@��@�X�����X_�h���xCqJ���1R�K4^�s?�{�d��)2O���r��D�:�`f�V���=%�-u���O��2�kl���2VJ���|=��yz3���Jp��f�y�E[��:E>�jF�@���x(`Q|����I��3U�R8�� ����X@�
����+Z�ax�!�s�@�H��g*����F�xBeJT�FA�.���h�OB����SE1`�����HsE"�����m6��?B�H3��N����,�%)�a�U�)dF)�s���dFZ!�y ��a���&t�������3fM��B:cS���M�����h
�9��Ak�u���������n�e�x�w;m�^�gU7JUA{�u��CdW�j$�G�q����H+����5)_�w.`�GozG?.�{GYS���QT
�VgF�}t6[�P�f�H����!2q
EPDS�Zp�~��nv���<F�43i�'D`-bC6����_.HZs��yd�0���,��� �����|9�^\fcq�����u(�"��q�
:Z>�)�� ����������M�W��/#1�����g�s���������Qc�G��Wwa��F�{�Q���* y�O�R�7�������h�xa�j�P-����>R}+d�%�m�h�cR����K�Q�
/����Pb`�y���O��,��e}���j�g�G�V5�,��Ut�������~��8�e��.2)2F��.6����r'&�j��8-����UHYt��D +�1� ��a��b��X�9��M ���,�4})�H���|��m�MX�j�Jq�5��"��]�?��m�Np����0 ���=a� R���Q���&�B�j��Ak��V������-�Jk�������W��������:n�^o[qI[���������
)JN�v8����v�����F\�N�\N�����'������7�������l�o&����eP�Z�=�+��Oc�z������H�X���P��Z�;YF��W�:/�_��]��;�>���'����e3����c@�;���8�z�����x��26dF��J������@�*���O])X�'u�# s���wV�����������j�����h�QH
��8��f���p���1k<71lK�h���t��v��t<���|[�q�K{/`4:�q������2�����?���{������X�����r����w�<M?:=�CDN�<��U=i��G����^��L}��u�f f���a��P�V�n��\A����b_�a��:n�?�O�����E�plAf��3�$��%/��T{��t �)5N5�{�gU$�K5:x���H����6Z�8?K~���>7����B/�M����m/��6YlH��ts{6��@�T/�����Y��v�;��-�q�7~��$L9�PE���Q���������<r|D!�4B��L�dt;@z�eu�f����x`}/�-�_��d������i�Qk���M�ra�,@0I�.������a���A^�n��|"h����.:|y�;9����2u�$���w5����������E���Q":�m�UR�&�����s������hn�����o����#�i����4��:NE��;D�����F��K�J��w��6�F���B^�)��X�1��-���V]c���f=n�
K���q�\^�����X��M����[��^�o���~&5�����xk���N��4:nc���x���b ���U�@�6�8��~�v��J�#��_���Y��t��i���i�w��V���ZYx�ISk�����y�l����s'E�v�]>��8p�����������|K��%zK2�����!!����s���� c����$r+�?HH#�A6$���-�Z0�����A���CA"�;k�%���D�_Qaq<w9�E}�H��:�e�C}�"*���e�����.�[�p�B���wc��(��]|�U��uI4�����v�U���D������qt�-����]����= ���;�8����h��`�Z_R�%�Q���V��rs6Z��A�)
�CKK���D�(���Ah�B�����mH��:e�w��P�Z�����"���v�]V�*�M���5��Eh��y�I��xC���<��g��v��h��0}�]����������Q8�)�C��W�4r���4����L�#9Qvi[h1�(����*|�r|��b��b�V��+���s��d'�i������aL�V��h��S�w����pHQ�R�, �pK���K�We0c~T!��
�+>� �OY��}q
���� 4L�HT�;�wQ���4�j��m�p?z0/�h
/�h%/�/���m:��+�:�W���/m��R�v���� ��.������fPuu�nV~!4��_��]IJl�SN w~Z���/��\R�����T�$=�q��Cl����y@
����4��X&c��,>���N&A_k��,��Y��#m��oD>'
ZGs)E��4H+L �4���G��L�NG
�a
H*��������J3����������-6Z��#�3�9���B��!���;hO��1�����9��6�����B����m�x�7� }m��<i�:�v�/��J{�+������2$�t�� 3�����a���'��=�+I���B�������fj�:�M�\������w��i�X����|,��O'i^H�����m�*�CBc8c@�Q��@nM�AvuH��;\<{�1����4:�WY��<���|U���U<��oh.g�<"B��0b^���� ���F����B��$�dA�G;��������/]}����3����
�w%�d����b���g���g�~R�V��N
Y�_�pR��i���Q��� F���
R����i���MN������DV�bu��`S�5o�;���=L��aK�s���f��\�#�I:R�:d��W��n��Y�U�/c���}�d������&�Ro���Xgw�1��Xbz��[e}�Z�����Uo������[_
�-5���*�:@��t�d ���0Ii>���������8�������2�D��^��D/���1���p9q���$����l�|��{�O��=
7@
?p.�S[����@�zQ�j�k� ��P�OCDf�����E���i����n�R$��@izY:y<lHS~30�`���o�\�@�����P��p�u�;�i����8T�L����S�g6w�6+��%9D��[t�lc�[����r��C`�x#���n e���qk�0����E��a��L���N��_���������]3Vk'R��/���'���H�(��8
��xW��g�S��������n�n'�e��������gC�r�[�u�$�,0(df�u1� �a%u�K%z-�Q�^)WR?�..Ps�|�,��K�$SaN�{,�<�f:��FZ8I]*#�lz&�u�b��D��x�"������b4+{�mOZ�T��b�n|��Y��t�1oF� d�d�/���/�/A�� PV�+y�IM>�Sh`%v��^��
Vq�2�
�u��+�Y�f+���m��ca�A&��j%�p���O�c���Vs
4�e(&/ ��q�('��E���oQR���|N>���'�T8��_�n>�����qJ_2��$��:'�cw��<*N�1���������6�T^����s�\�/z&w<�.'z��
��1��Gt�����R.z_p �v�K�X�������Q�D�^�{�>��R��=���g�S�,����\����R*]8�$H���B���X��`"�����w�6'����X�~�[��@���, �d����>!�.��R��� ��E����T��Q�B7�?8��.�sx��� M���z+p�! ����w)�D3{�FcY�����j��#z}�o��G��5�9�L�^����t�x|+��8e}#��h�e5)F.�C�9]�Na�2|��b�2�� �������-^
R)�T�MsO�S�o`�0
|.�@(��$?�hmRw\�����*M�!�"����<�m�k�W/b���-Q��J���*-��
��9���
��J�9�?����Q�2p[k8dB�
S���1V'����,E!^�|b(���2|��M�Cw����:`�|MR�����36'���� � V�n�T��}dZ8`T��$���yp����
Y��5�?$g�{����.6R��D��(�B�����,D�4�:.k���d�p2�:L��3�����=��5T*{b������q�=I6s��x�ORl��LF���S�
����4������W|�����5�!�9�>c6U����*
���NHM����t�I�f����t{�k����@m�i�9�JW,����9u�Z��s�Pi9���j��a��$��|��*��(��F��������z)��L;���Cy�2O��0l��yx~C�V�t���e�(��^#�~0�/j1�%������L�L<���))��F��@�b��A��&V8_0�(St�L$����V#��H��|l�������F����5�C�b_g���������Ap@��T�w5���B��#�?��� �z��Dp$V$���.qa�b:�����2��X�(�oYQi"�=�$�O�<F93�nQ����=���|��"�~��� .N�6T1�������*�=K���� �P%�H0�L��u<5H�X��w��S��4I�E�_P;�ph�M�P�X��
��9���YY���2�X0��
2�2����X������A/�*@TJ_�V|L�b��d����*!W�WQ�VBNY�p���J��q�K2N�d�]V���7;1�q%H���`K���������L�����������4�"���D9�a�c����6�z�$��I�����S����/�@��%����$�I~��@����|�$�t��
'D;��fL`�9�����r�Ot����uOy�q^�^�(d(Q���^�\����"F���X�;��NVLD�o���'a*/��{J��u;�M�Uw&�&7#F�I%�G�>H��5��
\�s����Q������<��m��l�a<��h(����
����AJ<�C�Dq�zG���������p��5����9 �S��E����_���:%n &dW�b�"���8��Sb1_�pei����H�Z��7u�����IiN��M��^��LgM��0��I�x*����T�*� ��\���=a�%�wW�w����Y�/�!:�x��mC ��^��_���`�L�|�k���v���R�;���X��*��i��{�����)?�����E=]��%��>d���jI�>�A��/~�8��)2�*�.��$������De.t��V2�,����o]��s����������$E�Q@�QF��@�����(
��M9�� ��2��
���4�����a�*H/
-����-1%x��V8\�f��n���n>�@�9�g�N6���y���������{[�GN�H��`;�{�QM
�IR98W�+��R�(uT��_�m�f:�Q���K��Yy��ET;�)XGUxQH@YW�8��d�5�`v�nJ�`�t��^MT��+���N���9N������T�K�U��,7D� {-U���%:�����X����>��s���C:1q��BU������X
a������I�
��
��Ve�{)�~-�s��Y�-��Y:2{�7��\^]�N__�!mHn3�V��� s�B]��`R�Q\:PId�t�'�<3��������[B��:��T+h�U��I��g��?�2��W�"��"QTb[M������6i����*\.�N���;�$H�U����R9��3�7�%�
���_�2��l&�,�2��7�/����f�T�0d�9�89%6��;�#w=�p;�vHl�e��iO�`�����8!N�,����L���Cp8_��@������q����Y�c���>}F���U;����h���,�N4�a��%t�n�����n[�] ��K��s,g#���<����>l�fz)rqD%���N��&�t� ����`����MP^��C^T�4�V�f����B8�����6��hMu���04��x`9��C2KNg<�����Q�����3�����~>���s|Y��X�<�����YK�1�M�=�����8>�D�Y��L�?����^<T���U�]��������\�Xr\23g{���$k�
l29�l�������%���U���-��T�?�E��-N���=����,��\_���X���O_.}���N�Q�z��:�&���$-�����������e<�����}��/9�:!�'.�bXLZu�|��o�N��wLv$�2P��K���wE��0m/����8 O+���j���I��,��M��Y���k������Y���� }Ji�� ������rOn�8A��6qAU�����*qF7G*7i��"[����*�^xR�'_J��!"�E�Q��CDT���6�F������p�uC���T�2iE��jS
n����1�j���}]���������J�D��q��1?�?:XT�?�k�mW�*nb��='_��1���M9S�-SM�x�*��QgL��Q����O�u��I#�G�0��<����m�����?/�@Yhz���y|�(��Y<k�q��91��}p�6;�Q��|�<:M��)0 %��(�9����`n�FkU��<6X`����ua�_2�RHibX���h��K�3m������������W�[�IA�2��k���� ��;n��0i_[������3����{=������q�2.���)j��$���b�������})\�q$��g
# ������Z?vd�'*��V�p������aK�;���o��"]e���(U�y
c��J��������� wo���@�i��n�����RIM`U� 4��i���V-���Q�Vt V_5i���J}7N�����(��gB��r��Z*����b�V�G�P��$�v>7�/�s� ilo}AT����2I`���C2pPy�m��-��L�(�[������� ��b�
��l��`e����OX>!f����N���7��� L��%���R�0_��jz������;'���@�>a���z[N�de��������R�N�c����Vb�I��[mBj��R������(JuFu���\{y�ei����Zv�]���}
�����aO%�av������#���*TF}Rs0��='<-Wg��Q{T;J"�s'e�,J��� �m���y6(+6K�V�H����%!Ih`�e��Q^?��c��c��JQ9����5{�r�Zw�P�P� )�^�5yZ���&rK���G�������v���[J��������?���l�=�b���&��ap� �_3$�W�y�����jn��������H��J?g^`?l�Wg-Y�n-��M�0��&��SV�u��<�>h���Ak�*��Ye�;���Ho�K<�4<U�:6.cp�1�������!Q�(`O3����J'��� ��[��B2��
X��C��<gX��1����/���uq ��dN�~�m���*�����g��Z����,{h�)�mz���_b����|�s+��V�&t����K3���c��:?\������;��X*Y�����!F���E��t�O��^���:�u)[#���lf��-��#�Ft��H}S
O�a��c��9�"� �i�I�����C�w$�?����g��[��Ya����u�0�g,e�KAf���y4t�]����J}��x-8#�����kD�7��&y��a�
�c�� GL��"4����4���{�5����������~4�AP���W����
��vWt�����#N��UU+�XTs@o��? ��4{�EA2^+A������D�a����������#�}x��r�57���h�����������+K�sI����Jm�i&"XG�������Y��U
)K��$Z�M����}�5El�/�i����S�9�`�dm�9����`6�r��9bU��J�������!U��?" *�<�<��� �*�!�!���E��#�}����P]���q����4���O�z�G�����SRA7�{^���c2a'����XZ���}���^7�&w�F[:�/����i�K�<����L��v;;�s
�{���l���GR��n�Jy~R�E�2�rJ����!D+�@(P
��*Y2�����m�D���
��j���GX�e�8�
����vdF*
����uV&d5iC��T���zbP;������ #�;TWB��|6�1���=�F��(�e@%J<���>z�J�a���,���O�c�&�.���_YJ�|k��Ie��q����8�B�\�����.Lc@)��N�B-��
��4*c��i���R�l�|�u���W5u����4:�Y\���+�q=tNF[�m�-K[v���P0P�Qi�@:���fd� �Q�
�[A�q���0 db���j��h:8�rK��l������������`*������6�F5�����ha�����W��B%*�a*��2����<r}6v�u���;N��R5z�k�D�v1D�&������E�cS�`|y*q[���Y'DnU������)���J \�7����Y�#�1�"xV��,��R��@Y��\��4����[o���G�>&�MI�b*Y���-�#[�A�s�s����,r
%?��k������4V�+�� a5!F�[�������c:�D6r�X������8C�t�d"eU���!i�����y�}�b���P���H�R0��c�-�P����_v�Tp��?��:b��1�~���%z<N1��H]�\O�������Je�B�Q����,N�?����%��4Je���3��G����mj��>)�q��7q������"��`��)�N!;���sl�����9[�U��@`��X���=N����l�4�+�qH�K:JfT��:� �W��v+K���v�[����zw}�w��n[d��V����Z������b��������B�J�i.n��������111.\dE{���k�)�g��"w*�7�kr��<}NJd�) �������<h�dg4Su��P�l�?5�d��P'< v��6����_���n�c�io��9zX�p/��Us��c�4j
��M+�lo#�|���w8�%����
���h=��"����G���
5X�����������8�Bb��Y2*�Q�
0f)-��9S�v�H� �E�,T�-[>e;��d��OY�]�3|�8�my|m�)��*�pn�|+�Cy��l�
$��B�����<N�%��1�2����+0�3s�<��9L�&��L���9LLzZ>�YLdJ�#�3���� u����P���}�������W�6������D%s��T����P,�O������+�>�Yo5���w����)�����o�l��=GVH��#Q "�T����<�(�������-RbjM��L�P�����.�x��H�5.�4wOsQ������E<P�����:�(����r��gm��T��������w%�Q�U������_�6��N��\�p<���FY�;�&}+����KU����f����i��o�i���~�8���6p [�~c�.�_� � �K�������gF�"����M��������pqI�Vi��N�r&,*(aq������~��
n?h����f�O�Pgy@��H^�f�O���q x^5�I��Y��������}7o�5z}�tR��R��$c"G� ���b�)����������8
������J�n�j���� ��ua>E�����=����0�?���*[��[����������4��j����G�BJ�m%nR�
����pV�_���DkI�}��r�� �(���k����StbX���S�ne��}QG<w�0�7���� n]�&qV<�\�e4H�DD_]�.���wN���H��`n*M����i����5�W�9�y�P0��0P���i��7�H
� /�d�,6�N���������Os�L����
���������,)�E�K�d�k-0e5�����<�s��f a������%3` f��?�r���A��D�\���uL��.j��X7�~s�@��5n�^�.2O��e�������Q7���e�.���?�p�/���4�}���_�!E
��*��3�Sn��|L�����b�xPs�J����R;D��I��c}9�`[O�"�����R{9�M�x�����B�?&GJ����&}V|�L��|Zi�A�j��������7N��;��;����v����yg��q�-��tL���)����M���1~����u���h�����V��w�3%/�D>n7~2�zq2��d����SD�]����G��}�X�C��.�o��v�Q��X�sl�9%��]���4��t����$t@;K#^����#��vz�O�_�axq����g /����cK��@�r�8Kv�������C�������8��E��]E�h���o�]�g;9{B�x�C�cMs�N�^�?r(`|�5����$�J2�j����@��{�>X|�OKo9$�
�x6_dMO��}EtT�q���!di�OC8�Z���u�q���I���2�)���S{f���F������L62��l�$���8��%q������6L
��1��Hi��H&����fxtT���R����.�)n��0�7�ww�C#�M�����N�kV]C��=&��p^��bm�vA�������kWF��"���� v�
l��CI�9�J�R��������E��;v�\������V����Y7��jtE?�����B�1�t�Rm�c���"P5��F>����7�)�Ka���Jg�Zy~�Z��!�
�1��K�L�-�/[�(�F���pypq���:������Z@�h�
�
��J�c�g����*R>��G����z*V�Q ��P`�<D����$�J[)���[��'����3��4C��G�( �Jg��X��1�T0���],Q�V�������W�
��2.C @u�/3ot�U��@)���k2.C�d.� ,�L�o�_�.c�'�X��W�JUg����C�����[q��������s��H?�'QZ�\A�����f��(���r��k�ul1Q�OJ�0x{������������[a}ak7A�w��x\��
�A����l�L ��[���1#��N�"�r���o���C����x����r��0���9���g�kC
;�m*�o�Y�S�������o���N��T~=�ea:�BT��������8����rJ�?bE:.bX!���'q�0�G�<@���e��j`������������6�a�������[��V������XU7@����28q��D��cSZ��,_��Q��{!25���+�@@���;T��]z��YK�c��l��_@K��N��|�gf�%���r�!:L�B�\�`3�����OtT8��j~V�G�4�0:�Q��U$uF�V������"��ETK#�h���o%�����t�V���8�J���G�9
e���'�����6��|+��E.���3'@�������lM�S���+e4d���n��n~U�[�C�Y�I}��'�����V���N�(�t�����xM��#����I�| A���S�Mu��?m��@: ��,.��L���^��|�kIFW�X�A�'k�U��<��Vl��������-
������������������.��-���wqqva�P?�0�_��g/�����j�?�_�.z��_��z�������]W:�������g��zv�_����0�\�O����-��_���������@�����D��n';�[Tm�M)C�TJA�����}U���a�m��F�
��d�H����%�2�u���0�����?������������{���u�O�N{O�����*��=����}���7���v�V��� ���m��R���w05*W�i��j�����M�����V�b��X� }qp���"