INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
The attached patch implements INSERT...ON DUPLICATE KEY LOCK FOR
UPDATE. This is similar to INSERT...ON DUPLICATE KEY IGNORE (which is
also proposed at part of this new revision of the patch), but
additionally acquires a row exclusive lock on the row that prevents
insertion from proceeding in respect of some tuple proposed for
insertion.
This feature offers something that I believe could be reasonably
described as upsert. Consider:
postgres=# create table foo(a int4 primary key, b text);
CREATE TABLE
postgres=# with r as (
insert into foo(a,b)
values (5, '!'), (6, '@')
on duplicate key lock for update
returning rejects *
)
update foo set b = r.b from r where foo.a = r.a;
UPDATE 0
Here there are 0 rows affected by the update, because all work was
done in the insert. If l do it again 2 rows are affected by the
update:
postgres=# with r as (
insert into foo(a,b)
values (5, '!'), (6, '@')
on duplicate key lock for update
returning rejects *
)
update foo set b = r.b from r where foo.a = r.a;
UPDATE 2
Obviously, rejects were now projected into the wCTE, and the
underlying rows were locked. The idea is that we can update the rows,
confident that each rejection-causing row will be updated in a
race-free fashion. I personally prefer this to something like MySQL's
INSERT...ON DUPLICATE KEY UPDATE, because it's more flexible. For
example, we could have deleted the locked rows instead, if that
happened to make sense. Making this kind of usage idiomatic feels to
me like the Postgres way to do upsert. Others may differ here. I will
however concede that it'll be unfortunate to not have some MySQL
compatibility, for the benefit of people porting widely used web
frameworks.
I'm not really sure if I should have done something brighter here than
lock the first duplicate found, or if it's okay that that's all I do.
That's another discussion entirely. Though previously Andres and I did
cover the question of prioritizing unique indexes, so that the most
sensible duplicate for the particular situation was returned,
according to some criteria.
As previously covered, I felt that including a row locking component
was essential to reasoning about our requirements for what I've termed
"speculative insertion" -- the basic implementation of value locking
that is needed to make all this work. As I said in that earlier
thread, there are many opinions about this, and it isn't obvious which
one is right. Any approach needs to have its interactions with row
locking considered right up front. Those that consider this a new
patch with new functionality, or even a premature expansion on what
I've already posted should carefully consider that. Do we really want
to assume that these two things are orthogonal? I think that they're
probably not, but even if that happens to turn out to have been not
the case, it's an unnecessary risk to take.
Row locking
==========
Row locking is implemented with calls to a new function above
ExecInsert. We don't bother with the usual EvalPlanQual looping
pattern for now, preferring to just re-check from scratch if there is
a concurrent update from another session (see comments in
ExecLockHeapTupleForUpdateSpec() for details). We might do better
here. I haven't considered the row locking functionality in too much
detail since the last revision, preferring to focus on value locking.
Buffer locking/value locking
======================
Andres raised concerns about the previous patch's use of exclusive
buffer locks for extended periods (i.e. during a single heap tuple
insertion). These locks served as extended value locks. With this
revision, we don't hold exclusive buffer locks for the duration of
heap insertion - we hold shared buffer locks instead. I believe that
Andres principal concern was the impact on concurrent index scans by
readers, so I think that all of this will go some way towards
alleviating his concerns generally.
This necessitated inventing entirely new LWLock semantics around
"weakening" (from exclusive to shared) and "strengthening" (from
shared to exclusive) of locks already held. Of course, as you'd
expect, there are some tricky race hazards surrounding these new
functions that clients need to be mindful of. These have been
documented within lwlock.c.
I looked for a precedent for these semantics, and found a few. Perhaps
the most prominent was Boost, a highly regarded, peer-reviews C++
library. Boost implements exactly these semantics for some of its
thread synchronization/mutex primitives:
They have a concept of upgradable ownership, which is just like shared
ownership, except, I gather, that the owner reserves the exclusive
right to upgrade to an exclusive lock (for them it's not quite an
exclusive lock; it's an upgradeable/downgradable exclusive lock). My
solution is to push that responsibility onto the client - I admonish
something along the lines of "don't let more than one shared locker do
this at a time per LWLock". I am of course mindful of this caveat in
my modifications to the btree code, where I "weaken" and then later
"strengthen" an exclusive lock - the trick here is that before I
weaken I get a regular exclusive lock, and I only actually weaken
after that when going ahead with insertion.
I suspect that this may not be the only place where this trick is helpful.
This intended usage is described in the relevant comments added to lwlock.c.
Testing
======
This time around, in order to build confidence in the new LWLock
infrastructure for buffer locking, on debug builds we re-verify that
the value proposed for insertion on the locked page is in fact not on
that page as expected during the second phase, and that our previous
insertion point calculation is still considered correct. This is kind
of like the way we re-verifying the wait-queue-is-in-lsn-order
invariant in syncrep.c on debug builds. It's really a fancier
assertion - it doesn't just test the state of scalar variables.
This was invaluable during development of the new LWLock infrastructure.
Just as before, but this time with just shared buffer locks held
during heap tuple insertion, the patch has resisted considerable
brute-force efforts to break it (e.g. using pgbench to get many
sessions speculatively inserting values into a table. Many different
INSERT... ON DUPLICATE KEY LOCK FOR UPDATE statements, interspersed
with UPDATE, DELETE and SELECT statements. Seeing if spurious
duplicate tuple insertions occur, or deadlocks, or assertion
failures).
As always, isolation tests are included.
Bugs
====
I fixed the bug that Andres reported in relation to multiple exclusive
indexes' interaction with waits for another transaction's end during
speculative insertion.
I did not get around to fixing the broken ecpg regression tests, as
reported by Peter Eisentraut. I was a little puzzled by the problem
there. I'll return to it in a while, or perhaps someone else can
propose a solution.
Thoughts?
--
Peter Geoghegan
Attachments:
On Sun, Sep 8, 2013 at 10:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
This necessitated inventing entirely new LWLock semantics around
"weakening" (from exclusive to shared) and "strengthening" (from
shared to exclusive) of locks already held. Of course, as you'd
expect, there are some tricky race hazards surrounding these new
functions that clients need to be mindful of. These have been
documented within lwlock.c.
I've since found that I can fairly reliably get this to deadlock at
high client counts (say, 95, which will do it on my 4 core laptop with
a little patience). To get this to happen, I used pgbench with a
single INSERT...ON DUPLICATE KEY IGNORE transaction script. The more
varied workload that I tested this most recent revision (v2) with the
most, with a transaction consisting on a mixture of different
statements (UPDATEs, DELETEs, INSERT...ON DUPLICATE KEY LOCK FOR
UPDATE) did not show the problem.
What I've been doing to recreate this is pgbench runs in an infinite
loop from a bash script, with a new table created for each iteration.
Each iteration has 95 clients "speculatively insert" a total of 1500
possible tuples for 15 seconds. After this period, the table has
exactly 1500 tuples, with primary key values 1 - 1500. Usually, after
about 5 - 20 minutes, deadlock occurs.
This was never a problem with the exclusive lock coding (v1),
unsurprisingly - after all, as far as buffer locks are concerned, it
did much the same thing as the existing code.
I've made some adjustments to LWLockWeaken, LWLockStrengthen and
LWLockRelease that made the deadlocks go away. Or at least, no
deadlocks or other problems manifested themselves using the same test
case for over two hours. Attached revision includes these changes, as
well as a few minor comment tweaks here and there.
I am working on an analysis of the broader deadlock hazards - the
implications of simultaneously holding multiple shared buffer locks
(that is, one for every unique index btree leaf page participating in
value locking) for the duration of a each heap tuple insertion (each
heap_insert() call). I'm particularly looking for unexpected ways in
which this locking could interact with other parts of the code that
also acquire buffer locks, for example vacuumlazy.c. I'll also try and
estimate how much of a maintainability burden unexpected locking
interactions with these other subsystems might be.
In case it isn't obvious, the deadlocking issue addressed by this
revision is not inherent to my design or anything like that - the bugs
fixed by this revision are entirely confined to lwlock.c.
--
Peter Geoghegan
Attachments:
Hi Peter,
Nice to see the next version, won't have time to look in any details in
the next few days tho.
On 2013-09-10 22:25:34 -0700, Peter Geoghegan wrote:
I am working on an analysis of the broader deadlock hazards - the
implications of simultaneously holding multiple shared buffer locks
(that is, one for every unique index btree leaf page participating in
value locking) for the duration of a each heap tuple insertion (each
heap_insert() call). I'm particularly looking for unexpected ways in
which this locking could interact with other parts of the code that
also acquire buffer locks, for example vacuumlazy.c. I'll also try and
estimate how much of a maintainability burden unexpected locking
interactions with these other subsystems might be.
I think for this approach to be workable you also need to explain how we
can deal with stuff like toast insertion that may need to write hundreds
of megabytes all the while leaving an entire value-range of the unique
key share locked.
I still think that even doing a plain heap insertion is longer than
acceptable to hold even a share lock over a btree page, but as long as
stuff like toast insertions happen while doing so that's peanuts.
The easiest answer is doing the toasting before doing the index locking,
but that will result in bloat, the avoidance of which seems to be the
primary advantage of your approach.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 11, 2013 at 2:28 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Nice to see the next version, won't have time to look in any details in
the next few days tho.
Thanks Andres!
I think for this approach to be workable you also need to explain how we
can deal with stuff like toast insertion that may need to write hundreds
of megabytes all the while leaving an entire value-range of the unique
key share locked.
Right. That is a question that needs to be addressed in a future revision.
I still think that even doing a plain heap insertion is longer than
acceptable to hold even a share lock over a btree page
Well, there is really only one way of judging something like that, and
that's to do a benchmark. I still haven't taken the time to "pick the
low hanging fruit" here that I'd mentioned - there are some fairly
obvious ways to shorten the window in which value locks are held.
Furthermore, I'm sort of at a loss as to what a fair benchmark would
look like - what is actually representative here? Also, what's the
baseline? It's not as if someone has an alternative, competing patch.
We can only hypothesize what additional costs those other approaches
introduce, unless someone has a suggestion as to how they can be
simulated without writing the full patch, which is something I'd
entertain.
As I've already pointed out, all page splits occur with the same
buffer exclusive lock held. Only, in our case, we're weakening that
lock to a shared lock. So I don't think that the heap insertion is
going to be that big of a deal, particularly in the average case.
Having said that, it's a question that surely must be closely examined
before proceeding much further. And yes, the worst case could be
pretty bad, and that surely matters too.
The easiest answer is doing the toasting before doing the index locking,
but that will result in bloat, the avoidance of which seems to be the
primary advantage of your approach.
I would say that the primary advantage of my approach is that it's
much simpler than any other approach that has been considered by
others in the past. The approach is easier to reason about because
it's really just an extension of how btrees already do value locking.
Granted, I haven't adequately demonstrated that things really are so
rosy, but I think I'll be able to. The key point is that with trivial
exception, all other parts of the code, like VACUUM, don't consider
themselves to directly have license to acquire locks on btree buffers
- they go through the AM interface instead. What do they know about
what makes sense for a particular AM? The surface area actually turns
out to be fairly manageable.
With the promise tuple approach, it's more the maintainability
overhead of new *classes* of bloat that I'm concerned about than the
bloat itself, and all the edge cases that are likely to be introduced.
But yes, the overhead of doing all that extra writing (including
WAL-logging twice), and the fact that it all has to happen with an
exclusive lock on the leaf page buffer is also a concern of mine. With
v3 of my patch, we still only have to do all the preliminary work like
finding the right page and verifying that there are no duplicates
once. So with recent revisions, the amount of time spent exclusive
locking with my proposed approach is now approximately half the time
of alternative proposals (assuming no page split is necessary). In the
worst case, the number of values locked on the leaf page is quite
localized and manageable, as a natural consequence of the fact that
it's a btree leaf page. I haven't run any numbers, but for an int4
btree (which really is the worst case here), 200 or so read-locked
values would be quite close to as bad as things got. Plus, if there
isn't a second phase of locking, which is on average a strong
possibility, those locks would be hardly held at all - contrast that
with having to do lots of exclusive locking for all that clean-up.
I might experiment with weakening the exclusive lock even earlier in
my next revision, and/or strengthening later. Off hand, I can't see a
reason for not weakening after we find the first leaf page that the
key might be on (granted, I haven't thought about it that much) -
_bt_check_unique() does not have license to alter the buffer already
proposed for insertion. Come to think of it, all of this new buffer
lock weakening/strengthening stuff might independently justify itself
as an optimization to regular btree index tuple insertion. That's a
whole other patch, though -- it's a big ambition to have as a sort of
incidental adjunct to what is already a big, complex patch.
In practice the vast majority of insertions don't involve TOASTing.
That's not an excuse for allowing the worst case to be really bad in
terms of its impact on query response time, but it may well justify
having whatever ameliorating measures we take result in bloat. It's at
least the kind of bloat we're more or less used to dealing with, and
have already invested a lot in controlling. Plus bloat-wise it can't
be any worse than just inserting the tuple and having the transaction
abort on a duplicate, since that already happens after toasting has
done its work with regular insertion.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 11, 2013 at 8:47 PM, Peter Geoghegan <pg@heroku.com> wrote:
In practice the vast majority of insertions don't involve TOASTing.
That's not an excuse for allowing the worst case to be really bad in
terms of its impact on query response time, but it may well justify
having whatever ameliorating measures we take result in bloat. It's at
least the kind of bloat we're more or less used to dealing with, and
have already invested a lot in controlling. Plus bloat-wise it can't
be any worse than just inserting the tuple and having the transaction
abort on a duplicate, since that already happens after toasting has
done its work with regular insertion.
Andres is being very polite here, but the reality is that this
approach has zero chance of being accepted. You can't hold buffer
locks for a long period of time across complex operations. Full stop.
It's a violation of the rules that are clearly documented in
src/backend/storage/buffer/README, which have been in place for a very
long time, and this patch is nowhere near important enough to warrant
a revision of those rules. We are not going to risk breaking every
bit of code anywhere in the backend or in third-party code that takes
a buffer lock. You are never going to convince me, or Tom, that the
benefit of doing that is worth the risk; in fact, I have a hard time
believing that you'll find ANY committer who thinks this approach is
worth considering.
Even if you get the code to run without apparent deadlocks, that
doesn't mean there aren't any; it just means that you haven't found
them all yet. And even if you managed to squash every such hazard
that exists today, so what? Fundamentally, locking protocols that
don't include deadlock detection don't scale. You can use such locks
in limited contexts where proofs of correctness are straightforward,
but trying to stretch them beyond that point results not only in bugs,
but also in bad performance and unmaintainable code. With a much more
complex locking regimen, even if your code is absolutely bug-free,
you've put a burden on the next guy who wants to change anything; how
will he avoid breaking things? Our buffer locking regimen suffers
from painful complexity and serious maintenance difficulties as is.
Moreover, we've already got performance and scalability problems that
are attributable to every backend in the system piling up waiting on a
single lwlock, or a group of simultaneously-held lwlocks.
Dramatically broadening the scope of where lwlocks are used and for
how long they're held is going to make that a whole lot worse. What's
worse, the problems will be subtle, restricted to the people using
this feature, and very difficult to measure on production systems, and
I have no confidence they'd ever get fixed.
A further problem is that a backend which holds even one lwlock can't
be interrupted. We've had this argument before and it seems that you
don't think that non-interruptibility is a problem, but it project
policy to allow for timely interrupts in all parts of the backend and
we're not going to change that policy for this patch. Heavyweight
locks are heavy weight precisely because they provide services - like
deadlock detection, satisfactory interrupt handling, and, also
importantly, FIFO queuing behavior - that are *important* for locks
that are held over an extended period of time. We're not going to go
put those services into the lightweight lock mechanism because then it
would no longer be light weight, and we're not going to ignore the
importance of them, either.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 13, 2013 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Andres is being very polite here, but the reality is that this
approach has zero chance of being accepted.
I quite like Andres, but I have yet to see him behave as you describe
in a situation where someone proposed what was fundamentally a bad
idea. Maybe you should let him speak for himself?
You can't hold buffer
locks for a long period of time across complex operations. Full stop.
It's a violation of the rules that are clearly documented in
src/backend/storage/buffer/README, which have been in place for a very
long time, and this patch is nowhere near important enough to warrant
a revision of those rules.
The importance of this patch is a value judgement. Our users have been
screaming for this for over ten years, so to my mind it has a fairly
high importance. Also, every other database system of every stripe
worth mentioning has something approximately equivalent to this,
including ones with much less functionality generally. The fact that
we don't is a really unfortunate omission.
As to the rules you refer to, you must mean "These locks are intended
to be short-term: they should not be held for long". I don't think
that they will ever be held for long. At least, when I've managed the
amount of work that a heap_insert() can do better. I expect to produce
a revision where toasting doesn't happen with the locks held soon.
Actually, I've already written the code, I just need to do some
testing.
We are not going to risk breaking every
bit of code anywhere in the backend or in third-party code that takes
a buffer lock. You are never going to convince me, or Tom, that the
benefit of doing that is worth the risk; in fact, I have a hard time
believing that you'll find ANY committer who thinks this approach is
worth considering.
I would suggest letting those other individuals speak for themselves
too. Particularly if you're going to name someone who is on vacation
like that.
Even if you get the code to run without apparent deadlocks, that
doesn't mean there aren't any;
Of course it doesn't. Who said otherwise?
Our buffer locking regimen suffers
from painful complexity and serious maintenance difficulties as is.
That's true to a point, but it has more to do with things like how
VACUUM interacts with hio.c. Things like this:
/*
* Release the file-extension lock; it's now OK for someone else to extend
* the relation some more. Note that we cannot release this lock before
* we have buffer lock on the new page, or we risk a race condition
* against vacuumlazy.c --- see comments therein.
*/
if (needLock)
UnlockRelationForExtension(relation, ExclusiveLock);
The btree code is different, though: It implements a well-defined
interface, with much clearer separation of concerns. As I've said
already, with trivial exception (think contrib), no external code
considers itself to have license to obtain locks of any sort on btree
buffers. No external code of ours - without exception - does anything
with multiple locks, or exclusive locks on btree buffers. I'll remind
you that I'm only holding shared locks when control is outside of the
btree code.
Even within the btree code, the number of access method functions that
could conflict with what I do here (that acquire exclusive locks) is
very small when you exclude things that only exclusive lock the
meta-page (there are also very few of those). So the surface area is
quite small.
I'm not denying that there is a cost, or that I haven't expanded
things in a direction I'd prefer not to. I just think that it may well
be worth it, particularly when you consider the alternatives - this
may well be the least worst thing. I mean, if we do the promise tuple
thing, and there are multiple unique indexes, what happens when an
inserter needs to block pending the outcome of another transaction?
They had better go clean up the promise tuples from the other unique
indexes that they're trying to insert into, because they cannot afford
to hold value locks for a long time, no matter how they're
implemented. That could take much longer than just releasing a shared
buffer lock, since for each unique index the promise tuple must be
re-found from scratch. There are huge issues with additional
complexity and bloat. Oh, and now your lightweight locks aren't so
lightweight any more.
If the value locks were made interruptible through some method, such
as the promise tuples approach, does that really make deadlocking
acceptable? So at least your system didn't seize up. But on the other
hand, the user randomly had a deadlock error through no fault of their
own. The former may be worse, but the latter is also inexcusable. In
general, the best solution is just to not have deadlock hazards. I
wouldn't be surprised if reasoning about deadlocking was harder with
that alternative approach to value locking, not easier.
Moreover, we've already got performance and scalability problems that
are attributable to every backend in the system piling up waiting on a
single lwlock, or a group of simultaneously-held lwlocks.
Dramatically broadening the scope of where lwlocks are used and for
how long they're held is going to make that a whole lot worse.
You can hardly compare a buffer's LWLock with a system one that
protects critical shared memory structures. We're talking about a
shared lock on a single btree leaf page per unique index involved in
upserting.
A further problem is that a backend which holds even one lwlock can't
be interrupted. We've had this argument before and it seems that you
don't think that non-interruptibility is a problem, but it project
policy to allow for timely interrupts in all parts of the backend and
we're not going to change that policy for this patch.
I don't think non-interruptibility is a problem? Really, do you think
that this kind of inflammatory rhetoric helps anybody? I said nothing
of the sort. I recall saying something about an engineering trade-off.
Of course I value interruptibility.
If you're concerned about non-interruptibility, consider XLogFlush().
That does rather a lot of work with WALWriteLock exclusive locked. On
a busy system, some backend is very frequently going to experience a
non-interruptible wait for the duration of however long it takes to
write and flush perhaps a whole segment. All other flushing backends
are stuck in non-interruptible waits waiting for that backend to
finish. I think that the group commit stuff might have regressed
worst-case interruptibility for flushers by quite a bit; should we
have never committed that, or do you agree with my view that it's
worth it?
In contrast, what I've proposed here is in general quite unlikely to
result in any I/O for the duration of the time the locks are held.
Only writers will be blocked. And only those inserting into a narrow
range of values around the btree leaf page. Much of the work that even
those writers need to do will be unimpeded anyway; they'll just block
on attempting to acquire an exclusive lock on the first btree leaf
page that the value they're inserting could be on. And the additional
non-interruptible wait of those inserters won't be terribly much more
than the wait of the backend where heap tuple insertion takes a long
time anyway - that guy already has to do close to 100% of that work
with a non-interruptible wait today (once we eliminate
heap_prepare_insert() and toasting). The UnlockReleaseBuffer() call is
right at the end of heap_insert, and the buffer is pinned and locked
very close to the start.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Peter Geoghegan (pg@heroku.com) wrote:
I would suggest letting those other individuals speak for themselves
too. Particularly if you're going to name someone who is on vacation
like that.
It was my first concern regarding this patch.
Thanks,
Stephen
On Fri, Sep 13, 2013 at 12:14 PM, Stephen Frost <sfrost@snowman.net> wrote:
It was my first concern regarding this patch.
It was my first concern too.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-09-13 11:59:54 -0700, Peter Geoghegan wrote:
On Fri, Sep 13, 2013 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Andres is being very polite here, but the reality is that this
approach has zero chance of being accepted.I quite like Andres, but I have yet to see him behave as you describe
in a situation where someone proposed what was fundamentally a bad
idea. Maybe you should let him speak for himself?
Unfortunately I have to agree with Robert here, I think it's a complete
nogo to do what you propose so far and I've several times now presented
arguments why I think so.
The reason I wasn't saying "this will never get accepted" are twofold:
a) I don't want to stiffle alternative ideas to the "promises" idea,
just because I think it's the way to go. That might stop a better idea
from being articulated. b) I am not actually in the position to say it's
not going to be accepted.
*I* think that unless you make some fundamental and very, very clever
modifications to your algorithm that end up *not holding a lock over
other operations at all*, it's not going to get committed. And I'll chip
in with my -1.
And clever modification doesn't mean slightly restructuring heapam.c's
operations.
The importance of this patch is a value judgement. Our users have been
screaming for this for over ten years, so to my mind it has a fairly
high importance. Also, every other database system of every stripe
worth mentioning has something approximately equivalent to this,
including ones with much less functionality generally. The fact that
we don't is a really unfortunate omission.
I aggree it's quite important but that doesn't mean we have to do stuff
that we think are unacceptable, especially as there *are* other ways to
do it.
As to the rules you refer to, you must mean "These locks are intended
to be short-term: they should not be held for long". I don't think
that they will ever be held for long. At least, when I've managed the
amount of work that a heap_insert() can do better. I expect to produce
a revision where toasting doesn't happen with the locks held soon.
Actually, I've already written the code, I just need to do some
testing.
I personally think - and have stated so before - that doing a
heap_insert() while holding the btree lock is unacceptable.
The btree code is different, though: It implements a well-defined
interface, with much clearer separation of concerns.
Which you're completely violating by linking the btree buffer locking
with the heap locking. It's not about the btree code alone.
At this point I am a bit confused why you are asking for review.
I mean, if we do the promise tuple thing, and there are multiple
unique indexes, what happens when an inserter needs to block pending
the outcome of another transaction? They had better go clean up the
promise tuples from the other unique indexes that they're trying to
insert into, because they cannot afford to hold value locks for a long
time, no matter how they're implemented.
Why? We're using normal transaction visibility rules here. We don't stop
*other* values on the same index getting updated or similar.
And anyway. It doesn't matter which problem the "promises" idea
has. We're discussing your proposal here.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 13, 2013 at 12:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
The reason I wasn't saying "this will never get accepted" are twofold:
a) I don't want to stiffle alternative ideas to the "promises" idea,
just because I think it's the way to go. That might stop a better idea
from being articulated. b) I am not actually in the position to say it's
not going to be accepted.
Well, the reality is that the promises idea hasn't been described in
remotely enough detail to compare it to what I have here. I've pointed
out plenty of problems with it. After all, it was the first thing that
I considered, and I'm on the record talking about it in the 2012 dev
meeting. I didn't take that approach for many good reasons.
The reason I ended up here is not because I didn't get the memo about
holding buffer locks across complex operations being a bad thing. At
least grant me that. I'm here because in all these years no one has
come up with a suggestion that doesn't have some very major downsides.
Like, even worse than this.
As to the rules you refer to, you must mean "These locks are intended
to be short-term: they should not be held for long". I don't think
that they will ever be held for long. At least, when I've managed the
amount of work that a heap_insert() can do better. I expect to produce
a revision where toasting doesn't happen with the locks held soon.
Actually, I've already written the code, I just need to do some
testing.I personally think - and have stated so before - that doing a
heap_insert() while holding the btree lock is unacceptable.
Presumably your reason is essentially that we exclusive lock a heap
buffer (exactly one heap buffer) while holding shared locks on btree
index buffers. Is that really so different to holding an exclusive
lock on a btree buffer while holding a shared lock on a heap buffer?
Because that's what _bt_check_unique() does today.
Now, I'll grant you that there is one appreciable difference, which is
that multiple unique indexes may be involved. But limiting ourselves
to the primary key or something like that remains an option. And I'm
not sure that it's really any worse anyway.
The btree code is different, though: It implements a well-defined
interface, with much clearer separation of concerns.Which you're completely violating by linking the btree buffer locking
with the heap locking. It's not about the btree code alone.
You're right that it isn't about just the btree code.
In order for a deadlock to occur, there must be a mutual dependency.
What code could feel entitled to hold buffer locks on btree buffers
and heap buffers at the same time except the btree code itself? It
already does so. But no one else does the same thing. If anyone did
anything with a heap buffer lock held that could result in a call into
one of the btree access method functions (I'm not contemplating the
possibility of this other code locking the btree buffer *directly*),
I'm quite sure that that would be rejected outright today, because
that causes deadlocks. Certainly, vacuumlazy.c doesn't do it, for
example. Why would anyone ever want to do that anyway? I cannot think
of any reason. I suppose that that does still leave "transitive
dependencies", but now you're stretching things. After all, you're not
supposed to hold buffer locks for long! The dependency would have to
transit through, say, one of the system LWLocks used for WAL Logging.
Seems pretty close to impossible that it'd be an issue - index stuff
is only WAL-logged as index tuples are inserted (that is, as the locks
are finally released). Everyone automatically does that kind of thing
in a consistent order of locking, unlocking in the opposite order
anyway.
But what of the btree code deadlocking with itself? There are only a
few functions (2 or 3) where that's possible even in principle. I
think that they're going to be not too hard to analyze. For example,
with insertion, the trick is to always lock in a consistent order and
unlock/insert in the opposite order. The heap shared lock(s) needed in
the btree code cannot deadlock with another upserter because once the
other upserter has that exclusive heap buffer lock, it's *inevitable*
that it will release all of its shared buffer locks. Granted, I could
stand to think about this case more, but you get the idea - it *is*
possible to clamp down on the code that needs to care about this stuff
to a large degree. It's subtle, but btrees are generally considered
pretty complicated, and the btree code already cares about some odd
cases like these (it takes special precuations for catalog indexes,
for example).
The really weird thing about my patch is that the btree code trusts
the executor to call the heapam code to do the right thing in the
right way - it now knows more than I'd prefer. Would you be happier if
the btree code took more direct responsibility for the heap tuple
insertion instead? Before you say "that's ridiculous", consider the
big modularity violation that has always existed. It may be no more
ridiculous than that. And that existing state of affairs may be no
less ridiculous than living with what I've already done.
At this point I am a bit confused why you are asking for review.
I am asking for us, collectively, through consensus, to resolve the
basic approach to doing this. That was something I stated right up
front, pointing out details of where the discussion had gone in the
past. That was my explicit goal. There has been plenty of discussing
on this down through the years, but nothing ever came from it.
Why is this an intractable problem for over a decade for us alone? Why
isn't this a problem for other database systems? I'm not implying that
it's because they do this. It's something that I am earnestly
interested in, though. A number of people have asked me that, and I
don't have a good answer for them.
I mean, if we do the promise tuple thing, and there are multiple
unique indexes, what happens when an inserter needs to block pending
the outcome of another transaction? They had better go clean up the
promise tuples from the other unique indexes that they're trying to
insert into, because they cannot afford to hold value locks for a long
time, no matter how they're implemented.Why? We're using normal transaction visibility rules here. We don't stop
*other* values on the same index getting updated or similar.
Because you're locking a value in some other, earlier unique index,
all the while waiting *indefinitely* on some other value in a second
or subsequent one. That isn't acceptable. A bunch of backends would
back up just because one backend had this contention on the second
unique index value that the others didn't actually have themselves. My
design allows those other backends to immediately go through and
finish.
Value locks have these kinds of hazards no matter how you implement
them. Deadlocks, and unreasonable stalling as described here is always
unacceptable - whether or not the problems are detected at runtime is
ultimately of marginal interest. Either way, it's a bug.
I think that the details of how this approach compare to others are
totally pertinent. For me, that's the whole point - getting towards
something that will balance all of these concerns and be acceptable.
Yes, it's entirely possible that that could look quite different to
what I have here. I do not want to reduce all this to a question of
"is this one design acceptable or not?". Am I not allowed to propose a
design to drive discussion? That's how the most important features get
implemented around here.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Geoghegan <pg@heroku.com> wrote:
we exclusive lock a heap buffer (exactly one heap buffer) while
holding shared locks on btree index buffers. Is that really so
different to holding an exclusive lock on a btree buffer while
holding a shared lock on a heap buffer? Because that's what
_bt_check_unique() does today.
Is it possible to get a deadlock doing only one of those two
things? Is it possible to avoid a deadlock doing both of them?
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 13, 2013 at 2:59 PM, Peter Geoghegan <pg@heroku.com> wrote:
I would suggest letting those other individuals speak for themselves
too. Particularly if you're going to name someone who is on vacation
like that.
You seem to be under the impression that I'm mentioning Tom's name, or
Andres's, because I need to win some kind of an argument. I don't.
We're not going to accept a patch that uses lwlocks in the way that
you are proposing.
I mean, if we do the promise tuple
thing, and there are multiple unique indexes, what happens when an
inserter needs to block pending the outcome of another transaction?
They had better go clean up the promise tuples from the other unique
indexes that they're trying to insert into, because they cannot afford
to hold value locks for a long time, no matter how they're
implemented.
As Andres already pointed out, this is not correct. Just to add to
what he said, we already have long-lasting value locks in the form of
SIREAD locks. SIREAD can exist at different levels of granularity, but
one of those levels is index-page-level granularity, where they have
the function of guarding against concurrent insertions of values that
would fall within that page, which just so happens to be the same
thing you want to do here. The difference between those locks and
what you're proposing here is that they are implemented differently.
That is why those were acceptable and this is not.
That could take much longer than just releasing a shared
buffer lock, since for each unique index the promise tuple must be
re-found from scratch. There are huge issues with additional
complexity and bloat. Oh, and now your lightweight locks aren't so
lightweight any more.
Yep, totally agreed. If you simply lock the buffer, or take some
other action which freezes out all concurrent modifications to the
page, then re-finding the lock is much simpler. On the other hand,
it's much simpler precisely because you've reduced concurrency to the
degree necessary to make it simple. And reducing concurrency is bad.
Similarly, complexity and bloat are not great things taken in
isolation, but many of our existing locking schemes are already very
complex. Tuple locks result in a complex jig that involves locking
the tuple via the heavyweight lock manager, performing a WAL-logged
modification to the page, and then releasing the lock in the
heavyweight lock manager. As here, that is way more expensive than
simply grabbing and holding a share-lock on the page. But we get a
number of important benefits out of it. The backend remains
interruptible while the tuple is locked, the protocol for granting
locks is FIFO to prevent starvation, we don't suppress page eviction
while the lock is held, we can simultaneously lock arbitrarily large
numbers of tuples, and deadlocks are detect and handled cleanly. If
those requirements were negotiable, we would surely have negotiated
them away already, because the performance benefits would be immense.
If the value locks were made interruptible through some method, such
as the promise tuples approach, does that really make deadlocking
acceptable?
Yes. It's not possible to prevent all deadlocks. It IS possible to
make sure that they are properly detected and that precisely one of
the transactions involved is rolled back to resolve the deadlock.
You can hardly compare a buffer's LWLock with a system one that
protects critical shared memory structures. We're talking about a
shared lock on a single btree leaf page per unique index involved in
upserting.
Actually, I can and I am. Buffers ARE critical shared memory structures.
A further problem is that a backend which holds even one lwlock can't
be interrupted. We've had this argument before and it seems that you
don't think that non-interruptibility is a problem, but it project
policy to allow for timely interrupts in all parts of the backend and
we're not going to change that policy for this patch.I don't think non-interruptibility is a problem? Really, do you think
that this kind of inflammatory rhetoric helps anybody? I said nothing
of the sort. I recall saying something about an engineering trade-off.
Of course I value interruptibility.
I don't see what's inflammatory about that statement. The point is
that this isn't the first time you've proposed a change which would
harm interruptibility and it isn't the first time I've objected on
precisely that basis. Interruptibility is not a nice-to-have that we
can trade away from time to time; it's essential and non-negotiable.
If you're concerned about non-interruptibility, consider XLogFlush().
That does rather a lot of work with WALWriteLock exclusive locked. On
a busy system, some backend is very frequently going to experience a
non-interruptible wait for the duration of however long it takes to
write and flush perhaps a whole segment. All other flushing backends
are stuck in non-interruptible waits waiting for that backend to
finish. I think that the group commit stuff might have regressed
worst-case interruptibility for flushers by quite a bit; should we
have never committed that, or do you agree with my view that it's
worth it?
It wouldn't take a lot to convince me that it wasn't worth it, because
I was never all that excited about that patch to begin with. I think
it mostly helps in extremely artificial situations that are not likely
to occur on real systems anyway. But, yeah, WALWriteLock is a
problem, no doubt about it. We should try to make the number of such
problems go down, not up, even if it means passing up new features
that we'd really like to have.
In contrast, what I've proposed here is in general quite unlikely to
result in any I/O for the duration of the time the locks are held.
Only writers will be blocked. And only those inserting into a narrow
range of values around the btree leaf page. Much of the work that even
those writers need to do will be unimpeded anyway; they'll just block
on attempting to acquire an exclusive lock on the first btree leaf
page that the value they're inserting could be on.
Sure, but you're talking about broadening the problem from the guy
performing the insert to everybody who might be trying to an insert
that hits one of the same unique-index pages. Instead of holding one
buffer lock, the guy performing the insert is now holding as many
buffer locks as there are indexes. That's a non-trivial issue.
For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes,
you'll error out. In fact, if you get the number of indexes exactly
right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and
panic the whole system.
Oh, and if different backends load the index list in different orders,
because say the system catalog gets vacuumed between their respective
relcache loads, then they may try to lock the indexes in different
orders and cause an undetected deadlock.
And, drifting a bit further off-topic, even to get as far as you have,
you've added overhead to every lwlock acquisition and release, even
for people who never use this functionality. I'm pretty skeptical
about anything that involves adding additional frammishes to the
lwlock mechanism. There are a few new primitives I'd like, too, but
every one we add slows things down for everybody.
And the additional
non-interruptible wait of those inserters won't be terribly much more
than the wait of the backend where heap tuple insertion takes a long
time anyway - that guy already has to do close to 100% of that work
with a non-interruptible wait today (once we eliminate
heap_prepare_insert() and toasting). The UnlockReleaseBuffer() call is
right at the end of heap_insert, and the buffer is pinned and locked
very close to the start.
That's true but somewhat misleading. Textually most of the function
holds the buffer lock, but heap_prepare_insert(),
CheckForSerializableConflictIn(), and RelationGetBufferForTuple(), and
XLogWrite() are the parts that do substantial amounts of computation,
and only the last of those happens while holding the buffer lock. And
that last is really fundamental, because we can't let any other
backend see the modified buffer until we've xlog'd the changes. The
problems you're proposing to create do not fall into the same
category.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I haven't read the patch and the btree code is an area I really don't know,
so take this for what it's worth....
It seems to me that the nature of the problem is that there will
unavoidably be a nexus between the two parts of the code here. We can try
to isolate it as much as possible but we're going to need a bit of a
compromise.
I'm imagining a function that takes two target heap buffers and a btree
key. It would descend the btree and holding the leaf page lock do a
try_lock on the heap pages. If it fails to get the locks then it releases
whatever it got and returns for the heap update to find new pages and try
again.
This still leaves the potential problem with page splits and I assume it
would still be tricky to call it without unsatisfactorily mixing executor
and btree code. But that's as far as I got.
--
greg
On 2013-09-14 09:57:43 +0100, Greg Stark wrote:
It seems to me that the nature of the problem is that there will
unavoidably be a nexus between the two parts of the code here. We can try
to isolate it as much as possible but we're going to need a bit of a
compromise.
I think Roberts and mine point is that there are several ways to
approach this without doing that.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-09-13 14:41:46 -0700, Peter Geoghegan wrote:
On Fri, Sep 13, 2013 at 12:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
The reason I wasn't saying "this will never get accepted" are twofold:
a) I don't want to stiffle alternative ideas to the "promises" idea,
just because I think it's the way to go. That might stop a better idea
from being articulated. b) I am not actually in the position to say it's
not going to be accepted.Well, the reality is that the promises idea hasn't been described in
remotely enough detail to compare it to what I have here. I've pointed
out plenty of problems with it.
Even if you disagree, I still think that doesn't matter in the very
least. You say:
I think that the details of how this approach compare to others are
totally pertinent. For me, that's the whole point - getting towards
something that will balance all of these concerns and be acceptable.
Well, the two other people involved in the discussion so far have gone
on the record saying that the presented approach is not acceptable to
them. And you haven't started reacting to that.
Yes, it's entirely possible that that could look quite different to
what I have here. I do not want to reduce all this to a question of
"is this one design acceptable or not?".
But the way you're discussing it so far is exactly reducing it that way.
If you want the discussion to be about *how* can we implement it that
the various concerns are addressed: fsck*ing great. I am with you there.
In the end, even though I have my usual strong opinions which is the
best way, I don't care which algorithm gets pursued further. At least,
if, and only if, it has a fighting chance of getting committed. Which
this doesn't.
After all, it was the first thing that
I considered, and I'm on the record talking about it in the 2012 dev
meeting. I didn't take that approach for many good reasons.
Well, I wasn't there when you said that ;)
The reason I ended up here is not because I didn't get the memo about
holding buffer locks across complex operations being a bad thing. At
least grant me that. I'm here because in all these years no one has
come up with a suggestion that doesn't have some very major downsides.
Like, even worse than this.
I think you're massively, massively, massively overstating the dangers
of bloat here. It's a known problem that's *NOT* getting worse by any of
the other proposals if you compare it with the loop/lock/catch
implementation of upsert that we have today as the only option. And we
*DO* have infrastructure to deal with bloat, even if could use some
improvement. We *don't* have infrastructure with deadlocks on
lwlocks. And we're not goint to get that infrastructure, because it
would even further remove the "lw" part of lwlocks.
As to the rules you refer to, you must mean "These locks are intended
to be short-term: they should not be held for long". I don't think
that they will ever be held for long. At least, when I've managed the
amount of work that a heap_insert() can do better. I expect to produce
a revision where toasting doesn't happen with the locks held soon.
Actually, I've already written the code, I just need to do some
testing.I personally think - and have stated so before - that doing a
heap_insert() while holding the btree lock is unacceptable.Presumably your reason is essentially that we exclusive lock a heap
buffer (exactly one heap buffer) while holding shared locks on btree
index buffers.
It's that it interleaves an already complex but local locking scheme
that required several years to become correct with another that is just
the same. That's an utterly horrid idea.
Is that really so different to holding an exclusive
lock on a btree buffer while holding a shared lock on a heap buffer?
Because that's what _bt_check_unique() does today.
Yes, it it is different. But, in my opinion, bt_check_unique() doing so
is a bug that needs fixing. Not something that we want to extend.
(Note that _bt_check_unique() already needs to deal with the fact that
it reads an unlocked page, because it moves right in some cases)
And, as you say:
Now, I'll grant you that there is one appreciable difference, which is
that multiple unique indexes may be involved. But limiting ourselves
to the primary key or something like that remains an option. And I'm
not sure that it's really any worse anyway.
I don't think that's an acceptable limitation. If it were something we
could lift in a release or two, maybe, but that's not what you're
talking about.
At this point I am a bit confused why you are asking for review.
I am asking for us, collectively, through consensus, to resolve the
basic approach to doing this. That was something I stated right up
front, pointing out details of where the discussion had gone in the
past. That was my explicit goal. There has been plenty of discussing
on this down through the years, but nothing ever came from it.
At the moment ISTM you're not conceding on *ANY* points. That's not very
often the way to find concensus.
Why is this an intractable problem for over a decade for us alone? Why
isn't this a problem for other database systems? I'm not implying that
it's because they do this. It's something that I am earnestly
interested in, though. A number of people have asked me that, and I
don't have a good answer for them.
Afaik all those go the route of bloat, don't they? Also, at least in the
past, mysql had a long list of caveats around it...
I mean, if we do the promise tuple thing, and there are multiple
unique indexes, what happens when an inserter needs to block pending
the outcome of another transaction? They had better go clean up the
promise tuples from the other unique indexes that they're trying to
insert into, because they cannot afford to hold value locks for a long
time, no matter how they're implemented.Why? We're using normal transaction visibility rules here. We don't stop
*other* values on the same index getting updated or similar.
Because you're locking a value in some other, earlier unique index,
all the while waiting *indefinitely* on some other value in a second
or subsequent one. That isn't acceptable. A bunch of backends would
back up just because one backend had this contention on the second
unique index value that the others didn't actually have themselves. My
design allows those other backends to immediately go through and
finish.
That argument doesn't make sense to me. You're inserting a unique
value. It completely makes sense that you can only insert one of
them. If it's unclear whether you can insert, you're going to have to
wait. Thats why they are UNIQUE after all. You're describing a complete
nonadvantange here. It's also how unique indexes already work.
Also note, that wait's on xids are properly supervised by deadlock detection.
Even if it had an advantage, not blocking *for the single unique key alone*
opens you to issues of livelocks where several backends retry because of
each other indefinitely.
Value locks have these kinds of hazards no matter how you implement
them. Deadlocks, and unreasonable stalling as described here is always
unacceptable - whether or not the problems are detected at runtime is
ultimately of marginal interest. Either way, it's a bug.
Whether postgres locks down in a way that can only resolved by kill -9
or whether it aborts a transaction are, like, a couple of magnitude of a
difference.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 14, 2013 at 12:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I mean, if we do the promise tuple
thing, and there are multiple unique indexes, what happens when an
inserter needs to block pending the outcome of another transaction?
They had better go clean up the promise tuples from the other unique
indexes that they're trying to insert into, because they cannot afford
to hold value locks for a long time, no matter how they're
implemented.As Andres already pointed out, this is not correct.
While not doing this is not incorrect, it certainly would be useful
for preventing deadlocks and unnecessary contention. In a world where
people expect either an insert or an update, we ought to try and
reduce contention across multiple unique indexes. I can understand why
that doesn't matter today, though - if you're going to insert
duplicates indifferent to whether or not there will be conflicts,
that's a kind of abuse, and not worth optimizing - seems probable that
most transactions will commit. However, it seems much less probable
that most upserters will insert. People may well wish to upsert all
the time where an insert is hardly ever necessary, which is one reason
why I have doubts about other proposals.
Note that today there is no guarantee that the original waiter for a
duplicate-inserting xact to complete will be the first one to get a
second chance, so I think it's hard to question this on correctness
grounds. Even if they are released in FIFO order, there is no reason
to assume that the first waiter will win the race with a second. Most
obviously, the second waiter may not even ever get the chance to block
on the same xid at all (so it's not really a waiter at all) and still
be able to insert, if the blocking-xact aborts after the second
"waiter" starts its descent but before it checks uniqueness. All this,
even though the second "waiter" arrived maybe minutes after the first.
What I'm talking about here is really unlikely to result in lock
starvation, because the original waiter typically gets to observe the
other waiter go through, and that's reason enough to give up entirely.
Now, it's kind of weird that the original waiter will still end up
blocking on the xid that caused it to wait in the first instance. So
there should be more thought put into that, like remembering the xid
and only waiting on it on a retry, or some similar scheme. Maybe you
could contrive a scenario where this causes lock starvation, but I
suspect you could do the same thing for the present btree insertion
code.
Just to add to
what he said, we already have long-lasting value locks in the form of
SIREAD locks. SIREAD can exist at different levels of granularity, but
one of those levels is index-page-level granularity, where they have
the function of guarding against concurrent insertions of values that
would fall within that page, which just so happens to be the same
thing you want to do here. The difference between those locks and
what you're proposing here is that they are implemented differently.
That is why those were acceptable and this is not.
As the implementer of this patch, I'm obligated to put some checks in
unique index insertion that everyone has to care about. There is no
way around that. Complexity issues aside, I think that an argument
could be made for this approach *reducing* the impact on concurrency
relative to other approaches, if there isn't too many unique indexes
to deal with, which is the case the vast majority of the time. I mean,
those other approaches necessitate doing so much more with *exclusive*
locks held. Like inserting, maybe doing a page split, WAL-logging, all
with the lock, and then either updating in place or killing the
promise tuple, and WAL-logging that, with an exclusive lock held the
second time around. Plus searching for everything twice. I think that
frequently killing all of those broken-promise tuples could have
deleterious effects on concurrency and/or index bloat or the kind only
remedied by reindex. Do you update the freespace map too? More
exclusive locks! Or if you leave it up to VACUUM (and just set the xid
to InvalidXid, which is still extra work), autovacuum has to care
about a new *class* of bloat - index-only bloat. Plus lots of dead
duplicates are bad for performance in btrees generally.
As here, that is way more expensive than
simply grabbing and holding a share-lock on the page. But we get a
number of important benefits out of it. The backend remains
interruptible while the tuple is locked, the protocol for granting
locks is FIFO to prevent starvation, we don't suppress page eviction
while the lock is held, we can simultaneously lock arbitrarily large
numbers of tuples, and deadlocks are detect and handled cleanly. If
those requirements were negotiable, we would surely have negotiated
them away already, because the performance benefits would be immense.
False equivalence. We only need to lock as many unique index *values*
(not tuples) as are proposed for insertion per slot (which can be
reasonably bound), and only for an instant. Clearly it would be
totally unacceptable if tuple-level locks made backends
uninterruptible indefinitely. Of course, this is nothing like that.
If the value locks were made interruptible through some method, such
as the promise tuples approach, does that really make deadlocking
acceptable?Yes. It's not possible to prevent all deadlocks. It IS possible to
make sure that they are properly detected and that precisely one of
the transactions involved is rolled back to resolve the deadlock.
You seem to have misunderstood me here, or perhaps I was unclear. I'm
referring to deadlocks that cannot really be predicted or analyzed by
the user at all - see my comments below on insertion order.
I don't think non-interruptibility is a problem? Really, do you think
that this kind of inflammatory rhetoric helps anybody? I said nothing
of the sort. I recall saying something about an engineering trade-off.
Of course I value interruptibility.I don't see what's inflammatory about that statement.
The fact that you simply stated that I don't think
non-interruptibility is a problem in a non-qualified way, obviously.
Interruptibility is not a nice-to-have that we
can trade away from time to time; it's essential and non-negotiable.
I seem to recall you saying something about the Linux kernel and their
attitude to interruptibility. Yes, interruptibility is not just a
nice-to-have; it is essentially. However, without dismissing your
other concerns, I have yet to hear a convincing argument as to why
anything I've done here is going to make any difference to
interruptibility that would be appreciable to any human. So far it's
been a slippery slope type argument that can be equally well used to
argue against some facet of almost any substantial patch ever
proposed. I just don't think that regressing interruptibility
marginally is *necessarily* sufficient justification for rejecting an
approach outright. FYI, *that's* how I value interruptibility
generally.
In contrast, what I've proposed here is in general quite unlikely to
result in any I/O for the duration of the time the locks are held.
Only writers will be blocked. And only those inserting into a narrow
range of values around the btree leaf page. Much of the work that evehn
those writers need to do will be unimpeded anyway; they'll just block
on attempting to acquire an exclusive lock on the first btree leaf
page that the value they're inserting could be on.Sure, but you're talking about broadening the problem from the guy
performing the insert to everybody who might be trying to an insert
that hits one of the same unique-index pages.
In general, that isn't that much worse than just blocking the value
directly. The number of possible values that could also be blocked is
quite low. The chances of it actually mattering that those additional
values are locked in the still small window in which the buffer locks
are held is in generally fairly low, particularly on larger tables
where there is naturally a large number of possible distinct values. I
will however concede that the impact on inserters that want to insert
a non-locked value that belongs on the locked page or its child might
be worse, but it's already a problem that inserted index tuples can
all end up on the same page, if not to the same extent.
Instead of holding one
buffer lock, the guy performing the insert is now holding as many
buffer locks as there are indexes. That's a non-trivial issue.
Actually, as many buffer locks as there are *unique* indexes. It might
be a non-trivial issue, but this whole problem is decidedly
non-trivial, as I'm sure we can all agree.
For that matter, if the table has more than MAX_SIMUL_LWLOCKS indexes,
you'll error out. In fact, if you get the number of indexes exactly
right, you'll exceed MAX_SIMUL_LWLOCKS in visibilitymap_clear() and
panic the whole system.
Oh, come on. We can obviously engineer a solution to that problem. I
don't think I've ever seen a table with close to 100 *unique* indexes.
4 or 5 is a very high number. If we just raised on error if someone
tried to do this with more than 10 unique indexes, I would guess
that'd we'd get exactly zero complaints about it.
Oh, and if different backends load the index list in different orders,
because say the system catalog gets vacuumed between their respective
relcache loads, then they may try to lock the indexes in different
orders and cause an undetected deadlock.
Undetected deadlock is really not much worse than detected deadlock
here. Either way, it's a bug. And it's something that any kind of
implementation will need to account for. It's not okay to
*unpredictably* deadlock, in a way that the user has no control over.
Today, someone can do an analysis of their application and eliminate
deadlocks if they need to. That might not be terribly practical much
of the time, but it can be done. It certainly is practical to do it in
a localized way. I wouldn't like to compromise that.
So yes, you're right that I need to control for this sort of thing
better than in the extant patch, and in fact this was discussed fairly
early on. But it's an inherent problem.
And, drifting a bit further off-topic, even to get as far as you have,
you've added overhead to every lwlock acquisition and release, even
for people who never use this functionality.
If you look at the code, you'll see that I've made very modest
modifications to LWLockRelease only. I would be extremely surprised if
the overhead was not only in the noise, but was completely impossible
to detect through any conventional benchmark. These are the same kind
of very modest changes made for LWLockAcquireOrWait(), and you said
nothing about that at the time. Despite the fact that you now appear
to think that that whole effort was largely a waste of time.
That's true but somewhat misleading. Textually most of the function
holds the buffer lock, but heap_prepare_insert(),
CheckForSerializableConflictIn(), and RelationGetBufferForTuple(), and
XLogWrite() are the parts that do substantial amounts of computation,
and only the last of those happens while holding the buffer lock.
I've already written modifications so that I don't have to do
heap_prepare_insert() with the locks held. There is no reason to call
CheckForSerializableConflictIn() with the additional locks held
either. After all, "For a heap insert, we only need to check for
table-level SSI locks". As for RelationGetBufferForTuple(), yes, the
majority of the time it will have to do very little without acquiring
an exclusive lock, because it's going to get that from the last place
a heap tuple was inserted from.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 14, 2013 at 3:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Well, the reality is that the promises idea hasn't been described in
remotely enough detail to compare it to what I have here. I've pointed
out plenty of problems with it.Even if you disagree, I still think that doesn't matter in the very
least.
It matters if you care about getting this feature.
You say:
I think that the details of how this approach compare to others are
totally pertinent. For me, that's the whole point - getting towards
something that will balance all of these concerns and be acceptable.Well, the two other people involved in the discussion so far have gone
on the record saying that the presented approach is not acceptable to
them. And you haven't started reacting to that.
Uh, yes I have. I'm not really sure what you could mean by that. What
am I refusing to address?
Yes, it's entirely possible that that could look quite different to
what I have here. I do not want to reduce all this to a question of
"is this one design acceptable or not?".But the way you're discussing it so far is exactly reducing it that way.
The fact that I was motivated to do things this way serves to
illustrate the problems generally.
If you want the discussion to be about *how* can we implement it that
the various concerns are addressed: fsck*ing great. I am with you there.
Isn't that what we were doing? There has been plenty of commentary on
alternative approaches.
In the end, even though I have my usual strong opinions which is the
best way, I don't care which algorithm gets pursued further. At least,
if, and only if, it has a fighting chance of getting committed. Which
this doesn't.
I don't think that any design that has been described to date doesn't
have serious problems. Causing excessive bloat, particularly in
indexes is a serious problem also.
The reason I ended up here is not because I didn't get the memo about
holding buffer locks across complex operations being a bad thing. At
least grant me that. I'm here because in all these years no one has
come up with a suggestion that doesn't have some very major downsides.
Like, even worse than this.I think you're massively, massively, massively overstating the dangers
of bloat here. It's a known problem that's *NOT* getting worse by any of
the other proposals if you compare it with the loop/lock/catch
implementation of upsert that we have today as the only option.
Why would I compare it with that? That's terrible, and very few of our
users actually know about it anyway. Also, will an UPDATE followed by
an INSERT really bloat all that much anyway?
And we
*DO* have infrastructure to deal with bloat, even if could use some
improvement. We *don't* have infrastructure with deadlocks on
lwlocks. And we're not goint to get that infrastructure, because it
would even further remove the "lw" part of lwlocks.
Everything I said so far is predicated on LWLocks not deadlocking
here, so I'm not really sure why you'd say that. If I can't find a way
to prevent deadlock, then clearly the approach is doomed.
It's that it interleaves an already complex but local locking scheme
that required several years to become correct with another that is just
the same. That's an utterly horrid idea.
You're missing my point, which is that it may be possible, with
relatively modest effort, to analyze things to ensure that deadlock is
impossible - regardless of the complexities of the two systems -
because they're reasonably well encapsulated. See below, under "I'll
say it again".
Now, I can certainly understand why you wouldn't be willing to accept
that at face value. The idea isn't absurd, though. You could think of
the heap_insert() call as being under the control of the btree code
(just as, say, heap_hot_search() is), even though the code isn't at
all structured that way, and that's awkward. I'm actually slightly
tempted to structure it that way.
Is that really so different to holding an exclusive
lock on a btree buffer while holding a shared lock on a heap buffer?
Because that's what _bt_check_unique() does today.Yes, it it is different. But, in my opinion, bt_check_unique() doing so
is a bug that needs fixing. Not something that we want to extend.
Well, I think you know that that's never going to happen. There are
all kinds of reasons why it works that way that cannot be disavowed.
My definition of a bug includes a user being affected.
At this point I am a bit confused why you are asking for review.
I am asking for us, collectively, through consensus, to resolve the
basic approach to doing this. That was something I stated right up
front, pointing out details of where the discussion had gone in the
past. That was my explicit goal. There has been plenty of discussing
on this down through the years, but nothing ever came from it.At the moment ISTM you're not conceding on *ANY* points. That's not very
often the way to find concensus.
Really? I've conceded plenty of points. Just now I conceded a point to
Robert about insertion being blocked for inserters that want to insert
a value that isn't already locked/existing, and he didn't even raise
that in the first place. Most prominently, I've conceded that it is
entirely questionable that I hold the buffer locks for longer - before
you even responded to my original patch! I've said it many many times
many many ways. It should be heavily scrutinized. But you both seem to
be making general points along those lines, without reference to what
I've actually done. Those general points could almost to the same
extent apply to _bt_check_unique() today, which is why I have a hard
time accepting them at face value. To say that what that function does
is "a bug" is just not credible, because it's been around in
essentially the same form since at least a time when you and I were in
primary school. I'll remind you that you haven't been able to
demonstrate deadlock in a way that invalidates my approach. While of
course that's not how this is supposed to work, I've been too busy
defending myself here to get down to the business of carefully
analysing the relatively modest interactions between btree and heap
that could conceivably introduce a deadlock. Yes, the burden to prove
this can't deadlock is mine, but I thought I'd provide you with the
opportunity to prove that it can.
I'll say it again: For a deadlock, there needs to be a mutual
dependency. Provided the locking phase doesn't acquire any locks other
than buffer locks, and during the interaction with the heap btree
inserters (or the locking phase) cannot acquire heap locks in a way
that conflicts with other upserters, we will be fine. It doesn't
necessarily matter how complex each system individually is, because
the two meet in such a limited area (well, two areas now, I suppose),
and they only meet in one direction - there is no reciprocation where
the heap code locks or otherwise interacts with index buffers. When
the heap insertion is performed, all index value locks are already
acquired. The locking phase cannot block itself because of the
ordering of locking, but also because the locks on the heap that it
takes are only shared locks.
Now, this analysis is somewhat complex, and underdeveloped. But as
Robert said, there are plenty of things about locking in Postgres that
are complex and subtle. He also said that it doesn't matter if I can
prove that it won't deadlock, but I'd like a second opinion on that,
since my proof might actually be, if not simple, short, and therefore
may not represent an ongoing burden in the way Robert seemed to think
it would.
That argument doesn't make sense to me. You're inserting a unique
value. It completely makes sense that you can only insert one of
them.
Even if it had an advantage, not blocking *for the single unique key alone*
opens you to issues of livelocks where several backends retry because of
each other indefinitely.
See my remarks to Robert.
Whether postgres locks down in a way that can only resolved by kill -9
or whether it aborts a transaction are, like, a couple of magnitude of a
difference.
Not really. I can see the advantage of having the deadlock be
detectable from a defensive-coding standpoint. But index locking
ordering inconsistencies, and the deadlocks they may cause are not
acceptable generally.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 14, 2013 at 1:57 AM, Greg Stark <stark@mit.edu> wrote:
It seems to me that the nature of the problem is that there will unavoidably
be a nexus between the two parts of the code here. We can try to isolate it
as much as possible but we're going to need a bit of a compromise.
Exactly. That's why all the proposals with the exception of this one
have to date involved unacceptable bloating - that's how they try and
span the nexus.
I'll find it very difficult to accept any implementation that is going
to bloat things even worse than our upsert looping example. The only
advantage of such an implementation over the upsert example is that
it'll avoid burning through subxacts. The main reason I don't want to
take that approach is that I know it won't be accepted, because it's a
disaster. That's why the people that proposed this in various forms
down through the years haven't gone and implemented it themselves. I
do not accept that all of this is like the general situation with row
locks. I do not think that the big costs of having many dead
duplicates in a unique index can be overlooked (or perhaps the cost of
cleaning them up eagerly, which is something I'd also expect to work
very badly). That's something that's going to reverberate all over the
place. Imagine a simple, innocent looking pattern that resulted in
there being unique indexes that became hugely bloated. It's not hard.
What I will concede (what I have conceded, actually) is that it would
be better if the locks were more granular. Now, I'm not so much
concerned about concurrent inserters inserting values that just so
happen to be values that were locked. It's more the case that I'm
worried about inserters blocking on other values that are incidentally
locked despite not already existing, that would go on the locked page
or maybe a later page. In particular, I'm concerned about the impact
on SERIAL primary key columns. Not exactly an uncommon case (though
one I'd already thought to optimize by locking last).
What I think might actually work acceptably is if we were to create an
SLRU that kept track of value-locks per buffer. The challenge there
would be to have regular unique index inserters care about them, while
having little to no impact on their regular performance. This might be
possible by having them check the buffer for external value locks in
the SLRU immediately after exclusive locking the buffer - usually that
only has to happen once per index tuple insertion (assuming no
duplicates necessitate retry). If they find their value in the SLRU,
they do something like unlock and block on the other xact and restart.
Now, obviously a lot of the details would have to be worked out, but
it seems possible.
In order for any of this to really be possible, there'd have to be
some concession made to my position, as Greg mentions here. In other
words, I'd need buy-in for the general idea of holding locks in shared
memory from indexes across heap tuple insertion (subject to a sound
deadlock analysis, of course). Some modest compromises may need to be
made around interruptibility. I'd also probably need agreement that
it's okay that value locks can not last more than an instant (they
cannot be held indefinitely pending the end of a transaction). This
isn't something that I imagine to be too controversial, because it's
true today for a single unique index. As I've already outlined, anyone
waiting on another transaction with a would-be duplicate to commit has
very few guarantees about the order that it'll get its second shot
relative to the order it initial queued up behind the successful but
not-yet-committed inserter.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15 Sep 2013 10:19, "Peter Geoghegan" <pg@heroku.com> wrote:
On Sat, Sep 14, 2013 at 1:57 AM, Greg Stark <stark@mit.edu> wrote:
It seems to me that the nature of the problem is that there will
unavoidably
be a nexus between the two parts of the code here. We can try to
isolate it
as much as possible but we're going to need a bit of a compromise.
In order for any of this to really be possible, there'd have to be
some concession made to my position, as Greg mentions here. In other
words, I'd need buy-in for the general idea of holding locks in shared
memory from indexes across heap tuple insertion (subject to a sound
deadlock analysis, of course).
Actually that wasn't what I meant by that.
What I meant is that there going to be some code coupling between the
executor and btree code. That's purely a question of course structure, and
will be true regardless of the algorithm you settle on.
What I was suggesting was an api for a function that would encapsulate that
coupling. The executor would call this function which would promise to
obtain all the locks needed for both operations or give up. Effectively it
would be a special btree operation which would have special knowledge of
the executor only in that it knows that being able to get a lock on two
heap buffers is something the executor needs sometimes.
I'm not sure this fits well with your syntax since it assumes the update
will happen at the same time as the index lookup but as I said I haven't
read your patch, maybe it's not incompatible. I'm writing all this on my
phone so it's mostly just pie in the sky brainstorming. I'm sorry if it's
entirely irrelevant.
Peter Geoghegan <pg@heroku.com> wrote:
There is no reason to call CheckForSerializableConflictIn() with
the additional locks held either. After all, "For a heap insert,
we only need to check for table-level SSI locks".
You're only talking about not covering that call with a *new*
LWLock, right? We put some effort into making sure that such calls
were only inside of LWLocks which were needed for correctness.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers