INSERT ... ON CONFLICT {UPDATE | IGNORE}
Attached WIP patch extends the INSERT statement, adding a new ON
CONFLICT {UPDATE | IGNORE} clause. This allows INSERT statements to
perform UPSERT operations (if you want a more formal definition of
UPSERT, I refer you to my pgCon talk's slides [1]http://www.pgcon.org/2014/schedule/attachments/327_upsert_weird.pdf, ("Goals for UPSERT in Postgres"), or the thread in
which I delineated the differences between SQL MERGE and UPSERT [2]/messages/by-id/CAM3SWZRP0c3g6+aJ=YYDGYAcTZg0xA8-1_FCVo5Xm7hrEL34kw@mail.gmail.com).
The patch builds on previous work in this area, and incorporates
feedback from Kevin and Andres.
Overview
=======
Example usage:
INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT UPDATE
SET val = 'update';
Essentially, the implementation has all stages of query processing
track some auxiliary UPDATE state. So, for example, during parse
analysis, UPDATE transformation occurs in an ad-hoc fashion tightly
driven by the parent INSERT, but using the existing infrastructure
(i.e. transformStmt()/transformUpdateStmt() is called, and is
insulated from having to care about the feature as a special case).
There are some restrictions on what this auxiliary update may do, but
FWIW there are considerably fewer than those that the equivalent MySQL
or SQLite feature imposes on their users. All of the following SQL
queries are valid with the patch applied:
-- Nesting within wCTE:
WITH t AS (
INSERT INTO z SELECT i, 'insert'
FROM generate_series(0, 16) i
ON CONFLICT UPDATE SET v = v || 'update' -- use of
operators/functions in targetlist
RETURNING * -- only projects inserted tuples, never updated tuples
)
SELECT * FROM t JOIN y ON t.k = y.a ORDER BY a, k;
-- IGNORE variant:
INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT IGNORE;
-- predicate within UPDATE auxiliary statement (row is still locked
when the UPDATE predicate isn't satisfied):
INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT UPDATE
WHERE val != 'delete';
As with SQL MERGE (at least as implemented in other systems),
subqueries may not appear within the UPDATE's targetlist, nor may they
appear within the special WHERE clause. But the "INSERT part" of the
query has no additional limitations, so you may for example put
subqueries within a VALUES() clause, or INSERT...SELECT...ON CONFLICT
UPDATE... just as you'd expect. INSERT has been augmented with a new
clause, but that clause does not unreasonably fail to play nice with
any other aspect of insertion. (Actually, that isn't quite true, since
at least for now table inheritance, updatable views and foreign tables
are unsupported. This can be revisited.)
I think that without initially realizing it, I copied the SQLite
syntax [3]https://sqlite.org/lang_conflict.html. However, unlike with that SQLite feature, CONFLICT only
refers to a would-be duplicate violation, and not a violation of any
other kind of constraint.
How this new approach works (Executor and Optimizer stuff)
============================================
During the execution of the parent ModifyTable, a special auxiliary
subquery (the UPDATE ModifyTable) is considered as a special case.
This is not a subplan of the ModifyTable node in the conventional
sense, and so does not appear within EXPLAIN output. However, it is
more or less independently planned, and entirely driven by the INSERT
ModifyTable. ExecModifyTable() is never called with this special
auxiliary plan state passed directly. Rather, its parent manages the
process as the need arises. ExecLockUpdateTuple() locks and
potentially updates tuples, using the EvalPlanQual() mechanism (even
at higher isolation levels, with appropriate precautions).
The per-tuple expression context of the auxiliary query/plan is used
with EvalPlanQual() from within ExecLockUpdateTuple() (the new routine
tasked with locking and updating on conflict). There is a new
ExecUpdate() call site within ExecLockUpdateTuple(). Given the
restrictions necessarily imposed on this pseudo-rescanning
(principally the outright rejection of anything that necessitates
PARAM_EXEC parameters during planning), this is safe, as far as I'm
aware. It is convenient to be able to re-use infrastructure in such a
way as to more or less handle the UPDATE independently, driven by the
INSERT, except for execution which is more directly handled by the
INSERT (i.e. there is no ExecModifyTable() call in respect of this new
auxiliary ModifyTable plan). Granted, it is kind of bizarre that the
auxiliary query may have a more complex plan than is necessary for our
purposes, but it doesn't actually appear to be a problem when
"rescanning" (Like a SELECT FOR UPDATE/FOR SHARE's node, we call
EvalPlanQualSetTuple() directly). It is likely worthwhile to teach the
optimizer that we really don't care about how the one and only base
rel within the UPDATE auxiliary subquery (the target table) is
scanned, if only to save a few cycles. I have (temporarily) hacked the
optimizer to prevent index-only scans, which are problematic here, by
adding disable_cost when a query parse tree that uses the feature is
seen. Although what I've done is a temporary kludge, the basic idea of
forcing a particular type of relation scan has a precedent: UPDATE
WHERE CURRENT OF artificially forces a TID scan, because only a TID
scan will work correctly there. I couldn't come up with a convenient
way to artificially inject disable_cost into alternative scan types,
in the less invasive style of isCurrentOf, because there is no
convenient qual to target within cost_qual_eval().
As in previous incarnations, we lock each tuple (although, of course,
only with the UPDATE variant). We may or may not also actually proceed
with the update, depending on whether or not the user-specified
special update predicate (if any) is satisfied. But if we do,
EvalPlanQual() is (once the tuple is locked) only ever evaluated on a
conclusively committed and locked-by-us conflict tuple as part of the
process of updating, even though it's possible for the UPDATE
predicate to be satisfied where conceivably it would not be satisfied
by the tuple version actually visible to the command's MVCC snapshot.
I think this is the correct behavior. We all seem to be in agreement
that we should update at READ COMMITTED if *no* version of the tuple
is visible. It seems utterly arbitrary to me to suggest that on the
one hand it's okay to introduce one particular "MVCC violation", but
not another equivalent one. The first scenario is one in which we
update despite our update's (or rather insert's) "predicate" not being
satisfied (according to our MVCC snapshot). The second scenario is one
in which the same "predicate" is also not satisfied according to our
MVCC snapshot, but in a slightly different way. Why bother introducing
a complicated distinction, if it's a distinction without a difference?
I'd rather have a behavior that is consistent, easy to reason about,
and easy to explain. And so, the predicate is considered once, after
conclusively locking a conflict tuple.
It feels natural and appropriate to me that if the special UPDATE qual
isn't satisfied, we still lock the tuple. After all, in order to make
a conclusive determination about the qual not being satisfied, we need
to lock the tuple. This happens to insulate ExecUpdate() from having
to care about "invisible tuples", which are now possible (although we
still throw an error, just with a useful error message that phrases
the problem in reference to this new feature).
Of course, at higher isolation levels serialization errors are thrown
when something inconsistent with the higher level's guarantees would
otherwise need to occur (even for the IGNORE variant). Still,
interactions with SSI, and preserving the guarantees of SSI should
probably be closely considered by a subject matter expert.
Omission
=======
The patch currently lacks a way of referencing datums rejected for
insertion when updating. The way MySQL handles the issue seems
questionable. They allow you to do something like this:
INSERT INTO upsert (key, val) VALUES (1 'val') ON DUPLICATE KEY UPDATE
val = VALUES(val);
The implication is that the updated value comes from the INSERT's
VALUES() list, but emulating that seems like a bad idea. In general,
at least with Postgres it's entirely possible that values rejected
differ from the values appearing in the VALUES() list, due to the
effects of before triggers. I'm not sure whether or not we should
assume equivalent transformations during any UPDATE before triggers.
This is an open item. I think it makes sense to deal with it a bit later.
"Value locking"
===========
To date, on-list discussion around UPSERT has almost exclusively
concerned what I've called "value locking"; the idea of locking values
in unique indexes in the abstract (to establish the right to insert
ahead of time). There was some useful discussion on this question
between myself and Heikki back around December/January. Ultimately, we
were unable to reach agreement on an approach and discussion tapered
off. However, Heikki did understand the concerns that informed by
design. He recognized the need to be able to easily *release* value
locks, so as to avoid "unprincipled deadlocks", where under high
concurrency there are deadlocks between sessions that only UPSERT a
single row at a time. I'm not sure how widely appreciated this point
is, but I believe that Heikki appreciates it. It is a very important
point in my opinion. I don't want an implementation that is in any way
inferior to the "UPSERT looping subxact" pattern does (i.e. the plpsql
thing that the docs suggest).
When we left off, Heikki continued to favor an approach that involved
speculatively inserting heap tuples, and then deleting them in the
event of a conflict. This design was made more complicated when the
need to *release* value locks became apparent (Heikki ended up making
some changes to HeapTupleSatisfiesDirty(), as well as sketching a
design for what you might call a "super delete", where xmin can be set
to InvalidTransactionId for speculatively-inserted heap tuples). After
all, it wasn't as if we could abort a subxact to release locks, which
is what the "UPSERT looping subxact" pattern does. I think it's fair
to say that that design became more complicated than initially
anticipated [4]/messages/by-id/CAM3SWZQoArVQGMi=v-jk3sBjsPg+wdjeUkM_6L5TZG_i9pyGzQ@mail.gmail.com [5]/messages/by-id/52B4AAF0.5090806@vmware.com.
Anyway, the greater point here is that fundamentally, AFAICT Heikki
and I were in agreement. Once you buy into the idea that we must avoid
holding on to "value locks" of whatever form - as Heikki evidently did
- then exactly what form they take is ultimately only a detail.
Granted, it's a very important detail, but a detail nonetheless. It
can be discussed entirely independently of all of this new stuff, and
thank goodness for that.
If anyone finds my (virtually unchanged) page heavyweight lock based
value locking approach objectionable, I ask that the criticism be
framed in a way that makes a sharp distinction between each of the
following:
1. You don't accept that value locks must be easily released in the
event of a conflict. Is anyone in this camp? It's far from obvious to
me what side of this question Andres is on at this stage, for example.
Robert might have something to say here too.
2. Having taken into account the experience of myself and Heikki, and
all that is implied by taking that approach ***while avoiding
unprincipled deadlocks***, you continue to believe that an approach
based on speculative heap insertion, or some alternative scheme is
better than what I have done to the nbtree code here, or you otherwise
dislike something about the proposed value locking scheme. You accept
that value locks must be released and released easily in the event of
a conflict, but like Heikki you just don't like what I've done to get
there.
Since we can (I believe) talk about the value locking aspect and the
rest of the patch independently, we should do so...unless you're in
camp 1, in which case I guess that we'll have to thrash it out.
Syntax, footguns
=============
As I mentioned, I have incorporated feedback from Kevin Grittner. You
may specify a unique index to merge on from within the INSERT
statement, thus avoiding the risk of inadvertently having the update
affect the wrong tuple due to the user failing to consider that there
was a would-be unique violation within some other unique index
constraining some other attribute. You may write the DML statement
like this:
INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT WITHIN
upsert_pkey UPDATE SET val = 'update';
I think that there is a good chance that at least some people will
want to make this mandatory. I guess that's fair enough, but I
*really* don't want to *mandate* that users specify the name of their
unique index in DML for obvious reasons. Perhaps we can come up with a
more tasteful syntax that covers all interesting cases (consider the
issues with partial unique indexes and before triggers for example,
where a conclusion reached about which index to use during parse
analysis may subsequently be invalidated by user-defined code, or
ambiguous specifications in the face of overlapping attributes between
two unique composite indexes, etc). The Right Thing is far from
obvious, and there is very little to garner from other systems, since
SQL MERGE promises essentially nothing about concurrency, both as
specified by the standard and in practice. You don't need a unique
index at all, and as I showed in my pgCon talk, there are race
conditions even for a trivial UPSERT operations in all major SQL MERGE
implementations.
Note that making mandatory (via syntax) merging on one particular
unique index buys the implementation no useful leeway. Just for
example, the unprincipled deadlocks test case that illustrated the
problem with early "promise tuple" style approaches to value locking
[6]: /messages/by-id/CAM3SWZShbE29KpoD44cVc3vpZJGmDer6k_6FGHiSzeOZGmTFSQ@mail.gmail.com
whether or not this should be mandatory is just a detail of the
feature's high level design, as opposed to something expected to
significantly influence the implementation.
Testing, performance
===============
As you'd expect, I've included both isolation tests and regression
tests covering a reasonable variety of cases. In addition, stress
testing is an important part of my testing strategy. Reviewers are
encouraged to try out these test bash scripts:
https://github.com/petergeoghegan/upsert
(Interested hackers should request collaborator status on that Github
project from me privately. I welcome new, interesting test cases.)
The performance of the patch seems quite good, and is something that
these stress-testing bash scripts also test. Upserts are compared
against "equivalent" inserts when we know we'll never update, and
against "equivalent" updates when we know we'll never insert.
On an 8 core test server, I can sustain ~90,000 ordinary insert
transactions per second on an unlogged table defined as follows:
create unlogged table foo
(
merge serial primary key,
b int4,
c text
);
In all cases pgbench uses 8 clients (1 per CPU core).
With "equivalent" upserts, it's about ~66,000 TPS. But this is a
particularly unsympathetic case, because I've deliberately exaggerated
the effects of heavyweight lock contention on leaf pages by using a
serial primary key. Plus, there's the additional planning and parsing
overhead.
When comparing updating with updating upserting, it's a similar story.
100,000 tuples are pre-inserted in each case. I can sustain ~98,000
TPS with plain updates, or ~70,000 TPS with "equivalent" upserts.
B-Tree index page heavyweight lock contention probably explains some
of the difference between "UPSERT inserts" and "UPSERT updates".
Interlocking with VACUUM, race conditions
===============================
In previous revisions, when we went to lock + update a tuple, no
"value locks" were held, and neither were any B-Tree page buffer pins,
because they were both released at the same time (recall that I call
my heavyweight lock on B-Tree leaf pages a value lock). We still do
that (unprincipled deadlocks are our only alternative), but now hold
on to the pin for longer, until after tuple locking. Old versions of
this patch used to sit on the B-Tree buffer pin to prevent concurrent
deletion only as long as value locks were held, but maybe it isn't
good enough to sit on the pin until before we lock/update, as value
locks are released: dropping the pin implies that the heap tuple can
physically go away, and in general the same TID may then contain
anything. We may have to interlock against vacuum by sitting on the
B-Tree buffer pin (but not the value lock) throughout locking +
update. That makes it impossible for the heap tuple slot to fail to
relate to the tuple from the B-Tree, that is under consideration for
locking/updating. Recall that we aren't quite dealing with MVCC
semantics here, since in READ COMMITTED mode we can lock a
conclusively committed + visible tuple with *no* version visible to
our command's MVCC snapshot. Therefore, it seems worth considering the
possibility that the nbtree README's observations on the necessity of
holding a pin to interlock against VACUUM (for non-MVCC snapshots)
apply.
In this revision we have two callbacks (or two calls to the same
callback, with different effects): One to release value locks early,
to avoid unprincipled deadlocks, and a second to finally release the
last unneeded buffer pin.
Recall that when we find a conflict (within _bt_checkunique()), it
must be conclusively committed and visible to new MVCC snapshots; we
know at that juncture that it's live. The concern is that it might be
deleted *and* garbage collected in the interim between finding the
conflict tuple, and locking it (in practice this interim period is
only an instant).
This is probably too paranoid, though: the fact that the upserter's
transaction is running ought to imply that GetOldestXmin() returns an
XID sufficient to prevent this. OTOH, I'm not sure that there exists
anything that looks like a precedent for relying on blocking vacuum in
this manner, and it might turn out to be limiting to rely on this.
And, I hasten to add, my fix (sitting on a B-Tree pin throughout row
locking) is in another way perhaps not paranoid enough: Who is to say
that our conflicting value is on the same B-Tree leaf page as our
value lock? If might not be, since _bt_checkunique() looks at later
B-Tree pages (the value locked page is merely "the first leaf page the
value could be on"). Pinning the heavyweight lock page's buffer is
certainly justified by the need for non-speculative inserters to see a
flag that obligates them to acquire the heavyweight page lock
themselves (see comments in patch for more), but this other reason is
kind of dubious.
In other words: I'm relying on the way VACUUM actually works to
prevent premature garbage collection. It's possible to imagine a world
in which HeapTupleSatisfiesVacuum() is smart enough to realize that
the tuple UPSERT wants to lock is not visible to anyone (assuming MVCC
semantics, etc), and never can be. I've tentatively added code to keep
a buffer pin for longer, but that's probably not good enough if we
assume that it's necessary at all. Basically, I want to be comfortable
about my rationale for it being okay that a "non-MVCC" "index scan"
doesn't hold a pin, but right now I'm not. I was conflicted on whether
or not I should include the "unpin later" logic at all; for now I've
left it in, if only as a placeholder. Needless to say, if there is a
race condition you can take it that it's very difficult to isolate.
FWIW, somewhat extensive stress-testing has revealed no bugs that you
might associate with these problems, with and without extended buffer
pinning, and with artificial random sleeps added at key points in an
effort to make any race condition bugs manifest themselves. I have
made a concerted effort to break the patch in that way, and I'm now
running out of ideas. Running the stress tests (with random delays in
key points in the code) for several days reveals no bugs. This is on
the same dedicated 8 core server, with plenty of concurrency.
It's probably a good idea to begin using my B-Tree verification tool
[7]: /messages/by-id/CAM3SWZRtV+xmRWLWq6c-x7czvwavFdwFi4St1zz4dDgFH4yN4g@mail.gmail.com -- Peter Geoghegan
MVCC, and will only detect the violation of invariants that are
localized to the B-Tree code, at least at the moment.
Open items
=========
I already mentioned the inability to reference rejected rows in an
UPDATE, as well as my unease about VACUUM interlocking, both of which
are open item. Also, some of the restrictions that I already mentioned
- on updatable views, inheritance, and foreign tables - are probably
unnecessary. We should be able to come with reasonable behavior for at
least some of those.
Patch
====
I thought that I went too long without posting something about all of
this to the list to get feedback, and so I decided to post this WIP
patch set. I've tried to break it up into pieces, but it isn't all
that suitable for representing as cumulative commits. I've also tried
to break up the discussion usefully (the question of how everything
fits together at a high level can hopefully be discussed separately
from the question of how "value locks" are actually implemented).
Thoughts?
[1]: http://www.pgcon.org/2014/schedule/attachments/327_upsert_weird.pdf, ("Goals for UPSERT in Postgres")
("Goals for UPSERT in Postgres")
[2]: /messages/by-id/CAM3SWZRP0c3g6+aJ=YYDGYAcTZg0xA8-1_FCVo5Xm7hrEL34kw@mail.gmail.com
[3]: https://sqlite.org/lang_conflict.html
[4]: /messages/by-id/CAM3SWZQoArVQGMi=v-jk3sBjsPg+wdjeUkM_6L5TZG_i9pyGzQ@mail.gmail.com
[5]: /messages/by-id/52B4AAF0.5090806@vmware.com
[6]: /messages/by-id/CAM3SWZShbE29KpoD44cVc3vpZJGmDer6k_6FGHiSzeOZGmTFSQ@mail.gmail.com
[7]: /messages/by-id/CAM3SWZRtV+xmRWLWq6c-x7czvwavFdwFi4St1zz4dDgFH4yN4g@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
0001-Make-UPDATE-privileges-distinct-from-INSERT-privileg.patchtext/x-patch; charset=US-ASCII; name=0001-Make-UPDATE-privileges-distinct-from-INSERT-privileg.patchDownload+98-40
0004-Internal-documentation-for-INSERT-.-ON-CONFLICT-UPDA.patchtext/x-patch; charset=US-ASCII; name=0004-Internal-documentation-for-INSERT-.-ON-CONFLICT-UPDA.patchDownload+244-11
0003-Tests-for-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patchtext/x-patch; charset=US-ASCII; name=0003-Tests-for-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patchDownload+485-6
0002-Support-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patchtext/x-patch; charset=US-ASCII; name=0002-Support-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patchDownload+2058-163
On 08/28/2014 04:43 AM, Peter Geoghegan wrote:
-- Nesting within wCTE:
WITH t AS (
INSERT INTO z SELECT i, 'insert'
FROM generate_series(0, 16) i
ON CONFLICT UPDATE SET v = v || 'update' -- use of
operators/functions in targetlist
RETURNING * -- only projects inserted tuples, never updated tuples
)
SELECT * FROM t JOIN y ON t.k = y.a ORDER BY a, k;
Personally I would find it surprising if RETURNING did not also return
the updated tuples. In many use cases for upsert the user does not care
if the row was new or not.
What I think would be useful is if all tuples were returned but there
was some way to filter out only the inserted ones.
--
Andreas Karlsson
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 28, 2014 at 7:29 AM, Andreas Karlsson <andreas@proxel.se> wrote:
Personally I would find it surprising if RETURNING did not also return the
updated tuples. In many use cases for upsert the user does not care if the
row was new or not.
I'm not attached to that particular behavior, but it does seem kind of
similar to the behavior of BEFORE triggers, where a NULL return value
("do nothing") will also cause RETURNING to not project the tuple.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/28/2014 09:05 PM, Peter Geoghegan wrote:
On Thu, Aug 28, 2014 at 7:29 AM, Andreas Karlsson <andreas@proxel.se> wrote:
Personally I would find it surprising if RETURNING did not also return the
updated tuples. In many use cases for upsert the user does not care if the
row was new or not.I'm not attached to that particular behavior, but it does seem kind of
similar to the behavior of BEFORE triggers, where a NULL return value
("do nothing") will also cause RETURNING to not project the tuple.
I see. So we have three cases where we may or may not want to project a
tuple.
1) The tuple was inserted
2) We got a conflict and updated the tuple
3) We got a conflict but skipped updating the tuple
My personal intuition was that (1) and (2) would be returned but not
(3). But I am not sure if that is the most useful behavior.
--
Andreas Karlsson
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 27, 2014 at 7:43 PM, Peter Geoghegan <pg@heroku.com> wrote:
There are some restrictions on what this auxiliary update may do, but
FWIW there are considerably fewer than those that the equivalent MySQL
or SQLite feature imposes on their users.
I realized that I missed a few cases here. For one thing, the posted
patch fails to arrange for the UPDATE post-parse-analysis tree
representation to go through the rewriter stage (on the theory that
user-defined rules shouldn't be able to separately affect the
auxiliary UPDATE query tree), but rewriting is at least necessary so
that rewriteTargetListIU() can expand a "SET val = DEFAULT"
targetlist, as well as normalize the ordering of the UPDATE's tlist.
Separately, the patch fails to defend against certain queries that
ought to be disallowed, where a subselect is specified with a subquery
expression in the auxiliary UPDATE's WHERE clause.
These are garden-variety bugs that aren't likely to affect the kind of
high-level design discussion that I'm looking for here. I'll post a
fixed version in a few days time.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 28, 2014 at 8:05 PM, Peter Geoghegan <pg@heroku.com> wrote:
I realized that I missed a few cases here. For one thing, the posted
patch fails to arrange for the UPDATE post-parse-analysis tree
representation to go through the rewriter stage (on the theory that
user-defined rules shouldn't be able to separately affect the
auxiliary UPDATE query tree), but rewriting is at least necessary so
that rewriteTargetListIU() can expand a "SET val = DEFAULT"
targetlist, as well as normalize the ordering of the UPDATE's tlist.
Separately, the patch fails to defend against certain queries that
ought to be disallowed, where a subselect is specified with a subquery
expression in the auxiliary UPDATE's WHERE clause.
Attached revision fixes all of these issues. I've added regression
tests for each bug, too, although all changes are rebased into my
original commits.
I decided to explicitly rely on a simpler approach to VACUUM
interlocking. I no longer bother holding on to a buffer pin for a
period longer than the period that associated "value locks" are held,
which was something I talked about at the start of this thread. There
is a note on this added to the nbtree README, just after the master
branch's current remarks on B-Tree VACUUM interlocking.
I've also pushed the responsibility for supporting this new feature on
foreign tables onto FDWs themselves. The only writable FDW we
currently ship, postgres_fdw, lacks support for the new feature, but
this can be revisited in due course. My impression is that the task of
adding support is not quite a straightforward matter of adding a bit
more deparsing logic, but also isn't significantly more difficult than
that.
--
Peter Geoghegan
Attachments:
On Wed, Aug 27, 2014 at 10:43 PM, Peter Geoghegan <pg@heroku.com> wrote:
Example usage:
INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT UPDATE
SET val = 'update';
I think that syntax is a dramatic improvement over your previous
proposals. The only part I don't entire like is this:
INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT WITHIN
upsert_pkey UPDATE SET val = 'update';
It seems to me that it would be better to specify a conflicting column
set rather than a conflicting index name.
I don't have much in the way of comments about the implementation, at
least not right at the moment, but...
Essentially, the implementation has all stages of query processing
During the execution of the parent ModifyTable, a special auxiliary
subquery (the UPDATE ModifyTable) is considered as a special case.
This is not a subplan of the ModifyTable node in the conventional
sense, and so does not appear within EXPLAIN output.
...that sounds wonky.
I already mentioned the inability to reference rejected rows in an
UPDATE, as well as my unease about VACUUM interlocking, both of which
are open item. Also, some of the restrictions that I already mentioned
- on updatable views, inheritance, and foreign tables - are probably
unnecessary. We should be able to come with reasonable behavior for at
least some of those.
If you've noted my comments on the UPDATE/DELETE .. ORDER BY .. LIMIT
thread, you won't be surprised to hear that I think those restrictions
will need to be lifted - especially for inheritance, but probably the
others as well.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 3, 2014 at 9:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Essentially, the implementation has all stages of query processing
During the execution of the parent ModifyTable, a special auxiliary
subquery (the UPDATE ModifyTable) is considered as a special case.
This is not a subplan of the ModifyTable node in the conventional
sense, and so does not appear within EXPLAIN output....that sounds wonky.
Which part? It certainly wouldn't be helpful if the (say) auxiliary
plan's "sequential scan" appeared within EXPLAIN output. That's just
an implementation detail. Note that the structure of the plan is
highly restricted, since it needs to be "driven by the insert" (or,
rather, the insert's conflicts, including conflicts not visible to the
command's MVCC snapshot). There won't be any interesting variation in
the plan. Although, that said, the implementation should probably
display any "Filter: ..." conditions implied by the special UPDATE
qual.
If you've noted my comments on the UPDATE/DELETE .. ORDER BY .. LIMIT
thread, you won't be surprised to hear that I think those restrictions
will need to be lifted - especially for inheritance, but probably the
others as well.
Sure.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 3, 2014 at 9:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT WITHIN
upsert_pkey UPDATE SET val = 'update';It seems to me that it would be better to specify a conflicting column
set rather than a conflicting index name.
I'm open to pursuing that, provided there is a possible implementation
that's robust against things like BEFORE triggers that modify
constrained attributes. It must also work well with partial unique
indexes. So I imagine we'd have to determine a way of looking up the
unique index only after BEFORE triggers fire. Unless you're
comfortable with punting on some of these cases by throwing an error,
then all of this is actually surprisingly ticklish. You've already
expressed concerns about the feature not playing nice with existing,
peripheral features though.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 3, 2014 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Sep 3, 2014 at 9:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Essentially, the implementation has all stages of query processing
During the execution of the parent ModifyTable, a special auxiliary
subquery (the UPDATE ModifyTable) is considered as a special case.
This is not a subplan of the ModifyTable node in the conventional
sense, and so does not appear within EXPLAIN output....that sounds wonky.
Which part? It certainly wouldn't be helpful if the (say) auxiliary
plan's "sequential scan" appeared within EXPLAIN output. That's just
an implementation detail. Note that the structure of the plan is
highly restricted, since it needs to be "driven by the insert" (or,
rather, the insert's conflicts, including conflicts not visible to the
command's MVCC snapshot). There won't be any interesting variation in
the plan. Although, that said, the implementation should probably
display any "Filter: ..." conditions implied by the special UPDATE
qual.
I think there shouldn't be any plan nodes in the system that don't get
displayed by explain. If you're using a plan node for something, and
think it shouldn't be displayed by explain, then either (1) you are
wrong or (2) you are abusing the plan node.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 4, 2014 at 8:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I think there shouldn't be any plan nodes in the system that don't get
displayed by explain. If you're using a plan node for something, and
think it shouldn't be displayed by explain, then either (1) you are
wrong or (2) you are abusing the plan node.
Maybe. I admit that I'm not entirely confident that the representation
of the auxiliary state during planning and execution is ideal.
However, it sure is convenient to be able to separately plan the
auxiliary query as a subquery, and not have to specially fish it out
of the subplan list later. Maybe we should add a mechanism that
essentially generates an equivalent, single ModifyTable plan. Or maybe
that would be adding a lot of code for no tangible benefit. I don't
see much point in making one ModifyTable node pull up from the other
for the benefit of this feature (which is another thing entirely to
having there be a single ModifyTable plan). For now, I'm glad to have
something that will allow us to drive discussion of the feature to the
next level. I don't have a good enough understanding of the optimizer
to be able to say with confidence what we should do, or to be able to
see the big picture of making any particular trade-off. It's not an
immediate concern, though.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 4, 2014 at 11:55 AM, Peter Geoghegan <pg@heroku.com> wrote:
It's not an immediate concern, though.
My immediate concern is to get some level of buy-in about how
everything fits together at a high level. Separately, as discussed in
my opening mail, there is the question of how value locking should
ultimately be implemented. These are two orthogonal questions, or are
pretty close to orthogonal. That helps. It also helps that people have
stopped being confused by the term "value locking" (I think).
I'm tempted to believe that the silence on the question of how things
fit together (such as the lack of discussion of my pgCon talk's
characterization of a "pick any 2" trade-off) means that that's
because everyone agrees with that. That seems pretty naive, though,
because a lot of the issues are very subtle. I think that various
interested people, including Robert and Andres have yet to make their
minds up on that. I'm not sure what Tom thinks of it.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 27, 2014 at 7:43 PM, Peter Geoghegan <pg@heroku.com> wrote:
Omission
=======The patch currently lacks a way of referencing datums rejected for
insertion when updating.
Attached revision of the patch set (which I'll call v1.2) adds this
capability in a separate commit. It now becomes possible to add a
CONFLICTING expression within the ON CONFLICT UPDATE targetlist or
predicate. Example use:
"""
postgres=# CREATE TABLE upsert(key int4 PRIMARY KEY, val text);
CREATE TABLE
postgres=# INSERT INTO upsert VALUES(1, 'Giraffe');
INSERT 0 1
postgres=# SELECT * FROM upsert;
key | val
-----+---------
1 | Giraffe
(1 row)
postgres=# INSERT INTO upsert VALUES(1, 'Bear'), (2, 'Lion') ON
CONFLICT UPDATE SET val = CONFLICTING(val);
INSERT 0 1
postgres=# SELECT * FROM upsert;
key | val
-----+------
1 | Bear
2 | Lion
(2 rows)
"""
Note that the effects of BEFORE INSERT triggers are carried here,
which I slightly favor over the alternative of not having it work that
way.
I've also expanded upon my explanation for the structure of the query
tree and plan within (revised/rebased versions of) earlier commits. I
am clearer on why there is a special subquery planning step for the
auxiliary UPDATE, rather than making the UPDATE directly accessible as
a subquery within the post-parse-analysis query tree. Basically, the
optimizer has no basis for understanding that a DML sublink isn't
optimizable. It'll try to pull-up the subquery and so on, which of
course does not and cannot work. Whereas treating it as an
independently planned subquery of the top-level query, kind of like a
data-modifying CTE makes sense (with such CTEs, the executor is
prepared for the possibility that not all rows will be pulled up - so
there too, the executor drives execution more directly than makes
sense when not dealing with DML: it finishes off the data-modifying
CTE's DML for any still-unconsumed tuples, within
ExecPostprocessPlan()).
It's certainly possible that a more unified representation makes sense
(i.e. one ModifyTable plan, likely still having seperate INSERT/UPDATE
representations at earlier stages of query processing), but that would
require serious refactoring of the representation of ModifyTable
operations -- just for example, consider the need for a
unified-though-separate targetlist, one for the INSERT part, the other
for the UPDATE part. For now, I continue to find it very convenient to
represent the UPDATE as a selectively executed, auxiliary, distinct
ModifyTable plan, rather than adding a subquery rangetable directly
during parse analysis.
There is another significant change. In this revision, I am at least
"honest" about the plan representation within EXPLAIN:
"""
postgres=# EXPLAIN ANALYZE INSERT INTO upsert VALUES(1, 'Bear'), (2,
'Lion') ON CONFLICT UPDATE SET val = CONFLICTING(val);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Insert on upsert (cost=0.00..0.03 rows=2 width=36) (actual
time=0.115..0.115 rows=0 loops=1)
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=36)
(actual time=0.003..0.005 rows=2 loops=1)
-> Conflict Update on upsert (cost=0.00..22.30 rows=1230
width=36) (actual time=0.042..0.051 rows=0 loops=1)
-> Seq Scan on upsert (cost=0.00..22.30 rows=1230 width=36)
(never executed)
Planning time: 0.065 ms
Execution time: 0.158 ms
(6 rows)
postgres=# EXPLAIN ANALYZE INSERT INTO upsert VALUES(1, 'Bear'), (2,
'Lion') ON CONFLICT UPDATE SET val = CONFLICTING(val) where key = 2;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Insert on upsert (cost=0.00..0.03 rows=2 width=36) (actual
time=0.075..0.075 rows=0 loops=1)
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=36)
(actual time=0.001..0.002 rows=2 loops=1)
-> Conflict Update on upsert (cost=4.16..8.17 rows=1 width=36)
(actual time=0.012..0.026 rows=0 loops=1)
-> Bitmap Heap Scan on upsert (cost=4.16..8.17 rows=1
width=36) (never executed)
Recheck Cond: (key = 2)
-> Bitmap Index Scan on upsert_pkey (cost=0.00..4.16
rows=1 width=0) (never executed)
Index Cond: (key = 2)
Planning time: 0.090 ms
Execution time: 0.125 ms
(9 rows)
"""
The second query gets a bitmap scan because plain index scans have
been disabled for the UPDATE (a temporary kludge), since index-only
scans can break things - IndexOnlyRecheck() throws an error. Not quite
sure why the optimizer doesn't care about resjunk for the UPDATE,
which is presumably why in general regular updates never use
index-only scans. Since I think the actual auxiliary plan generation
needs work, so as to not have uselessly complicated plans, I didn't
try too hard to figure that out. This plan structure is not
acceptable, of course, but maybe almost the same thing would be
acceptable if the auxiliary plan shown here wasn't unnecessarily
complex - if we forced a simple pseudo-scan placeholder, without
wasting optimizer cycles, somewhat in the style of WHERE CURRENT OF.
This is something discussed in newly expanded comments within
planner.c. I would have made the optimizer produce a suitably simple
plan myself, but I don't have a good enough understanding of it to
figure out how (at least in a reasonable amount of time). Pointers on
how this might be accomplished are very welcome.
With this addition, the feature is functionally complete. That just
leaves the small matter of how it has been implemented. :-)
This is still clearly a work in progress implementation, with design
trade-offs that are very much in need of fairly high level discussion.
--
Peter Geoghegan
Attachments:
0002-Support-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch.gzapplication/x-gzip; name=0002-Support-INSERT-.-ON-CONFLICT-UPDATE-IGNORE.patch.gzDownload
On 28 August 2014 03:43, Peter Geoghegan <pg@heroku.com> wrote:
The patch currently lacks a way of referencing datums rejected for
insertion when updating. The way MySQL handles the issue seems
questionable. They allow you to do something like this:INSERT INTO upsert (key, val) VALUES (1 'val') ON DUPLICATE KEY UPDATE
val = VALUES(val);The implication is that the updated value comes from the INSERT's
VALUES() list, but emulating that seems like a bad idea. In general,
at least with Postgres it's entirely possible that values rejected
differ from the values appearing in the VALUES() list, due to the
effects of before triggers. I'm not sure whether or not we should
assume equivalent transformations during any UPDATE before triggers.This is an open item. I think it makes sense to deal with it a bit later.
IMHO it is impossible to know if any of the other code is correct
until we have a clear and stable vision of what the command is
supposed to perform.
The inner workings are less important than what the feature does.
FWIW, the row available at the end of all BEFORE triggers is clearly
the object we should be manipulating, not the original VALUES()
clause. Otherwise this type of INSERT would behave differently from
normal INSERTs. Which would likely violate RLS, if nothing else.
As I mentioned, I have incorporated feedback from Kevin Grittner. You
may specify a unique index to merge on from within the INSERT
statement, thus avoiding the risk of inadvertently having the update
affect the wrong tuple due to the user failing to consider that there
was a would-be unique violation within some other unique index
constraining some other attribute. You may write the DML statement
like this:INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT WITHIN
upsert_pkey UPDATE SET val = 'update';I think that there is a good chance that at least some people will
want to make this mandatory. I guess that's fair enough, but I
*really* don't want to *mandate* that users specify the name of their
unique index in DML for obvious reasons. Perhaps we can come up with a
more tasteful syntax that covers all interesting cases (consider the
issues with partial unique indexes and before triggers for example,
where a conclusion reached about which index to use during parse
analysis may subsequently be invalidated by user-defined code, or
ambiguous specifications in the face of overlapping attributes between
two unique composite indexes, etc). The Right Thing is far from
obvious, and there is very little to garner from other systems, since
SQL MERGE promises essentially nothing about concurrency, both as
specified by the standard and in practice. You don't need a unique
index at all, and as I showed in my pgCon talk, there are race
conditions even for a trivial UPSERT operations in all major SQL MERGE
implementations.
Surely if there are multiple unique indexes then the result row must
be validated against all unique indexes before it is allowed at all?
The only problem I see is if the newly inserted row matches one row on
one unique value and a different row on a different unique index.
Turning the INSERT into an UPDATE will still fail on one or other, no
matter which index you pick. If there is one row for ALL unique
indexes then it is irrelevant which index you pick. So either way, I
cannot see a reason to specify an index.
If we do need such a construct, we have already the concept of an
IDENTITY for a table, added in 9.4, currently targeted at replication.
Listing indexes or columns in the DML statement is more pushups for
developers and ORMs, so lets KISS.
The way forwards, in my view, is to define precisely the behaviour we
wish to have. That definition will include the best current mechanism
for running an UPSERT using INSERT/UPDATE/loops and comparing that
against what is being provided here. We will then have a functional
test of equivalence of the approaches, and a basis for making a
performance test that shows that performance is increased without any
loss of concurrency.
Once we have that, we can then be certain our time spent on internals
is not wasted by overlooking a simple userland gotcha.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 25, 2014 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
IMHO it is impossible to know if any of the other code is correct
until we have a clear and stable vision of what the command is
supposed to perform.
+1.
The inner workings are less important than what the feature does.
+1.
FWIW, the row available at the end of all BEFORE triggers is clearly
the object we should be manipulating, not the original VALUES()
clause. Otherwise this type of INSERT would behave differently from
normal INSERTs. Which would likely violate RLS, if nothing else.
+1.
Surely if there are multiple unique indexes then the result row must
be validated against all unique indexes before it is allowed at all?The only problem I see is if the newly inserted row matches one row on
one unique value and a different row on a different unique index.
Turning the INSERT into an UPDATE will still fail on one or other, no
matter which index you pick. If there is one row for ALL unique
indexes then it is irrelevant which index you pick. So either way, I
cannot see a reason to specify an index.
Failure could be the right thing in some cases. For example, imagine
that a user has a table containing names, email addresses, and (with
apologies for the American-ism, but I don't know what would be
comparable elsewhere) social security numbers. The user has unique
indexes on both email addresses and SSNs. If a new record arrives for
the same email address, they want to replace the existing record; but
a new record arrives with the same SSN, they want the transaction to
fail. Otherwise, a newly-arrived record might overwrite the email
address of an existing record, which they never want to do, because
they view email address as the primary key.
I think this kind of scenario will actually be pretty common.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 25 September 2014 15:35, Robert Haas <robertmhaas@gmail.com> wrote:
The only problem I see is if the newly inserted row matches one row on
one unique value and a different row on a different unique index.
Turning the INSERT into an UPDATE will still fail on one or other, no
matter which index you pick. If there is one row for ALL unique
indexes then it is irrelevant which index you pick. So either way, I
cannot see a reason to specify an index.Failure could be the right thing in some cases. For example, imagine
that a user has a table containing names, email addresses, and (with
apologies for the American-ism, but I don't know what would be
comparable elsewhere) social security numbers. The user has unique
indexes on both email addresses and SSNs. If a new record arrives for
the same email address, they want to replace the existing record; but
a new record arrives with the same SSN, they want the transaction to
fail. Otherwise, a newly-arrived record might overwrite the email
address of an existing record, which they never want to do, because
they view email address as the primary key.
I agree with your example, but not your conclusion.
If a new record arrives with a new email address that matches an
existing record it will fail. There is a case that would be allowed,
which would be a record that creates an entirely new email address. So
you do have a point to argue from.
However, IMV enforcing such a restriction should be done with an After
trigger, which is already possible, not by complicating a DML
statement with information it shouldn't need to know, or that might
change in the future.
Let's keep this new feature as simple as possible. ORMs everywhere
need to be encouraged to implement this and they won't do it unless it
is bone simple to use.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 25, 2014 at 11:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 25 September 2014 15:35, Robert Haas <robertmhaas@gmail.com> wrote:
The only problem I see is if the newly inserted row matches one row on
one unique value and a different row on a different unique index.
Turning the INSERT into an UPDATE will still fail on one or other, no
matter which index you pick. If there is one row for ALL unique
indexes then it is irrelevant which index you pick. So either way, I
cannot see a reason to specify an index.Failure could be the right thing in some cases. For example, imagine
that a user has a table containing names, email addresses, and (with
apologies for the American-ism, but I don't know what would be
comparable elsewhere) social security numbers. The user has unique
indexes on both email addresses and SSNs. If a new record arrives for
the same email address, they want to replace the existing record; but
a new record arrives with the same SSN, they want the transaction to
fail. Otherwise, a newly-arrived record might overwrite the email
address of an existing record, which they never want to do, because
they view email address as the primary key.I agree with your example, but not your conclusion.
If a new record arrives with a new email address that matches an
existing record it will fail. There is a case that would be allowed,
which would be a record that creates an entirely new email address. So
you do have a point to argue from.However, IMV enforcing such a restriction should be done with an After
trigger, which is already possible, not by complicating a DML
statement with information it shouldn't need to know, or that might
change in the future.
I've never been a fan of putting the index name in there. I agree
that's stuff that a DML statement shouldn't need to know about. What
I've advocated for in the past is specifying the list of columns that
should be used to determine whether to insert or update. If you have
a match on those columns, update the row; else insert. Any other
unique indexes stand or fall as may be.
I still think that idea has merit.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 28 August 2014 03:43, Peter Geoghegan <pg@heroku.com> wrote:
"Value locking"
===========To date, on-list discussion around UPSERT has almost exclusively
concerned what I've called "value locking"; the idea of locking values
in unique indexes in the abstract (to establish the right to insert
ahead of time). There was some useful discussion on this question
between myself and Heikki back around December/January. Ultimately, we
were unable to reach agreement on an approach and discussion tapered
off. However, Heikki did understand the concerns that informed by
design. He recognized the need to be able to easily *release* value
locks, so as to avoid "unprincipled deadlocks", where under high
concurrency there are deadlocks between sessions that only UPSERT a
single row at a time. I'm not sure how widely appreciated this point
is, but I believe that Heikki appreciates it. It is a very important
point in my opinion. I don't want an implementation that is in any way
inferior to the "UPSERT looping subxact" pattern does (i.e. the plpsql
thing that the docs suggest).When we left off, Heikki continued to favor an approach that involved
speculatively inserting heap tuples, and then deleting them in the
event of a conflict. This design was made more complicated when the
need to *release* value locks became apparent (Heikki ended up making
some changes to HeapTupleSatisfiesDirty(), as well as sketching a
design for what you might call a "super delete", where xmin can be set
to InvalidTransactionId for speculatively-inserted heap tuples). After
all, it wasn't as if we could abort a subxact to release locks, which
is what the "UPSERT looping subxact" pattern does. I think it's fair
to say that that design became more complicated than initially
anticipated [4] [5].Anyway, the greater point here is that fundamentally, AFAICT Heikki
and I were in agreement. Once you buy into the idea that we must avoid
holding on to "value locks" of whatever form - as Heikki evidently did
- then exactly what form they take is ultimately only a detail.
Granted, it's a very important detail, but a detail nonetheless. It
can be discussed entirely independently of all of this new stuff, and
thank goodness for that.If anyone finds my (virtually unchanged) page heavyweight lock based
value locking approach objectionable, I ask that the criticism be
framed in a way that makes a sharp distinction between each of the
following:1. You don't accept that value locks must be easily released in the
event of a conflict. Is anyone in this camp? It's far from obvious to
me what side of this question Andres is on at this stage, for example.
Robert might have something to say here too.2. Having taken into account the experience of myself and Heikki, and
all that is implied by taking that approach ***while avoiding
unprincipled deadlocks***, you continue to believe that an approach
based on speculative heap insertion, or some alternative scheme is
better than what I have done to the nbtree code here, or you otherwise
dislike something about the proposed value locking scheme. You accept
that value locks must be released and released easily in the event of
a conflict, but like Heikki you just don't like what I've done to get
there.Since we can (I believe) talk about the value locking aspect and the
rest of the patch independently, we should do so...unless you're in
camp 1, in which case I guess that we'll have to thrash it out.
I'm trying to understand and help out with pushing this patch forwards
to completion.
Basically, I have absolutely no idea whether I object to or agree with
1) and don't know where to look to find out. We need a clear
exposition of design and the alternatives.
My approach would be to insert an index tuple for that value into the
index, but with the leaf ituple marked with an xid rather than a ctid.
If someone tries to insert into the index they would see this and wait
for the inserting transaction to end. The inserting transaction would
then resolve what happens in the heap (insert/update) and later
repoint the index tuple to the inserted/updated row version. I don't
see the need for page level locking since it would definitely result
in deadlocks (e.g. SQLServer).
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 25, 2014 at 11:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Basically, I have absolutely no idea whether I object to or agree with
1) and don't know where to look to find out. We need a clear
exposition of design and the alternatives.My approach would be to insert an index tuple for that value into the
index, but with the leaf ituple marked with an xid rather than a ctid.
If someone tries to insert into the index they would see this and wait
for the inserting transaction to end. The inserting transaction would
then resolve what happens in the heap (insert/update) and later
repoint the index tuple to the inserted/updated row version. I don't
see the need for page level locking since it would definitely result
in deadlocks (e.g. SQLServer).
The page level locks are only used to prevent concurrent insertion for
as long as it takes to get consensus to proceed among unique indexes,
and to actually insert a heap tuple. They're all released before we
lock the tuple for update, should we take that path (yes, really).
This is consistent with the behavior of other systems, I think. That's
my whole reason for preferring to do things that way. If you have a
"promise tuples" approach - be it what you outline here, or what
Heikki prototyped with heap tuple insertion, or any other - then you
need a way to *release* those "value locks" in the event of a
conflict/needing to update, before locking/updating. Otherwise, you
get deadlocks. This is an issue I highlighted when it came up with
Heikki's prototype.
AFAICT, any scheme for "value locking" needs to strongly consider the
need to *release* value locks inexpensively. Whatever else they do,
they cannot persist for the duration of the transaction IMV.
Does that make sense? If not, my next suggestion is applying an
earlier revision of Heikki's prototype, and seeing for yourself how it
can be made to deadlock in an unprincipled/impossible to prevent way
[1]: /messages/by-id/52B4AAF0.5090806@vmware.com -- Peter Geoghegan
pattern as something that this needs to be better than in every way.
This is one important way in which we might fail to live up to that
standard.
[1]: /messages/by-id/52B4AAF0.5090806@vmware.com -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 25, 2014 at 7:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 25, 2014 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
IMHO it is impossible to know if any of the other code is correct
until we have a clear and stable vision of what the command is
supposed to perform.+1.
The inner workings are less important than what the feature does.
+1.
FWIW, the row available at the end of all BEFORE triggers is clearly
the object we should be manipulating, not the original VALUES()
clause. Otherwise this type of INSERT would behave differently from
normal INSERTs. Which would likely violate RLS, if nothing else.+1.
I agree with all of this. I'm glad that my opinion on how a
CONFLICTING() expression interacts with BEFORE triggers is accepted,
too.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers