Why is unique constraint needed for upsert?

Started by Seamus Abshereover 11 years ago15 messagesgeneral
Jump to latest
#1Seamus Abshere
seamus@abshere.net

hi all,

Upsert is usually defined [1]http://postgresql.uservoice.com/forums/21853-general/suggestions/245202-merge-upsert-replace in reference to a violating a unique key:

Insert, if unique constraint violation then update; or update, if not found then insert.

Is this theoretically preferable to just looking for a row that matches
certain criteria, updating it if found or inserting otherwise?

For an example of the latter approach, see MongoDB's flavor of upsert
[2]: http://docs.mongodb.org/manual/reference/method/db.collection.update/
better because it functions correctly whether or not an index is in place.

Best, thanks,
Seamus

[1]: http://postgresql.uservoice.com/forums/21853-general/suggestions/245202-merge-upsert-replace
http://postgresql.uservoice.com/forums/21853-general/suggestions/245202-merge-upsert-replace
[2]: http://docs.mongodb.org/manual/reference/method/db.collection.update/

--
Seamus Abshere, SCEA
https://github.com/seamusabshere

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2John R Pierce
pierce@hogranch.com
In reply to: Seamus Abshere (#1)
Re: Why is unique constraint needed for upsert?

On 7/23/2014 10:21 AM, Seamus Abshere wrote:

hi all,

Upsert is usually defined [1] in reference to a violating a unique key:

Insert, if unique constraint violation then update; or update, if not
found then insert.

Is this theoretically preferable to just looking for a row that
matches certain criteria, updating it if found or inserting otherwise?

what happens when two connections do this more or less concurrently, in
transactions?

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Igor Neyman
ineyman@perceptron.com
In reply to: John R Pierce (#2)
Re: Why is unique constraint needed for upsert?

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of John R Pierce
Sent: Wednesday, July 23, 2014 1:32 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Why is unique constraint needed for upsert?

On 7/23/2014 10:21 AM, Seamus Abshere wrote:

hi all,

Upsert is usually defined [1] in reference to a violating a unique key:

Insert, if unique constraint violation then update; or update, if not
found then insert.

Is this theoretically preferable to just looking for a row that
matches certain criteria, updating it if found or inserting otherwise?

what happens when two connections do this more or less concurrently, in transactions?

--
john r pierce 37N 122W
somewhere on the middle of the left coast

Well, that's exactly why OP prefers Mongo, which doesn't care about such "small" things as ACID.

Regards,
Igor Neyman

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: John R Pierce (#2)
Re: Why is unique constraint needed for upsert?

John R Pierce <pierce@hogranch.com> writes:

On 7/23/2014 10:21 AM, Seamus Abshere wrote:

Upsert is usually defined [1] in reference to a violating a unique key:
Is this theoretically preferable to just looking for a row that
matches certain criteria, updating it if found or inserting otherwise?

what happens when two connections do this more or less concurrently, in
transactions?

For the OP's benefit --- the subtext John left unstated is that the
unique-key mechanism has already solved the problem of preventing
concurrent updates from creating duplicate keys. If we build a version of
UPSERT that doesn't rely on a unique index then it'll need some entirely
new mechanism to prevent concurrent key insertion. (And if you don't care
about concurrent cases, you don't really need UPSERT ...)

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Seamus Abshere
seamus@abshere.net
In reply to: Tom Lane (#4)
Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

On 7/23/14 3:40 PM, Tom Lane wrote:

John R Pierce <pierce@hogranch.com> writes:

On 7/23/2014 10:21 AM, Seamus Abshere wrote:

Upsert is usually defined [1] in reference to a violating a unique key:
Is this theoretically preferable to just looking for a row that
matches certain criteria, updating it if found or inserting otherwise?

what happens when two connections do this more or less concurrently, in
transactions?

For the OP's benefit --- the subtext John left unstated is that the
unique-key mechanism has already solved the problem of preventing
concurrent updates from creating duplicate keys. If we build a version of
UPSERT that doesn't rely on a unique index then it'll need some entirely
new mechanism to prevent concurrent key insertion. (And if you don't care
about concurrent cases, you don't really need UPSERT ...)

hi all,

What if we treat atomicity as optional? You could have extremely
readable syntax like:

-- no guarantees, no index required
UPSERT age = 5 INTO dogs WHERE name = 'Jerry';

-- optionally tell us how you want to deal with collision
UPSERT age = 3 INTO dogs WHERE name = 'Jerry' ON DUPLICATE KEY KEEP NEWEST;
UPSERT age = 5 INTO dogs WHERE name = 'Jerry' ON DUPLICATE KEY KEEP NEWEST;

-- only **require** (by throwing an error) a unique index or a locked table for queries like
UPSERT age = age+1 INTO dogs WHERE name = 'Jerry';

Obviously this flies in the face of what most people say the
"fundamental Upsert property" is [1]http://www.pgcon.org/2014/schedule/events/661.en.html

At READ COMMITTED isolation level, you should always get an atomic insert or update [1]

I just think there are a lot of non-concurrent bulk loading and
processing workflows that could benefit from the performance advantages
of upsert (one trip to database).

Best, thanks,
Seamus

[1]: http://www.pgcon.org/2014/schedule/events/661.en.html

--
Seamus Abshere, SCEA
https://github.com/seamusabshere

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Seamus Abshere (#5)
Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

Seamus Abshere <seamus@abshere.net> writes:

On 7/23/14 3:40 PM, Tom Lane wrote:

For the OP's benefit --- the subtext John left unstated is that the
unique-key mechanism has already solved the problem of preventing
concurrent updates from creating duplicate keys.

What if we treat atomicity as optional?

You'll get a *much* warmer response to that kind of suggestion from
MongoDB or MySQL, no doubt. PG is not in the business of optional
data integrity.

I just think there are a lot of non-concurrent bulk loading and
processing workflows that could benefit from the performance advantages
of upsert (one trip to database).

What exactly is your argument for supposing that an UPSERT without an
underlying index would perform so well? It seems much more likely
that it'd suck, because of having to do full-table scans to look
for existing rows.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7John R Pierce
pierce@hogranch.com
In reply to: Seamus Abshere (#5)
Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

On 7/23/2014 1:45 PM, Seamus Abshere wrote:

What if we treat atomicity as optional? You could have extremely
readable syntax like:

atomicity is not and never will be optional in PostgreSQL.

-- no guarantees, no index required
UPSERT age = 5 INTO dogs WHERE name = 'Jerry';

and if there's several rows with name='Jerry', you'd want to update them
ALL ? if name isn't indexed, this will, as Tom suggests, require a FULL
table scan, and it still will have issues with concurrency (connection
scans table, finds nothing, starts to insert, user 2 scans table, finds
nothing, starts to insert, poof, now we have two records?!?). If name
*is* indexed and unique, this collision will cause a error at commit for
at least one of those connections.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Seamus Abshere
seamus@abshere.net
In reply to: John R Pierce (#7)
Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

On 7/23/14 6:03 PM, John R Pierce wrote:

On 7/23/2014 1:45 PM, Seamus Abshere wrote:

What if we treat atomicity as optional?

atomicity is not and never will be optional in PostgreSQL.

I'm wondering what a minimal definition of upsert could be - possibly
separating concurrency handling out as a (rigorously defined) option for
those who need it.

-- no guarantees, no index required
UPSERT age = 5 INTO dogs WHERE name = 'Jerry';

and if there's several rows with name='Jerry', you'd want to update them
ALL ? if name isn't indexed, this will, as Tom suggests, require a FULL
table scan, and it still will have issues with concurrency

Ah, I was just saying, in terms of correctness, it seems to me that
upsert shouldn't NEED a index to work, just like you don't need an index
on "name" when you say WHERE name = 'Jerry' in SELECTs or INSERTS or
UPDATES.

Appreciate the defense of data integrity in any case!!

Best,
Seamus

--
Seamus Abshere, SCEA
https://github.com/seamusabshere

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9David G. Johnston
david.g.johnston@gmail.com
In reply to: Seamus Abshere (#5)
Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

seamusabshere wrote

At READ COMMITTED isolation level, you should always get an atomic insert
or update [1]

I just think there are a lot of non-concurrent bulk loading and
processing workflows that could benefit from the performance advantages
of upsert (one trip to database).

Bulk load raw data into UNLOGGED staging table
LOCK production table
UPDATE matched records
INSERT unmatched records
UNLOCK production table
TRUNCATE staging table

This seems like it would be sufficient for "non-concurrent bulk loading"...

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Why-is-unique-constraint-needed-for-upsert-tp5812552p5812628.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10David G. Johnston
david.g.johnston@gmail.com
In reply to: Seamus Abshere (#8)
Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

seamusabshere wrote

On 7/23/14 6:03 PM, John R Pierce wrote:

On 7/23/2014 1:45 PM, Seamus Abshere wrote:

What if we treat atomicity as optional?

atomicity is not and never will be optional in PostgreSQL.

I'm wondering what a minimal definition of upsert could be - possibly
separating concurrency handling out as a (rigorously defined) option for
those who need it.

I don't know how you can avoid the implicit need for an "IF" in the
algorithm. I guess if you had some way to force an INSERT to automatically
hide any previous entries/records with the same PK you could move the
checking to the read side of the equation - and deal with the necessary
periodic cleanup. At this point you are basically implementing a Temporal
database...

If you leave the checking to occur during write why wouldn't you want an
index to make that go faster? It isn't mandatory but any performant
implementation is going to use one.

You can enforce a "unique constraint violation" without an index so you
initial premise is wrong - though again why would you want to?

Also, why do you assume MongoDB doesn't use an index to execute the supplied
query?

From your link:

"To prevent MongoDB from inserting the same document more than once, create
a unique index on the name field. With a unique index, if an applications
issues a group of upsert operations, exactly one update() would successfully
insert a new document."

Given we do not have native UPSERT I'm not sure where your question is
coming from anyway. I'm not sure what the plans are for UPSERT at the
moment but nothing prevents us from performing the UPSERT comparison on a
non-uniqe set of columns. If the only unique index on a table is its
"serial" column then you would get behavior similar to MongoDB w/o a unique
index on "name".

Though that does re-up the question about what happens when you issue a
subsequent UPSERT and more than one matching record is returned...the most
logical being apply the update to all matched records.

I have difficulty imaging a situation where this would be desirable. If I
am using UPSERT I am defining a complete entity that I need to cause to
exist. If three of them already exist there is some differentiating factor
between them that my UPSERT command would clobber. The example given in the
MongoDB link is not a particularly convincing use-case.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Why-is-unique-constraint-needed-for-upsert-tp5812552p5812631.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Seamus Abshere
seamus@abshere.net
In reply to: David G. Johnston (#10)
Re: Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

On 7/23/14 6:50 PM, David G Johnston wrote:

seamusabshere wrote

On 7/23/14 6:03 PM, John R Pierce wrote:

On 7/23/2014 1:45 PM, Seamus Abshere wrote:

What if we treat atomicity as optional?

atomicity is not and never will be optional in PostgreSQL.

I'm wondering what a minimal definition of upsert could be - possibly
separating concurrency handling out as a (rigorously defined) option for
those who need it.

Given we do not have native UPSERT I'm not sure where your question is
coming from anyway. I'm not sure what the plans are for UPSERT at the
moment but nothing prevents us from performing the UPSERT comparison on a
non-uniqe set of columns.

hi David,

My argument lives and dies on the assumption that UPSERT would be useful
even if it was (when given with no options) just a macro for

UPDATE db SET b = data WHERE a = key;
IF NOT found THEN
INSERT INTO db(a,b) VALUES (key, data);
END IF;

Adding things like unique indexes would work like you would expect with
individual INSERTs or UPDATEs - your statement might raise an exception.
Then, going beyond, UPSERT would optionally support atomic "a = a+1"
stuff, special actions to take on duplicate keys, all the concurrency
stuff that people have been talking about.

IMO having such a complicated definition of what an upsert "must" be
makes it a unicorn when it could just be a sibling to INSERT and UPDATE.

Best,
Seamus

--
Seamus Abshere, SCEA
https://github.com/seamusabshere

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12John R Pierce
pierce@hogranch.com
In reply to: Seamus Abshere (#11)
Re: Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

On 7/23/2014 3:29 PM, Seamus Abshere wrote:

My argument lives and dies on the assumption that UPSERT would be
useful even if it was (when given with no options) just a macro for

UPDATE db SET b = data WHERE a = key;
IF NOT found THEN
INSERT INTO db(a,b) VALUES (key, data);
END IF;

but that won't work if two connections execute similar 'upserts'
concurrently. both updates will see the record isn't there, then one
or the other insert will fail, depending on which transaction commits first.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13David G. Johnston
david.g.johnston@gmail.com
In reply to: Seamus Abshere (#11)
Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

hi David,

My argument lives and dies on the assumption that UPSERT would be useful
even if it was (when given with no options) just a macro for

UPDATE db SET b = data WHERE a = key;
IF NOT found THEN
INSERT INTO db(a,b) VALUES (key, data);
END IF;

Adding things like unique indexes would work like you would expect with
individual INSERTs or UPDATEs - your statement might raise an exception.
Then, going beyond, UPSERT would optionally support atomic "a = a+1"
stuff, special actions to take on duplicate keys, all the concurrency
stuff that people have been talking about.

IMO having such a complicated definition of what an upsert "must" be
makes it a unicorn when it could just be a sibling to INSERT and UPDATE.

Fair enough. I'd personally much rather have a staging table and use
writeable CTEs to implement something that simple - retrying on the off
chance an error occurs.

I'd use UPSERT (probably still with a staging table) if I expect a high
level of concurrency is going to force me to retry often and the
implementation will handle that for me.

To be honest though I haven't given it that much thought as I've had little
need for it.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Why-is-unique-constraint-needed-for-upsert-tp5812552p5812641.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

#14Seamus Abshere
seamus@abshere.net
In reply to: John R Pierce (#12)
Re: Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

On 7/23/14 7:45 PM, John R Pierce wrote:

On 7/23/2014 3:29 PM, Seamus Abshere wrote:

My argument lives and dies on the assumption that UPSERT would be
useful even if it was (when given with no options) just a macro for

UPDATE db SET b = data WHERE a = key;
IF NOT found THEN
INSERT INTO db(a,b) VALUES (key, data);
END IF;

but that won't work if two connections execute similar 'upserts'
concurrently. both updates will see the record isn't there, then one
or the other insert will fail, depending on which transaction commits
first.

John,

Right - if you had a situation where that might happen, you would use a
slightly more advanced version of the UPSERT command (and/or add a
unique index).

UPSERT, in this conception and in its most basic form, would be subject
to many of the same (and more) concurrency concerns as basic INSERTs and
UPDATEs.

Providing options may be preferable magically handling everything.

Best,
Seamus

--
Seamus Abshere, SCEA
+598 99 54 99 54
https://github.com/seamusabshere

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15John R Pierce
pierce@hogranch.com
In reply to: Seamus Abshere (#14)
Re: Re: Why is unique constraint needed for upsert? (treat atomicity as optional)

On 7/23/2014 3:58 PM, Seamus Abshere wrote:

Right - if you had a situation where that might happen, you would use
a slightly more advanced version of the UPSERT command (and/or add a
unique index).

a unique index wouldn't resolve the problem. without one, you'd end up
with two records, with one, you'd end up with an error.

naive programmers never seem to expect concurrency, its something that
just happens.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general