Getting sequence-generated IDs from multiple row insert
Hi folks,
I've just run into a subtle but fairly serious race condition while using
web.py's SQL library to insert multiple rows into the database and return
their IDs (a "serial primary key" column). Specifically I'm using the
multiple_insert() function that web.py defines here:
https://github.com/webpy/webpy/blob/master/web/db.py#L793
This function runs a query like this:
INSERT INTO table (col1, col2) VALUES (col1_1, col2_1), (col1_2,
col2_2), ...; SELECT currval('table_id_seq');
Using the output of the currval(), web.py tries to build a list of the most
recent IDs by creating a range from "currval - num_rows_inserted + 1"
through "currval". In Python:
out = range(out-len(values)+1, out+1)
This *looks* nice, and must have seemed fine to the developers who
implemented it, but I've just hit a case where two sessions each doing a
multiple insert don't use sequential IDs. For example, the range code above
for the first insert gave 2117552...2117829. And the second insert gave
2117625...2117818. Which are obviously overlapping and is a nasty bug
waiting to happen. Thankfully it caused an IntegrityError further down in
my code so I didn't screw things up.
First of all, I presume this is expected, and is how the sequence with a
multi-row insert is supposed to work? In other words, the sequence
guarantees the IDs will be unique, but with multi-row insert, they won't
necessarily be consecutive? If so, it's a fairly serious bug in web.py's
multiple_insert(), which probably shouldn't return anything due to this
issue.
Second, what's the right thing to do here? The first thing I found was
PostgreSQL's "RETURNING" clause, but somewhat frustratingly for this use
case, even that's not guaranteed to return the results in the order you
specified. I need the IDs in insertion order so I can do further
processing. Tom Lane and others in this thread indicate that this is not a
guarantee of the RETURNING clause, for future optimization reasons and due
to how SQL handles sets:
So currently I've changed my code to use RETURNING and then I'm ordering
the results based on a secondary column that I know the order of. This
works, but seems clunky, so I'm wondering if there's a nicer way.
Thanks,
Ben
On Mon, Mar 31, 2014 at 03:28:14PM -0400, Ben Hoyt wrote:
, but I've just hit a case where two sessions each doing a
multiple insert don't use sequential IDs. For example, the range code above
for the first insert gave 2117552...2117829. And the second insert gave
2117625...2117818. Which are obviously overlapping and is a nasty bug
waiting to happen. Thankfully it caused an IntegrityError further down in
my code so I didn't screw things up.
Good thing you caught it. But yes, just from the description it
looked like an obvious race to me. Concurrency is hard.
Second, what's the right thing to do here? The first thing I found was
PostgreSQL's "RETURNING" clause, but somewhat frustratingly for this use
case, even that's not guaranteed to return the results in the order you
specified.
In SQL, _nothing_ is guaranteed to return in the order you specified.
This isn't really a Postgres thing; unless you use ORDER BY, SQL's
sets are not ordered.
I need the IDs in insertion order so I can do further processing.
This sets off alarm bells for me. What further processing are you
doing? Is it possible that you could move that into a single step in
the database (maybe with a function or even a trigger) so that the
result of your RETURNING really would provide you with what you need?
So currently I've changed my code to use RETURNING and then I'm ordering
the results based on a secondary column that I know the order of. This
works, but seems clunky, so I'm wondering if there's a nicer way.
This is probably what I'd do, assuming that "further processing" isn't
more data transformation. If it _is_, then I'd do the whole thing in
a single step (in the database, once I inserted).
A
--
Andrew Sullivan
ajs@crankycanuck.ca
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Andrew Sullivan-8 wrote
So currently I've changed my code to use RETURNING and then I'm ordering
the results based on a secondary column that I know the order of. This
works, but seems clunky, so I'm wondering if there's a nicer way.This is probably what I'd do, assuming that "further processing" isn't
more data transformation. If it _is_, then I'd do the whole thing in
a single step (in the database, once I inserted).
If order is an implicit property of the source data then you need to
explicitly encode that order during (or before) import. There are numerous
ways to implement such but except for extremely simple cases PostgreSQL will
not do the appropriate thing automatically in the face of concurrency.
Also, do you need sequential IDs or just IDs that are ever increasing? And
if the later then tagging the input source will let you distinguish between
two different datasets even if their sequences are overlapping.
David J.
--
View this message in context: http://postgresql.1045698.n5.nabble.com/Getting-sequence-generated-IDs-from-multiple-row-insert-tp5798092p5798107.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mon, Mar 31, 2014 at 01:34:04PM -0700, David Johnston wrote:
If order is an implicit property of the source data then you need to
explicitly encode that order during (or before) import.
Sure, but the problem the OP had I thought was that the RETURNING
clause doesn't guarantee that the rows coming back are in the order
they were inserted. This is just a SQL thing. (I guess you could
ORDER BY the RETURNING clause, right?)
There are numerous
ways to implement such but except for extremely simple cases PostgreSQL will
not do the appropriate thing automatically in the face of concurrency.
It _is_ doing the appropriate thing, though: this is SQL. The rows
aren't ordered unless you tell them to be.
A
--
Andrew Sullivan
ajs@crankycanuck.ca
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general