Issues with upserts
The wiki entry https://wiki.postgresql.org/wiki/UPSERT kind of suggests that
with the introduction of ON CONFLICT DO UPDATE the problem of upserts is
solved. But is it?
A common use case for upserts is to keep a table up to date from an external
data source. So you might have a cron job that runs a relatively large batch
of upserts every couple of minutes.
I have found that this use case is not adequately covered by ON CONFLICT DO
UPDATE for two reasons:
- New versions are created for all rows, even if the data is identical. This
quickly fills up the WAL and puts unnecessary load on the tablespace drives.
- If the conflict target is not the serial column, the sequence backing the
serial column gets incremented for every row. This quickly reaches the point
where the serial for new rows exceeds the range of an integer.
Does this mean I have to SELECT the data first and do the conflict check in
the application or is there a better SQL-only way?
- New versions are created for all rows, even if the data is identical.
This quickly fills up the WAL and puts unnecessary load on the tablespace
drives.- If the conflict target is not the serial column, the sequence backing
the serial column gets incremented for every row. This quickly reaches the
point where the serial for new rows exceeds the range of an integer.
It sounds like you aren't adding a WHERE clause to prevent the duplicate
rows from being updated. It would help if you could share your query, but
in general this could look like this:
INSERT INTO my_table (col1, col2)
SELECT col1, col2 FROM other_table
ON CONFLICT (col1, col2) DO UPDATE SET (col1, col2) = (EXCLUDED.col1,
EXCLUDED.col2)
WHERE (my_table.col1, my_table.col2) IS DISTINCT FROM (EXCLUDED.col1,
EXCLUDED.col2);
On Wednesday, July 13, 2022, André Hänsel <andre@webkr.de> wrote:
The wiki entry https://wiki.postgresql.org/wiki/UPSERT kind of suggests
that with the introduction of ON CONFLICT DO UPDATE the problem of upserts
is solved. But is it?A common use case for upserts is to keep a table up to date from an
external data source. So you might have a cron job that runs a relatively
large batch of upserts every couple of minutes.I have found that this use case is not adequately covered by ON CONFLICT
DO UPDATE for two reasons:- New versions are created for all rows, even if the data is identical.
This quickly fills up the WAL and puts unnecessary load on the tablespace
drives.- If the conflict target is not the serial column, the sequence backing
the serial column gets incremented for every row. This quickly reaches the
point where the serial for new rows exceeds the range of an integer.Does this mean I have to SELECT the data first and do the conflict check
in the application or is there a better SQL-only way?
Well, first of all, don’t use serial/auto-increment in this situation.
Second, you need to figure out what the unique key in the provided data is
and key off of that for detecting duplicates. Ideally it already contains
some bigint key column which then makes the first point easy to accept.
David J.
Jeremy Smith wrote:
It sounds like you aren't adding a WHERE clause to prevent the duplicate rows from being updated. It would help if you could share your query, but in general this could look like this:
INSERT INTO my_table (col1, col2)
SELECT col1, col2 FROM other_table
ON CONFLICT (col1, col2) DO UPDATE SET (col1, col2) = (EXCLUDED.col1, EXCLUDED.col2)
WHERE (my_table.col1, my_table.col2) IS DISTINCT FROM (EXCLUDED.col1, EXCLUDED.col2)
Here’s an example:
https://dbfiddle.uk/?rdbms=postgres_14 <https://dbfiddle.uk/?rdbms=postgres_14&fiddle=b48d062d2eedbab14157359694b16081> &fiddle=b48d062d2eedbab14157359694b16081
CREATE TABLE t (
id serial PRIMARY KEY,
name text NOT NULL UNIQUE,
address text NOT NULL
);
INSERT INTO t(name, address)
VALUES ('foo', 'Baker street'),('bar', 'Miller street')
ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address
WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);
INSERT INTO t(name, address)
VALUES ('foo', 'Baker street'),('bar', 'Miller street')
ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address
WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);
INSERT INTO t(name, address)
VALUES ('foo', 'Baker street'),('bar', 'Miller street')
ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address
WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);
INSERT INTO t(name, address)
VALUES ('foo', 'Baker street'),('bar', 'Miller street')
ON CONFLICT (name) DO UPDATE SET address = EXCLUDED.address
WHERE (t.name, t.address) IS DISTINCT FROM (EXCLUDED.name, EXCLUDED.address);
SELECT last_value FROM t_id_seq;
This will yield “8”, showing that new sequence numbers have been generated for each attempt.
=?utf-8?Q?Andr=C3=A9_H=C3=A4nsel?= <andre@webkr.de> writes:
This will yield “8”, showing that new sequence numbers have been generated for each attempt.
Well, yeah, because the INSERT has to be attempted first, and
that includes forming the whole candidate row including the
nextval() result. If you're expecting a serial ID column to not
have holes in the sequence of values, you're going to be sadly
disappointed, whether you use ON CONFLICT UPDATE or not.
regards, tom lane
On Wed, Jul 13, 2022 at 7:58 AM André Hänsel <andre@webkr.de> wrote:
SELECT last_value FROM t_id_seq;
This will yield “8”, showing that new sequence numbers have been generated
for each attempt.
Yep, an entire able-to-be-inserted tuple is formed every time. That
requires evaluating defaults so that every column has a value.
David J.
On 7/13/22 07:58, André Hänsel wrote:
Jeremy Smith wrote:
CREATE TABLE t (
id serial PRIMARY KEY,
name text NOT NULL UNIQUE,
address text NOT NULL
);
This will yield “8”, showing that new sequence numbers have been
generated for each attempt.
If running out of id's is a concern use bigserial instead of serial as
it uses bigint:
bigint -9223372036854775808 to +9223372036854775807
vs
integer for serial:
integer -2147483648 to +2147483647
--
Adrian Klaver
adrian.klaver@aklaver.com