Bulkloading using COPY - ignore duplicates?

Started by Lee Kindnessover 24 years ago65 messageshackers
Jump to latest
#1Lee Kindness
lkindness@csl.co.uk

Hello,

I'm in the process of porting a large application from Ingres to
PostgreSQL. We make heavy use of bulkloading using the 'COPY'
statement in ESQL/C. Consider the SQL statements below (in a psql
session on an arbitrary database):

CREATE TABLE copytest(f1 INTEGER, f2 INTEGER);
CREATE UNIQUE INDEX copytest_idx ON copytest USING BTREE(f1, f2);
COPY copytest FROM '/tmp/copytest';

Given the file /tmp/copytest:

1 1
2 2
3 3
4 4
4 4
5 5
6 6

will result in the following output:

ERROR: copy: line 5, Cannot insert a duplicate key into unique index copytest_idx

However my application code is assuming that duplicate rows will
simply be ignored (this is the case in Ingres, and I believe Oracle's
bulkloader too). I propose modifying _bt_check_unique() in
/backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than
ERROR) elog() and return NULL (or appropriate) to the calling function
if a duplicate key is detected and a 'COPY FROM' is in progress (add
new parameter to flag this).

Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input? Would:

WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)

need to be added to the COPY command (I hope not)?

Thanks,

--
Lee Kindness, Senior Software Engineer
Concept Systems Limited.

#2Justin Clift
justin@postgresql.org
In reply to: Lee Kindness (#1)
Re: Bulkloading using COPY - ignore duplicates?

Lee Kindness wrote:

<snip>

WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)

I would suggest :

WITH ON_DUPLICATE = IGNORE|TERMINATE

Or maybe IGNORE_DUPLICATE

purely for easier understanding, given there is no present standard nor
other databases' syntax to conform to.

:)

Regards and best wishes,

Justin Clift

need to be added to the COPY command (I hope not)?

Thanks,

--
Lee Kindness, Senior Software Engineer
Concept Systems Limited.

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Lee Kindness (#1)
Re: Bulkloading using COPY - ignore duplicates?

Lee Kindness <lkindness@csl.co.uk> writes:

Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input?

Yes. This change will not be acceptable unless it's made an optional
(and not default, IMHO, though perhaps that's negotiable) feature of
COPY.

The implementation might be rather messy too. I don't much care for the
notion of a routine as low-level as bt_check_unique knowing that the
context is or is not COPY. We might have to do some restructuring.

Would:
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
need to be added to the COPY command (I hope not)?

It occurs to me that skip-the-insert might be a useful option for
INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
the regular discussions we see on whether to do INSERT first or
UPDATE first when the key might already exist.) Maybe a SET variable
that applies to all forms of insertion would be appropriate.

regards, tom lane

#4Lee Kindness
lkindness@csl.co.uk
In reply to: Justin Clift (#2)
Re: Bulkloading using COPY - ignore duplicates?

Justin Clift writes:

Lee Kindness wrote:

WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)

I would suggest :
WITH ON_DUPLICATE = IGNORE|TERMINATE
purely for easier understanding, given there is no present standard nor
other databases' syntax to conform to.

Personally I don't see the need, and think that 'COPY FROM' could well
just go with the new semantics...

Onto an implementation issue - _bt_check_unique() returns a
TransactionId, my plans were to return NullTransactionId on a
duplicate key but naturally this is used in the success
scenario. Looking in backend/transam/transam.c I see:

TransactionId NullTransactionId = (TransactionId) 0;
TransactionId AmiTransactionId = (TransactionId) 512;
TransactionId FirstTransactionId = (TransactionId) 514;

From this I'd gather <514 can be used as magic-values/constants, So
would I be safe doing:

TransactionId XXXXTransactionId = (TransactionId) 1;

and return XXXXTransactionId from _bt_check_unique() back to
_bt_do_insert()? Naturally XXXX is something meaningful. I presume all
I need to know is if 'xwait' in _bt_check_unique() is ever '1'...

Thanks,

--
Lee Kindness, Senior Software Engineer
Concept Systems Limited.

#5Lee Kindness
lkindness@csl.co.uk
In reply to: Tom Lane (#3)
Re: Bulkloading using COPY - ignore duplicates?

Tom Lane writes:

Lee Kindness <lkindness@csl.co.uk> writes:

Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input?

Yes. This change will not be acceptable unless it's made an optional
(and not default, IMHO, though perhaps that's negotiable) feature of
COPY.

I see where you're coming from, but seriously what's the use/point of
COPY aborting and doing a rollback if one duplicate key is found? I
think it's quite reasonable to presume the input to COPY has had as
little processing done on it as possible. I could loop through the
input file before sending it to COPY but that's just wasting cycles
and effort - Postgres has btree lookup built in, I don't want to roll
my own before giving Postgres my input file!

The implementation might be rather messy too. I don't much care
for the notion of a routine as low-level as bt_check_unique knowing
that the context is or is not COPY. We might have to do some
restructuring.

Well in reality it wouldn't be "you're getting run from copy" but
rather "notice on duplicate, rather than error & exit". There is a
telling comment in nbtinsert.c just before _bt_check_unique() is
called:

/*
* If we're not allowing duplicates, make sure the key isn't already
* in the index. XXX this belongs somewhere else, likely
*/

So perhaps dupes should be searched for before _bt_doinsert is called,
or somewhere more appropriate?

Would:
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
need to be added to the COPY command (I hope not)?

It occurs to me that skip-the-insert might be a useful option for
INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
the regular discussions we see on whether to do INSERT first or
UPDATE first when the key might already exist.) Maybe a SET variable
that applies to all forms of insertion would be appropriate.

That makes quite a bit of sense.

--
Lee Kindness, Senior Software Engineer
Concept Systems Limited.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Lee Kindness (#5)
Re: Bulkloading using COPY - ignore duplicates?

Lee Kindness <lkindness@csl.co.uk> writes:

I see where you're coming from, but seriously what's the use/point of
COPY aborting and doing a rollback if one duplicate key is found?

Error detection. If I'm loading what I think is valid data, having the
system silently ignore certain types of errors is not acceptable ---
I'm especially not pleased at the notion of removing an error check
that's always been there because someone else thinks that would make it
more convenient for his application.

I think it's quite reasonable to presume the input to COPY has had as
little processing done on it as possible.

The primary and traditional use of COPY has always been to reload dumped
data. That's why it doesn't do any fancy processing like DEFAULT
insertion, and that's why it should be quite strict about error
conditions. In a reload scenario, any sort of problem deserves
careful investigation.

regards, tom lane

#7Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#6)
Re: Bulkloading using COPY - ignore duplicates?

Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input?

Yes. This change will not be acceptable unless it's made an optional
(and not default, IMHO, though perhaps that's negotiable) feature of
COPY.

The implementation might be rather messy too. I don't much
care for the
notion of a routine as low-level as bt_check_unique knowing that the
context is or is not COPY. We might have to do some restructuring.

Would:
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
need to be added to the COPY command (I hope not)?

It occurs to me that skip-the-insert might be a useful option for
INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
the regular discussions we see on whether to do INSERT first or
UPDATE first when the key might already exist.) Maybe a SET variable
that applies to all forms of insertion would be appropriate.

Imho yes, but:
I thought that the problem was, that you cannot simply skip the
insert, because at that time the tuple (pointer) might have already
been successfully inserted into an other index/heap, and thus this was
only sanely possible with savepoints/undo.

An idea would probably be to at once mark the new tuple dead, and
proceed
normally?

Andreas

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#7)
Re: Bulkloading using COPY - ignore duplicates?

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

I thought that the problem was, that you cannot simply skip the
insert, because at that time the tuple (pointer) might have already
been successfully inserted into an other index/heap, and thus this was
only sanely possible with savepoints/undo.

Hmm, good point. If we don't error out the transaction then that tuple
would become good when we commit. This is nastier than it appears.

regards, tom lane

#9Lee Kindness
lkindness@csl.co.uk
In reply to: Tom Lane (#6)
Re: Bulkloading using COPY - ignore duplicates?

Tom Lane writes:

I'm especially not pleased at the notion of removing an error check
that's always been there because someone else thinks that would make it
more convenient for his application.

Please, don't get me wrong - I don't want to come across arrogant. I'm
simply trying to improve the 'COPY FROM' command in a situation where
speed is a critical issue and the data is dirty... And that must be a
relatively common scenario in industry.

And I never said the duplicate should be silently ignored - an
elog(NOTICE) should still be output.

Lee.

#10Thomas Swan
tswan@olemiss.edu
In reply to: Lee Kindness (#1)
Re: Bulkloading using COPY - ignore duplicates?

Lee Kindness wrote:

Tom Lane writes:

Lee Kindness <lkindness@csl.co.uk> writes:

Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input?

Yes. This change will not be acceptable unless it's made an optional
(and not default, IMHO, though perhaps that's negotiable) feature of
COPY.

I see where you're coming from, but seriously what's the use/point of
COPY aborting and doing a rollback if one duplicate key is found? I
think it's quite reasonable to presume the input to COPY has had as
little processing done on it as possible. I could loop through the
input file before sending it to COPY but that's just wasting cycles
and effort - Postgres has btree lookup built in, I don't want to roll
my own before giving Postgres my input file!

The implementation might be rather messy too. I don't much care
for the notion of a routine as low-level as bt_check_unique knowing
that the context is or is not COPY. We might have to do some
restructuring.

Well in reality it wouldn't be "you're getting run from copy" but
rather "notice on duplicate, rather than error & exit". There is a
telling comment in nbtinsert.c just before _bt_check_unique() is
called:

/*
* If we're not allowing duplicates, make sure the key isn't already
* in the index. XXX this belongs somewhere else, likely
*/

So perhaps dupes should be searched for before _bt_doinsert is called,
or somewhere more appropriate?

Would:
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
need to be added to the COPY command (I hope not)?

It occurs to me that skip-the-insert might be a useful option for
INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
the regular discussions we see on whether to do INSERT first or
UPDATE first when the key might already exist.) Maybe a SET variable
that applies to all forms of insertion would be appropriate.

That makes quite a bit of sense.

This is tring to avoid one step.

IMHO, you should copy into a temporary table and the do a select
distinct from it into the table that you want.

A. You can validate your data before you put it into your permanent table.
B. This doesn't cost you much.

Don't make the assumption that bulk copies have not been checked or
validated. The assumption should be correct data or you shouldn't be
using COPY.

Show quoted text
#11Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Thomas Swan (#10)
Re: Bulkloading using COPY - ignore duplicates?

IMHO, you should copy into a temporary table and the do a select
distinct from it into the table that you want.

Which would be way too slow for normal operation :-(
We are talking about a "fast as possible" data load from a flat file
that may have duplicates (or even data errors, but that
is another issue).

Andreas

#12Thomas Swan
tswan@olemiss.edu
In reply to: Zeugswetter Andreas SB SD (#11)
Re: Bulkloading using COPY - ignore duplicates?

Zeugswetter Andreas SB SD wrote:

IMHO, you should copy into a temporary table and the do a select
distinct from it into the table that you want.

Which would be way too slow for normal operation :-(
We are talking about a "fast as possible" data load from a flat file
that may have duplicates (or even data errors, but that
is another issue).

Andreas

Then the IGNORE_DUPLICATE would definitely be the way to go, if speed is
the question...

#13Lee Kindness
lkindness@csl.co.uk
In reply to: Zeugswetter Andreas SB SD (#11)
Re: Bulkloading using COPY - ignore duplicates?

Okay,

If I'm going to modify 'COPY INTO' to include 'ignore duplicates'
functionality it looks like I'll have to add to the COPY syntax. The
most obvious way is to add:

WITH IGNORE DUPLICATES

to the syntax. I'm going to need my hand held a bit for this! The
grammar for COPY will need updating in gram.y and specifically the
'WITH' keyword will have 'IGNORE DUPLICATES' as well as 'NULL AS'.

Any pointers?

Thanks, Lee.

#14Lee Kindness
lkindness@csl.co.uk
In reply to: Lee Kindness (#13)
Re: Bulkloading using COPY - ignore duplicates?

Lee Kindness writes:

If I'm going to modify 'COPY INTO' to include 'ignore duplicates'
functionality it looks like I'll have to add to the COPY syntax. The
most obvious way is to add:
WITH IGNORE DUPLICATES

Or does it make more sense to add a 'COPY_IGNORE_DUPLICATES' SET
parameter?

Lee.

#15Bruce Momjian
bruce@momjian.us
In reply to: Lee Kindness (#1)
Re: Bulkloading using COPY - ignore duplicates?

However my application code is assuming that duplicate rows will
simply be ignored (this is the case in Ingres, and I believe Oracle's
bulkloader too). I propose modifying _bt_check_unique() in
/backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than
ERROR) elog() and return NULL (or appropriate) to the calling function
if a duplicate key is detected and a 'COPY FROM' is in progress (add
new parameter to flag this).

If you have a UNIQUE index on the table, just throwing away duplicates
seems really bad to me. I know Ingres had that heapsort structure that
would remove duplicates. That may be an interesting feature to add as
an operation that can be performed.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#8)
Re: Bulkloading using COPY - ignore duplicates?

I said:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

I thought that the problem was, that you cannot simply skip the
insert, because at that time the tuple (pointer) might have already
been successfully inserted into an other index/heap, and thus this was
only sanely possible with savepoints/undo.

Hmm, good point. If we don't error out the transaction then that tuple
would become good when we commit. This is nastier than it appears.

On further thought, I think it *would* be possible to do this without
savepoints, but it'd take some restructuring of the index AM API.
What'd have to happen is that a unique index could not raise an elog
ERROR when it detects a uniqueness conflict. Instead, it'd return a
uniqueness-conflict indication back to its caller. This would have
to propagate up to the level of the executor. At that point we'd make
the choice of whether to raise an error or not. If not, we'd need to
modify the just-created tuple to mark it deleted by the current
transaction. We can't remove it, since that would leave any
already-created entries in other indexes pointing to nothing. But
marking it deleted by the same xact and command ID that inserted it
would leave things in a valid state until VACUUM comes along to do the
janitorial work.

To support backoff in the case of a conflict during UPDATE, it'd also be
necessary to un-mark the prior version of the tuple, which we'd already
marked as deleted. This might create some concurrency issues in case
there are other updaters waiting to see if we commit or not. (The same
issue would arise for savepoint-based undo, though.) We might want to
punt on that part for now.

The effects don't stop propagating there, either. The decision not to
insert the tuple must be reported up still further, so that the executor
knows not to run any AFTER INSERT/UPDATE triggers and knows not to count
the tuple as inserted/updated for the command completion report.

In short, quite a lot of code to touch to make this happen ...

regards, tom lane

#17Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#16)
Re: Bulkloading using COPY - ignore duplicates?

-----Original Message-----
From: Tom Lane

I said:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

I thought that the problem was, that you cannot simply skip the
insert, because at that time the tuple (pointer) might have already
been successfully inserted into an other index/heap, and thus this was
only sanely possible with savepoints/undo.

Hmm, good point. If we don't error out the transaction then that tuple
would become good when we commit. This is nastier than it appears.

On further thought, I think it *would* be possible to do this without
savepoints,

It's a very well known issue that the partial rolloback functionality is
a basis of this kind of problem and it's the reason I've mentioned that
UNDO functionality has the highest priority. IMHO we shouldn't
implement a partial rolloback functionality specific to an individual
problem.

regards,
Hiroshi Inoue

#18Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Hiroshi Inoue (#17)
Re: Bulkloading using COPY - ignore duplicates?

The effects don't stop propagating there, either. The decision
not to insert the tuple must be reported up still further, so
that the executor knows not to run any AFTER INSERT/UPDATE
triggers and knows not to count the tuple as inserted/updated
for the command completion report.

But what about BEFORE insert/update triggers which could insert
records too?

Vadim

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#18)
Re: Bulkloading using COPY - ignore duplicates?

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

The effects don't stop propagating there, either. The decision
not to insert the tuple must be reported up still further, so
that the executor knows not to run any AFTER INSERT/UPDATE
triggers and knows not to count the tuple as inserted/updated
for the command completion report.

But what about BEFORE insert/update triggers which could insert
records too?

Well, what about them? It's already possible for a later BEFORE trigger
to cause the actual insertion to be suppressed, so I don't see any
difference from what we have now. If a BEFORE trigger takes actions
on the assumption that the insert will happen, it's busted already.

Mind you, I'm not actually advocating that we do any of this ;-).
I was just sketching a possible implementation approach in case someone
wants to try it.

regards, tom lane

#20Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#19)
Re: Bulkloading using COPY - ignore duplicates?

But what about BEFORE insert/update triggers which could
insert records too?

Well, what about them? It's already possible for a later
BEFORE trigger to cause the actual insertion to be suppressed,
so I don't see any difference from what we have now.
If a BEFORE trigger takes actions on the assumption that the
insert will happen, it's busted already.

This problem could be solved now by implementing *single* trigger.
In future, we could give users ability to specify trigger
execution order.
But with proposed feature ...

Mind you, I'm not actually advocating that we do any of this ;-).

I understand -:)

I was just sketching a possible implementation approach in
case someone wants to try it.

And I'm just sketching possible problems -:)

Vadim

#21Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#3)
#22Jim Buttafuoco
jim@buttafuoco.net
In reply to: Peter Eisentraut (#21)
#23Lee Kindness
lkindness@csl.co.uk
In reply to: Peter Eisentraut (#21)
#24Lee Kindness
lkindness@csl.co.uk
In reply to: Lee Kindness (#1)
#25Horák Daniel
horak@sit.plzen-city.cz
In reply to: Lee Kindness (#24)
#26Lee Kindness
lkindness@csl.co.uk
In reply to: Lee Kindness (#23)
#27Patrick Welche
prlw1@newn.cam.ac.uk
In reply to: Lee Kindness (#9)
#28Lee Kindness
lkindness@csl.co.uk
In reply to: Patrick Welche (#27)
#29Hannu Krosing
hannu@tm.ee
In reply to: Lee Kindness (#1)
#30Lee Kindness
lkindness@csl.co.uk
In reply to: Hannu Krosing (#29)
#31Lee Kindness
lkindness@csl.co.uk
In reply to: Lee Kindness (#30)
#32Patrick Welche
prlw1@newn.cam.ac.uk
In reply to: Lee Kindness (#28)
#33Lee Kindness
lkindness@csl.co.uk
In reply to: Patrick Welche (#32)
#34Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Lee Kindness (#33)
#35Peter Eisentraut
peter_e@gmx.net
In reply to: Lee Kindness (#26)
#36Peter Eisentraut
peter_e@gmx.net
In reply to: Lee Kindness (#23)
#37Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Tom Lane (#3)
#38Lee Kindness
lkindness@csl.co.uk
In reply to: Peter Eisentraut (#36)
#39Jim Buttafuoco
jim@buttafuoco.net
In reply to: Lee Kindness (#38)
#40Peter Eisentraut
peter_e@gmx.net
In reply to: Jim Buttafuoco (#39)
#41Lee Kindness
lkindness@csl.co.uk
In reply to: Peter Eisentraut (#40)
#42Lee Kindness
lkindness@csl.co.uk
In reply to: Peter Eisentraut (#40)
#43Peter Eisentraut
peter_e@gmx.net
In reply to: Lee Kindness (#42)
#44Lee Kindness
lkindness@csl.co.uk
In reply to: Peter Eisentraut (#43)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Lee Kindness (#44)
#46Lee Kindness
lkindness@csl.co.uk
In reply to: Tom Lane (#45)
#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Lee Kindness (#46)
#48Lee Kindness
lkindness@csl.co.uk
In reply to: Tom Lane (#47)
#49Bruce Momjian
bruce@momjian.us
In reply to: Lee Kindness (#48)
#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#49)
#51Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#50)
#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#51)
#53Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#52)
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#53)
#55Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#54)
#56Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#55)
#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#56)
#58Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#57)
#59Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Bruce Momjian (#58)
#60Bruce Momjian
bruce@momjian.us
In reply to: Mikheev, Vadim (#59)
#61Daniel Kalchev
daniel@digsys.bg
In reply to: Bruce Momjian (#60)
#62Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Daniel Kalchev (#61)
#63Daniel Kalchev
daniel@digsys.bg
In reply to: Mikheev, Vadim (#62)
#64Bruce Momjian
bruce@momjian.us
In reply to: Mikheev, Vadim (#62)
#65Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Bruce Momjian (#64)