Skipping duplicate records?
Hi,
I've recently migrated from MySQL to PostgreSQL and as impressed as I am
with Postgres, I have found one seemingly missing feature to be a little
bothersome..
'mysqlimport' has the ability to skip duplicate records when doing bulk
imports from non-binary files. PostgreSQL doesn't seem to have this
feature, and it causes a problem for me as I import extremely large
amounts of data into Postgres using 'copy' and it rejects the whole file
if one record breaches the primary key.
I have managed to get around this by hacking
src/backend/access/nbtree/nbtinsert.c to call elog with NOTICE instead of
ERROR, causing it to skip the duplicate record and continue importing.
Is there a way to get around this without changing the code? If not, will
a future release of Postgres optionally implement this?
Thanks in advance,
Steve Micallef
On Thu, 7 Jun 2001, Steve Micallef wrote:
'mysqlimport' has the ability to skip duplicate records when doing bulk
imports from non-binary files. PostgreSQL doesn't seem to have this
feature, and it causes a problem for me as I import extremely large
amounts of data into Postgres using 'copy' and it rejects the whole file
if one record breaches the primary key.
As a quick comment, I personnally find the above to be a *feature*. If
something goes wrong during the COPY, I really want this to be handled in
a transactional manner, and just not do anything. Else it's a pain to find
out WHAT was really inserted, etc.
Your problem is really that your input data is incorrect: it doesn't
respect the constraints you want on the data.
You could:
- import the data in an non-constrained table (no UNIQUE nor
PRIMARY KEY), when import is complete, remove the duplicates
assuming id is your to-be-primary-key:
SELECT t1.id
FROM temp_table t1, temp_table t2
WHERE (t1.id = t2.id) AND (t2.oid != t2.oid);
And now it's up to you to think and see which one of those records
with the duplicate IDs are the one to keep.