Bulkloading using COPY - ignore duplicates?
Hello,
I'm in the process of porting a large application from Ingres to
PostgreSQL. We make heavy use of bulkloading using the 'COPY'
statement in ESQL/C. Consider the SQL statements below (in a psql
session on an arbitrary database):
CREATE TABLE copytest(f1 INTEGER, f2 INTEGER);
CREATE UNIQUE INDEX copytest_idx ON copytest USING BTREE(f1, f2);
COPY copytest FROM '/tmp/copytest';
Given the file /tmp/copytest:
1 1
2 2
3 3
4 4
4 4
5 5
6 6
will result in the following output:
ERROR: copy: line 5, Cannot insert a duplicate key into unique index copytest_idx
However my application code is assuming that duplicate rows will
simply be ignored (this is the case in Ingres, and I believe Oracle's
bulkloader too). I propose modifying _bt_check_unique() in
/backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than
ERROR) elog() and return NULL (or appropriate) to the calling function
if a duplicate key is detected and a 'COPY FROM' is in progress (add
new parameter to flag this).
Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input? Would:
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
need to be added to the COPY command (I hope not)?
Thanks,
--
Lee Kindness, Senior Software Engineer
Concept Systems Limited.
Lee Kindness wrote:
<snip>
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
I would suggest :
WITH ON_DUPLICATE = IGNORE|TERMINATE
Or maybe IGNORE_DUPLICATE
purely for easier understanding, given there is no present standard nor
other databases' syntax to conform to.
:)
Regards and best wishes,
Justin Clift
need to be added to the COPY command (I hope not)?
Thanks,
--
Lee Kindness, Senior Software Engineer
Concept Systems Limited.
--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi
Lee Kindness <lkindness@csl.co.uk> writes:
Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input?
Yes. This change will not be acceptable unless it's made an optional
(and not default, IMHO, though perhaps that's negotiable) feature of
COPY.
The implementation might be rather messy too. I don't much care for the
notion of a routine as low-level as bt_check_unique knowing that the
context is or is not COPY. We might have to do some restructuring.
Would:
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
need to be added to the COPY command (I hope not)?
It occurs to me that skip-the-insert might be a useful option for
INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
the regular discussions we see on whether to do INSERT first or
UPDATE first when the key might already exist.) Maybe a SET variable
that applies to all forms of insertion would be appropriate.
regards, tom lane
Justin Clift writes:
Lee Kindness wrote:
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
I would suggest :
WITH ON_DUPLICATE = IGNORE|TERMINATE
purely for easier understanding, given there is no present standard nor
other databases' syntax to conform to.
Personally I don't see the need, and think that 'COPY FROM' could well
just go with the new semantics...
Onto an implementation issue - _bt_check_unique() returns a
TransactionId, my plans were to return NullTransactionId on a
duplicate key but naturally this is used in the success
scenario. Looking in backend/transam/transam.c I see:
TransactionId NullTransactionId = (TransactionId) 0;
TransactionId AmiTransactionId = (TransactionId) 512;
TransactionId FirstTransactionId = (TransactionId) 514;
From this I'd gather <514 can be used as magic-values/constants, So
would I be safe doing:
TransactionId XXXXTransactionId = (TransactionId) 1;
and return XXXXTransactionId from _bt_check_unique() back to
_bt_do_insert()? Naturally XXXX is something meaningful. I presume all
I need to know is if 'xwait' in _bt_check_unique() is ever '1'...
Thanks,
--
Lee Kindness, Senior Software Engineer
Concept Systems Limited.
Tom Lane writes:
Lee Kindness <lkindness@csl.co.uk> writes:
Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input?Yes. This change will not be acceptable unless it's made an optional
(and not default, IMHO, though perhaps that's negotiable) feature of
COPY.
I see where you're coming from, but seriously what's the use/point of
COPY aborting and doing a rollback if one duplicate key is found? I
think it's quite reasonable to presume the input to COPY has had as
little processing done on it as possible. I could loop through the
input file before sending it to COPY but that's just wasting cycles
and effort - Postgres has btree lookup built in, I don't want to roll
my own before giving Postgres my input file!
The implementation might be rather messy too. I don't much care
for the notion of a routine as low-level as bt_check_unique knowing
that the context is or is not COPY. We might have to do some
restructuring.
Well in reality it wouldn't be "you're getting run from copy" but
rather "notice on duplicate, rather than error & exit". There is a
telling comment in nbtinsert.c just before _bt_check_unique() is
called:
/*
* If we're not allowing duplicates, make sure the key isn't already
* in the index. XXX this belongs somewhere else, likely
*/
So perhaps dupes should be searched for before _bt_doinsert is called,
or somewhere more appropriate?
Would:
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
need to be added to the COPY command (I hope not)?It occurs to me that skip-the-insert might be a useful option for
INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
the regular discussions we see on whether to do INSERT first or
UPDATE first when the key might already exist.) Maybe a SET variable
that applies to all forms of insertion would be appropriate.
That makes quite a bit of sense.
--
Lee Kindness, Senior Software Engineer
Concept Systems Limited.
Lee Kindness <lkindness@csl.co.uk> writes:
I see where you're coming from, but seriously what's the use/point of
COPY aborting and doing a rollback if one duplicate key is found?
Error detection. If I'm loading what I think is valid data, having the
system silently ignore certain types of errors is not acceptable ---
I'm especially not pleased at the notion of removing an error check
that's always been there because someone else thinks that would make it
more convenient for his application.
I think it's quite reasonable to presume the input to COPY has had as
little processing done on it as possible.
The primary and traditional use of COPY has always been to reload dumped
data. That's why it doesn't do any fancy processing like DEFAULT
insertion, and that's why it should be quite strict about error
conditions. In a reload scenario, any sort of problem deserves
careful investigation.
regards, tom lane
Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input?Yes. This change will not be acceptable unless it's made an optional
(and not default, IMHO, though perhaps that's negotiable) feature of
COPY.The implementation might be rather messy too. I don't much
care for the
notion of a routine as low-level as bt_check_unique knowing that the
context is or is not COPY. We might have to do some restructuring.Would:
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
need to be added to the COPY command (I hope not)?It occurs to me that skip-the-insert might be a useful option for
INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
the regular discussions we see on whether to do INSERT first or
UPDATE first when the key might already exist.) Maybe a SET variable
that applies to all forms of insertion would be appropriate.
Imho yes, but:
I thought that the problem was, that you cannot simply skip the
insert, because at that time the tuple (pointer) might have already
been successfully inserted into an other index/heap, and thus this was
only sanely possible with savepoints/undo.
An idea would probably be to at once mark the new tuple dead, and
proceed
normally?
Andreas
Import Notes
Resolved by subject fallback
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
I thought that the problem was, that you cannot simply skip the
insert, because at that time the tuple (pointer) might have already
been successfully inserted into an other index/heap, and thus this was
only sanely possible with savepoints/undo.
Hmm, good point. If we don't error out the transaction then that tuple
would become good when we commit. This is nastier than it appears.
regards, tom lane
Tom Lane writes:
I'm especially not pleased at the notion of removing an error check
that's always been there because someone else thinks that would make it
more convenient for his application.
Please, don't get me wrong - I don't want to come across arrogant. I'm
simply trying to improve the 'COPY FROM' command in a situation where
speed is a critical issue and the data is dirty... And that must be a
relatively common scenario in industry.
And I never said the duplicate should be silently ignored - an
elog(NOTICE) should still be output.
Lee.
Lee Kindness wrote:
Tom Lane writes:
Lee Kindness <lkindness@csl.co.uk> writes:
Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input?Yes. This change will not be acceptable unless it's made an optional
(and not default, IMHO, though perhaps that's negotiable) feature of
COPY.I see where you're coming from, but seriously what's the use/point of
COPY aborting and doing a rollback if one duplicate key is found? I
think it's quite reasonable to presume the input to COPY has had as
little processing done on it as possible. I could loop through the
input file before sending it to COPY but that's just wasting cycles
and effort - Postgres has btree lookup built in, I don't want to roll
my own before giving Postgres my input file!The implementation might be rather messy too. I don't much care
for the notion of a routine as low-level as bt_check_unique knowing
that the context is or is not COPY. We might have to do some
restructuring.Well in reality it wouldn't be "you're getting run from copy" but
rather "notice on duplicate, rather than error & exit". There is a
telling comment in nbtinsert.c just before _bt_check_unique() is
called:/*
* If we're not allowing duplicates, make sure the key isn't already
* in the index. XXX this belongs somewhere else, likely
*/So perhaps dupes should be searched for before _bt_doinsert is called,
or somewhere more appropriate?Would:
WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
need to be added to the COPY command (I hope not)?It occurs to me that skip-the-insert might be a useful option for
INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
the regular discussions we see on whether to do INSERT first or
UPDATE first when the key might already exist.) Maybe a SET variable
that applies to all forms of insertion would be appropriate.That makes quite a bit of sense.
This is tring to avoid one step.
IMHO, you should copy into a temporary table and the do a select
distinct from it into the table that you want.
A. You can validate your data before you put it into your permanent table.
B. This doesn't cost you much.
Don't make the assumption that bulk copies have not been checked or
validated. The assumption should be correct data or you shouldn't be
using COPY.
Show quoted text
IMHO, you should copy into a temporary table and the do a select
distinct from it into the table that you want.
Which would be way too slow for normal operation :-(
We are talking about a "fast as possible" data load from a flat file
that may have duplicates (or even data errors, but that
is another issue).
Andreas
Import Notes
Resolved by subject fallback
Zeugswetter Andreas SB SD wrote:
IMHO, you should copy into a temporary table and the do a select
distinct from it into the table that you want.Which would be way too slow for normal operation :-(
We are talking about a "fast as possible" data load from a flat file
that may have duplicates (or even data errors, but that
is another issue).Andreas
Then the IGNORE_DUPLICATE would definitely be the way to go, if speed is
the question...
Okay,
If I'm going to modify 'COPY INTO' to include 'ignore duplicates'
functionality it looks like I'll have to add to the COPY syntax. The
most obvious way is to add:
WITH IGNORE DUPLICATES
to the syntax. I'm going to need my hand held a bit for this! The
grammar for COPY will need updating in gram.y and specifically the
'WITH' keyword will have 'IGNORE DUPLICATES' as well as 'NULL AS'.
Any pointers?
Thanks, Lee.
Lee Kindness writes:
If I'm going to modify 'COPY INTO' to include 'ignore duplicates'
functionality it looks like I'll have to add to the COPY syntax. The
most obvious way is to add:
WITH IGNORE DUPLICATES
Or does it make more sense to add a 'COPY_IGNORE_DUPLICATES' SET
parameter?
Lee.
However my application code is assuming that duplicate rows will
simply be ignored (this is the case in Ingres, and I believe Oracle's
bulkloader too). I propose modifying _bt_check_unique() in
/backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than
ERROR) elog() and return NULL (or appropriate) to the calling function
if a duplicate key is detected and a 'COPY FROM' is in progress (add
new parameter to flag this).
If you have a UNIQUE index on the table, just throwing away duplicates
seems really bad to me. I know Ingres had that heapsort structure that
would remove duplicates. That may be an interesting feature to add as
an operation that can be performed.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
I said:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
I thought that the problem was, that you cannot simply skip the
insert, because at that time the tuple (pointer) might have already
been successfully inserted into an other index/heap, and thus this was
only sanely possible with savepoints/undo.
Hmm, good point. If we don't error out the transaction then that tuple
would become good when we commit. This is nastier than it appears.
On further thought, I think it *would* be possible to do this without
savepoints, but it'd take some restructuring of the index AM API.
What'd have to happen is that a unique index could not raise an elog
ERROR when it detects a uniqueness conflict. Instead, it'd return a
uniqueness-conflict indication back to its caller. This would have
to propagate up to the level of the executor. At that point we'd make
the choice of whether to raise an error or not. If not, we'd need to
modify the just-created tuple to mark it deleted by the current
transaction. We can't remove it, since that would leave any
already-created entries in other indexes pointing to nothing. But
marking it deleted by the same xact and command ID that inserted it
would leave things in a valid state until VACUUM comes along to do the
janitorial work.
To support backoff in the case of a conflict during UPDATE, it'd also be
necessary to un-mark the prior version of the tuple, which we'd already
marked as deleted. This might create some concurrency issues in case
there are other updaters waiting to see if we commit or not. (The same
issue would arise for savepoint-based undo, though.) We might want to
punt on that part for now.
The effects don't stop propagating there, either. The decision not to
insert the tuple must be reported up still further, so that the executor
knows not to run any AFTER INSERT/UPDATE triggers and knows not to count
the tuple as inserted/updated for the command completion report.
In short, quite a lot of code to touch to make this happen ...
regards, tom lane
-----Original Message-----
From: Tom LaneI said:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
I thought that the problem was, that you cannot simply skip the
insert, because at that time the tuple (pointer) might have already
been successfully inserted into an other index/heap, and thus this was
only sanely possible with savepoints/undo.Hmm, good point. If we don't error out the transaction then that tuple
would become good when we commit. This is nastier than it appears.On further thought, I think it *would* be possible to do this without
savepoints,
It's a very well known issue that the partial rolloback functionality is
a basis of this kind of problem and it's the reason I've mentioned that
UNDO functionality has the highest priority. IMHO we shouldn't
implement a partial rolloback functionality specific to an individual
problem.
regards,
Hiroshi Inoue
The effects don't stop propagating there, either. The decision
not to insert the tuple must be reported up still further, so
that the executor knows not to run any AFTER INSERT/UPDATE
triggers and knows not to count the tuple as inserted/updated
for the command completion report.
But what about BEFORE insert/update triggers which could insert
records too?
Vadim
Import Notes
Resolved by subject fallback
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
The effects don't stop propagating there, either. The decision
not to insert the tuple must be reported up still further, so
that the executor knows not to run any AFTER INSERT/UPDATE
triggers and knows not to count the tuple as inserted/updated
for the command completion report.
But what about BEFORE insert/update triggers which could insert
records too?
Well, what about them? It's already possible for a later BEFORE trigger
to cause the actual insertion to be suppressed, so I don't see any
difference from what we have now. If a BEFORE trigger takes actions
on the assumption that the insert will happen, it's busted already.
Mind you, I'm not actually advocating that we do any of this ;-).
I was just sketching a possible implementation approach in case someone
wants to try it.
regards, tom lane
But what about BEFORE insert/update triggers which could
insert records too?Well, what about them? It's already possible for a later
BEFORE trigger to cause the actual insertion to be suppressed,
so I don't see any difference from what we have now.
If a BEFORE trigger takes actions on the assumption that the
insert will happen, it's busted already.
This problem could be solved now by implementing *single* trigger.
In future, we could give users ability to specify trigger
execution order.
But with proposed feature ...
Mind you, I'm not actually advocating that we do any of this ;-).
I understand -:)
I was just sketching a possible implementation approach in
case someone wants to try it.
And I'm just sketching possible problems -:)
Vadim
Import Notes
Resolved by subject fallback
Tom Lane writes:
It occurs to me that skip-the-insert might be a useful option for
INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
the regular discussions we see on whether to do INSERT first or
UPDATE first when the key might already exist.) Maybe a SET variable
that applies to all forms of insertion would be appropriate.
What we need is:
1. Make errors not abort the transaction.
2. Error codes
Then you can make your client deal with this in which ever way you want,
at least for single-value inserts.
However, it seems to me that COPY ignoring duplicates can easily be done
by preprocessing the input file.
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
I have used Oracle SQLOADER for many years now. It has the ability to
put rejects/discards/bad into an output file and keep on going, maybe
this should be added to the copy command.
COPY [ BINARY ] table [ WITH OIDS ]
FROM { 'filename' | stdin }
[ [USING] DELIMITERS 'delimiter' ]
[ WITH NULL AS 'null string' ]
[ DISCARDS 'filename' ]
what do you think???
Tom Lane writes:
It occurs to me that skip-the-insert might be a useful option for
INSERTs that detect a unique-key conflict, not only for COPY. (Cf.
the regular discussions we see on whether to do INSERT first or
UPDATE first when the key might already exist.) Maybe a SET
variable
that applies to all forms of insertion would be appropriate.
What we need is:
1. Make errors not abort the transaction.
2. Error codes
Then you can make your client deal with this in which ever way you
want,
at least for single-value inserts.
However, it seems to me that COPY ignoring duplicates can easily be
done
by preprocessing the input file.
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter---------------------------(end of broadcast)-------------------------
--
Show quoted text
TIP 4: Don't 'kill -9' the postmaster
Import Notes
Resolved by subject fallback
Peter Eisentraut writes:
However, it seems to me that COPY ignoring duplicates can easily be
done by preprocessing the input file.
Or by post-processing, like (error checking cut):
void import_shots(char *impfile, int lineshoot_id)
{
char tab_name[128];
char tab_temp[128];
frig_file(impfile); /* add the postgres header */
sprintf(tab_name, "shot_%d", lineshoot_id);
sprintf(tab_temp, "shot_%d_tmp", lineshoot_id);
sprintf(cmd, "CREATE TEMPORARY TABLE %s AS SELECT * FROM shot",
tab_temp);
EXEC SQL EXECUTE IMMEDIATE :cmd;
EXEC SQL COMMIT WORK; /* will not work without comit here! */
sprintf(cmd, "COPY BINARY %s FROM '%s'", tab_temp, impfile);
append_page_alloc(cmd, tab_name, impfile, 1);
EXEC SQL EXECUTE IMMEDIATE :cmd;
sprintf(cmd, "INSERT INTO %s SELECT DISTINCT ON(shot_time) * FROM %s",
tab_name, tab_temp);
EXEC SQL EXECUTE IMMEDIATE :cmd;
sprintf(cmd, "DROP TABLE %s", tab_temp);
EXEC SQL EXECUTE IMMEDIATE :cmd;
EXEC SQL COMMIT WORK ;
remove(impfile);
}
However this is adding significant time to the import
operation. Likewise I could loop round the input file first and hunt
for duplicates, again with a performance hit.
My main point is that Postgres can easily and quickly check for
duplicates during the COPY (as it does currently) and it adds zero
execution time to simply ignore these duplicate rows. Obviously this
is a useful feature otherwise Oracle, Ingres and other commercial
relational databases wouldn't feature similiar functionality.
Yes, in an ideal world the input to COPY should be clean and
consistent with defined indexes. However this is only really the case
when COPY is used for database/table backup and restore. It misses the
point that a major use of COPY is in speed optimisation on bulk
inserts...
Lee.
Guys,
I've made some inroads towards adding 'ignore duplicates'
functionality to PostgreSQL's COPY command. I've updated the parser
grammar for COPY FROM to now accept:
COPY [ BINARY ] table [ WITH OIDS ]
FROM { 'filename' | stdin }
[ [USING] DELIMITERS 'delimiter' ]
[ WITH [NULL AS 'null string']
[IGNORE DUPLICATES] ]
and added code to propagate this setting down to the CopyFrom function
in backend/commands/copy.c.
I also played around with _bt_check_unique, _bt_do_insert and btinsert
to return NULL on duplicate rather than elog(ERROR). Likewise
ExecInsertIndexTuples and index_insert were passed the
ignore_duplicate flag and index_insert changed to elog(ERROR) if the
return from the insert function was NULL and ignore_duplicate flag was
false.
These changes worked and gave the desired result for the COPY FROM
command, however as many mentioned these changes are far too low
level... After assessing the situation more fully, I believe the
following change in CopyFrom would be more suitable:
/* BEFORE ROW INSERT Triggers */
if (resultRelInfo->ri_TrigDesc &&
resultRelInfo->ri_TrigDesc->n_before_row[TRIGGER_EVENT_INSERT] > 0)
{
HeapTuple newtuple;
newtuple = ExecBRInsertTriggers(estate, resultRelInfo, tuple);
if (newtuple == NULL) /* "do nothing" */
skip_tuple = true;
else if (newtuple != tuple) /* modified by Trigger(s) */
{
heap_freetuple(tuple);
tuple = newtuple;
}
}
/* new code */
if( ignore_duplicates == true )
{
if( duplicate index value )
skip_tuple = true;
}
if (!skip_tuple)
{
Now I imagine 'duplicate index value' would be functionally similar to
_bt_check_unique but obviously higher level. Is there any existing
code with the functionality I desire? Can anyone point me in the right
way...
Thanks,
Lee Kindness.
Lee Kindness writes:
Show quoted text
I'm in the process of porting a large application from Ingres to
PostgreSQL. We make heavy use of bulkloading using the 'COPY'
statement in ESQL/C. Consider the SQL statements below (in a psql
session on an arbitrary database):CREATE TABLE copytest(f1 INTEGER, f2 INTEGER);
CREATE UNIQUE INDEX copytest_idx ON copytest USING BTREE(f1, f2);
COPY copytest FROM '/tmp/copytest';Given the file /tmp/copytest:
1 1
2 2
3 3
4 4
4 4
5 5
6 6will result in the following output:
ERROR: copy: line 5, Cannot insert a duplicate key into unique index copytest_idx
However my application code is assuming that duplicate rows will
simply be ignored (this is the case in Ingres, and I believe Oracle's
bulkloader too). I propose modifying _bt_check_unique() in
/backend/access/nbtree/nbtinsert.c to emit a NOTICE (rather than
ERROR) elog() and return NULL (or appropriate) to the calling function
if a duplicate key is detected and a 'COPY FROM' is in progress (add
new parameter to flag this).Would this seem a reasonable thing to do? Does anyone rely on COPY
FROM causing an ERROR on duplicate input? Would:WITH ON_DUPLICATE = CONTINUE|TERMINATE (or similar)
need to be added to the COPY command (I hope not)?
Thanks,
--
Lee Kindness, Senior Software Engineer
Concept Systems Limited.
I've made some inroads towards adding 'ignore duplicates'
functionality to PostgreSQL's COPY command. I've updated the parser
grammar for COPY FROM to now accept:COPY [ BINARY ] table [ WITH OIDS ]
FROM { 'filename' | stdin }
[ [USING] DELIMITERS 'delimiter' ]
[ WITH [NULL AS 'null string']
[IGNORE DUPLICATES] ]
Is there any possibility that COPY could insert a row so the old row
will be discarded and a new one inserted instead the old? It could be
useful for doing replications on table with modified rows.
Dan
Import Notes
Resolved by subject fallback
Gents,
I started quite a long thread about this back in September. To
summarise I was proposing that COPY FROM would not abort the
transaction when it encountered data which would cause a uniqueness
violation on the table index(s).
Generally I think this was seen as a 'Good Thing'TM for a number of
reasons:
1. Performance enhancements when doing doing bulk inserts - pre or
post processing the data to remove duplicates is very time
consuming. Likewise the best tool should always be used for the job at
and, and for searching/removing things it's a database.
2. Feature parity with other database systems. For example Oracle's
SQLOADER has a feature to not insert duplicates and rather move
them to another file for later investigation.
Naturally the default behaviour would be the current one of assuming
valid data. Also the duplicate check would not add anything to the
current code path for COPY FROM - it would not take any longer.
I attempted to add this functionality to PostgreSQL myself but got as
far as an updated parser and a COPY FROM which resulted in a database
recovery!
So (here's the question finally) is it worthwhile adding this
enhancement to the TODO list?
Thanks, Lee.
--
Lee Kindness, Senior Software Engineer, Concept Systems Limited.
http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595
On Mon, Oct 01, 2001 at 03:17:43PM +0100, Lee Kindness wrote:
Tom Lane writes:
I'm especially not pleased at the notion of removing an error check
that's always been there because someone else thinks that would make it
more convenient for his application.Please, don't get me wrong - I don't want to come across arrogant. I'm
simply trying to improve the 'COPY FROM' command in a situation where
speed is a critical issue and the data is dirty... And that must be a
relatively common scenario in industry.
Isn't that when you do your bulk copy into into a holding table, then
clean it up, and then insert into your live system?
Patrick
Patrick Welche writes:
On Mon, Oct 01, 2001 at 03:17:43PM +0100, Lee Kindness wrote:
Please, don't get me wrong - I don't want to come across arrogant. I'm
simply trying to improve the 'COPY FROM' command in a situation where
speed is a critical issue and the data is dirty... And that must be a
relatively common scenario.Isn't that when you do your bulk copy into into a holding table, then
clean it up, and then insert into your live system?
That's what I'm currently doing as a workaround - a SELECT DISTINCT
from a temporary table into the real table with the unique index on
it. However this takes absolute ages - say 5 seconds for the copy
(which is the ballpark figure I aiming toward and can achieve with
Ingres) plus another 30ish seconds for the SELECT DISTINCT.
The majority of database systems out there handle this situation in
one manner or another (MySQL ignores or replaces; Ingres ignores;
Oracle ignores or logs; others...). Indeed PostgreSQL currently checks
for duplicates in the COPY code but throws an elog(ERROR) rather than
ignoring the row, or passing the error back up the call chain.
My use of PostgreSQL is very time critical, and sadly this issue alone
may force an evaluation of Oracle's performance in this respect!
Best regards, Lee Kindness.
--
Lee Kindness, Senior Software Engineer, Concept Systems Limited.
http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595
Lee Kindness wrote:
Patrick Welche writes:
On Mon, Oct 01, 2001 at 03:17:43PM +0100, Lee Kindness wrote:
Please, don't get me wrong - I don't want to come across arrogant. I'm
simply trying to improve the 'COPY FROM' command in a situation where
speed is a critical issue and the data is dirty... And that must be a
relatively common scenario.Isn't that when you do your bulk copy into into a holding table, then
clean it up, and then insert into your live system?That's what I'm currently doing as a workaround - a SELECT DISTINCT
from a temporary table into the real table with the unique index on
it. However this takes absolute ages - say 5 seconds for the copy
(which is the ballpark figure I aiming toward and can achieve with
Ingres) plus another 30ish seconds for the SELECT DISTINCT.The majority of database systems out there handle this situation in
one manner or another (MySQL ignores or replaces; Ingres ignores;
Oracle ignores or logs; others...). Indeed PostgreSQL currently checks
for duplicates in the COPY code but throws an elog(ERROR) rather than
ignoring the row, or passing the error back up the call chain.
I guess postgresql will be able to do it once savepoints get
implemented.
My use of PostgreSQL is very time critical, and sadly this issue alone
may force an evaluation of Oracle's performance in this respect!
Can't you clean the duplicates _outside_ postgresql, say
cat dumpfile | sort | uniq | psql db -c 'copy mytable from stdin'
with your version of uniq.
or perhaps
psql db -c 'copy mytable to stdout' >> dumpfile
sort dumpfile | uniq | psql db -c 'copy mytable from stdin'
if you already have something in mytable.
------------
Hannu
Hannu Krosing writes:
Lee Kindness wrote:
The majority of database systems out there handle this situation in
one manner or another (MySQL ignores or replaces; Ingres ignores;
Oracle ignores or logs; others...). Indeed PostgreSQL currently checks
for duplicates in the COPY code but throws an elog(ERROR) rather than
ignoring the row, or passing the error back up the call chain.I guess postgresql will be able to do it once savepoints get
implemented.
This is encouraging to hear. I can see how this would make the code
changes relatively minimal and more manageable - the changes to the
current code are simply over my head!
Are savepoints relatively high up on the TODO list, once 7.2 is out the
door?
My use of PostgreSQL is very time critical, and sadly this issue alone
may force an evaluation of Oracle's performance in this respect!Can't you clean the duplicates _outside_ postgresql, say
cat dumpfile | sort | uniq | psql db -c 'copy mytable from stdin'
This is certainly a possibility, however it's just really moving the
processing elsewhere. The combined time is still around the same.
I've/we've done a lot of investigation with approaches like this and
also with techniques assuming the locality of the duplicates (which is
a no-goer). None improve the situation.
I'm not going to compare the time of just using INSERTs rather than
COPY...
Thanks for your response, Lee Kindness.
--
Lee Kindness, Senior Software Engineer, Concept Systems Limited.
http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595
Lee Kindness writes:
I'm not going to compare the time of just using INSERTs rather than
COPY...
Ooops, I'm NOW going to... Obviously my subconscious is telling me
otherwise - bring on the Christmas party!
Lee.
--
Lee Kindness, Senior Software Engineer, Concept Systems Limited.
http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595
On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote:
That's what I'm currently doing as a workaround - a SELECT DISTINCT
from a temporary table into the real table with the unique index on
it. However this takes absolute ages - say 5 seconds for the copy
(which is the ballpark figure I aiming toward and can achieve with
Ingres) plus another 30ish seconds for the SELECT DISTINCT.
Then your column really isn't unique, so how about dropping the unique index,
import the data, fix the duplicates, recreate the unique index - just as
another possible work around ;)
Patrick
Patrick Welche writes:
On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote:
That's what I'm currently doing as a workaround - a SELECT DISTINCT
from a temporary table into the real table with the unique index on
it. However this takes absolute ages - say 5 seconds for the copy
(which is the ballpark figure I aiming toward and can achieve with
Ingres) plus another 30ish seconds for the SELECT DISTINCT.Then your column really isn't unique,
That's another discussion entirely ;) - it's spat out by a real-time
system which doesn't have the time or resources to check this. Further
precision loss later in the data's life adds more duplicates...
so how about dropping the unique index, import the data, fix the
duplicates, recreate the unique index - just as another possible
work around ;)
This is just going to be the same(ish) time, no?
CREATE TABLE tab (p1 INT, p2 INT, other1 INT, other2 INT);
COPY tab FROM 'file';
DELETE FROM tab WHERE p1, p2 NOT IN (SELECT DISTINCT p1, p2
FROM tab);
CREATE UNIQUE INDEX tab_idx ON tab USING BTREE(p1, p2);
or am I missing something?
Thanks, Lee.
--
Lee Kindness, Senior Software Engineer, Concept Systems Limited.
http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595
On Thu, Dec 13, 2001 at 03:44:31PM +0000, Lee Kindness wrote:
Patrick Welche writes:
On Thu, Dec 13, 2001 at 01:25:11PM +0000, Lee Kindness wrote:
That's what I'm currently doing as a workaround - a SELECT DISTINCT
from a temporary table into the real table with the unique index on
it. However this takes absolute ages - say 5 seconds for the copy
(which is the ballpark figure I aiming toward and can achieve with
Ingres) plus another 30ish seconds for the SELECT DISTINCT.Then your column really isn't unique,
That's another discussion entirely ;) - it's spat out by a real-time
system which doesn't have the time or resources to check this. Further
precision loss later in the data's life adds more duplicates...
Hmm, the data has a later life - sounds like you'll need to remove dups
then, anyway, so can you get away with just letting the dups in? Remove
the UNIQUE requirement, and let the real time system just dump away.
How critical is it to later steps that there be no dups? And how many
(potential) dups is your RTS producing, anyway?
Your later processing (which apparently can _generate_ dups) might be
the out of the critical time path place to worry about removing dups.
Ross
P.S. This falls into the class of problem solving characterized by
"if you can't solve the problem as stated, restate the problem to be
one you _can_ solve" ;-)
Show quoted text
so how about dropping the unique index, import the data, fix the
duplicates, recreate the unique index - just as another possible
work around ;)This is just going to be the same(ish) time, no?
CREATE TABLE tab (p1 INT, p2 INT, other1 INT, other2 INT);
COPY tab FROM 'file';
DELETE FROM tab WHERE p1, p2 NOT IN (SELECT DISTINCT p1, p2
FROM tab);
CREATE UNIQUE INDEX tab_idx ON tab USING BTREE(p1, p2);or am I missing something?
Thanks, Lee.
--
Lee Kindness, Senior Software Engineer, Concept Systems Limited.
http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
Lee Kindness writes:
1. Performance enhancements when doing doing bulk inserts - pre or
post processing the data to remove duplicates is very time
consuming. Likewise the best tool should always be used for the job at
and, and for searching/removing things it's a database.
Arguably, a better tool for this is sort(1). For instance, if you have a
typical copy input file with tab-separated fields and the primary key is
in columns 1 and 2, you can remove duplicates with
sort -k 1,2 -u INFILE > OUTFILE
To get a record of what duplicates were removed, use diff.
--
Peter Eisentraut peter_e@gmx.net
Lee Kindness writes:
Yes, in an ideal world the input to COPY should be clean and
consistent with defined indexes. However this is only really the case
when COPY is used for database/table backup and restore. It misses the
point that a major use of COPY is in speed optimisation on bulk
inserts...
I think allowing this feature would open up a world of new dangerous
ideas, such as ignoring check contraints or foreign keys or magically
massaging other tables so that the foreign keys are satisfied, or ignoring
default values, or whatever. The next step would then be allowing the
same optimizations in INSERT. I feel COPY should load the data and that's
it. If you don't like the data you have then you have to fix it first.
--
Peter Eisentraut peter_e@gmx.net
O.K., time to start looking into the _nature_ of the dups in your
data, to see if there's anything specific to take advantage of, since
the general solution (tell the DBMS to ignore dups) isn't available,
and isn't likely to get there real soon.
So what does your data look like, and how do the dups occur?
Any chance it's in a really simple format, and the dups are also really
simple, like 'one record per line, dups occur as identical adjacent
lines?' if so, 'uniq' will solve the problem with little to no speed
penalty. (it's the sort that kills ...)
Or are you only gettinga dup'ed field,m and the rule 'ignore later
records?' I could see this happen if the dta is timestamped at a
granularity that doesn't _exactly_ match the repetition rate: e.g.
stamp to the second, record once a second.
So, what's it look like? Since it's one format, I bet a small, simple
pipe filter could handle dup elimination on the fly.
Ross
Show quoted text
On Thu, Dec 13, 2001 at 05:02:15PM +0000, Lee Kindness wrote:
The RTS outputs to a file which is then subsequently used as input to
other packages, one of which is the application i'm concerned
with. While fixing at source is the ideal solution there are terabytes
of legacy data around (this is raw seismic navigational data). Also
there are more than one competing packages...Our package post-processes (we're still very concerned about speed as
this is normally done while 'shooting' the seismic data) this data to
produce the final seismic navigational data, which is then later used
by other products...The problem at hand is importing the initial data - no duplicates are
produced by the program itself later (nor in its output data).Sadly a large number of later SQL queries assume no duplicates and
would result in incorrect processing calculations, amongst other
things. The shear number of these queries makes changing them
impractical.P.S. This falls into the class of problem solving characterized by
"if you can't solve the problem as stated, restate the problem to be
one you _can_ solve" ;-)Which is what i've been knocking my head against for the last few
weeks ;) The real problem is a move away from our current RDMS
(Ingres) to PostgreSQL will not happen if the performance of the
product significantly decreases (which it currently has for the import
stage) and since Ingres already just ignores the duplicates...I really want to move to PostgreSQL...
Thanks for your input,
--
Lee Kindness, Senior Software Engineer, Concept Systems Limited.
http://services.csl.co.uk/ http://www.csl.co.uk/ +44 131 5575595
Import Notes
Reply to msg id not found: 15384.57111.172847.719334@elsick.csl.co.uk
Peter Eisentraut writes:
I think allowing this feature would open up a world of new
dangerous ideas, such as ignoring check contraints or foreign keys
or magically massaging other tables so that the foreign keys are
satisfied, or ignoring default values, or whatever. The next step
would then be allowing the same optimizations in INSERT. I feel
COPY should load the data and that's it. If you don't like the
data you have then you have to fix it first.
I agree that PostgreSQL's checks during COPY are a bonus and I
wouldn't dream of not having them. Many database systems provide a
fast bulkload by ignoring these constraits and cross references -
that's a tricky/horrid situation.
However I suppose the question is should such 'invalid data' abort the
transaction, it seems a bit drastic...
I suppose i'm not really after a IGNORE DUPLICATES option, but rather
a CONTINUE ON ERROR kind of thing.
Regards, Lee.
I agree with Lee, I also like Oracle's options for a discard file, so
you can look at what was rejected, fix your problem and reload if
necessary just the rejects.
Jim
Show quoted text
Peter Eisentraut writes:
I think allowing this feature would open up a world of new
dangerous ideas, such as ignoring check contraints or foreign keys
or magically massaging other tables so that the foreign keys are
satisfied, or ignoring default values, or whatever. The next step
would then be allowing the same optimizations in INSERT. I feel
COPY should load the data and that's it. If you don't like the
data you have then you have to fix it first.I agree that PostgreSQL's checks during COPY are a bonus and I
wouldn't dream of not having them. Many database systems provide a
fast bulkload by ignoring these constraits and cross references -
that's a tricky/horrid situation.However I suppose the question is should such 'invalid data' abort the
transaction, it seems a bit drastic...I suppose i'm not really after a IGNORE DUPLICATES option, but rather
a CONTINUE ON ERROR kind of thing.Regards, Lee.
Import Notes
Resolved by subject fallback
Jim Buttafuoco writes:
I agree with Lee, I also like Oracle's options for a discard file, so
you can look at what was rejected, fix your problem and reload if
necessary just the rejects.
How do you know which one is the duplicate and which one is the good one?
More likely you will have to fix the entire thing. Anything else would
undermine the general data model except in specific use cases.
--
Peter Eisentraut peter_e@gmx.net
Peter Eisentraut writes:
Jim Buttafuoco writes:
I agree with Lee, I also like Oracle's options for a discard file, so
you can look at what was rejected, fix your problem and reload if
necessary just the rejects.How do you know which one is the duplicate and which one is the good one?
More likely you will have to fix the entire thing. Anything else would
undermine the general data model except in specific use cases.
In the general case most data is sequential, in which case it would be
normal to assume that the first record is the definitive one. Most
database systems go with this assumption apart from MySQL which gives
the user a choice between IGNORE or UPDATE...
Lee.
Peter Eisentraut writes:
Jim Buttafuoco writes:
I agree with Lee, I also like Oracle's options for a discard file, so
you can look at what was rejected, fix your problem and reload if
necessary just the rejects.How do you know which one is the duplicate and which one is the good one?
More likely you will have to fix the entire thing. Anything else would
undermine the general data model except in specific use cases.
Consider SELECT DISTINCT - which is the 'duplicate' and which one is
the good one?
Lee.
Lee Kindness writes:
Consider SELECT DISTINCT - which is the 'duplicate' and which one is
the good one?
It's not the same thing. SELECT DISTINCT only eliminates rows that are
completely the same, not only equal in their unique contraints.
Maybe you're thinking of SELECT DISTINCT ON (). Observe the big warning
that the result of that statement are random unless ORDER BY is used. --
But that's not the same thing either. We've never claimed that the COPY
input has an ordering assumption. In fact you're asking for a bit more
than an ordering assumption, you're saying that the earlier data is better
than the later data. I think in a random use case that is more likely
*not* to be the case because the data at the end is newer.
Btw., here's another concern about this proposed feature: If I do a
client-side COPY, how will you sent the "ignored" rows back to the client?
--
Peter Eisentraut peter_e@gmx.net
Peter Eisentraut writes:
Lee Kindness writes:
Consider SELECT DISTINCT - which is the 'duplicate' and which one is
the good one?It's not the same thing. SELECT DISTINCT only eliminates rows that are
completely the same, not only equal in their unique contraints.
Maybe you're thinking of SELECT DISTINCT ON (). Observe the big warning
that the result of that statement are random unless ORDER BY is used. --
But that's not the same thing either. We've never claimed that the COPY
input has an ordering assumption. In fact you're asking for a bit more
than an ordering assumption, you're saying that the earlier data is better
than the later data. I think in a random use case that is more likely
*not* to be the case because the data at the end is newer.
You're right - I was meaning 'SELECT DISTINCT ON ()'. However I'm only
using it as an example of where the database is choosing (be it
randomly) the data to discarded. While I've said in this thread that
'COPY FROM IGNORE DUPLICATES' would ignore later duplicates I'm not
really that concerned about what it ignores; first, later, random,
... I agree if it was of concern then it should be pre-processed.
Btw., here's another concern about this proposed feature: If I do
a client-side COPY, how will you sent the "ignored" rows back to
the client?
Again a number of different ideas have been mixed up in the
discussion. Oracle's logging option was only given as an example of
how other database systems deal with this option - If it wasn't
explicitly given then it's reasonable to discard the extra
information.
What really would be nice in the SQL-world is a standardised COPY
statement...
Best regards, Lee Kindness.
Lee Kindness <lkindness@csl.co.uk> writes:
You're right - I was meaning 'SELECT DISTINCT ON ()'. However I'm only
using it as an example of where the database is choosing (be it
randomly) the data to discarded.
Not a good example to support your argument. The entire point of
DISTINCT ON (imho) is that the rows that are kept or discarded are
*not* random, but can be selected by the user by specifying additional
sort columns. DISTINCT ON would be pretty useless if it weren't for
that flexibility. The corresponding concept in COPY will need to
provide flexible means for deciding which row to keep and which to
drop, else it'll be pretty useless.
regards, tom lane
Tom Lane writes:
Lee Kindness <lkindness@csl.co.uk> writes:
You're right - I was meaning 'SELECT DISTINCT ON ()'. However I'm only
using it as an example of where the database is choosing (be it
randomly) the data to discarded.Not a good example to support your argument. The entire point of
DISTINCT ON (imho) is that the rows that are kept or discarded are
*not* random, but can be selected by the user by specifying additional
sort columns. DISTINCT ON would be pretty useless if it weren't for
that flexibility. The corresponding concept in COPY will need to
provide flexible means for deciding which row to keep and which to
drop, else it'll be pretty useless.
At which point it becomes quicker to resort to INSERT...
Here's the crux question - how can I get management to go with
PostgreSQL when a core operation (import of data into a transient
database) is at least 6 times slower than the current version?
With a lot of work investigating the incoming data, the number of
incoming duplicates has been massively reduced by fixing/tackling at
source. However rouge values do still crop up (the data originates
from a real-time system with multiple hardware inputs from multiple
hardware vendors) and when they do (even just 1) the performance dies.
Add to this terrabytes of legacy data...
While you may see the option of ignoring duplicates in COPY as 'pretty
useless', it obviously has its place/use otherwise every other
database system wouldn't have support for it! (not that following the
pack is always a good idea)
In an ideal world 'COPY FROM' would only be used with data output by
'COPY TO' and it would be nice and sanitised. However in some fields
this often is not a possibility due to performance constraints!
Best regards,
--
Lee Kindness, Senior Software Engineer, Concept Systems Limited.
http://services.csl.co.uk/ http://www.csl.co.uk/
Lee Kindness <lkindness@csl.co.uk> writes:
In an ideal world 'COPY FROM' would only be used with data output by
'COPY TO' and it would be nice and sanitised. However in some fields
this often is not a possibility due to performance constraints!
Of course, the more bells and whistles we add to COPY, the slower it
will get, which rather defeats the purpose no?
regards, tom lane
Tom Lane writes:
Lee Kindness <lkindness@csl.co.uk> writes:
In an ideal world 'COPY FROM' would only be used with data output by
'COPY TO' and it would be nice and sanitised. However in some fields
this often is not a possibility due to performance constraints!Of course, the more bells and whistles we add to COPY, the slower it
will get, which rather defeats the purpose no?
Indeed, but as I've mentioned in this thread in the past, the code
path for COPY FROM already does a check against the unique index (if
there is one) but bombs-out rather than handling it...
It wouldn't add any execution time if there were no duplicates in the
input!
regards, Lee.
Lee Kindness wrote:
Tom Lane writes:
Lee Kindness <lkindness@csl.co.uk> writes:
In an ideal world 'COPY FROM' would only be used with data output by
'COPY TO' and it would be nice and sanitised. However in some fields
this often is not a possibility due to performance constraints!Of course, the more bells and whistles we add to COPY, the slower it
will get, which rather defeats the purpose no?Indeed, but as I've mentioned in this thread in the past, the code
path for COPY FROM already does a check against the unique index (if
there is one) but bombs-out rather than handling it...It wouldn't add any execution time if there were no duplicates in the
input!
I know many purists object to allowing COPY to discard invalid rows in
COPY input, but it seems we have lots of requests for this feature, with
few workarounds except pre-processing the flat file. Of course, if they
use INSERT, they will get errors that they can just ignore. I don't see
how allowing errors in COPY is any more illegal, except that COPY is one
command while multiple INSERTs are separate commands.
Seems we need to allow such a capability, if only crudely. I don't
think we can create a discard file because of the problem with remote
COPY.
I think we can allow something like:
COPY FROM '/tmp/x' WITH ERRORS 2
meaning we will allow at most two errors and will report the error line
numbers to the user. I think this syntax clearly indicates that errors
are being accepted in the input. An alternate syntax would allow an
unlimited number of errors:
COPY FROM '/tmp/x' WITH ERRORS
The errors can be non-unique errors, or even CHECK constraint errors.
Unless I hear complaints, I will add it to TODO.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I think we can allow something like:
COPY FROM '/tmp/x' WITH ERRORS 2
This is not going to happen, at least not until after there's a
wholesale revision of error handling. As things stand we do not
have a choice: elog(ERROR) must abort the transaction, because we
can't guarantee that things are in good enough shape to continue.
See the archives for previous discussions.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I think we can allow something like:
COPY FROM '/tmp/x' WITH ERRORS 2This is not going to happen, at least not until after there's a
wholesale revision of error handling. As things stand we do not
have a choice: elog(ERROR) must abort the transaction, because we
can't guarantee that things are in good enough shape to continue.
See the archives for previous discussions.
Yes, I realize we need subtransactions or something, but we should add
it to the TODO list if it is a valid request, right?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I think we can allow something like:
COPY FROM '/tmp/x' WITH ERRORS 2
Yes, I realize we need subtransactions or something, but we should add
it to the TODO list if it is a valid request, right?
Well, I don't like that particular API in any case. Why would I think
that 2 errors are okay and 3 are not, if I'm loading a
many-thousand-line COPY file? Wouldn't it matter *what* the errors
are, at least as much as how many there are? "Discard duplicate rows"
is one thing, but "ignore bogus data" (eg, unrecognizable timestamps)
is not the same animal at all.
As someone already remarked, the correct, useful form of such a feature
is to echo the rejected lines to some sort of output file that I can
look at afterwards. How many errors there are is not the issue.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I think we can allow something like:
COPY FROM '/tmp/x' WITH ERRORS 2Yes, I realize we need subtransactions or something, but we should add
it to the TODO list if it is a valid request, right?Well, I don't like that particular API in any case. Why would I think
that 2 errors are okay and 3 are not, if I'm loading a
many-thousand-line COPY file? Wouldn't it matter *what* the errors
I threw the count idea in as a possible compromise. :-)
are, at least as much as how many there are? "Discard duplicate rows"
is one thing, but "ignore bogus data" (eg, unrecognizable timestamps)
is not the same animal at all.
Yes, when we have error codes, it would be nice to specify certain
errors to ignore.
As someone already remarked, the correct, useful form of such a feature
is to echo the rejected lines to some sort of output file that I can
look at afterwards. How many errors there are is not the issue.
How about for TODO:
* Allow COPY to report error lines and continue; requires
nested transactions; optionally allow error codes to be specified
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
How about for TODO:
* Allow COPY to report error lines and continue; requires
nested transactions; optionally allow error codes to be specified
Okay, that seems reasonable.
regards, tom lane
Tom Lane wrote:
How about for TODO:
* Allow COPY to report error lines and continue; requires
nested transactions; optionally allow error codes to be specifiedOkay, that seems reasonable.
Good. Now that I think of it, nested transactions don't seem required.
We already allow pg_dump to dump a database using INSERTs, and we don't
put those inserts in a single transaction when we load them:
CREATE TABLE "test" (
"x" integer
);
INSERT INTO "test" VALUES (1);
INSERT INTO "test" VALUES (2);
Should we be wrapping these INSERTs in a transaction? Can we do COPY
with each row being its own transaction?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian wrote:
Tom Lane wrote:
How about for TODO:
* Allow COPY to report error lines and continue; requires
nested transactions; optionally allow error codes to be specifiedOkay, that seems reasonable.
Good. Now that I think of it, nested transactions don't seem required.
We already allow pg_dump to dump a database using INSERTs, and we don't
put those inserts in a single transaction when we load them:CREATE TABLE "test" (
"x" integer
);INSERT INTO "test" VALUES (1);
INSERT INTO "test" VALUES (2);Should we be wrapping these INSERTs in a transaction? Can we do COPY
with each row being its own transaction?
OK, added to TODO:
o Allow COPY to report error lines and continue; optionally
allow error codes to be specified
Seems nested transactions are not required if we load each COPY line in
its own transaction, like we do with INSERT from pg_dump.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seems nested transactions are not required if we load each COPY line in
its own transaction, like we do with INSERT from pg_dump.
I don't think that's an acceptable answer. Consider
BEGIN;
other stuff;
COPY ....;
other stuff;
ROLLBACK;
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seems nested transactions are not required if we load each COPY line in
its own transaction, like we do with INSERT from pg_dump.I don't think that's an acceptable answer. Consider
Oh, very good point. "Requires nested transactions" added to TODO.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seems nested transactions are not required if we load
each COPY line in its own transaction, like we do with
INSERT from pg_dump.I don't think that's an acceptable answer. Consider
Oh, very good point. "Requires nested transactions" added to TODO.
Also add performance issue with per-line-commit...
Also-II - there is more common name for required feature - savepoints.
Vadim
Import Notes
Resolved by subject fallback
Mikheev, Vadim wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seems nested transactions are not required if we load
each COPY line in its own transaction, like we do with
INSERT from pg_dump.I don't think that's an acceptable answer. Consider
Oh, very good point. "Requires nested transactions" added to TODO.
Also add performance issue with per-line-commit...
Also-II - there is more common name for required feature - savepoints.
OK, updated TODO to prefer savepoints term.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian said:
Mikheev, Vadim wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Seems nested transactions are not required if we load
each COPY line in its own transaction, like we do with
INSERT from pg_dump.I don't think that's an acceptable answer. Consider
Oh, very good point. "Requires nested transactions" added to TODO.
Also add performance issue with per-line-commit...
Also-II - there is more common name for required feature - savepoints.
OK, updated TODO to prefer savepoints term.
Now, how about the same functionality for
INSERT into table1 SELECT * from table2 ... WITH ERRORS;
Should allow the insert to complete, even if table1 has unique indexes and we
try to insert duplicate rows. Might save LOTS of time in bulkloading scripts
not having to do single INSERTs.
Guess all this will be available in 7.3?
Daniel
Now, how about the same functionality for
INSERT into table1 SELECT * from table2 ... WITH ERRORS;
Should allow the insert to complete, even if table1 has unique indexes and
we
try to insert duplicate rows. Might save LOTS of time in bulkloading
scripts
not having to do single INSERTs.
1. I prefer Oracle' (and others, I believe) way - put statement(s) in PL
block and define
for what exceptions (errors) what actions should be taken (ie IGNORE for
NON_UNIQ_KEY
error, etc).
2. For INSERT ... SELECT statement one can put DISTINCT in select' target
list.
Guess all this will be available in 7.3?
We'll see.
Vadim
"Vadim Mikheev" said:
1. I prefer Oracle' (and others, I believe) way - put statement(s) in PL
block and define
for what exceptions (errors) what actions should be taken (ie IGNORE for
NON_UNIQ_KEY
error, etc).
Some people prefer 'pure' SQL. Anyway, it can be argued which is worse - the
usage of non-SQL language, or usage of extended SQL language. I guess the SQL
standard does not provide for such functionality?
2. For INSERT ... SELECT statement one can put DISTINCT in select' target
list.
With this construct, you are effectively copying rows from one table to
another - or constructing rows from various sources (constants, other tables
etc) and inserting these in the table. If the target table has unique indexes
(or constraints), and some of the rows returned by SELECT violate the
restrictions - you are supposed to get errors - and unfortunately the entire
INSERT is aborted. I fail to see how DISTINCT can help here... Perhaps it is
possible to include checking for already existing tuples in the destination
table in the select... but this will significantly increase the runtime,
especially when the destination table is huge.
My idea is to let this INSERT statement insert as much of its rows as
possible, eventually returning NOTICEs or ignoring the errors (with an IGNORE
ERRORS syntax for example :)
I believe all this functionality will have to consider the syntax firts.
Daniel
Vadim Mikheev wrote:
Now, how about the same functionality for
INSERT into table1 SELECT * from table2 ... WITH ERRORS;
Should allow the insert to complete, even if table1 has unique indexes and
we
try to insert duplicate rows. Might save LOTS of time in bulkloading
scripts
not having to do single INSERTs.
1. I prefer Oracle' (and others, I believe) way - put statement(s) in PL
block and define
for what exceptions (errors) what actions should be taken (ie IGNORE for
NON_UNIQ_KEY
error, etc).
Added to TODO:
* Allow command blocks that can ignore certain types of errors
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
1. I prefer Oracle' (and others, I believe) way - put
statement(s) in PL block and define for what exceptions
(errors) what actions should be taken (ie IGNORE forNON_UNIQ_KEY error, etc).
Some people prefer 'pure' SQL. Anyway, it can be argued which
is worse - the usage of non-SQL language, or usage of extended
SQL language. I guess the SQL standard does not provide for such
functionality?
Yes, there is no such syntax in standard. And imho when some
feature is not in standard then it's better to implement it
how others do (for as much compatibility as possible/significant).
2. For INSERT ... SELECT statement one can put DISTINCT in
select' target list.
With this construct, you are effectively copying rows from
one table to another - or constructing rows from various
sources (constants, other tables etc) and inserting these
in the table. If the target table has unique indexes
(or constraints), and some of the rows returned by SELECT violate
Sorry, I didn't consider this case, you're right.
I believe all this functionality will have to consider the
syntax firts.
All this functionality will have to consider savepoints
implementation first. As for syntax - we could implement both.
Vadim
Import Notes
Resolved by subject fallback