Bug in copy
Hello
I run this command:
copy "Parts" ("Id","Title") from stdin with (format csv, delimiter ",",
on_error ignore)
But I receive this error:
duplicate key value violates unique constraint "PartsUniqueLocaleTitle"
This means that the on_error setting is not working. When I try to insert a
million records, this becomes extremely annoying and counterproductive.
When we specify that on_error should be ignored, any type of error
including data type inconsistency, check constraint inconsistency, foreign
key inconsistency, etc. should be ignored and Postgres should move to the
next record and not fail the entire bulk operation.
Regards
Saeed Nemati
On Fri, Feb 7, 2025 at 10:28 PM me nefcanto <sn.1361@gmail.com> wrote:
copy "Parts" ("Id","Title") from stdin with (format csv, delimiter ",",
on_error ignore)But I receive this error:
duplicate key value violates unique constraint "PartsUniqueLocaleTitle"
This means that the on_error setting is not working. When I try to insert
a million records, this becomes extremely annoying and counterproductive.When we specify that on_error should be ignored, any type of error
including data type inconsistency, check constraint inconsistency, foreign
key inconsistency, etc. should be ignored and Postgres should move to the
next record and not fail the entire bulk operation.
While the would be a possible definition to implement what actually is
implemented is:
"Specifies how to behave when encountering an error converting a column's
input value into its data type."
Thus the observed behavior is not a bug.
David J.
On Feb 8, 2025 at 13:28 +0800, me nefcanto <sn.1361@gmail.com>, wrote:
Hello
I run this command:
copy "Parts" ("Id","Title") from stdin with (format csv, delimiter ",", on_error ignore)
But I receive this error:
duplicate key value violates unique constraint "PartsUniqueLocaleTitle"
This means that the on_error setting is not working. When I try to insert a million records, this becomes extremely annoying and counterproductive.
When we specify that on_error should be ignored, any type of error including data type inconsistency, check constraint inconsistency, foreign key inconsistency, etc. should be ignored and Postgres should move to the next record and not fail the entire bulk operation.
RegardsSaeed Nemati
Hi,
As my understanding, on_error is designed to handle errors during data type conversions in PostgreSQL, similar to what we do in Greenplum or Cloudberry.
Since these rows are valid, on_error doesn’t raise any concerns.
--
Zhang Mingli
HashData
Hi, thank you for the response. If we analyze semantically, it had to be
on_type_error or something. But what matters is the problem at hand.
Inserting a million records not in an all-or-fail is a requirement. What
options do we have for that?
On Sat, Feb 8, 2025 at 9:22 AM Zhang Mingli <zmlpostgres@gmail.com> wrote:
Show quoted text
On Feb 8, 2025 at 13:28 +0800, me nefcanto <sn.1361@gmail.com>, wrote:
Hello
I run this command:
copy "Parts" ("Id","Title") from stdin with (format csv, delimiter ",",
on_error ignore)
But I receive this error:
duplicate key value violates unique constraint "PartsUniqueLocaleTitle"
This means that the on_error setting is not working. When I try to insert
a million records, this becomes extremely annoying and counterproductive.
When we specify that on_error should be ignored, any type of error
including data type inconsistency, check constraint inconsistency, foreign
key inconsistency, etc. should be ignored and Postgres should move to the
next record and not fail the entire bulk operation.
RegardsSaeed NematiHi,
As my understanding, on_error is designed to handle errors during data
type conversions in PostgreSQL, similar to what we do in Greenplum or
Cloudberry.
Since these rows are valid, on_error doesn’t raise any concerns.--
Zhang Mingli
HashData
On Friday, February 7, 2025, me nefcanto <sn.1361@gmail.com> wrote:
Hi, thank you for the response. If we analyze semantically, it had to be
on_type_error or something. But what matters is the problem at hand.
Inserting a million records not in an all-or-fail is a requirement. What
options do we have for that?
In core, nothing really. As you see we are just now adding this kind of
thing piece-by-piece. Insert has the “no duplicates” solved but not other
issues, while copy solves for malformed data at present. The generality of
the name is because future errors to be ignored can then be added without
having an option for everything single one of them.
You might consider searching the internet for solutions to this
long-standing need.
The one project that comes to mind, but I’ve never used, is pg_bulkload.
https://github.com/ossc-db/pg_bulkload
In any case, this is now a discussion better suited for the -general
mailing list.
David J.
On Sat, 2025-02-08 at 09:31 +0330, me nefcanto wrote:
Inserting a million records not in an all-or-fail is a requirement. What options do we have for that?
Use COPY to load the data into a new (temporary?) table.
Then use INSERT INTO ... SELECT ... ON CONFLICT ... or MERGE to merge
the data from that table to the actual destination.
COPY is not a full-fledged ETL tool.
Yours,
Laurenz Albe
@laurenz if I use `insert into` or the `merge` would I be able to bypass
records with errors? Or would I fail there too? I mean there are lots of
ways a record can be limited. Unique indexes, check constraints, foreign
key constraints, etc. What happens in those cases?
And why not fixing the "on_error ignore" in the first place? Maybe that
would be a simpler way. I don't know the internals of bulk insertion, but
if at some point it has a loop in it, then that's much simpler to catch
errors in that loop.
Regards
Saeed
On Sun, Feb 9, 2025 at 9:32 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
Show quoted text
On Sat, 2025-02-08 at 09:31 +0330, me nefcanto wrote:
Inserting a million records not in an all-or-fail is a requirement. What
options do we have for that?
Use COPY to load the data into a new (temporary?) table.
Then use INSERT INTO ... SELECT ... ON CONFLICT ... or MERGE to merge
the data from that table to the actual destination.COPY is not a full-fledged ETL tool.
Yours,
Laurenz Albe
@David, I saw that pg_bulkload. Amazing performance. But that's a command
line tool. I need to insert bulk data in my Node.js app, via code.
On Sun, Feb 9, 2025 at 4:00 PM me nefcanto <sn.1361@gmail.com> wrote:
Show quoted text
@laurenz if I use `insert into` or the `merge` would I be able to bypass
records with errors? Or would I fail there too? I mean there are lots of
ways a record can be limited. Unique indexes, check constraints, foreign
key constraints, etc. What happens in those cases?And why not fixing the "on_error ignore" in the first place? Maybe that
would be a simpler way. I don't know the internals of bulk insertion, but
if at some point it has a loop in it, then that's much simpler to catch
errors in that loop.Regards
SaeedOn Sun, Feb 9, 2025 at 9:32 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:On Sat, 2025-02-08 at 09:31 +0330, me nefcanto wrote:
Inserting a million records not in an all-or-fail is a requirement.
What options do we have for that?
Use COPY to load the data into a new (temporary?) table.
Then use INSERT INTO ... SELECT ... ON CONFLICT ... or MERGE to merge
the data from that table to the actual destination.COPY is not a full-fledged ETL tool.
Yours,
Laurenz Albe
On Sunday, February 9, 2025, me nefcanto <sn.1361@gmail.com> wrote:
@David, I saw that pg_bulkload. Amazing performance. But that's a command
line tool. I need to insert bulk data in my Node.js app, via code.
That seems like an arbitrary limitation to impose on yourself. Given that,
there is no solution that will satisfy you that I am aware of.
David J.
On Sun, 2025-02-09 at 16:00 +0330, me nefcanto wrote:
@laurenz if I use `insert into` or the `merge` would I be able to bypass records
with errors? Or would I fail there too? I mean there are lots of ways a record
can be limited. Unique indexes, check constraints, foreign key constraints, etc.
What happens in those cases?
With INSERT ... ON CONFLICT, you can only handle primar and unique key violations.
MERGE allows some more freedom, but it also only checks for rows that match existing
rows.
You won't find a command that ignores or handles arbitrary kinds of errors.
You have to figure out what kinds of errors you expect and handle them explicitly
by running queries against the data.
I don't think that a catch-it-all handler that handles all errors would be very
useful. Normally, there are certain errors you want to tolerate, while others
should be considered unrecoverable and lead to errors.
Yours,
Laurenz Albe
Sorry for the long delay.
Let's analyze the scenario of fake data insertion. I want to create a
million fake products, sometimes even 100 million (we're on MariaDB now and
we plan to migrate to Postgres). My team uses fake data for performance
tests and other use cases. Now there could be literally now way to sanitize
those records
Another scenario is translations. Even in production we have translation
files for more than 20 languages, and for more than 2 thousand keys. That
means we need to insert 40 thousand translation records in the production.
Another scenario is updating nested model values for a large hierarchical
table. For example, the categories table. Anytime the user changes a record
in that table we need to recalculate the nested model for the entire
categories and bulk update the results.
The point is, the database schema is not in our hands. We don't know what
rules exist on each table and what rules change. And it's not practical and
feasible to spend resources on keeping our bulk insertion logic with the
database changes.
It's a good design that Postgres add a catch-all handler for each row and
report accordingly. Give it 1 million records, and it should give you back
1 million results.
Is there a problem in implementing this? After all one expects the most
advanced open source database to support this real-world requirement.
Regards
Saeed
On Sun, Feb 9, 2025 at 8:09 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:
Show quoted text
On Sun, 2025-02-09 at 16:00 +0330, me nefcanto wrote:
@laurenz if I use `insert into` or the `merge` would I be able to bypass
records
with errors? Or would I fail there too? I mean there are lots of ways a
record
can be limited. Unique indexes, check constraints, foreign key
constraints, etc.
What happens in those cases?
With INSERT ... ON CONFLICT, you can only handle primar and unique key
violations.
MERGE allows some more freedom, but it also only checks for rows that
match existing
rows.You won't find a command that ignores or handles arbitrary kinds of errors.
You have to figure out what kinds of errors you expect and handle them
explicitly
by running queries against the data.I don't think that a catch-it-all handler that handles all errors would be
very
useful. Normally, there are certain errors you want to tolerate, while
others
should be considered unrecoverable and lead to errors.Yours,
Laurenz Albe
On Mon, Feb 24, 2025 at 9:09 AM me nefcanto <sn.1361@gmail.com> wrote:
The point is, the database schema is not in our hands. We don't know what
rules exist on each table and what rules change. And it's not practical and
feasible to spend resources on keeping our bulk insertion logic with the
database changes.
That's a company problem, and not one that can be solved by changing the
way COPY works.
Is there a problem in implementing this? After all one expects the most
advanced open source database to support this real-world requirement.
We're not going to change Postgres into an ETL application. This thread has
given you numerous Postgres-specific solutions, but there is also no
shortage of ETL applications you can try out.
Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support
I didn't understand how putting a catch-all handler turns the Postgres into
an ETL application. And about all those numerous Postgres-specific
solutions, can you please enumerate them? Because there are not numerous
options, just a couple of options, and none of them work. Inserting the
data into a table without constraint and then performing a merge won't
work. Because merge also does not have a catch-all handler. What other
solutions have been proposed? pg_bulkload?
So basically two solutions have been suggested, Merge and pg_bulkload. One
does not work (just handles more error and fails again on a single error)
and the other is a command-line utility.
I also did a simple research using AI and came up with this table:
RDBMS System - Row-Level Error Handling for Bulk Copy
Oracle Database - Yes (e.g., SQL*Loader can direct bad records)
Microsoft SQL Server - No (BULK INSERT generally aborts on row errors)
MySQL - Yes (using LOAD DATA with IGNORE options)
PostgreSQL - No (COPY stops on error without workarounds)
IBM DB2 - Yes (LOAD utility supports error logging)
Teradata - Yes (has options for capturing reject rows)
SAP HANA - No (bulk load typically aborts on errors)
MariaDB - Yes (similar to MySQL with error-handling options)
SQLite - No (no dedicated bulk load command with error handling)
Amazon Redshift - Yes (COPY command allows error tolerance settings)
Google BigQuery - Yes (load jobs can be configured to skip bad rows)
Netezza - Yes (supports reject files for bulk loads)
Snowflake - Yes (COPY INTO offers error handling parameters)
Vertica - Yes (COPY command logs errors and continues)
10 RDBMS systems support catch-all problems, and only 4 do not. This means
that the majority have found it to be a useful feature.
Greg, may I ask what's your argument against having a catch-all error in
bulk operations? Because the ETL argument is not a valid one. As I
mentioned previously, none of my real-world requirements were related to
the ETL.
Regards
Saeed
On Mon, Feb 24, 2025 at 7:25 PM Greg Sabino Mullane <htamfids@gmail.com>
wrote:
Show quoted text
On Mon, Feb 24, 2025 at 9:09 AM me nefcanto <sn.1361@gmail.com> wrote:
The point is, the database schema is not in our hands. We don't know what
rules exist on each table and what rules change. And it's not practical and
feasible to spend resources on keeping our bulk insertion logic with the
database changes.That's a company problem, and not one that can be solved by changing the
way COPY works.Is there a problem in implementing this? After all one expects the most
advanced open source database to support this real-world requirement.We're not going to change Postgres into an ETL application. This thread
has given you numerous Postgres-specific solutions, but there is also no
shortage of ETL applications you can try out.Cheers,
Greg--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support