Bug in copy

Started by me nefcantoabout 1 year ago14 messagesbugs
Jump to latest
#1me nefcanto
sn.1361@gmail.com

Hello

I run this command:

copy "Parts" ("Id","Title") from stdin with (format csv, delimiter ",",
on_error ignore)

But I receive this error:

duplicate key value violates unique constraint "PartsUniqueLocaleTitle"

This means that the on_error setting is not working. When I try to insert a
million records, this becomes extremely annoying and counterproductive.

When we specify that on_error should be ignored, any type of error
including data type inconsistency, check constraint inconsistency, foreign
key inconsistency, etc. should be ignored and Postgres should move to the
next record and not fail the entire bulk operation.

Regards
Saeed Nemati

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: me nefcanto (#1)
Re: Bug in copy

On Fri, Feb 7, 2025 at 10:28 PM me nefcanto <sn.1361@gmail.com> wrote:

copy "Parts" ("Id","Title") from stdin with (format csv, delimiter ",",
on_error ignore)

But I receive this error:

duplicate key value violates unique constraint "PartsUniqueLocaleTitle"

This means that the on_error setting is not working. When I try to insert
a million records, this becomes extremely annoying and counterproductive.

When we specify that on_error should be ignored, any type of error
including data type inconsistency, check constraint inconsistency, foreign
key inconsistency, etc. should be ignored and Postgres should move to the
next record and not fail the entire bulk operation.

While the would be a possible definition to implement what actually is
implemented is:

"Specifies how to behave when encountering an error converting a column's
input value into its data type."

Thus the observed behavior is not a bug.

David J.

#3Zhang Mingli
zmlpostgres@gmail.com
In reply to: me nefcanto (#1)
Re: Bug in copy

On Feb 8, 2025 at 13:28 +0800, me nefcanto <sn.1361@gmail.com>, wrote:

Hello
I run this command:
copy "Parts" ("Id","Title") from stdin with (format csv, delimiter ",", on_error ignore)
But I receive this error:
duplicate key value violates unique constraint "PartsUniqueLocaleTitle"
This means that the on_error setting is not working. When I try to insert a million records, this becomes extremely annoying and counterproductive.
When we specify that on_error should be ignored, any type of error including data type inconsistency, check constraint inconsistency, foreign key inconsistency, etc. should be ignored and Postgres should move to the next record and not fail the entire bulk operation.
RegardsSaeed Nemati

Hi,

As my understanding,  on_error is designed to handle errors during data type conversions in PostgreSQL, similar to what we do in Greenplum or Cloudberry.
Since these rows are valid, on_error doesn’t raise any concerns.

--
Zhang Mingli
HashData

#4me nefcanto
sn.1361@gmail.com
In reply to: Zhang Mingli (#3)
Re: Bug in copy

Hi, thank you for the response. If we analyze semantically, it had to be
on_type_error or something. But what matters is the problem at hand.
Inserting a million records not in an all-or-fail is a requirement. What
options do we have for that?

On Sat, Feb 8, 2025 at 9:22 AM Zhang Mingli <zmlpostgres@gmail.com> wrote:

Show quoted text

On Feb 8, 2025 at 13:28 +0800, me nefcanto <sn.1361@gmail.com>, wrote:

Hello
I run this command:
copy "Parts" ("Id","Title") from stdin with (format csv, delimiter ",",
on_error ignore)
But I receive this error:
duplicate key value violates unique constraint "PartsUniqueLocaleTitle"
This means that the on_error setting is not working. When I try to insert
a million records, this becomes extremely annoying and counterproductive.
When we specify that on_error should be ignored, any type of error
including data type inconsistency, check constraint inconsistency, foreign
key inconsistency, etc. should be ignored and Postgres should move to the
next record and not fail the entire bulk operation.
RegardsSaeed Nemati

Hi,

As my understanding, on_error is designed to handle errors during data
type conversions in PostgreSQL, similar to what we do in Greenplum or
Cloudberry.
Since these rows are valid, on_error doesn’t raise any concerns.

--
Zhang Mingli
HashData

#5David G. Johnston
david.g.johnston@gmail.com
In reply to: me nefcanto (#4)
Re: Bug in copy

On Friday, February 7, 2025, me nefcanto <sn.1361@gmail.com> wrote:

Hi, thank you for the response. If we analyze semantically, it had to be
on_type_error or something. But what matters is the problem at hand.
Inserting a million records not in an all-or-fail is a requirement. What
options do we have for that?

In core, nothing really. As you see we are just now adding this kind of
thing piece-by-piece. Insert has the “no duplicates” solved but not other
issues, while copy solves for malformed data at present. The generality of
the name is because future errors to be ignored can then be added without
having an option for everything single one of them.

You might consider searching the internet for solutions to this
long-standing need.

The one project that comes to mind, but I’ve never used, is pg_bulkload.

https://github.com/ossc-db/pg_bulkload

In any case, this is now a discussion better suited for the -general
mailing list.

David J.

#6Laurenz Albe
laurenz.albe@cybertec.at
In reply to: me nefcanto (#4)
Re: Bug in copy

On Sat, 2025-02-08 at 09:31 +0330, me nefcanto wrote:

Inserting a million records not in an all-or-fail is a requirement. What options do we have for that?

Use COPY to load the data into a new (temporary?) table.
Then use INSERT INTO ... SELECT ... ON CONFLICT ... or MERGE to merge
the data from that table to the actual destination.

COPY is not a full-fledged ETL tool.

Yours,
Laurenz Albe

#7me nefcanto
sn.1361@gmail.com
In reply to: Laurenz Albe (#6)
Re: Bug in copy

@laurenz if I use `insert into` or the `merge` would I be able to bypass
records with errors? Or would I fail there too? I mean there are lots of
ways a record can be limited. Unique indexes, check constraints, foreign
key constraints, etc. What happens in those cases?

And why not fixing the "on_error ignore" in the first place? Maybe that
would be a simpler way. I don't know the internals of bulk insertion, but
if at some point it has a loop in it, then that's much simpler to catch
errors in that loop.

Regards
Saeed

On Sun, Feb 9, 2025 at 9:32 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

On Sat, 2025-02-08 at 09:31 +0330, me nefcanto wrote:

Inserting a million records not in an all-or-fail is a requirement. What

options do we have for that?

Use COPY to load the data into a new (temporary?) table.
Then use INSERT INTO ... SELECT ... ON CONFLICT ... or MERGE to merge
the data from that table to the actual destination.

COPY is not a full-fledged ETL tool.

Yours,
Laurenz Albe

#8me nefcanto
sn.1361@gmail.com
In reply to: me nefcanto (#7)
Re: Bug in copy

@David, I saw that pg_bulkload. Amazing performance. But that's a command
line tool. I need to insert bulk data in my Node.js app, via code.

On Sun, Feb 9, 2025 at 4:00 PM me nefcanto <sn.1361@gmail.com> wrote:

Show quoted text

@laurenz if I use `insert into` or the `merge` would I be able to bypass
records with errors? Or would I fail there too? I mean there are lots of
ways a record can be limited. Unique indexes, check constraints, foreign
key constraints, etc. What happens in those cases?

And why not fixing the "on_error ignore" in the first place? Maybe that
would be a simpler way. I don't know the internals of bulk insertion, but
if at some point it has a loop in it, then that's much simpler to catch
errors in that loop.

Regards
Saeed

On Sun, Feb 9, 2025 at 9:32 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

On Sat, 2025-02-08 at 09:31 +0330, me nefcanto wrote:

Inserting a million records not in an all-or-fail is a requirement.

What options do we have for that?

Use COPY to load the data into a new (temporary?) table.
Then use INSERT INTO ... SELECT ... ON CONFLICT ... or MERGE to merge
the data from that table to the actual destination.

COPY is not a full-fledged ETL tool.

Yours,
Laurenz Albe

#9David G. Johnston
david.g.johnston@gmail.com
In reply to: me nefcanto (#8)
Re: Bug in copy

On Sunday, February 9, 2025, me nefcanto <sn.1361@gmail.com> wrote:

@David, I saw that pg_bulkload. Amazing performance. But that's a command
line tool. I need to insert bulk data in my Node.js app, via code.

That seems like an arbitrary limitation to impose on yourself. Given that,
there is no solution that will satisfy you that I am aware of.

David J.

#10Laurenz Albe
laurenz.albe@cybertec.at
In reply to: me nefcanto (#7)
Re: Bug in copy

On Sun, 2025-02-09 at 16:00 +0330, me nefcanto wrote:

@laurenz if I use `insert into` or the `merge` would I be able to bypass records
with errors? Or would I fail there too? I mean there are lots of ways a record
can be limited. Unique indexes, check constraints, foreign key constraints, etc.
What happens in those cases?

With INSERT ... ON CONFLICT, you can only handle primar and unique key violations.
MERGE allows some more freedom, but it also only checks for rows that match existing
rows.

You won't find a command that ignores or handles arbitrary kinds of errors.
You have to figure out what kinds of errors you expect and handle them explicitly
by running queries against the data.

I don't think that a catch-it-all handler that handles all errors would be very
useful. Normally, there are certain errors you want to tolerate, while others
should be considered unrecoverable and lead to errors.

Yours,
Laurenz Albe

#11me nefcanto
sn.1361@gmail.com
In reply to: Laurenz Albe (#10)
Re: Bug in copy

Sorry for the long delay.

Let's analyze the scenario of fake data insertion. I want to create a
million fake products, sometimes even 100 million (we're on MariaDB now and
we plan to migrate to Postgres). My team uses fake data for performance
tests and other use cases. Now there could be literally now way to sanitize
those records
Another scenario is translations. Even in production we have translation
files for more than 20 languages, and for more than 2 thousand keys. That
means we need to insert 40 thousand translation records in the production.
Another scenario is updating nested model values for a large hierarchical
table. For example, the categories table. Anytime the user changes a record
in that table we need to recalculate the nested model for the entire
categories and bulk update the results.

The point is, the database schema is not in our hands. We don't know what
rules exist on each table and what rules change. And it's not practical and
feasible to spend resources on keeping our bulk insertion logic with the
database changes.

It's a good design that Postgres add a catch-all handler for each row and
report accordingly. Give it 1 million records, and it should give you back
1 million results.

Is there a problem in implementing this? After all one expects the most
advanced open source database to support this real-world requirement.

Regards
Saeed

On Sun, Feb 9, 2025 at 8:09 PM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

Show quoted text

On Sun, 2025-02-09 at 16:00 +0330, me nefcanto wrote:

@laurenz if I use `insert into` or the `merge` would I be able to bypass

records

with errors? Or would I fail there too? I mean there are lots of ways a

record

can be limited. Unique indexes, check constraints, foreign key

constraints, etc.

What happens in those cases?

With INSERT ... ON CONFLICT, you can only handle primar and unique key
violations.
MERGE allows some more freedom, but it also only checks for rows that
match existing
rows.

You won't find a command that ignores or handles arbitrary kinds of errors.
You have to figure out what kinds of errors you expect and handle them
explicitly
by running queries against the data.

I don't think that a catch-it-all handler that handles all errors would be
very
useful. Normally, there are certain errors you want to tolerate, while
others
should be considered unrecoverable and lead to errors.

Yours,
Laurenz Albe

#12Greg Sabino Mullane
greg@turnstep.com
In reply to: me nefcanto (#11)
Re: Bug in copy

On Mon, Feb 24, 2025 at 9:09 AM me nefcanto <sn.1361@gmail.com> wrote:

The point is, the database schema is not in our hands. We don't know what
rules exist on each table and what rules change. And it's not practical and
feasible to spend resources on keeping our bulk insertion logic with the
database changes.

That's a company problem, and not one that can be solved by changing the
way COPY works.

Is there a problem in implementing this? After all one expects the most
advanced open source database to support this real-world requirement.

We're not going to change Postgres into an ETL application. This thread has
given you numerous Postgres-specific solutions, but there is also no
shortage of ETL applications you can try out.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

#13me nefcanto
sn.1361@gmail.com
In reply to: Greg Sabino Mullane (#12)
Re: Bug in copy

I didn't understand how putting a catch-all handler turns the Postgres into
an ETL application. And about all those numerous Postgres-specific
solutions, can you please enumerate them? Because there are not numerous
options, just a couple of options, and none of them work. Inserting the
data into a table without constraint and then performing a merge won't
work. Because merge also does not have a catch-all handler. What other
solutions have been proposed? pg_bulkload?

So basically two solutions have been suggested, Merge and pg_bulkload. One
does not work (just handles more error and fails again on a single error)
and the other is a command-line utility.

I also did a simple research using AI and came up with this table:

RDBMS System - Row-Level Error Handling for Bulk Copy
Oracle Database - Yes (e.g., SQL*Loader can direct bad records)
Microsoft SQL Server - No (BULK INSERT generally aborts on row errors)
MySQL - Yes (using LOAD DATA with IGNORE options)
PostgreSQL - No (COPY stops on error without workarounds)
IBM DB2 - Yes (LOAD utility supports error logging)
Teradata - Yes (has options for capturing reject rows)
SAP HANA - No (bulk load typically aborts on errors)
MariaDB - Yes (similar to MySQL with error-handling options)
SQLite - No (no dedicated bulk load command with error handling)
Amazon Redshift - Yes (COPY command allows error tolerance settings)
Google BigQuery - Yes (load jobs can be configured to skip bad rows)
Netezza - Yes (supports reject files for bulk loads)
Snowflake - Yes (COPY INTO offers error handling parameters)
Vertica - Yes (COPY command logs errors and continues)

10 RDBMS systems support catch-all problems, and only 4 do not. This means
that the majority have found it to be a useful feature.

Greg, may I ask what's your argument against having a catch-all error in
bulk operations? Because the ETL argument is not a valid one. As I
mentioned previously, none of my real-world requirements were related to
the ETL.

Regards
Saeed

On Mon, Feb 24, 2025 at 7:25 PM Greg Sabino Mullane <htamfids@gmail.com>
wrote:

Show quoted text

On Mon, Feb 24, 2025 at 9:09 AM me nefcanto <sn.1361@gmail.com> wrote:

The point is, the database schema is not in our hands. We don't know what
rules exist on each table and what rules change. And it's not practical and
feasible to spend resources on keeping our bulk insertion logic with the
database changes.

That's a company problem, and not one that can be solved by changing the
way COPY works.

Is there a problem in implementing this? After all one expects the most
advanced open source database to support this real-world requirement.

We're not going to change Postgres into an ETL application. This thread
has given you numerous Postgres-specific solutions, but there is also no
shortage of ETL applications you can try out.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

#14Greg Sabino Mullane
greg@turnstep.com
In reply to: me nefcanto (#13)
Re: Bug in copy

Please raise this on pgsql-general, this is a feature request, not a bug. I
am not responding to this thread further.