Please implement a catch-all error handler per row, for COPY

Started by me nefcantoover 1 year ago10 messagesgeneral

sn.1361@gmail.com

over 1 year ago

Hello

Please consider these scenarios:

- I want to create a million fake products, sometimes even 100 million
(we're on MariaDB now and we plan to migrate to Postgres). My team uses
fake data for performance tests and other use cases.
- Another scenario is translations. Even in production, we have translation
files for more than 20 languages, and for more than 2 thousand keys. That
means we need to insert 40 thousand translation records in the production.
- Another scenario is updating nested model values for a large hierarchical
table. For example, the categories table. Anytime the user changes a record
in that table we need to recalculate the nested model for the entire
categories and bulk update the results.

All of these scenarios are such that data sanitation is difficult if not
possible before doing the bulk operation (copy).

I realized that when we specify `on_error ignore` it just handles a handful
of errors. I thought this was a bug and sent an email to the pgsql-bugs
maling list. But they said it's the intended behavior.

Can you please provide a row-level catch-all handler for the copy command?

Regards
Saeed

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: me nefcanto (#1)

Re: Please implement a catch-all error handler per row, for COPY

me nefcanto <sn.1361@gmail.com> writes:

Can you please provide a row-level catch-all handler for the copy command?

Very unlikely to happen. COPY is not intended as a general purpose
ETL mechanism, and we don't want to load it down with features that
would create substantial performance penalties. Which that would.

Even ignoring the performance angle, this request seems remarkably
ill-specified. What is a "row-level handler" for errors that have
to do with identifying row boundaries?

regards, tom lane

Adrian Klaver

adrian.klaver@aklaver.com

over 1 year ago

In reply to: me nefcanto (#1)

Re: Please implement a catch-all error handler per row, for COPY

On 3/1/25 07:18, me nefcanto wrote:

Hello

All of these scenarios are such that data sanitation is difficult if not
possible before doing the bulk operation (copy).

I realized that when we specify `on_error ignore` it just handles a
handful of errors. I thought this was a bug and sent an email to the
pgsql-bugs maling list. But they said it's the intended behavior.

Take a look at:

http://ossc-db.github.io/pg_bulkload/index.html

and

https://pgloader.io/

They may be able to meet your needs.

Regards
Saeed

--
Adrian Klaver
adrian.klaver@aklaver.com

David G. Johnston

david.g.johnston@gmail.com

over 1 year ago

In reply to: Tom Lane (#2)

Re: Please implement a catch-all error handler per row, for COPY

On Sat, Mar 1, 2025 at 9:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

me nefcanto <sn.1361@gmail.com> writes:

Can you please provide a row-level catch-all handler for the copy

command?

Very unlikely to happen. COPY is not intended as a general purpose
ETL mechanism, and we don't want to load it down with features that
would create substantial performance penalties. Which that would.

Maybe it isn't a general purpose ETL tool but there is no reasonable way to
do some things unless COPY can be put into a mode that doesn't have the
same performance requirements it needs to serve as our dump/restore tool of
choice.

I have to imagine such a mode, if not enabled, would have little to no
impact on how COPY behaves compared to today. It's kinda like VACUUM FULL
existing doesn't impact how VACUUM behaves. Seems more desirable than
inventing a whole new SQL Command to do this and copy-paste all of the COPY
code since it does mostly the same thing.

In short, it's probably worth giving it a try if someone wants to.
Rejecting it without seeing the proposal seems premature.

Now, it's probably challenging enough that if the person requesting the
feature isn't driving its development the odds of it getting worked on is
fairly low.

David J.

Greg Sabino Mullane

greg@turnstep.com

over 1 year ago

In reply to: David G. Johnston (#4)

Re: Please implement a catch-all error handler per row, for COPY

FYI the -bugs thread in question:

/messages/by-id/CAEHBEOBCweDWGNHDaUk4=10HG0QXXJJAGXbEnFLMB30M+Qw+dg@mail.gmail.com

seems to imply the primary blocker was a unique constraint.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

me nefcanto

sn.1361@gmail.com

over 1 year ago

In reply to: Greg Sabino Mullane (#5)

Re: Please implement a catch-all error handler per row, for COPY

Thank you all for responding. With respect, I don't understand why COPY
gets related to ETL. All of the requirements I mentioned above have nothing
to do with ETL. We have a table of categories. A hierarchical table. This
is a multitenant app. Hundreds of thousands of records are in it. We want
to calculate the hierarchical properties (nested set models, node depth,
determining leaves, materialized paths, etc.) and then update the entire
table. What does this have to do with ETL? Or as I said we create millions
of records for products, attributes of products, and pricing and media of
products to load test our system. Again, what does that have to do with ETL?

The point is, that there is already an `on_error ignore` clause there. This
means that somewhere there is a try/catch per row. If I'm wrong, please let
me know. But when the `on_error ignore` catches problem x for each row,
then it can catch all problems for each row without any performance problem.

Let me give you an example in C#:

try
{
}
catch (SomeException ex)
{
}

becomes:

try
{
}
catch (Exception ex)
{
if (ex is SomeException)
{
}
}

The last catch clause catches everything. How does it affect performance?
Running a simple if for hundreds of millions of iterations is literally
nothing in time complexity.

As I have specified in the bug thread, from 11 RDBMSs, 7 support this. Thus
it's not an uncommon weird request.

Regards
Saeed

On Sat, Mar 1, 2025 at 8:45 PM Greg Sabino Mullane <htamfids@gmail.com>
wrote:

Show quoted text

FYI the -bugs thread in question:

/messages/by-id/CAEHBEOBCweDWGNHDaUk4=10HG0QXXJJAGXbEnFLMB30M+Qw+dg@mail.gmail.com

seems to imply the primary blocker was a unique constraint.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

Christophe Pettus

xof@thebuild.com

over 1 year ago

In reply to: me nefcanto (#6)

Re: Please implement a catch-all error handler per row, for COPY

On Mar 2, 2025, at 19:44, me nefcanto <sn.1361@gmail.com> wrote:

As I have specified in the bug thread, from 11 RDBMSs, 7 support this. Thus it's not an uncommon weird request.

If your organization is interested in producing a design and a patch, or paying a developer or organization to do so, that would be the best way forward. Everyone who works on the core PostgreSQL code is either a volunteer or extremely busy. Those that are paid to work on PostgreSQL usually have their priorities mapped out for months (years?) already. The best way to convince the community to adopt a feature to come bearing a high-quality patch. That's no guarantee, of course, but the probability is much higher that way.

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: me nefcanto (#6)

Re: Please implement a catch-all error handler per row, for COPY

me nefcanto <sn.1361@gmail.com> writes:

The point is, that there is already an `on_error ignore` clause there. This
means that somewhere there is a try/catch per row. If I'm wrong, please let
me know.

You are wrong. See

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=d9f7f5d32

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=9e2d87011

and a ton of related commits.

It's possible that the specific case of unique-index violations
could be handled in a similar style. But "catch any error whatever"
is simply not going to happen here, because a subtransaction per
row is not practical.

regards, tom lane

Adrian Klaver

adrian.klaver@aklaver.com

over 1 year ago

In reply to: me nefcanto (#6)

Re: Please implement a catch-all error handler per row, for COPY

On 3/2/25 19:44, me nefcanto wrote:

Thank you all for responding. With respect, I don't understand why COPY
gets related to ETL. All of the requirements I mentioned above have
nothing to do with ETL. We have a table of categories. A hierarchical
table. This is a multitenant app. Hundreds of thousands of records are
in it. We want to calculate the hierarchical properties (nested set
models, node depth, determining leaves, materialized paths, etc.) and
then update the entire table. What does this have to do with ETL? Or as

Pretty sure it defines ETL, extract data from a source, transform it to
meet the properties and load it.

At any rate:

1) ON_ERROR IGNORE only just appeared in Postgres 17.

2) Any further changes to it would constitute a feature change that can
only happen in major version upgrade. The next major version is due this
Fall and development is already well under way. The chances of it making
it in given the time frame is slim, so that pushes a practicable
deadline until Fall of 2026.

3) As I mentioned in a previous post there are packages already out
there that may get you want you want now.

--
Adrian Klaver
adrian.klaver@aklaver.com

#10

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: me nefcanto (#6)

Re: Please implement a catch-all error handler per row, for COPY

On Mon, 2025-03-03 at 07:14 +0330, me nefcanto wrote:

The point is, that there is already an `on_error ignore` clause there. This means that
somewhere there is a try/catch per row. If I'm wrong, please let me know.

The crucial point that Tom referred to is the "per row". What is a row?

Imagine you have a COPY FROM statement for a table with three columns,
and the data look like this:

1,Smith,John
2,Lewis,Jerry
Lee
3,Prince
4,Albe,Laurenz

We may be able to guess what is meant, but how shall the machine know
where the line boundaries are, which data to ignore and which to process?

Currently, that is no problem, because errors are only identified
after the data have been successfully parsed.

Yours,
Laurenz Albe