Add new error_action COPY ON_ERROR "log"

Started by torikoshiaabout 2 years ago58 messages
Jump to latest
#1torikoshia
torikoshia@oss.nttdata.com

Hi,

As described in 9e2d870119, COPY ON_EEOR is expected to have more
"error_action".
(Note that option name was changed by b725b7eec)

I'd like to have a new option "log", which skips soft errors and logs
information that should have resulted in errors to PostgreSQL log.

I think this option has some advantages like below:

1) We can know which number of line input data was not loaded and
reason.

Example:

=# copy t1 from stdin with (on_error log);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF
signal.

1
2
3
z
\.

LOG: invalid input syntax for type integer: "z"
NOTICE: 1 row was skipped due to data type incompatibility
COPY 3

=# \! tail data/log/postgresql*.log
LOG: 22P02: invalid input syntax for type integer: "z"
CONTEXT: COPY t1, line 4, column i: "z"
LOCATION: pg_strtoint32_safe, numutils.c:620
STATEMENT: copy t1 from stdin with (on_error log);

2) Easier maintenance than storing error information in tables or
proprietary log files.
For example, in case a large number of soft errors occur, some
mechanisms are needed to prevent an infinite increase in the size of the
destination data, but we can left it to PostgreSQL's log rotation.

Attached a patch.
This basically comes from previous discussion[1]/messages/by-id/c0fb57b82b150953f26a5c7e340412e8@oss.nttdata.com which did both "ignore"
and "log" soft error.

As shown in the example above, the log output to the client does not
contain CONTEXT, so I'm a little concerned that client cannot see what
line of the input data had a problem without looking at the server log.

What do you think?

[1]: /messages/by-id/c0fb57b82b150953f26a5c7e340412e8@oss.nttdata.com
/messages/by-id/c0fb57b82b150953f26a5c7e340412e8@oss.nttdata.com

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Attachments:

v1-0001-Add-new-error_action-log-to-ON_ERROR.patchtext/x-diff; name=v1-0001-Add-new-error_action-log-to-ON_ERROR.patchDownload+48-12
#2jian he
jian.universality@gmail.com
In reply to: torikoshia (#1)
Re: Add new error_action COPY ON_ERROR "log"

On Fri, Jan 26, 2024 at 12:42 AM torikoshia <torikoshia@oss.nttdata.com> wrote:

Hi,

As described in 9e2d870119, COPY ON_EEOR is expected to have more
"error_action".
(Note that option name was changed by b725b7eec)

I'd like to have a new option "log", which skips soft errors and logs
information that should have resulted in errors to PostgreSQL log.

I think this option has some advantages like below:

1) We can know which number of line input data was not loaded and
reason.

Example:

=# copy t1 from stdin with (on_error log);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF
signal.

1
2
3
z
\.

LOG: invalid input syntax for type integer: "z"
NOTICE: 1 row was skipped due to data type incompatibility
COPY 3

=# \! tail data/log/postgresql*.log
LOG: 22P02: invalid input syntax for type integer: "z"
CONTEXT: COPY t1, line 4, column i: "z"
LOCATION: pg_strtoint32_safe, numutils.c:620
STATEMENT: copy t1 from stdin with (on_error log);

2) Easier maintenance than storing error information in tables or
proprietary log files.
For example, in case a large number of soft errors occur, some
mechanisms are needed to prevent an infinite increase in the size of the
destination data, but we can left it to PostgreSQL's log rotation.

Attached a patch.
This basically comes from previous discussion[1] which did both "ignore"
and "log" soft error.

As shown in the example above, the log output to the client does not
contain CONTEXT, so I'm a little concerned that client cannot see what
line of the input data had a problem without looking at the server log.

What do you think?

I doubt the following part:
If the <literal>log</literal> value is specified,
<command>COPY</command> behaves the same as
<literal>ignore</literal>, exept that
it logs information that should have resulted in errors to
PostgreSQL log at
<literal>INFO</literal> level.

I think it does something like:
When an error happens, cstate->escontext->error_data->elevel will be ERROR
you manually change the cstate->escontext->error_data->elevel to LOG,
then you call ThrowErrorData.

but it's not related to `<literal>INFO</literal> level`?
my log_min_messages is default, warning.

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: torikoshia (#1)
Re: Add new error_action COPY ON_ERROR "log"

On Thu, Jan 25, 2024 at 9:42 AM torikoshia <torikoshia@oss.nttdata.com>
wrote:

Hi,

As described in 9e2d870119, COPY ON_EEOR is expected to have more
"error_action".
(Note that option name was changed by b725b7eec)

I'd like to have a new option "log", which skips soft errors and logs
information that should have resulted in errors to PostgreSQL log.

Seems like an easy win but largely unhelpful in the typical case. I
suppose ETL routines using this feature may be running on their machine
under root or "postgres" but in a system where they are not this very
useful information is inaccessible to them. I suppose the DBA could set up
an extractor to send these specific log lines elsewhere but that seems like
enough hassle to disfavor this approach and favor one that can place the
soft error data and feedback into user-specified tables in the same
database. Setting up temporary tables or unlogged tables probably is going
to be a more acceptable methodology than trying to get to the log files.

David J.

#4torikoshia
torikoshia@oss.nttdata.com
In reply to: David G. Johnston (#3)
Re: Add new error_action COPY ON_ERROR "log"

On Fri, Jan 26, 2024 at 10:44 PM jian he <jian.universality@gmail.com>
wrote:

I doubt the following part:
If the <literal>log</literal> value is specified,
<command>COPY</command> behaves the same as
<literal>ignore</literal>, exept that
it logs information that should have resulted in errors to
PostgreSQL log at
<literal>INFO</literal> level.

I think it does something like:
When an error happens, cstate->escontext->error_data->elevel will be
ERROR
you manually change the cstate->escontext->error_data->elevel to LOG,
then you call ThrowErrorData.

but it's not related to `<literal>INFO</literal> level`?
my log_min_messages is default, warning.

Thanks!

Modified them to NOTICE in accordance with the following summary
message:

NOTICE: x row was skipped due to data type incompatibility

On 2024-01-27 00:43, David G. Johnston wrote:

On Thu, Jan 25, 2024 at 9:42 AM torikoshia
<torikoshia@oss.nttdata.com> wrote:

Hi,

As described in 9e2d870119, COPY ON_EEOR is expected to have more
"error_action".
(Note that option name was changed by b725b7eec)

I'd like to have a new option "log", which skips soft errors and
logs
information that should have resulted in errors to PostgreSQL log.

Seems like an easy win but largely unhelpful in the typical case. I
suppose ETL routines using this feature may be running on their
machine under root or "postgres" but in a system where they are not
this very useful information is inaccessible to them. I suppose the
DBA could set up an extractor to send these specific log lines
elsewhere but that seems like enough hassle to disfavor this approach
and favor one that can place the soft error data and feedback into
user-specified tables in the same database. Setting up temporary
tables or unlogged tables probably is going to be a more acceptable
methodology than trying to get to the log files.

David J.

I agree that not a few people would prefer to store error information in
tables and there have already been suggestions[1]/messages/by-id/CACJufxEkkqnozdnvNMGxVAA94KZaCPkYw_Cx4JKG9ueNaZma_A@mail.gmail.com.

OTOH not everyone thinks saving table information is the best idea[2]/messages/by-id/20231109002600.fuihn34bjqqgmbjm@awork3.anarazel.de.

I think it would be desirable for ON_ERROR to be in a form that allows
the user to choose where to store error information from among some
options, such as table, log and file.

"ON_ERROR log" would be useful at least in the case of 'running on their
machine under root or "postgres"' as you pointed out.

[1]: /messages/by-id/CACJufxEkkqnozdnvNMGxVAA94KZaCPkYw_Cx4JKG9ueNaZma_A@mail.gmail.com
/messages/by-id/CACJufxEkkqnozdnvNMGxVAA94KZaCPkYw_Cx4JKG9ueNaZma_A@mail.gmail.com

[2]: /messages/by-id/20231109002600.fuihn34bjqqgmbjm@awork3.anarazel.de
/messages/by-id/20231109002600.fuihn34bjqqgmbjm@awork3.anarazel.de

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Attachments:

v2-0001-Add-new-error_action-log-to-ON_ERROR-option.patchtext/x-diff; name=v2-0001-Add-new-error_action-log-to-ON_ERROR-option.patchDownload+53-12
#5Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: torikoshia (#4)
Re: Add new error_action COPY ON_ERROR "log"

On Mon, Jan 29, 2024 at 8:41 AM torikoshia <torikoshia@oss.nttdata.com> wrote:

On 2024-01-27 00:43, David G. Johnston wrote:

I'd like to have a new option "log", which skips soft errors and
logs
information that should have resulted in errors to PostgreSQL log.

user-specified tables in the same database. Setting up temporary
tables or unlogged tables probably is going to be a more acceptable
methodology than trying to get to the log files.

OTOH not everyone thinks saving table information is the best idea[2].

The added NOTICE message gives a summary of how many rows were
skipped, but not the line numbers. It's hard for the users to find the
malformed data, especially when they are bulk-inserting from data
files of multiple GBs in size (hard to open such files in any editor
too).

I understand the value of writing the info to server log or tables of
users' choice as being discussed in a couple of other threads.
However, I'd prefer we do the simplest thing first before we go debate
server log or table - let the users know what line numbers are
containing the errors that COPY ignored something like [1]postgres=# CREATE TABLE check_ign_err (n int, m int[], k int); CREATE TABLE postgres=# COPY check_ign_err FROM STDIN WITH (on_error ignore); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. with a
simple change like [2]diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index 906756362e..93ab5d4ebb 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -961,7 +961,16 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,. It not only helps users figure out which rows
and attributes were malformed, but also helps them redirect them to
server logs with setting log_min_messages = notice [3]2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 2 for column check_ign_err, COPY n 2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err, line 2, column n: "a" 2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 3 for column check_ign_err, COPY k 2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err, line 3, column k: "3333333333" 2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 4 for column check_ign_err, COPY m 2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err, line 4, column m: "{a, 4}" 2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 5 for column check_ign_err, COPY n 2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err, line 5, column n: "" 2024-02-12 06:20:29.363 UTC [427892] NOTICE: 4 rows were skipped due to data type incompatibility. In the worst
case scenario, a problem with this one NOTICE per malformed row is
that it can overload the psql session if all the rows are malformed.
I'm not sure if this is a big problem, but IMO better than a single
summary NOTICE message and simpler than writing to tables of users'
choice.

Thoughts?

FWIW, I presented the new COPY ... ON_ERROR option feature at
Hyderabad PostgreSQL User Group meetup and it was well-received by the
audience. The audience felt it's going to be extremely helpful for
bulk-loading tasks. Thank you all for working on this feature.

[1]: postgres=# CREATE TABLE check_ign_err (n int, m int[], k int); CREATE TABLE postgres=# COPY check_ign_err FROM STDIN WITH (on_error ignore); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal.
postgres=# CREATE TABLE check_ign_err (n int, m int[], k int);
CREATE TABLE
postgres=# COPY check_ign_err FROM STDIN WITH (on_error ignore);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.

1 {1} 1

a {2} 2
3 {3} 3333333333
4 {a, 4} 4

5 {5}>> >> >> >> >> 5
\.>>
NOTICE: detected data type incompatibility at line number 2 for
column check_ign_err, COPY n
NOTICE: detected data type incompatibility at line number 3 for
column check_ign_err, COPY k
NOTICE: detected data type incompatibility at line number 4 for
column check_ign_err, COPY m
NOTICE: detected data type incompatibility at line number 5 for
column check_ign_err, COPY n
NOTICE: 4 rows were skipped due to data type incompatibility
COPY 2

[2]
diff --git a/src/backend/commands/copyfromparse.c
b/src/backend/commands/copyfromparse.c
index 906756362e..93ab5d4ebb 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -961,7 +961,16 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
                 &values[m]))
                        {
                                cstate->num_errors++;
-                               return true;
+
+                               if (cstate->opts.on_error != COPY_ON_ERROR_STOP)
+                               {
+                                       ereport(NOTICE,
+
errmsg("detected data type incompatibility at line number %llu for
column %s, COPY %s",
+
(unsigned long long) cstate->cur_lineno,
+
cstate->cur_relname,
+
cstate->cur_attname));
+                                       return true;
+                               }
                        }

cstate->cur_attname = NULL;

[3]: 2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 2 for column check_ign_err, COPY n 2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err, line 2, column n: "a" 2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 3 for column check_ign_err, COPY k 2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err, line 3, column k: "3333333333" 2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 4 for column check_ign_err, COPY m 2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err, line 4, column m: "{a, 4}" 2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type incompatibility at line number 5 for column check_ign_err, COPY n 2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err, line 5, column n: "" 2024-02-12 06:20:29.363 UTC [427892] NOTICE: 4 rows were skipped due to data type incompatibility
2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type
incompatibility at line number 2 for column check_ign_err, COPY n
2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err,
line 2, column n: "a"
2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type
incompatibility at line number 3 for column check_ign_err, COPY k
2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err,
line 3, column k: "3333333333"
2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type
incompatibility at line number 4 for column check_ign_err, COPY m
2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err,
line 4, column m: "{a, 4}"
2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type
incompatibility at line number 5 for column check_ign_err, COPY n
2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err,
line 5, column n: ""
2024-02-12 06:20:29.363 UTC [427892] NOTICE: 4 rows were skipped due
to data type incompatibility

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#6torikoshia
torikoshia@oss.nttdata.com
In reply to: Bharath Rupireddy (#5)
Re: Add new error_action COPY ON_ERROR "log"

On 2024-02-12 15:46, Bharath Rupireddy wrote:

Thanks for your comments.

On Mon, Jan 29, 2024 at 8:41 AM torikoshia <torikoshia@oss.nttdata.com>
wrote:

On 2024-01-27 00:43, David G. Johnston wrote:

I'd like to have a new option "log", which skips soft errors and
logs
information that should have resulted in errors to PostgreSQL log.

user-specified tables in the same database. Setting up temporary
tables or unlogged tables probably is going to be a more acceptable
methodology than trying to get to the log files.

OTOH not everyone thinks saving table information is the best idea[2].

The added NOTICE message gives a summary of how many rows were
skipped, but not the line numbers. It's hard for the users to find the
malformed data, especially when they are bulk-inserting from data
files of multiple GBs in size (hard to open such files in any editor
too).

I understand the value of writing the info to server log or tables of
users' choice as being discussed in a couple of other threads.
However, I'd prefer we do the simplest thing first before we go debate
server log or table - let the users know what line numbers are
containing the errors that COPY ignored something like [1] with a
simple change like [2].

Agreed.
Unlike my patch, it hides the error information(i.e. 22P02: invalid
input syntax for type integer: ), but I feel that it's usually
sufficient to know the row number and column where the error occurred.

It not only helps users figure out which rows
and attributes were malformed, but also helps them redirect them to
server logs with setting log_min_messages = notice [3]. In the worst
case scenario, a problem with this one NOTICE per malformed row is
that it can overload the psql session if all the rows are malformed.
I'm not sure if this is a big problem, but IMO better than a single
summary NOTICE message and simpler than writing to tables of users'
choice.

Maybe could we do what you suggested for the behavior when 'log' is set
to on_error?

Thoughts?

FWIW, I presented the new COPY ... ON_ERROR option feature at
Hyderabad PostgreSQL User Group meetup and it was well-received by the
audience. The audience felt it's going to be extremely helpful for
bulk-loading tasks. Thank you all for working on this feature.

Thanks for informing it!
I hope it will not be reverted as the audience:)

[1]
postgres=# CREATE TABLE check_ign_err (n int, m int[], k int);
CREATE TABLE
postgres=# COPY check_ign_err FROM STDIN WITH (on_error ignore);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF
signal.

1 {1} 1

a {2} 2
3 {3} 3333333333
4 {a, 4} 4

5 {5}>> >> >> >> >> 5
\.>>
NOTICE: detected data type incompatibility at line number 2 for
column check_ign_err, COPY n
NOTICE: detected data type incompatibility at line number 3 for
column check_ign_err, COPY k
NOTICE: detected data type incompatibility at line number 4 for
column check_ign_err, COPY m
NOTICE: detected data type incompatibility at line number 5 for
column check_ign_err, COPY n
NOTICE: 4 rows were skipped due to data type incompatibility
COPY 2

[2]
diff --git a/src/backend/commands/copyfromparse.c
b/src/backend/commands/copyfromparse.c
index 906756362e..93ab5d4ebb 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -961,7 +961,16 @@ NextCopyFrom(CopyFromState cstate, ExprContext 
*econtext,
&values[m]))
{
cstate->num_errors++;
-                               return true;
+
+                               if (cstate->opts.on_error != 
COPY_ON_ERROR_STOP)
+                               {
+                                       ereport(NOTICE,
+
errmsg("detected data type incompatibility at line number %llu for
column %s, COPY %s",
+
(unsigned long long) cstate->cur_lineno,
+
cstate->cur_relname,
+
cstate->cur_attname));
+                                       return true;
+                               }
}

cstate->cur_attname = NULL;

[3]
2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type
incompatibility at line number 2 for column check_ign_err, COPY n
2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err,
line 2, column n: "a"
2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type
incompatibility at line number 3 for column check_ign_err, COPY k
2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err,
line 3, column k: "3333333333"
2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type
incompatibility at line number 4 for column check_ign_err, COPY m
2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err,
line 4, column m: "{a, 4}"
2024-02-12 06:20:29.363 UTC [427892] NOTICE: detected data type
incompatibility at line number 5 for column check_ign_err, COPY n
2024-02-12 06:20:29.363 UTC [427892] CONTEXT: COPY check_ign_err,
line 5, column n: ""
2024-02-12 06:20:29.363 UTC [427892] NOTICE: 4 rows were skipped due
to data type incompatibility

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

#7Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: torikoshia (#6)
Re: Add new error_action COPY ON_ERROR "log"

On Wed, Feb 14, 2024 at 5:04 PM torikoshia <torikoshia@oss.nttdata.com> wrote:

[....] let the users know what line numbers are

containing the errors that COPY ignored something like [1] with a
simple change like [2].

Agreed.
Unlike my patch, it hides the error information(i.e. 22P02: invalid
input syntax for type integer: ), but I feel that it's usually
sufficient to know the row number and column where the error occurred.

Right.

It not only helps users figure out which rows
and attributes were malformed, but also helps them redirect them to
server logs with setting log_min_messages = notice [3]. In the worst
case scenario, a problem with this one NOTICE per malformed row is
that it can overload the psql session if all the rows are malformed.
I'm not sure if this is a big problem, but IMO better than a single
summary NOTICE message and simpler than writing to tables of users'
choice.

Maybe could we do what you suggested for the behavior when 'log' is set
to on_error?

My point is that why someone wants just the summary of failures
without row and column info especially for bulk loading tasks. I'd
suggest doing it independently of 'log' or 'table'. I think we can
keep things simple just like the attached patch, and see how this
feature will be adopted. I'm sure we can come back and do things like
saving to 'log' or 'table' or 'separate_error_file' etc., if we
receive any firsthand feedback.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Add-detailed-info-when-COPY-skips-soft-errors.patchapplication/octet-stream; name=v1-0001-Add-detailed-info-when-COPY-skips-soft-errors.patchDownload+15-2
#8torikoshia
torikoshia@oss.nttdata.com
In reply to: Bharath Rupireddy (#7)
Re: Add new error_action COPY ON_ERROR "log"

On 2024-02-16 17:15, Bharath Rupireddy wrote:

On Wed, Feb 14, 2024 at 5:04 PM torikoshia <torikoshia@oss.nttdata.com>
wrote:

[....] let the users know what line numbers are

containing the errors that COPY ignored something like [1] with a
simple change like [2].

Agreed.
Unlike my patch, it hides the error information(i.e. 22P02: invalid
input syntax for type integer: ), but I feel that it's usually
sufficient to know the row number and column where the error occurred.

Right.

It not only helps users figure out which rows
and attributes were malformed, but also helps them redirect them to
server logs with setting log_min_messages = notice [3]. In the worst
case scenario, a problem with this one NOTICE per malformed row is
that it can overload the psql session if all the rows are malformed.
I'm not sure if this is a big problem, but IMO better than a single
summary NOTICE message and simpler than writing to tables of users'
choice.

Maybe could we do what you suggested for the behavior when 'log' is
set
to on_error?

My point is that why someone wants just the summary of failures
without row and column info especially for bulk loading tasks. I'd
suggest doing it independently of 'log' or 'table'. I think we can
keep things simple just like the attached patch, and see how this
feature will be adopted. I'm sure we can come back and do things like
saving to 'log' or 'table' or 'separate_error_file' etc., if we
receive any firsthand feedback.

Thoughts?

I may be wrong since I seldom do data loading tasks, but I greed with
you.

I also a little concerned about the case where there are many malformed
data and it causes lots of messages, but the information is usually
valuable and if users don't need it, they can suppress it by changing
client_min_messages.

Currently both summary of failures and individual information is logged
in NOTICE level.
If we should assume that there are cases where only summary information
is required, it'd be useful to set lower log level, i.e. LOG to the
individual information.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

#9Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: torikoshia (#8)
Re: Add new error_action COPY ON_ERROR "log"

On Fri, Feb 16, 2024 at 8:17 PM torikoshia <torikoshia@oss.nttdata.com> wrote:

I may be wrong since I seldom do data loading tasks, but I greed with
you.

I also a little concerned about the case where there are many malformed
data and it causes lots of messages, but the information is usually
valuable and if users don't need it, they can suppress it by changing
client_min_messages.

Currently both summary of failures and individual information is logged
in NOTICE level.
If we should assume that there are cases where only summary information
is required, it'd be useful to set lower log level, i.e. LOG to the
individual information.

How about we emit the summary at INFO level and individual information
at NOTICE level? With this, the summary is given a different priority
than the individual info. With SET client_min_messages = WARNING; one
can still get the summary but not the individual info. Also, to get
all of these into server log, one can SET log_min_messages = INFO; or
SET log_min_messages = NOTICE;.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Add-detailed-info-when-COPY-skips-soft-errors.patchapplication/x-patch; name=v2-0001-Add-detailed-info-when-COPY-skips-soft-errors.patchDownload+17-4
#10torikoshia
torikoshia@oss.nttdata.com
In reply to: Bharath Rupireddy (#9)
Re: Add new error_action COPY ON_ERROR "log"

On 2024-02-17 15:00, Bharath Rupireddy wrote:

On Fri, Feb 16, 2024 at 8:17 PM torikoshia <torikoshia@oss.nttdata.com>
wrote:

I may be wrong since I seldom do data loading tasks, but I greed with
you.

I also a little concerned about the case where there are many
malformed
data and it causes lots of messages, but the information is usually
valuable and if users don't need it, they can suppress it by changing
client_min_messages.

Currently both summary of failures and individual information is
logged
in NOTICE level.
If we should assume that there are cases where only summary
information
is required, it'd be useful to set lower log level, i.e. LOG to the
individual information.

How about we emit the summary at INFO level and individual information
at NOTICE level? With this, the summary is given a different priority
than the individual info. With SET client_min_messages = WARNING; one
can still get the summary but not the individual info. Also, to get
all of these into server log, one can SET log_min_messages = INFO; or
SET log_min_messages = NOTICE;.

Thoughts?

It looks good to me.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

#11torikoshia
torikoshia@oss.nttdata.com
In reply to: torikoshia (#10)
Re: Add new error_action COPY ON_ERROR "log"

On 2024-02-20 17:22, torikoshia wrote:

On 2024-02-17 15:00, Bharath Rupireddy wrote:

On Fri, Feb 16, 2024 at 8:17 PM torikoshia
<torikoshia@oss.nttdata.com> wrote:

I may be wrong since I seldom do data loading tasks, but I greed with
you.

I also a little concerned about the case where there are many
malformed
data and it causes lots of messages, but the information is usually
valuable and if users don't need it, they can suppress it by changing
client_min_messages.

Currently both summary of failures and individual information is
logged
in NOTICE level.
If we should assume that there are cases where only summary
information
is required, it'd be useful to set lower log level, i.e. LOG to the
individual information.

How about we emit the summary at INFO level and individual information
at NOTICE level? With this, the summary is given a different priority
than the individual info. With SET client_min_messages = WARNING; one
can still get the summary but not the individual info. Also, to get
all of these into server log, one can SET log_min_messages = INFO; or
SET log_min_messages = NOTICE;.

Thoughts?

It looks good to me.

Here are comments on the v2 patch.

+               if (cstate->opts.on_error != COPY_ON_ERROR_STOP)
+               {
+                   ereport(NOTICE,

I think cstate->opts.on_error is not COPY_ON_ERROR_STOP here, since if
it is COPY_ON_ERROR_STOP, InputFunctionCallSafe() should already have
errored out.

Should it be something like "Assert(cstate->opts.on_error !=
COPY_ON_ERROR_STOP)"?

Should below manual also be updated?

A NOTICE message containing the ignored row count is emitted at the end
of the COPY FROM if at least one row was discarded.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

#12Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: torikoshia (#11)
Re: Add new error_action COPY ON_ERROR "log"

On Mon, Feb 26, 2024 at 5:47 PM torikoshia <torikoshia@oss.nttdata.com> wrote:

It looks good to me.

Here are comments on the v2 patch.

Thanks for looking at it.

+               if (cstate->opts.on_error != COPY_ON_ERROR_STOP)
+               {
+                   ereport(NOTICE,

I think cstate->opts.on_error is not COPY_ON_ERROR_STOP here, since if
it is COPY_ON_ERROR_STOP, InputFunctionCallSafe() should already have
errored out.

Should it be something like "Assert(cstate->opts.on_error !=
COPY_ON_ERROR_STOP)"?

Nice catch. When COPY_ON_ERROR_STOP is specified, we use ereport's
soft error mechanism. An assertion seems a good choice to validate the
state is what we expect. Done that way.

Should below manual also be updated?

A NOTICE message containing the ignored row count is emitted at the end
of the COPY FROM if at least one row was discarded.

Changed.

PSA v3 patch with the above review comments addressed.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Add-detailed-info-when-COPY-skips-soft-errors.patchapplication/octet-stream; name=v3-0001-Add-detailed-info-when-COPY-skips-soft-errors.patchDownload+18-4
#13Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#12)
Re: Add new error_action COPY ON_ERROR "log"

On Wed, Feb 28, 2024 at 12:10:00PM +0530, Bharath Rupireddy wrote:

On Mon, Feb 26, 2024 at 5:47 PM torikoshia <torikoshia@oss.nttdata.com> wrote:

+               if (cstate->opts.on_error != COPY_ON_ERROR_STOP)
+               {
+                   ereport(NOTICE,

I think cstate->opts.on_error is not COPY_ON_ERROR_STOP here, since if
it is COPY_ON_ERROR_STOP, InputFunctionCallSafe() should already have
errored out.

Should it be something like "Assert(cstate->opts.on_error !=
COPY_ON_ERROR_STOP)"?

Nice catch. When COPY_ON_ERROR_STOP is specified, we use ereport's
soft error mechanism. An assertion seems a good choice to validate the
state is what we expect. Done that way.

Hmm. I am not really on board with this patch, that would generate
one NOTICE message each time a row is incompatible in the soft error
mode. If you have a couple of billion rows to bulk-load into the
backend and even 0.01% of them are corrupted, you could finish with a
more than 100k log entries, and all systems should be careful about
the log quantity generated, especially if we use the syslogger which
could become easily a bottleneck.

The existing ON_ERROR controls what to do on error. I think that we'd
better control the amount of information reported with a completely
separate option, an option even different than where to redirect
errors (if required, which would be either the logs, the client, a
pipe, a combination of these or even all of them).
--
Michael

#14Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#13)
Re: Add new error_action COPY ON_ERROR "log"

On Fri, Mar 1, 2024 at 10:22 AM Michael Paquier <michael@paquier.xyz> wrote:

Nice catch. When COPY_ON_ERROR_STOP is specified, we use ereport's
soft error mechanism. An assertion seems a good choice to validate the
state is what we expect. Done that way.

Hmm. I am not really on board with this patch, that would generate
one NOTICE message each time a row is incompatible in the soft error
mode. If you have a couple of billion rows to bulk-load into the
backend and even 0.01% of them are corrupted, you could finish with a
more than 100k log entries, and all systems should be careful about
the log quantity generated, especially if we use the syslogger which
could become easily a bottleneck.

Hm. I was having some concerns about it as mentioned upthread. But,
thanks a lot for illustrating it.

The existing ON_ERROR controls what to do on error. I think that we'd
better control the amount of information reported with a completely
separate option, an option even different than where to redirect
errors (if required, which would be either the logs, the client, a
pipe, a combination of these or even all of them).

How about an extra option to error_action ignore-with-verbose which is
similar to ignore but when specified emits one NOTICE per malformed
row? With this, one can say COPY x FROM stdin (ON_ERROR
ignore-with-verbose);.

Alternatively, we can think of adding a new option verbose altogether
which can be used for not only this but for reporting some other COPY
related info/errors etc. With this, one can say COPY x FROM stdin
(VERBOSE on, ON_ERROR ignore);.

There's also another way of having a separate GUC, but -100 from me
for it. Because, it not only increases the total number of GUCs by 1,
but also might set a wrong precedent to have a new GUC for controlling
command level outputs.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#15Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#14)
Re: Add new error_action COPY ON_ERROR "log"

On Mon, Mar 04, 2024 at 05:00:00AM +0530, Bharath Rupireddy wrote:

How about an extra option to error_action ignore-with-verbose which is
similar to ignore but when specified emits one NOTICE per malformed
row? With this, one can say COPY x FROM stdin (ON_ERROR
ignore-with-verbose);.

Alternatively, we can think of adding a new option verbose altogether
which can be used for not only this but for reporting some other COPY
related info/errors etc. With this, one can say COPY x FROM stdin
(VERBOSE on, ON_ERROR ignore);.

I would suggest a completely separate option, because that offers more
flexibility as each option has a separate meaning. My main concern in
using one option to control them all is that one may want in the
future to be able to specify more combinations of actions at query
level, especially if more modes are added to the ON_ERROR mode. One
option would prevent that.

Perhaps ERROR_VERBOSE or ERROR_VERBOSITY would be better names, but
I'm never wedded to my naming suggestions. Bad history with the
matter.

There's also another way of having a separate GUC, but -100 from me
for it. Because, it not only increases the total number of GUCs by 1,
but also might set a wrong precedent to have a new GUC for controlling
command level outputs.

What does this have to do with GUCs? The ON_ERROR option does not
have one.
--
Michael

#16Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#15)
Re: Add new error_action COPY ON_ERROR "log"

On Tue, Mar 5, 2024 at 4:48 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Mar 04, 2024 at 05:00:00AM +0530, Bharath Rupireddy wrote:

How about an extra option to error_action ignore-with-verbose which is
similar to ignore but when specified emits one NOTICE per malformed
row? With this, one can say COPY x FROM stdin (ON_ERROR
ignore-with-verbose);.

Alternatively, we can think of adding a new option verbose altogether
which can be used for not only this but for reporting some other COPY
related info/errors etc. With this, one can say COPY x FROM stdin
(VERBOSE on, ON_ERROR ignore);.

I would suggest a completely separate option, because that offers more
flexibility as each option has a separate meaning. My main concern in
using one option to control them all is that one may want in the
future to be able to specify more combinations of actions at query
level, especially if more modes are added to the ON_ERROR mode. One
option would prevent that.

Perhaps ERROR_VERBOSE or ERROR_VERBOSITY would be better names, but
I'm never wedded to my naming suggestions. Bad history with the
matter.

+1 for a separate option and LOG_VERBOSITY seemed a better and generic
naming choice. Because, the ON_ERROR ignore isn't actually an error
per se IMO.

There's also another way of having a separate GUC, but -100 from me
for it. Because, it not only increases the total number of GUCs by 1,
but also might set a wrong precedent to have a new GUC for controlling
command level outputs.

What does this have to do with GUCs? The ON_ERROR option does not
have one.

My thought was to have a separate GUC for deciding log level for COPY
command messages/errors similar to log_replication_commands. But
that's a no-go for sure when compared with a separate option.

Please see the attached v4 patch. If it looks good, I can pull
LOG_VERBOSITY changes out into 0001 and with 0002 containing the
detailed messages for discarded rows.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Add-detailed-info-when-COPY-skips-soft-errors.patchapplication/x-patch; name=v4-0001-Add-detailed-info-when-COPY-skips-soft-errors.patchDownload+45-6
#17Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#16)
Re: Add new error_action COPY ON_ERROR "log"

On Wed, Mar 06, 2024 at 07:32:28PM +0530, Bharath Rupireddy wrote:

Please see the attached v4 patch. If it looks good, I can pull
LOG_VERBOSITY changes out into 0001 and with 0002 containing the
detailed messages for discarded rows.

The approach looks sensible seen from here.

+    LOG_VERBOSITY [ <replaceable class="parameter">boolean</replaceable> ]
[...]
+    <term><literal>LOG_VERBOSITY</literal></term>
+    <listitem>
+     <para>
+      Sets the verbosity of logged messages by <command>COPY</command>
+      command. As an example, see its usage for
+      <command>COPY FROM</command> command's <literal>ON_ERROR</literal>
+      clause with <literal>ignore</literal> option.
+     </para>
+    </listitem>

Is a boolean the best interface for the end-user, though? Maybe
something like a "mode" value would speak more than a yes/no from the
start, say a "default" mode to emit only the last LOG and a "verbose"
for the whole set in the case of ON_ERROR? That could use an enum
from the start internally, but that's an implementation detail.

Describing what gets logged in the paragraph of ON_ERROR sounds fine,
especially if in the future more logs are added depending on other
options. That's an assumption at this stage, of course.

I am adding Alexander Korotkov in CC, as the original committer of
9e2d8701194f, as I assume that he may want to chime in this
discussion.

Torikoshi-san or others, if you have any comments about the interface,
feel free.
--
Michael

#18torikoshia
torikoshia@oss.nttdata.com
In reply to: Michael Paquier (#17)
Re: Add new error_action COPY ON_ERROR "log"

On 2024-03-07 13:00, Michael Paquier wrote:

On Wed, Mar 06, 2024 at 07:32:28PM +0530, Bharath Rupireddy wrote:

Please see the attached v4 patch. If it looks good, I can pull
LOG_VERBOSITY changes out into 0001 and with 0002 containing the
detailed messages for discarded rows.

The approach looks sensible seen from here.

+    LOG_VERBOSITY [ <replaceable 
class="parameter">boolean</replaceable> ]
[...]
+    <term><literal>LOG_VERBOSITY</literal></term>
+    <listitem>
+     <para>
+      Sets the verbosity of logged messages by <command>COPY</command>
+      command. As an example, see its usage for
+      <command>COPY FROM</command> command's 
<literal>ON_ERROR</literal>
+      clause with <literal>ignore</literal> option.
+     </para>
+    </listitem>

Is a boolean the best interface for the end-user, though? Maybe
something like a "mode" value would speak more than a yes/no from the
start, say a "default" mode to emit only the last LOG and a "verbose"
for the whole set in the case of ON_ERROR? That could use an enum
from the start internally, but that's an implementation detail.

Describing what gets logged in the paragraph of ON_ERROR sounds fine,
especially if in the future more logs are added depending on other
options. That's an assumption at this stage, of course.

I am adding Alexander Korotkov in CC, as the original committer of
9e2d8701194f, as I assume that he may want to chime in this
discussion.

Torikoshi-san or others, if you have any comments about the interface,
feel free.

Thanks.

Maybe I'm overly concerned, but I'm a little concerned about whether the
contents of this output can really be called verbose, since it does not
output the actual soft error message that occurred, but only the row and
column where the error occurred.

Since the soft error mechanism can at least output the contents of soft
errors in the server log [1]/messages/by-id/20230322175000.qbdctk7bnmifh5an@awork3.anarazel.de, it might be a good idea to use something
like a 'mode' value instead of boolean as Michael-san suggested, so that
the log output contents can be adjusted at multiple levels.

[1]: /messages/by-id/20230322175000.qbdctk7bnmifh5an@awork3.anarazel.de
/messages/by-id/20230322175000.qbdctk7bnmifh5an@awork3.anarazel.de

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

#19Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Michael Paquier (#17)
Re: Add new error_action COPY ON_ERROR "log"

On Thu, Mar 7, 2024 at 1:00 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Mar 06, 2024 at 07:32:28PM +0530, Bharath Rupireddy wrote:

Please see the attached v4 patch. If it looks good, I can pull
LOG_VERBOSITY changes out into 0001 and with 0002 containing the
detailed messages for discarded rows.

The approach looks sensible seen from here.

Looks like a good approach to me.

+    LOG_VERBOSITY [ <replaceable class="parameter">boolean</replaceable> ]
[...]
+    <term><literal>LOG_VERBOSITY</literal></term>
+    <listitem>
+     <para>
+      Sets the verbosity of logged messages by <command>COPY</command>
+      command. As an example, see its usage for
+      <command>COPY FROM</command> command's <literal>ON_ERROR</literal>
+      clause with <literal>ignore</literal> option.
+     </para>
+    </listitem>

Is a boolean the best interface for the end-user, though? Maybe
something like a "mode" value would speak more than a yes/no from the
start, say a "default" mode to emit only the last LOG and a "verbose"
for the whole set in the case of ON_ERROR? That could use an enum
from the start internally, but that's an implementation detail.

+1 for making it an enum, so that we will be able to have multiple
levels for example to get actual soft error contents.

One question I have is; do we want to write multiple NOTICE messages
for one row if the row has malformed data on some columns?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#20Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#19)
Re: Add new error_action COPY ON_ERROR "log"

On Thu, Mar 07, 2024 at 03:52:41PM +0900, Masahiko Sawada wrote:

One question I have is; do we want to write multiple NOTICE messages
for one row if the row has malformed data on some columns?

Good idea. We can do that as the field strings are already parsed.
--
Michael

#21Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#17)
#22Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#20)
#23Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#21)
#24Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#22)
#25Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#23)
#26Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#25)
#27Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#26)
#28Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#27)
#29Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#28)
#30Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#29)
#31Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#30)
#32Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#31)
#33Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#32)
#34Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#33)
#35Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#34)
#36Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#35)
#37Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#36)
#38Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#37)
#39Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#35)
#40Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#39)
#41Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#40)
#42Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#41)
#43Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#42)
#44Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#43)
#45Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#44)
#46Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#45)
#47Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#46)
#48Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#47)
#49Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#48)
#50torikoshia
torikoshia@oss.nttdata.com
In reply to: Masahiko Sawada (#45)
#51Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: torikoshia (#50)
#52torikoshia
torikoshia@oss.nttdata.com
In reply to: Bharath Rupireddy (#51)
#53Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#51)
#54torikoshia
torikoshia@oss.nttdata.com
In reply to: torikoshia (#52)
#55Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#53)
#56Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#55)
#57Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#56)
#58Michael Paquier
michael@paquier.xyz
In reply to: Masahiko Sawada (#57)