Add new COPY option REJECT_LIMIT
Hi,
9e2d870 enabled the COPY command to skip soft error, and I think we can
add another option which specifies the maximum tolerable number of soft
errors.
I remember this was discussed in [1]/messages/by-id/752672.1699474336@sss.pgh.pa.us, and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.
Attached a patch for this.
What do you think?
[1]: /messages/by-id/752672.1699474336@sss.pgh.pa.us
/messages/by-id/752672.1699474336@sss.pgh.pa.us
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachments:
v1-0001-Add-new-COPY-option-REJECT_LIMIT.patchtext/x-diff; name=v1-0001-Add-new-COPY-option-REJECT_LIMIT.patchDownload
From 7f111e98e21654c4ca338c93d7cbb4ec9acaabcb Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Fri, 26 Jan 2024 18:32:40 +0900
Subject: [PATCH v1] Add new COPY option REJECT_LIMIT
REJECT_LIMIT specifies the maximum tolerable number of malformed rows.
If input data has more malformed errors than this value, entire COPY fails.
This option must be used with ON_ERROR to be set to other than stop.
---
doc/src/sgml/ref/copy.sgml | 13 +++++++++++++
src/backend/commands/copy.c | 16 ++++++++++++++++
src/backend/commands/copyfrom.c | 6 ++++++
src/include/commands/copy.h | 1 +
src/test/regress/expected/copy2.out | 10 ++++++++++
src/test/regress/sql/copy2.sql | 21 +++++++++++++++++++++
6 files changed, 67 insertions(+)
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 21a5c4a052..8982e8464a 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -393,6 +393,19 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>REJECT_LIMIT</literal></term>
+ <listitem>
+ <para>
+ Specifies the maximum tolerable number of malformed rows.
+ If input data has caused more malformed errors than this value, entire
+ <command>COPY</command> fails.
+ This option must be used with <literal>ON_ERROR</literal> to be set to
+ other than <literal>stop</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>ENCODING</literal></term>
<listitem>
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cc0786c6f4..ca5263d588 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -615,6 +615,22 @@ ProcessCopyOptions(ParseState *pstate,
on_error_specified = true;
opts_out->on_error = defGetCopyOnErrorChoice(defel, pstate, is_from);
}
+ else if (strcmp(defel->defname, "reject_limit") == 0)
+ {
+ int64 reject_limit = defGetInt64(defel);
+
+ if (!opts_out->on_error)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT requires ON_ERROR to be set to other than stop")));
+ if (opts_out->reject_limit > 0)
+ errorConflictingDefElem(defel, pstate);
+ if (reject_limit <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT must be greater than zero")));
+ opts_out->reject_limit = reject_limit;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 1fe70b9133..15066887ea 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1017,6 +1017,12 @@ CopyFrom(CopyFromState cstate)
pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
++skipped);
+ if (cstate->opts.reject_limit > 0 && skipped > cstate->opts.reject_limit)
+ ereport(ERROR,
+ (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+ errmsg("exceeded the number specified by REJECT LIMIT \"%d\"",
+ cstate->opts.reject_limit)));
+
continue;
}
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index b3da3cb0be..8f8dab9524 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -73,6 +73,7 @@ typedef struct CopyFormatOptions
bool *force_null_flags; /* per-column CSV FN flags */
bool convert_selectively; /* do selective binary conversion? */
CopyOnErrorChoice on_error; /* what to do when error happened */
+ int reject_limit; /* tolerable number of malformed rows */
List *convert_select; /* list of column names (can be NIL) */
} CopyFormatOptions;
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 25c401ce34..28de7a2685 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -108,6 +108,10 @@ COPY x to stdin (format BINARY, on_error unsupported);
ERROR: COPY ON_ERROR cannot be used with COPY TO
LINE 1: COPY x to stdin (format BINARY, on_error unsupported);
^
+COPY x from stdin with (reject_limit 3);
+ERROR: REJECT_LIMIT requires ON_ERROR to be set to other than stop
+COPY x from stdin with (on_error ignore, reject_limit 0);
+ERROR: REJECT_LIMIT must be greater than zero
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
ERROR: column "d" specified more than once
@@ -751,6 +755,12 @@ CONTEXT: COPY check_ign_err, line 1: "1 {1}"
COPY check_ign_err FROM STDIN WITH (on_error ignore);
ERROR: extra data after last expected column
CONTEXT: COPY check_ign_err, line 1: "1 {1} 3 abc"
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+ERROR: exceeded the number specified by REJECT LIMIT "3"
+CONTEXT: COPY check_ign_err, line 5, column n: ""
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+NOTICE: 4 rows were skipped due to data type incompatibility
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index b5e549e856..e5299b1308 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -80,6 +80,8 @@ COPY x to stdin (format CSV, force_not_null(a));
COPY x to stdout (format TEXT, force_null(a));
COPY x to stdin (format CSV, force_null(a));
COPY x to stdin (format BINARY, on_error unsupported);
+COPY x from stdin with (reject_limit 3);
+COPY x from stdin with (on_error ignore, reject_limit 0);
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
@@ -534,6 +536,25 @@ COPY check_ign_err FROM STDIN WITH (on_error ignore);
1 {1} 3 abc
\.
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
base-commit: f2743a7d70e7b2891277632121bb51e739743a47
--
2.39.2
On Fri, Jan 26, 2024 at 2:49 AM torikoshia <torikoshia@oss.nttdata.com>
wrote:
Hi,
9e2d870 enabled the COPY command to skip soft error, and I think we can
add another option which specifies the maximum tolerable number of soft
errors.I remember this was discussed in [1], and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.Attached a patch for this.
What do you think?
I'm opposed to adding this particular feature.
When implementing this kind of business rule I'd need the option to specify
a percentage, not just an absolute value.
I would focus on trying to put the data required to make this kind of
determination into a place where applications implementing such business
rules and monitoring can readily get at it. The "ERRORS TO" and maybe a
corresponding "STATS TO" option where a table can be specified for the
system to place the problematic data and stats about the copy itself.
David J.
On 2024-01-27 00:20, David G. Johnston wrote:
Thanks for your comments!
On Fri, Jan 26, 2024 at 2:49 AM torikoshia
<torikoshia@oss.nttdata.com> wrote:Hi,
9e2d870 enabled the COPY command to skip soft error, and I think we
can
add another option which specifies the maximum tolerable number of
soft
errors.I remember this was discussed in [1], and feel it would be useful
when
loading 'dirty' data but there is a limit to how dirty it can be.Attached a patch for this.
What do you think?
I'm opposed to adding this particular feature.
When implementing this kind of business rule I'd need the option to
specify a percentage, not just an absolute value.
Yeah, it seems useful for some cases.
Actually, Greenplum enables to specify not only the max number of bad
rows but also its percentage[1]https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-load-topics-g-handling-load-errors.html.
I may be wrong, but considering some dataloaders support something like
reject_limit(Redshift supports MAXERROR[2]https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html, pg_bulkload supports
PARSE_ERRORS[3]https://ossc-db.github.io/pg_bulkload/pg_bulkload.html), specifying the "number" of the bad row might also be
useful.
I think we can implement reject_limit specified by percentage simply
calculating the ratio of skipped and processed at the end of CopyFrom()
like this:
if (cstate->opts.reject_limit > 0 && (double) skipped / (processed +
skipped) > cstate->opts.reject_limit_percent)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
errmsg("exceeded the ratio specified
by ..
I would focus on trying to put the data required to make this kind of
determination into a place where applications implementing such
business rules and monitoring can readily get at it. The "ERRORS TO"
and maybe a corresponding "STATS TO" option where a table can be
specified for the system to place the problematic data and stats about
the copy itself.
It'd be nice to have such informative tables, but I believe the benefit
of reject_limit is it fails the entire loading when the threshold is
exceeded.
I imagine when we just have error and stats information tables for COPY,
users have to delete the rows when they confirmed too many errors in
these tables.
[1]: https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-load-topics-g-handling-load-errors.html
[2]: https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html
[3]: https://ossc-db.github.io/pg_bulkload/pg_bulkload.html
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
On 2024/01/26 18:49, torikoshia wrote:
Hi,
9e2d870 enabled the COPY command to skip soft error, and I think we can add another option which specifies the maximum tolerable number of soft errors.
I remember this was discussed in [1], and feel it would be useful when loading 'dirty' data but there is a limit to how dirty it can be.
Attached a patch for this.
What do you think?
The patch no longer applies cleanly to HEAD. Could you update it?
I think the REJECT_LIMIT feature is useful. Allowing it to be set as either the absolute number of skipped rows or a percentage of the total input rows is a good idea.
However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR option is still necessary. REJECT_LIMIT seems to cover the same cases. For instance, REJECT_LIMIT=infinity can act like ON_ERROR=ignore, and REJECT_LIMIT=0 can act like ON_ERROR=stop.
Therefore, having both ON_ERROR and REJECT_LIMIT might be confusing.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 2024-07-03 02:07, Fujii Masao wrote:
Thanks for your comments!
On 2024/01/26 18:49, torikoshia wrote:
Hi,
9e2d870 enabled the COPY command to skip soft error, and I think we
can add another option which specifies the maximum tolerable number of
soft errors.I remember this was discussed in [1], and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.Attached a patch for this.
What do you think?
The patch no longer applies cleanly to HEAD. Could you update it?
I'm going to update it after discussing the option format as described
below.
I think the REJECT_LIMIT feature is useful. Allowing it to be set as
either the absolute number of skipped rows or a percentage of the
total input rows is a good idea.However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR
option is still necessary. REJECT_LIMIT seems to cover the same cases.
For instance, REJECT_LIMIT=infinity can act like ON_ERROR=ignore, and
REJECT_LIMIT=0 can act like ON_ERROR=stop.
I agree that it's possible to use only REJECT_LIMIT without ON_ERROR.
I also think it's easy to understand that REJECT_LIMIT=0 is
ON_ERROR=stop.
However, expressing REJECT_LIMIT='infinity' needs some definition like
"setting REJECT_LIMIT to -1 means 'infinity'", doesn't it? If so, I
think this might not so intuitive.
Also, since it seems Snowflake and Redshift have both options equivalent
to REJECT_LIMIT and ON_ERROR, having both of them in PostgreSQL COPY
might not be surprising:
- Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num |
'SKIP_FILE_num%' | ABORT_STATEMENT"[1]https://docs.snowflake.com/en/sql-reference/sql/copy-into-table#copy-options-copyoptions
- Redshift has MAXERROR and IGNOREALLERRORS options[2]https://docs.aws.amazon.com/en_en/redshift/latest/dg/copy-parameters-data-load.html
BTW after seeing Snowflake makes SKIP_FILE_num one of the options of
ON_ERROR, I'm a bit wondering whether REJECT_LIMIT also should be the
same.
[1]: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table#copy-options-copyoptions
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table#copy-options-copyoptions
[2]: https://docs.aws.amazon.com/en_en/redshift/latest/dg/copy-parameters-data-load.html
https://docs.aws.amazon.com/en_en/redshift/latest/dg/copy-parameters-data-load.html
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
On 2024/07/04 12:05, torikoshia wrote:
I'm going to update it after discussing the option format as described below.
Thanks!
I agree that it's possible to use only REJECT_LIMIT without ON_ERROR.
I also think it's easy to understand that REJECT_LIMIT=0 is ON_ERROR=stop.
However, expressing REJECT_LIMIT='infinity' needs some definition like "setting REJECT_LIMIT to -1 means 'infinity'", doesn't it? If so, I think this might not so intuitive.
How about allowing REJECT_LIMIT to accept the keywords "infinity", "unlimited",
or "all" in addition to a number? This way, users can specify one of these
keywords instead of -1 to ignore all errors. The server code would then
internally set the REJECT_LIMIT to -1 or another appropriate value when
these keywords are used, but users wouldn't need to worry about this detail.
If we choose "all" as the keyword, renaming the option to IGNORE_ERRORS
might be more intuitive and easier to understand than REJECT_LIMIT.
Also, since it seems Snowflake and Redshift have both options equivalent to REJECT_LIMIT and ON_ERROR, having both of them in PostgreSQL COPY might not be surprising:
- Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num | 'SKIP_FILE_num%' | ABORT_STATEMENT"[1]
- Redshift has MAXERROR and IGNOREALLERRORS options[2]
Ok, so here's a summary of the options and their behaviors:
To ignore all errors and continue to the end:
- Snowflake: ON_ERROR=CONTINUE
- Redshift: IGNOREALLERRORS
- Postgres (with your patch): ON_ERROR=ignore
- Postgres (with my idea): IGNORE_ERRORS=all
To fail as soon as an error is found:
- Snowflake: ON_ERROR=ABORT_STATEMENT (default) / SKIP_FILE
- Redshift: MAXERROR=0 (default)
- Postgres (with your patch): ON_ERROR=stop (default)
- Postgres (with my idea): IGNORE_ERRORS=0 (default)
To fail when NNN or more errors are found:
- Snowflake: ON_ERROR=SKIP_FILE_NNN
- Redshift: MAXERROR=NNN
- Postgres (with your patch): REJECT_LIMIT=NNN-1 and ON_ERROR=ignore
- Postgres (with my idea): IGNORE_ERRORS=NNN
This makes me think it might be better to treat REJECT_LIMIT as
an additional option for ON_ERROR=stop instead of ON_ERROR=ignore
if we adopt your patch. Since ON_ERROR=stop is the default,
users could set the maximum number of allowed errors by specifying
only REJECT_LIMIT. Otherwise, they would need to specify both
ON_ERROR=ignore and REJECT_LIMIT.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 2024-07-05 12:59, Fujii Masao wrote:
On 2024/07/04 12:05, torikoshia wrote:
I'm going to update it after discussing the option format as described
below.Thanks!
I agree that it's possible to use only REJECT_LIMIT without ON_ERROR.
I also think it's easy to understand that REJECT_LIMIT=0 is
ON_ERROR=stop.
However, expressing REJECT_LIMIT='infinity' needs some definition like
"setting REJECT_LIMIT to -1 means 'infinity'", doesn't it? If so, I
think this might not so intuitive.How about allowing REJECT_LIMIT to accept the keywords "infinity",
"unlimited",
or "all" in addition to a number? This way, users can specify one of
these
keywords instead of -1 to ignore all errors. The server code would then
internally set the REJECT_LIMIT to -1 or another appropriate value when
these keywords are used, but users wouldn't need to worry about this
detail.
Agreed.
If we choose "all" as the keyword, renaming the option to IGNORE_ERRORS
might be more intuitive and easier to understand than REJECT_LIMIT.
I feel that 'infinite' and 'unlimited' are unfamiliar values for
PostgreSQL parameters, so 'all' might be better and IGNORE_ERRORS would
be a better parameter name as your suggestion.
Also, since it seems Snowflake and Redshift have both options
equivalent to REJECT_LIMIT and ON_ERROR, having both of them in
PostgreSQL COPY might not be surprising:
- Snowflake's ON_ERROR accepts "CONTINUE | SKIP_FILE | SKIP_FILE_num |
'SKIP_FILE_num%' | ABORT_STATEMENT"[1]
- Redshift has MAXERROR and IGNOREALLERRORS options[2]Ok, so here's a summary of the options and their behaviors:
To ignore all errors and continue to the end:
- Snowflake: ON_ERROR=CONTINUE
- Redshift: IGNOREALLERRORS
- Postgres (with your patch): ON_ERROR=ignore
- Postgres (with my idea): IGNORE_ERRORS=allTo fail as soon as an error is found:
- Snowflake: ON_ERROR=ABORT_STATEMENT (default) / SKIP_FILE
- Redshift: MAXERROR=0 (default)
- Postgres (with your patch): ON_ERROR=stop (default)
- Postgres (with my idea): IGNORE_ERRORS=0 (default)To fail when NNN or more errors are found:
- Snowflake: ON_ERROR=SKIP_FILE_NNN
- Redshift: MAXERROR=NNN
- Postgres (with your patch): REJECT_LIMIT=NNN-1 and ON_ERROR=ignore
- Postgres (with my idea): IGNORE_ERRORS=NNN
Thanks for the summary.
This makes me think it might be better to treat REJECT_LIMIT as
an additional option for ON_ERROR=stop instead of ON_ERROR=ignore
if we adopt your patch. Since ON_ERROR=stop is the default,
users could set the maximum number of allowed errors by specifying
only REJECT_LIMIT. Otherwise, they would need to specify both
ON_ERROR=ignore and REJECT_LIMIT.
That makes sense.
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
On 2024-07-03 02:07, Fujii Masao wrote:
However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR
option is still necessary.
I remembered another reason for the necessity of ON_ERROR.
ON_ERROR defines how to behave when encountering an error and it just
accepts 'ignore' and 'stop' currently, but is expected to support other
options such as saving details of errors to a table[1]/messages/by-id/CACJufxH_OJpVra=0c4ow8fbxHj7heMcVaTNEPa5vAurSeNA-6Q@mail.gmail.com.
ON_ERROR=stop is a synonym for REJECT_LIMIT=infinity, but I imagine
REJECT_LIMIT would not replace future options of ON_ERROR.
Considering this and the option we want to add this time is to specify
an upper limit on the number or ratio of errors, the name of this option
like "reject_limit" seems better than "ignore_errors".
On Fri, Jul 5, 2024 at 4:13 PM torikoshia <torikoshia@oss.nttdata.com>
wrote:
On 2024-07-05 12:59, Fujii Masao wrote:
On 2024/07/04 12:05, torikoshia wrote:
I'm going to update it after discussing the option format as
described
below.
Updated the patch.
0001 sets limit by the absolute number of error rows and 0002 sets limit
by ratio of the error.
If we choose "all" as the keyword, renaming the option to
IGNORE_ERRORS
might be more intuitive and easier to understand than REJECT_LIMIT.
I feel that 'infinite' and 'unlimited' are unfamiliar values for
PostgreSQL parameters, so 'all' might be better and IGNORE_ERRORS would
be a better parameter name as your suggestion.
As described above, attached patch adopts REJECT_LIMIT, so it uses
"infinity".
This makes me think it might be better to treat REJECT_LIMIT as
an additional option for ON_ERROR=stop instead of ON_ERROR=ignore
if we adopt your patch. Since ON_ERROR=stop is the default,
users could set the maximum number of allowed errors by specifying
only REJECT_LIMIT. Otherwise, they would need to specify both
ON_ERROR=ignore and REJECT_LIMIT.
That makes sense.
On my second thought, whatever value ON_ERROR is specified(e.g. ignore,
stop, table), it seems fine to use REJECT_LIMIT.
I feel REJECT_LIMIT has both "ignore" and "stop" characteristics,
meaning it ignores errors until it reaches REJECT_LIMIT and stops when
it exceeds the REJECT_LIMIT.
And REJECT_LIMIT seems orthogonal to 'table', which specifies where to
save error details.
Attached patch allows using REJECT_LIMIT regardless of the ON_ERROR
option value.
[1]: /messages/by-id/CACJufxH_OJpVra=0c4ow8fbxHj7heMcVaTNEPa5vAurSeNA-6Q@mail.gmail.com
/messages/by-id/CACJufxH_OJpVra=0c4ow8fbxHj7heMcVaTNEPa5vAurSeNA-6Q@mail.gmail.com
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachments:
v2-0001-Add-new-COPY-option-REJECT_LIMIT_number.patchtext/x-diff; charset=us-ascii; name=v2-0001-Add-new-COPY-option-REJECT_LIMIT_number.patchDownload
From d00e4a404fc9c6ccc8aec3885210909a21c55005 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Wed, 17 Jul 2024 21:47:52 +0900
Subject: [PATCH v2] Add new COPY option REJECT_LIMIT number
---
doc/src/sgml/ref/copy.sgml | 20 +++++++++++++
src/backend/commands/copy.c | 46 +++++++++++++++++++++++++++++
src/backend/commands/copyfrom.c | 6 ++++
src/include/commands/copy.h | 10 +++++++
src/test/regress/expected/copy2.out | 14 +++++++++
src/test/regress/sql/copy2.sql | 30 +++++++++++++++++++
6 files changed, 126 insertions(+)
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 1518af8a04..ee84d9661f 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,6 +44,7 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
FORCE_NOT_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
FORCE_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
ON_ERROR <replaceable class="parameter">error_action</replaceable>
+ REJECT_LIMIT { <replaceable class="parameter">integer</replaceable> | INFINITY }
ENCODING '<replaceable class="parameter">encoding_name</replaceable>'
LOG_VERBOSITY <replaceable class="parameter">verbosity</replaceable>
</synopsis>
@@ -411,6 +412,25 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>REJECT_LIMIT</literal></term>
+ <listitem>
+ <para>
+ When a positive integer value is specified, <command>COPY</command> limits
+ the maximum tolerable number of errors while converting a column's input
+ value into its data type.
+ If input data caused more errors than the specified value, entire
+ <command>COPY</command> fails.
+ Otherwise, <command>COPY</command> discards the input row and continues
+ with the next one.
+ </para>
+ <para>
+ When specified <literal>INFINITY</literal>, <command>COPY</command> ignores all
+ the errors. This is a synonym for <literal>ON_ERROR</literal> <literal>ignore</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>ENCODING</literal></term>
<listitem>
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index df7a4a21c9..f298614043 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -415,6 +415,43 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
return COPY_ON_ERROR_STOP; /* keep compiler quiet */
}
+/*
+ * Extract a CopyRejectLimits values from a DefElem.
+ */
+static CopyRejectLimits
+defGetCopyRejectLimitOptions(DefElem *def)
+{
+ CopyRejectLimits limits;
+ int64 num_err;
+
+ switch(nodeTag(def->arg))
+ {
+ case T_Integer:
+ num_err = defGetInt64(def);
+ if (num_err <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number for REJECT_LIMIT must be greater than zero")));
+ break;
+ case T_String:
+ if (pg_strcasecmp(defGetString(def), "INFINITY") == 0)
+ /* when set to 0, it is treated as no limit */
+ num_err = 0;
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("string for REJECT_LIMIT must be 'INFINITY'")));
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("value for REJECT_LIMIT must be positive integer or 'INFINITY'")));
+ }
+ limits.num_err = num_err;
+
+ return limits;
+}
+
/*
* Extract a CopyLogVerbosityChoice value from a DefElem.
*/
@@ -466,6 +503,7 @@ ProcessCopyOptions(ParseState *pstate,
bool header_specified = false;
bool on_error_specified = false;
bool log_verbosity_specified = false;
+ bool reject_limit_specified = false;
ListCell *option;
/* Support external use for option sanity checking */
@@ -632,6 +670,14 @@ ProcessCopyOptions(ParseState *pstate,
log_verbosity_specified = true;
opts_out->log_verbosity = defGetCopyLogVerbosityChoice(defel, pstate);
}
+ else if (strcmp(defel->defname, "reject_limit") == 0)
+ {
+ if (reject_limit_specified)
+ errorConflictingDefElem(defel, pstate);
+ reject_limit_specified = true;
+ opts_out->on_error = COPY_ON_ERROR_IGNORE;
+ opts_out->reject_limits = defGetCopyRejectLimitOptions(defel);
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index ce4d62e707..65d950ef07 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1012,6 +1012,12 @@ CopyFrom(CopyFromState cstate)
pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
++skipped);
+ if (cstate->opts.reject_limits.num_err &&
+ skipped > cstate->opts.reject_limits.num_err)
+ ereport(ERROR,
+ (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+ errmsg("exceeded the number specified by REJECT_LIMIT \"%lld\"",
+ (long long) cstate->opts.reject_limits.num_err)));
continue;
}
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index 141fd48dc1..86ba1cbe16 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -49,6 +49,15 @@ typedef enum CopyLogVerbosityChoice
COPY_LOG_VERBOSITY_VERBOSE, /* logs additional messages */
} CopyLogVerbosityChoice;
+/*
+ * A struct to hold reject_limit options, in a parsed form.
+ * More values to be added in another patch.
+ */
+typedef struct CopyRejectLimits
+{
+ int64 num_err; /* maximum tolerable number of errors */
+} CopyRejectLimits;
+
/*
* A struct to hold COPY options, in a parsed form. All of these are related
* to formatting, except for 'freeze', which doesn't really belong here, but
@@ -83,6 +92,7 @@ typedef struct CopyFormatOptions
bool convert_selectively; /* do selective binary conversion? */
CopyOnErrorChoice on_error; /* what to do when error happened */
CopyLogVerbosityChoice log_verbosity; /* verbosity of logged messages */
+ CopyRejectLimits reject_limits; /* thresholds of reject_limit */
List *convert_select; /* list of column names (can be NIL) */
} CopyFormatOptions;
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 931542f268..be04bac6b1 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -85,6 +85,10 @@ COPY x from stdin (log_verbosity default, log_verbosity verbose);
ERROR: conflicting or redundant options
LINE 1: COPY x from stdin (log_verbosity default, log_verbosity verb...
^
+COPY x from stdin (reject_limit 'INFINITY', reject_limit 3);
+ERROR: conflicting or redundant options
+LINE 1: COPY x from stdin (reject_limit 'INFINITY', reject_limit 3);
+ ^
-- incorrect options
COPY x to stdin (format BINARY, delimiter ',');
ERROR: cannot specify DELIMITER in BINARY mode
@@ -116,6 +120,8 @@ COPY x to stdout (log_verbosity unsupported);
ERROR: COPY LOG_VERBOSITY "unsupported" not recognized
LINE 1: COPY x to stdout (log_verbosity unsupported);
^
+COPY x from stdin with (reject_limit 0);
+ERROR: number for REJECT_LIMIT must be greater than zero
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
ERROR: column "d" specified more than once
@@ -789,6 +795,14 @@ CONTEXT: COPY check_ign_err, line 1: "1 {1}"
COPY check_ign_err FROM STDIN WITH (on_error ignore);
ERROR: extra data after last expected column
CONTEXT: COPY check_ign_err, line 1: "1 {1} 3 abc"
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (reject_limit 3);
+ERROR: exceeded the number specified by REJECT_LIMIT "3"
+CONTEXT: COPY check_ign_err, line 5, column n: ""
+COPY check_ign_err FROM STDIN WITH (reject_limit 4);
+NOTICE: 4 rows were skipped due to data type incompatibility
+COPY check_ign_err FROM STDIN WITH (reject_limit 'INFINITY');
+NOTICE: 4 rows were skipped due to data type incompatibility
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index 8b14962194..2ddc50ce8d 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -68,6 +68,7 @@ COPY x from stdin (convert_selectively (a), convert_selectively (b));
COPY x from stdin (encoding 'sql_ascii', encoding 'sql_ascii');
COPY x from stdin (on_error ignore, on_error ignore);
COPY x from stdin (log_verbosity default, log_verbosity verbose);
+COPY x from stdin (reject_limit 'INFINITY', reject_limit 3);
-- incorrect options
COPY x to stdin (format BINARY, delimiter ',');
@@ -82,6 +83,7 @@ COPY x to stdout (format TEXT, force_null(a));
COPY x to stdin (format CSV, force_null(a));
COPY x to stdin (format BINARY, on_error unsupported);
COPY x to stdout (log_verbosity unsupported);
+COPY x from stdin with (reject_limit 0);
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
@@ -557,6 +559,34 @@ COPY check_ign_err FROM STDIN WITH (on_error ignore);
1 {1} 3 abc
\.
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (reject_limit 3);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
+COPY check_ign_err FROM STDIN WITH (reject_limit 4);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
+COPY check_ign_err FROM STDIN WITH (reject_limit 'INFINITY');
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
base-commit: 86d33987e8b0364b468c9b40c5f2a0a1aed87ef1
--
2.39.2
v2-0002-Add-new-COPY-option-REJECT_LIMIT_ratio.patchtext/x-diff; charset=us-ascii; name=v2-0002-Add-new-COPY-option-REJECT_LIMIT_ratio.patchDownload
From 3ba99a391d96cda49c1d45ed8c965bc65cb675c3 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Wed, 17 Jul 2024 21:55:27 +0900
Subject: [PATCH v2] Add new COPY option REJECT_LIMIT ratio
---
doc/src/sgml/ref/copy.sgml | 11 ++++++++++-
src/backend/commands/copy.c | 18 +++++++++++++-----
src/backend/commands/copyfrom.c | 10 ++++++++++
src/include/commands/copy.h | 1 +
src/test/regress/expected/copy2.out | 7 +++++++
src/test/regress/sql/copy2.sql | 19 +++++++++++++++++++
6 files changed, 60 insertions(+), 6 deletions(-)
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index ee84d9661f..bdeed92e8e 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,7 +44,7 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
FORCE_NOT_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
FORCE_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
ON_ERROR <replaceable class="parameter">error_action</replaceable>
- REJECT_LIMIT { <replaceable class="parameter">integer</replaceable> | INFINITY }
+ REJECT_LIMIT { <replaceable class="parameter">integer</replaceable> | <replaceable class="parameter">floating point</replaceable> | INFINITY }
ENCODING '<replaceable class="parameter">encoding_name</replaceable>'
LOG_VERBOSITY <replaceable class="parameter">verbosity</replaceable>
</synopsis>
@@ -424,6 +424,15 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
Otherwise, <command>COPY</command> discards the input row and continues
with the next one.
</para>
+ <para>
+ When a positive floating point value is specified, <command>COPY</command>
+ limits the maximum ratio of errors while converting a column's input
+ value into its data type.
+ If input data caused an error ratio greater than the specified value,
+ entire <command>COPY</command> fails.
+ Otherwise, <command>COPY</command> discards only the input rows where
+ errors occured and copies all the other rows.
+ </para>
<para>
When specified <literal>INFINITY</literal>, <command>COPY</command> ignores all
the errors. This is a synonym for <literal>ON_ERROR</literal> <literal>ignore</literal>.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f298614043..0930826d70 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -422,7 +422,8 @@ static CopyRejectLimits
defGetCopyRejectLimitOptions(DefElem *def)
{
CopyRejectLimits limits;
- int64 num_err;
+ uint64 num_err = 0;
+ float ratio_err = 0;
switch(nodeTag(def->arg))
{
@@ -433,14 +434,20 @@ defGetCopyRejectLimitOptions(DefElem *def)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("number for REJECT_LIMIT must be greater than zero")));
break;
+ case T_Float:
+ ratio_err = defGetNumeric(def);
+ if (ratio_err <= 0 || ratio_err >= 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("float for REJECT_LIMIT must be greater than zero and smaller than 1")));
+ break;
case T_String:
- if (pg_strcasecmp(defGetString(def), "INFINITY") == 0)
- /* when set to 0, it is treated as no limit */
- num_err = 0;
- else
+ if (pg_strcasecmp(defGetString(def), "INFINITY") != 0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("string for REJECT_LIMIT must be 'INFINITY'")));
+
+ /* when set to 0, it is treated as no limit */
break;
default:
ereport(ERROR,
@@ -448,6 +455,7 @@ defGetCopyRejectLimitOptions(DefElem *def)
errmsg("value for REJECT_LIMIT must be positive integer or 'INFINITY'")));
}
limits.num_err = num_err;
+ limits.ratio_err = ratio_err;
return limits;
}
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 65d950ef07..69485a25a1 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -646,6 +646,7 @@ CopyFrom(CopyFromState cstate)
int64 processed = 0;
int64 excluded = 0;
int64 skipped = 0;
+ double ratio_err = 0;
bool has_before_insert_row_trig;
bool has_instead_insert_row_trig;
bool leafpart_use_multi_insert = false;
@@ -1310,6 +1311,15 @@ CopyFrom(CopyFromState cstate)
CopyMultiInsertInfoFlush(&multiInsertInfo, NULL, &processed);
}
+ ratio_err = (double) skipped / (processed + skipped);
+ if (cstate->opts.reject_limits.ratio_err &&
+ cstate->opts.reject_limits.ratio_err < ratio_err)
+ ereport(ERROR,
+ (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+ errmsg("exceeded the ratio specified by REJECT_LIMIT \"%f\", the error ratio was: \"%f\"",
+ cstate->opts.reject_limits.ratio_err,
+ ratio_err)));
+
/* Done, clean up */
error_context_stack = errcallback.previous;
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index 86ba1cbe16..ad2606763d 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -56,6 +56,7 @@ typedef enum CopyLogVerbosityChoice
typedef struct CopyRejectLimits
{
int64 num_err; /* maximum tolerable number of errors */
+ float ratio_err; /* maximum tolerable ratio of errors */
} CopyRejectLimits;
/*
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index be04bac6b1..584abac15c 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -122,6 +122,8 @@ LINE 1: COPY x to stdout (log_verbosity unsupported);
^
COPY x from stdin with (reject_limit 0);
ERROR: number for REJECT_LIMIT must be greater than zero
+COPY x from stdin with (reject_limit 1.1);
+ERROR: float for REJECT_LIMIT must be greater than zero and smaller than 1
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
ERROR: column "d" specified more than once
@@ -803,6 +805,11 @@ COPY check_ign_err FROM STDIN WITH (reject_limit 4);
NOTICE: 4 rows were skipped due to data type incompatibility
COPY check_ign_err FROM STDIN WITH (reject_limit 'INFINITY');
NOTICE: 4 rows were skipped due to data type incompatibility
+COPY check_ign_err FROM STDIN WITH (reject_limit 0.6);
+ERROR: exceeded the ratio specified by REJECT_LIMIT "0.600000", the error ratio was: "0.666667"
+CONTEXT: COPY check_ign_err, line 7: ""
+COPY check_ign_err FROM STDIN WITH (reject_limit 0.7);
+NOTICE: 4 rows were skipped due to data type incompatibility
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index 2ddc50ce8d..39f44d941f 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -84,6 +84,7 @@ COPY x to stdin (format CSV, force_null(a));
COPY x to stdin (format BINARY, on_error unsupported);
COPY x to stdout (log_verbosity unsupported);
COPY x from stdin with (reject_limit 0);
+COPY x from stdin with (reject_limit 1.1);
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
@@ -587,6 +588,24 @@ a {7} 7
10 {10} 10
\.
+COPY check_ign_err FROM STDIN WITH (reject_limit 0.6);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
+COPY check_ign_err FROM STDIN WITH (reject_limit 0.7);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
base-commit: 86d33987e8b0364b468c9b40c5f2a0a1aed87ef1
--
2.39.2
On 2024/07/17 22:21, torikoshia wrote:
On 2024-07-03 02:07, Fujii Masao wrote:
However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR option is still necessary.
I remembered another reason for the necessity of ON_ERROR.
ON_ERROR defines how to behave when encountering an error and it just accepts 'ignore' and 'stop' currently, but is expected to support other options such as saving details of errors to a table[1].
Wouldn't it be better to separate the option specifying where
error details are output from the ON_ERROR option
(which determines behavior when encountering errors)?
"table" seems valid for both ON_ERROR=ignore and ON_ERROR=stop.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Hi Torikoshia,
On Wed, Jul 17, 2024 at 9:21 PM torikoshia <torikoshia@oss.nttdata.com> wrote:
On 2024-07-03 02:07, Fujii Masao wrote:
However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR
option is still necessary.I remembered another reason for the necessity of ON_ERROR.
ON_ERROR defines how to behave when encountering an error and it just
accepts 'ignore' and 'stop' currently, but is expected to support other
options such as saving details of errors to a table[1].
ON_ERROR=stop is a synonym for REJECT_LIMIT=infinity, but I imagine
REJECT_LIMIT would not replace future options of ON_ERROR.Considering this and the option we want to add this time is to specify
an upper limit on the number or ratio of errors, the name of this option
like "reject_limit" seems better than "ignore_errors".On Fri, Jul 5, 2024 at 4:13 PM torikoshia <torikoshia@oss.nttdata.com>
wrote:On 2024-07-05 12:59, Fujii Masao wrote:
On 2024/07/04 12:05, torikoshia wrote:
I'm going to update it after discussing the option format as
described
below.Updated the patch.
0001 sets limit by the absolute number of error rows and 0002 sets limit
by ratio of the error.
In patch 0002, the ratio is calculated by the already skipped/processed
rows, but what if a user wants to copy 1000 rows, and he/she can tolerate
10 error rows, so he/she might set *reject_limit 0.01*, but one bad row in the
first 100 rows will fail the entire command, this might surprise the user.
This case can be resolved by 0001 *reject_limit 10*, so I think the *by ratio*
is less useful.
If we choose "all" as the keyword, renaming the option to
IGNORE_ERRORS
might be more intuitive and easier to understand than REJECT_LIMIT.I feel that 'infinite' and 'unlimited' are unfamiliar values for
PostgreSQL parameters, so 'all' might be better and IGNORE_ERRORS would
be a better parameter name as your suggestion.As described above, attached patch adopts REJECT_LIMIT, so it uses
"infinity".This makes me think it might be better to treat REJECT_LIMIT as
an additional option for ON_ERROR=stop instead of ON_ERROR=ignore
if we adopt your patch. Since ON_ERROR=stop is the default,
users could set the maximum number of allowed errors by specifying
only REJECT_LIMIT. Otherwise, they would need to specify both
ON_ERROR=ignore and REJECT_LIMIT.That makes sense.
On my second thought, whatever value ON_ERROR is specified(e.g. ignore,
stop, table), it seems fine to use REJECT_LIMIT.
I feel REJECT_LIMIT has both "ignore" and "stop" characteristics,
meaning it ignores errors until it reaches REJECT_LIMIT and stops when
it exceeds the REJECT_LIMIT.
And REJECT_LIMIT seems orthogonal to 'table', which specifies where to
save error details.Attached patch allows using REJECT_LIMIT regardless of the ON_ERROR
option value.[1]
/messages/by-id/CACJufxH_OJpVra=0c4ow8fbxHj7heMcVaTNEPa5vAurSeNA-6Q@mail.gmail.com--
Regards,--
Atsushi Torikoshi
NTT DATA Group Corporation
--
Regards
Junwang Zhao
On 2024/07/19 22:03, Fujii Masao wrote:
On 2024/07/17 22:21, torikoshia wrote:
On 2024-07-03 02:07, Fujii Masao wrote:
However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR option is still necessary.
I remembered another reason for the necessity of ON_ERROR.
ON_ERROR defines how to behave when encountering an error and it just accepts 'ignore' and 'stop' currently, but is expected to support other options such as saving details of errors to a table[1].
Wouldn't it be better to separate the option specifying where
error details are output from the ON_ERROR option
(which determines behavior when encountering errors)?
"table" seems valid for both ON_ERROR=ignore and ON_ERROR=stop.
I still find it odd to accept "table" as a value for ON_ERROR. However,
"set_to_null" or "replace-column" proposed in [1]/messages/by-id/CAKFQuwawy1e6YR4S=j+y7pXqg_Dw1WBVrgvf=BP3d1_aSfe_+Q@mail.gmail.com seem valid for
ON_ERROR. So, I'm okay with keeping the ON_ERROR option.
On my second thought, whatever value ON_ERROR is specified(e.g. ignore, stop, table), it seems fine to use REJECT_LIMIT.
I feel REJECT_LIMIT has both "ignore" and "stop" characteristics, meaning it ignores errors until it reaches REJECT_LIMIT and stops when it exceeds the REJECT_LIMIT.
ON_ERROR specifies how to handle errors, and "stop" means to fail
the command. So, if ON_ERROR=stop, REJECT_LIMIT should have no effect,
and the command should fail immediately upon encountering an error.
As in your original proposal, I now think REJECT_LIMIT should only
apply when ON_ERROR=ignore. The command would ignore errors and
continue processing, but if the number of errors exceeds REJECT_LIMIT,
the command should fail. Thought?
BTW if "set_to_null" is supported someday, REJECT_LIMIT can also
apply. The command would cinsert NULL into the target table upon
encountering errors and continue, but fail if the number of errors
exceed REJECT_LIMIT.
Regards,
[1]: /messages/by-id/CAKFQuwawy1e6YR4S=j+y7pXqg_Dw1WBVrgvf=BP3d1_aSfe_+Q@mail.gmail.com
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com> wrote:
Thanks for the comment.
In patch 0002, the ratio is calculated by the already skipped/processed
rows, but what if a user wants to copy 1000 rows, and he/she can
tolerate
10 error rows, so he/she might set *reject_limit 0.01*, but one bad row
in the
first 100 rows will fail the entire command, this might surprise the
user.
Since the ratio is calculated after all data is processed, the case "one
bad row in the first 100 rows will fail the entire command" doesn't
happen:
=# \! wc -l 1000rows-with-10err.data
1000 1000rows-with-10err.data
=# COPY t1 from '1000rows-with-10err.data' with (log_verbosity
verbose, reject_limit 0.01);
NOTICE: skipping row due to data type incompatibility at line 10 for
column i: "a"
NOTICE: skipping row due to data type incompatibility at line 11 for
column i: "a"
NOTICE: skipping row due to data type incompatibility at line 12 for
column i: "a"
NOTICE: skipping row due to data type incompatibility at line 13 for
column i: "a"
NOTICE: skipping row due to data type incompatibility at line 14 for
column i: "a"
NOTICE: skipping row due to data type incompatibility at line 15 for
column i: "a"
NOTICE: skipping row due to data type incompatibility at line 16 for
column i: "a"
NOTICE: skipping row due to data type incompatibility at line 17 for
column i: "a"
NOTICE: skipping row due to data type incompatibility at line 18 for
column i: "a"
NOTICE: skipping row due to data type incompatibility at line 19 for
column i: "a"
NOTICE: 10 rows were skipped due to data type incompatibility
COPY 990
On 2024-07-20 02:08, Fujii Masao wrote:
On 2024/07/19 22:03, Fujii Masao wrote:
On 2024/07/17 22:21, torikoshia wrote:
On 2024-07-03 02:07, Fujii Masao wrote:
However, if we support REJECT_LIMIT, I'm not sure if the ON_ERROR
option is still necessary.I remembered another reason for the necessity of ON_ERROR.
ON_ERROR defines how to behave when encountering an error and it just
accepts 'ignore' and 'stop' currently, but is expected to support
other options such as saving details of errors to a table[1].Wouldn't it be better to separate the option specifying where
error details are output from the ON_ERROR option
(which determines behavior when encountering errors)?
"table" seems valid for both ON_ERROR=ignore and ON_ERROR=stop.I still find it odd to accept "table" as a value for ON_ERROR. However,
"set_to_null" or "replace-column" proposed in [1] seem valid for
ON_ERROR. So, I'm okay with keeping the ON_ERROR option.
Agreed.
On my second thought, whatever value ON_ERROR is specified(e.g.
ignore, stop, table), it seems fine to use REJECT_LIMIT.
I feel REJECT_LIMIT has both "ignore" and "stop" characteristics,
meaning it ignores errors until it reaches REJECT_LIMIT and stops when
it exceeds the REJECT_LIMIT.ON_ERROR specifies how to handle errors, and "stop" means to fail
the command. So, if ON_ERROR=stop, REJECT_LIMIT should have no effect,
and the command should fail immediately upon encountering an error.As in your original proposal, I now think REJECT_LIMIT should only
apply when ON_ERROR=ignore. The command would ignore errors and
continue processing, but if the number of errors exceeds REJECT_LIMIT,
the command should fail. Thought?
Makes sense.
Updated the patch.
BTW if "set_to_null" is supported someday, REJECT_LIMIT can also
apply. The command would cinsert NULL into the target table upon
encountering errors and continue, but fail if the number of errors
exceed REJECT_LIMIT.
Agreed.
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachments:
v3-0001-Add-new-COPY-option-REJECT_LIMIT-number.patchtext/x-diff; name=v3-0001-Add-new-COPY-option-REJECT_LIMIT-number.patchDownload
From 594e578bf1633b8a24f6378436dc349664de6ba4 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Mon, 22 Jul 2024 21:19:38 +0900
Subject: [PATCH v3] Add new COPY option REJECT_LIMIT number
---
doc/src/sgml/ref/copy.sgml | 22 +++++++++++++
src/backend/commands/copy.c | 49 +++++++++++++++++++++++++++++
src/backend/commands/copyfrom.c | 6 ++++
src/include/commands/copy.h | 10 ++++++
src/test/regress/expected/copy2.out | 14 +++++++++
src/test/regress/sql/copy2.sql | 30 ++++++++++++++++++
6 files changed, 131 insertions(+)
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 1518af8a04..beb455372d 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,6 +44,7 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
FORCE_NOT_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
FORCE_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
ON_ERROR <replaceable class="parameter">error_action</replaceable>
+ REJECT_LIMIT { <replaceable class="parameter">integer</replaceable> | INFINITY }
ENCODING '<replaceable class="parameter">encoding_name</replaceable>'
LOG_VERBOSITY <replaceable class="parameter">verbosity</replaceable>
</synopsis>
@@ -411,6 +412,27 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>REJECT_LIMIT</literal></term>
+ <listitem>
+ <para>
+ When a positive integer value is specified, <command>COPY</command> limits
+ the maximum tolerable number of errors while converting a column's input
+ value into its data type.
+ If input data caused more errors than the specified value, entire
+ <command>COPY</command> fails.
+ Otherwise, <command>COPY</command> discards the input row and continues
+ with the next one.
+ This option must be used with <literal>ON_ERROR</literal> to be set to
+ other than <literal>stop</literal>.
+ </para>
+ <para>
+ When specified <literal>INFINITY</literal>, <command>COPY</command> ignores all
+ the errors. This is a synonym for <literal>ON_ERROR</literal> <literal>ignore</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>ENCODING</literal></term>
<listitem>
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index df7a4a21c9..8fbab9336f 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -415,6 +415,43 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
return COPY_ON_ERROR_STOP; /* keep compiler quiet */
}
+/*
+ * Extract a CopyRejectLimits values from a DefElem.
+ */
+static CopyRejectLimits
+defGetCopyRejectLimitOptions(DefElem *def)
+{
+ CopyRejectLimits limits;
+ int64 num_err;
+
+ switch(nodeTag(def->arg))
+ {
+ case T_Integer:
+ num_err = defGetInt64(def);
+ if (num_err <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number for REJECT_LIMIT must be greater than zero")));
+ break;
+ case T_String:
+ if (pg_strcasecmp(defGetString(def), "INFINITY") == 0)
+ /* when set to 0, it is treated as no limit */
+ num_err = 0;
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("string for REJECT_LIMIT must be 'INFINITY'")));
+ break;
+ default:
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("value for REJECT_LIMIT must be positive integer or 'INFINITY'")));
+ }
+ limits.num_err = num_err;
+
+ return limits;
+}
+
/*
* Extract a CopyLogVerbosityChoice value from a DefElem.
*/
@@ -466,6 +503,7 @@ ProcessCopyOptions(ParseState *pstate,
bool header_specified = false;
bool on_error_specified = false;
bool log_verbosity_specified = false;
+ bool reject_limit_specified = false;
ListCell *option;
/* Support external use for option sanity checking */
@@ -632,6 +670,17 @@ ProcessCopyOptions(ParseState *pstate,
log_verbosity_specified = true;
opts_out->log_verbosity = defGetCopyLogVerbosityChoice(defel, pstate);
}
+ else if (strcmp(defel->defname, "reject_limit") == 0)
+ {
+ if (reject_limit_specified)
+ errorConflictingDefElem(defel, pstate);
+ if (!opts_out->on_error)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT requires ON_ERROR to be set to other than stop")));
+ reject_limit_specified = true;
+ opts_out->reject_limits = defGetCopyRejectLimitOptions(defel);
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index ce4d62e707..65d950ef07 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1012,6 +1012,12 @@ CopyFrom(CopyFromState cstate)
pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
++skipped);
+ if (cstate->opts.reject_limits.num_err &&
+ skipped > cstate->opts.reject_limits.num_err)
+ ereport(ERROR,
+ (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+ errmsg("exceeded the number specified by REJECT_LIMIT \"%lld\"",
+ (long long) cstate->opts.reject_limits.num_err)));
continue;
}
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index 141fd48dc1..86ba1cbe16 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -49,6 +49,15 @@ typedef enum CopyLogVerbosityChoice
COPY_LOG_VERBOSITY_VERBOSE, /* logs additional messages */
} CopyLogVerbosityChoice;
+/*
+ * A struct to hold reject_limit options, in a parsed form.
+ * More values to be added in another patch.
+ */
+typedef struct CopyRejectLimits
+{
+ int64 num_err; /* maximum tolerable number of errors */
+} CopyRejectLimits;
+
/*
* A struct to hold COPY options, in a parsed form. All of these are related
* to formatting, except for 'freeze', which doesn't really belong here, but
@@ -83,6 +92,7 @@ typedef struct CopyFormatOptions
bool convert_selectively; /* do selective binary conversion? */
CopyOnErrorChoice on_error; /* what to do when error happened */
CopyLogVerbosityChoice log_verbosity; /* verbosity of logged messages */
+ CopyRejectLimits reject_limits; /* thresholds of reject_limit */
List *convert_select; /* list of column names (can be NIL) */
} CopyFormatOptions;
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 931542f268..8ce12f5b6f 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -85,6 +85,10 @@ COPY x from stdin (log_verbosity default, log_verbosity verbose);
ERROR: conflicting or redundant options
LINE 1: COPY x from stdin (log_verbosity default, log_verbosity verb...
^
+COPY x from stdin (on_error ignore, reject_limit 'INFINITY', reject_limit 3);
+ERROR: conflicting or redundant options
+LINE 1: ... stdin (on_error ignore, reject_limit 'INFINITY', reject_lim...
+ ^
-- incorrect options
COPY x to stdin (format BINARY, delimiter ',');
ERROR: cannot specify DELIMITER in BINARY mode
@@ -116,6 +120,8 @@ COPY x to stdout (log_verbosity unsupported);
ERROR: COPY LOG_VERBOSITY "unsupported" not recognized
LINE 1: COPY x to stdout (log_verbosity unsupported);
^
+COPY x from stdin with (on_error ignore, reject_limit 0);
+ERROR: number for REJECT_LIMIT must be greater than zero
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
ERROR: column "d" specified more than once
@@ -789,6 +795,14 @@ CONTEXT: COPY check_ign_err, line 1: "1 {1}"
COPY check_ign_err FROM STDIN WITH (on_error ignore);
ERROR: extra data after last expected column
CONTEXT: COPY check_ign_err, line 1: "1 {1} 3 abc"
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+ERROR: exceeded the number specified by REJECT_LIMIT "3"
+CONTEXT: COPY check_ign_err, line 5, column n: ""
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+NOTICE: 4 rows were skipped due to data type incompatibility
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 'INFINITY');
+NOTICE: 4 rows were skipped due to data type incompatibility
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index 8b14962194..8cc82990f4 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -68,6 +68,7 @@ COPY x from stdin (convert_selectively (a), convert_selectively (b));
COPY x from stdin (encoding 'sql_ascii', encoding 'sql_ascii');
COPY x from stdin (on_error ignore, on_error ignore);
COPY x from stdin (log_verbosity default, log_verbosity verbose);
+COPY x from stdin (on_error ignore, reject_limit 'INFINITY', reject_limit 3);
-- incorrect options
COPY x to stdin (format BINARY, delimiter ',');
@@ -82,6 +83,7 @@ COPY x to stdout (format TEXT, force_null(a));
COPY x to stdin (format CSV, force_null(a));
COPY x to stdin (format BINARY, on_error unsupported);
COPY x to stdout (log_verbosity unsupported);
+COPY x from stdin with (on_error ignore, reject_limit 0);
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
@@ -557,6 +559,34 @@ COPY check_ign_err FROM STDIN WITH (on_error ignore);
1 {1} 3 abc
\.
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 'INFINITY');
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
--
2.39.2
v3-0002-Add-new-COPY-option-REJECT_LIMIT-ratio.patchtext/x-diff; name=v3-0002-Add-new-COPY-option-REJECT_LIMIT-ratio.patchDownload
From 87894c42dfa222a273170f4783ef8f3fae9537f5 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Mon, 22 Jul 2024 21:23:42 +0900
Subject: [PATCH v3] Add new COPY option REJECT_LIMIT ratio
---
doc/src/sgml/ref/copy.sgml | 11 ++++++++++-
src/backend/commands/copy.c | 18 +++++++++++++-----
src/backend/commands/copyfrom.c | 10 ++++++++++
src/include/commands/copy.h | 1 +
src/test/regress/expected/copy2.out | 7 +++++++
src/test/regress/sql/copy2.sql | 19 +++++++++++++++++++
6 files changed, 60 insertions(+), 6 deletions(-)
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index beb455372d..e2e0cc9665 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,7 +44,7 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
FORCE_NOT_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
FORCE_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
ON_ERROR <replaceable class="parameter">error_action</replaceable>
- REJECT_LIMIT { <replaceable class="parameter">integer</replaceable> | INFINITY }
+ REJECT_LIMIT { <replaceable class="parameter">integer</replaceable> | <replaceable class="parameter">floating point</replaceable> | INFINITY }
ENCODING '<replaceable class="parameter">encoding_name</replaceable>'
LOG_VERBOSITY <replaceable class="parameter">verbosity</replaceable>
</synopsis>
@@ -426,6 +426,15 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
This option must be used with <literal>ON_ERROR</literal> to be set to
other than <literal>stop</literal>.
</para>
+ <para>
+ When a positive floating point value is specified, <command>COPY</command>
+ limits the maximum ratio of errors while converting a column's input
+ value into its data type.
+ If input data caused an error ratio greater than the specified value,
+ entire <command>COPY</command> fails.
+ Otherwise, <command>COPY</command> discards only the input rows where
+ errors occured and copies all the other rows.
+ </para>
<para>
When specified <literal>INFINITY</literal>, <command>COPY</command> ignores all
the errors. This is a synonym for <literal>ON_ERROR</literal> <literal>ignore</literal>.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 8fbab9336f..b3cc63d44f 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -422,7 +422,8 @@ static CopyRejectLimits
defGetCopyRejectLimitOptions(DefElem *def)
{
CopyRejectLimits limits;
- int64 num_err;
+ uint64 num_err = 0;
+ double ratio_err = 0;
switch(nodeTag(def->arg))
{
@@ -433,14 +434,20 @@ defGetCopyRejectLimitOptions(DefElem *def)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("number for REJECT_LIMIT must be greater than zero")));
break;
+ case T_Float:
+ ratio_err = defGetNumeric(def);
+ if (ratio_err <= 0 || ratio_err >= 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("float for REJECT_LIMIT must be greater than zero and smaller than 1")));
+ break;
case T_String:
- if (pg_strcasecmp(defGetString(def), "INFINITY") == 0)
- /* when set to 0, it is treated as no limit */
- num_err = 0;
- else
+ if (pg_strcasecmp(defGetString(def), "INFINITY") != 0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("string for REJECT_LIMIT must be 'INFINITY'")));
+
+ /* when set to 0, it is treated as no limit */
break;
default:
ereport(ERROR,
@@ -448,6 +455,7 @@ defGetCopyRejectLimitOptions(DefElem *def)
errmsg("value for REJECT_LIMIT must be positive integer or 'INFINITY'")));
}
limits.num_err = num_err;
+ limits.ratio_err = ratio_err;
return limits;
}
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 65d950ef07..69485a25a1 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -646,6 +646,7 @@ CopyFrom(CopyFromState cstate)
int64 processed = 0;
int64 excluded = 0;
int64 skipped = 0;
+ double ratio_err = 0;
bool has_before_insert_row_trig;
bool has_instead_insert_row_trig;
bool leafpart_use_multi_insert = false;
@@ -1310,6 +1311,15 @@ CopyFrom(CopyFromState cstate)
CopyMultiInsertInfoFlush(&multiInsertInfo, NULL, &processed);
}
+ ratio_err = (double) skipped / (processed + skipped);
+ if (cstate->opts.reject_limits.ratio_err &&
+ cstate->opts.reject_limits.ratio_err < ratio_err)
+ ereport(ERROR,
+ (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+ errmsg("exceeded the ratio specified by REJECT_LIMIT \"%f\", the error ratio was: \"%f\"",
+ cstate->opts.reject_limits.ratio_err,
+ ratio_err)));
+
/* Done, clean up */
error_context_stack = errcallback.previous;
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index 86ba1cbe16..7ea88da894 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -56,6 +56,7 @@ typedef enum CopyLogVerbosityChoice
typedef struct CopyRejectLimits
{
int64 num_err; /* maximum tolerable number of errors */
+ double ratio_err; /* maximum tolerable ratio of errors */
} CopyRejectLimits;
/*
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 8ce12f5b6f..173bd7170e 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -122,6 +122,8 @@ LINE 1: COPY x to stdout (log_verbosity unsupported);
^
COPY x from stdin with (on_error ignore, reject_limit 0);
ERROR: number for REJECT_LIMIT must be greater than zero
+COPY x from stdin with (on_error ignore, reject_limit 1.1);
+ERROR: float for REJECT_LIMIT must be greater than zero and smaller than 1
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
ERROR: column "d" specified more than once
@@ -803,6 +805,11 @@ COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
NOTICE: 4 rows were skipped due to data type incompatibility
COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 'INFINITY');
NOTICE: 4 rows were skipped due to data type incompatibility
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 0.6);
+ERROR: exceeded the ratio specified by REJECT_LIMIT "0.600000", the error ratio was: "0.666667"
+CONTEXT: COPY check_ign_err, line 7: ""
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 0.7);
+NOTICE: 4 rows were skipped due to data type incompatibility
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index 8cc82990f4..1b90ba0b35 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -84,6 +84,7 @@ COPY x to stdin (format CSV, force_null(a));
COPY x to stdin (format BINARY, on_error unsupported);
COPY x to stdout (log_verbosity unsupported);
COPY x from stdin with (on_error ignore, reject_limit 0);
+COPY x from stdin with (on_error ignore, reject_limit 1.1);
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
@@ -587,6 +588,24 @@ a {7} 7
10 {10} 10
\.
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 0.6);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 0.7);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
--
2.39.2
Hi! Nice feature.
Few comments:
+ When a positive integer value is specified, <command>COPY</command> limits + the maximum tolerable number of errors while converting a column's input + value into its data type.
If nothing is specified, then the maximum tolerable number of errors
is one, right? Should we state this explicitly in the documentation?
+COPY x from stdin with (on_error ignore, reject_limit 0);
How about a test where reject_limit is a string, but not
(case-intensively) 'infinity'?
+ CopyRejectLimits reject_limits; /* thresholds of reject_limit */
Why are there multiple thresholds? Can we have only one?
Other places look good to me.
On 2024/07/22 21:37, torikoshia wrote:
On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com> wrote:
Thanks for the comment.In patch 0002, the ratio is calculated by the already skipped/processed
rows, but what if a user wants to copy 1000 rows, and he/she can tolerate
10 error rows, so he/she might set *reject_limit 0.01*, but one bad row in the
first 100 rows will fail the entire command, this might surprise the user.Since the ratio is calculated after all data is processed, the case "one bad row in the first 100 rows will fail the entire command" doesn't happen:
Yes, but is this the desired behavior when using the ratio threshold?
I was thinking that COPY should fail once the error ratio (errors vs.
total rows in the input file) reaches the threshold. Since it's
difficult to know the total number of rows in the input file,
implementing this seems not easy, though.
Updated the patch.
Thanks for updating the patch!
+ This option must be used with <literal>ON_ERROR</literal> to be set to
+ other than <literal>stop</literal>.
Regarding the ON_ERROR option, now it only has two values.
Instead of saying "other than stop," we should simply use "ignore"
for clarity and intuition?
+ When specified <literal>INFINITY</literal>, <command>COPY</command> ignores all
+ the errors. This is a synonym for <literal>ON_ERROR</literal> <literal>ignore</literal>.
For the INFINITY value, the description is a bit unclear.
As I understand it, INFINITY is the default for REJECT_LIMIT.
So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT,
COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is used.
In line with my previous suggestion, if we support only REJECT_LIMIT
without ON_ERROR, having INFINITY makes sense. However,
with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant.
Users can just set ON_ERROR=ignore to ignore all errors,
making INFINITY unnecessary for REJECT_LIMIT. This is open for
discussion, but I believe it's better to remove INFINITY from
the REJECT_LIMIT options.
+ else if (strcmp(defel->defname, "reject_limit") == 0)
+ {
+ if (reject_limit_specified)
+ errorConflictingDefElem(defel, pstate);
+ if (!opts_out->on_error)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT requires ON_ERROR to be set to other than stop")));
Using "ignore" instead of "other than stop" in the error message
is clearer and more intuitive.
Checking if ON_ERROR and REJECT_LIMIT are specified should be
done after the foreach() processing of the options. Otherwise,
if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can occur.
---------------
=# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore');
ERROR: REJECT_LIMIT requires ON_ERROR to be set to other than stop
---------------
+ ereport(ERROR,
+ (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+ errmsg("exceeded the number specified by REJECT_LIMIT \"%lld\"",
+ (long long) cstate->opts.reject_limits.num_err)));
The error message isn't clear about what exceeded REJECT_LIMIT.
How about saying "skipped more than REJECT_LIMIT rows" or something instead?
+/*
+ * A struct to hold reject_limit options, in a parsed form.
+ * More values to be added in another patch.
+ */
The latter comment doesn't seem necessary or helpful.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Tue, Jul 23, 2024 at 1:35 PM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:
Thanks for your review.
On 2024/07/22 21:37, torikoshia wrote:
On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com>
wrote:
Thanks for the comment.In patch 0002, the ratio is calculated by the already
skipped/processed
rows, but what if a user wants to copy 1000 rows, and he/she can
tolerate
10 error rows, so he/she might set *reject_limit 0.01*, but one bad
row in the
first 100 rows will fail the entire command, this might surprise the
user.Since the ratio is calculated after all data is processed, the case
"one bad row in the first 100 rows will fail the entire command"
doesn't happen:Yes, but is this the desired behavior when using the ratio threshold?
I was thinking that COPY should fail once the error ratio (errors vs.
total rows in the input file) reaches the threshold.
Users might expect like you.
However, implementing it would need a row number counting feature as you
pointed out, and it seems an overkill.
Describing the difference between ratio and number in the manual might
help, but
it might be better to make REJECT_LIMIT support only the number of
errors and leave it to the user to calculate the number from the ratio.
I'd like to hear if anyone has an opinion on the need for supporting
ratio.
I remember David prefers ratio[1]/messages/by-id/CAKFQuwYP91_G6tktYFTZq_CmkZ_=zuWjkz1+25Nd8bpsrDkx5Q@mail.gmail.com.
+ This option must be used with <literal>ON_ERROR</literal> to be set to + other than <literal>stop</literal>.Regarding the ON_ERROR option, now it only has two values.
Instead of saying "other than stop," we should simply use "ignore"
for clarity and intuition?
I'll Modify it.
The reason for the roundabout wording was the expectation that on_error
would support values other than these in the future, but as you point
out, there are currently only two.
+ When specified <literal>INFINITY</literal>, <command>COPY</command> ignores all + the errors. This is a synonym for <literal>ON_ERROR</literal> <literal>ignore</literal>.For the INFINITY value, the description is a bit unclear.
As I understand it, INFINITY is the default for REJECT_LIMIT.
So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT,
COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is
used.In line with my previous suggestion, if we support only REJECT_LIMIT
without ON_ERROR, having INFINITY makes sense. However,
with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant.
Users can just set ON_ERROR=ignore to ignore all errors,
making INFINITY unnecessary for REJECT_LIMIT. This is open for
discussion, but I believe it's better to remove INFINITY from
the REJECT_LIMIT options.
Agreed.
Unless there are opposing opinions, I'm going to remove 'INFINITY' in
the next patch.
+ else if (strcmp(defel->defname, "reject_limit") == 0) + { + if (reject_limit_specified) + errorConflictingDefElem(defel, pstate); + if (!opts_out->on_error) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("REJECT_LIMIT requires ON_ERROR to be set to other than stop")));Using "ignore" instead of "other than stop" in the error message
is clearer and more intuitive.
Agreed.
Checking if ON_ERROR and REJECT_LIMIT are specified should be
done after the foreach() processing of the options. Otherwise,
if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can
occur.---------------
=# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore');
ERROR: REJECT_LIMIT requires ON_ERROR to be set to other than stop
---------------
Ugh, I'll modify it.
+ ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("exceeded the number specified by REJECT_LIMIT \"%lld\"", + (long long) cstate->opts.reject_limits.num_err)));The error message isn't clear about what exceeded REJECT_LIMIT.
How about saying "skipped more than REJECT_LIMIT rows" or something
instead?
Agreed.
+/* + * A struct to hold reject_limit options, in a parsed form. + * More values to be added in another patch. + */The latter comment doesn't seem necessary or helpful.
Agreed.
[1]: /messages/by-id/CAKFQuwYP91_G6tktYFTZq_CmkZ_=zuWjkz1+25Nd8bpsrDkx5Q@mail.gmail.com
/messages/by-id/CAKFQuwYP91_G6tktYFTZq_CmkZ_=zuWjkz1+25Nd8bpsrDkx5Q@mail.gmail.com
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
On 2024-07-23 02:06, Kirill Reshke wrote:
Thanks for your review.
Few comments:
+ When a positive integer value is specified, <command>COPY</command> limits + the maximum tolerable number of errors while converting a column's input + value into its data type.If nothing is specified, then the maximum tolerable number of errors
is one, right? Should we state this explicitly in the documentation?
REJECT_LIMIT now can be used wonly when on_error=ignore, I think the
default(when only on_error=ignore is specified) is unlimited.
Anyway, I'm going to add a description about the default.
+COPY x from stdin with (on_error ignore, reject_limit 0);
How about a test where reject_limit is a string, but not
(case-intensively) 'infinity'?
Considering the discussion in[1]/messages/by-id/5f807fcf3a36df7ba41464ab40b5c37d@oss.nttdata.com, I'm now going to remove 'infinity'.
+ CopyRejectLimits reject_limits; /* thresholds of reject_limit */
Why are there multiple thresholds? Can we have only one?
This is because I thought it'd be more convenient to support both the
number and the ratio of error, but I'm also beginning to think that it
might be better to support only the number of cases, as discussed in
[1]: /messages/by-id/5f807fcf3a36df7ba41464ab40b5c37d@oss.nttdata.com
[1]: /messages/by-id/5f807fcf3a36df7ba41464ab40b5c37d@oss.nttdata.com
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
Updated the patch.
On 2024-07-23 23:06, torikoshia wrote:
On Tue, Jul 23, 2024 at 1:35 PM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:Thanks for your review.
On 2024/07/22 21:37, torikoshia wrote:
On Fri, Jul 19, 2024 at 11:48 PM Junwang Zhao <zhjwpku@gmail.com>
wrote:
Thanks for the comment.In patch 0002, the ratio is calculated by the already
skipped/processed
rows, but what if a user wants to copy 1000 rows, and he/she can
tolerate
10 error rows, so he/she might set *reject_limit 0.01*, but one bad
row in the
first 100 rows will fail the entire command, this might surprise the
user.Since the ratio is calculated after all data is processed, the case
"one bad row in the first 100 rows will fail the entire command"
doesn't happen:Yes, but is this the desired behavior when using the ratio threshold?
I was thinking that COPY should fail once the error ratio (errors vs.
total rows in the input file) reaches the threshold.Users might expect like you.
However, implementing it would need a row number counting feature as
you pointed out, and it seems an overkill.Describing the difference between ratio and number in the manual might
help, but
it might be better to make REJECT_LIMIT support only the number of
errors and leave it to the user to calculate the number from the
ratio.I'd like to hear if anyone has an opinion on the need for supporting
ratio.
Since there was no opinion about it, attached a patch only for
REJECT_LIMIT specifying number.
+ This option must be used with <literal>ON_ERROR</literal> to be set to + other than <literal>stop</literal>.Regarding the ON_ERROR option, now it only has two values.
Instead of saying "other than stop," we should simply use "ignore"
for clarity and intuition?I'll Modify it.
The reason for the roundabout wording was the expectation that
on_error would support values other than these in the future, but as
you point out, there are currently only two.+ When specified <literal>INFINITY</literal>, <command>COPY</command> ignores all + the errors. This is a synonym for <literal>ON_ERROR</literal> <literal>ignore</literal>.For the INFINITY value, the description is a bit unclear.
As I understand it, INFINITY is the default for REJECT_LIMIT.
So, if ON_ERROR=ignore is set without specifying REJECT_LIMIT,
COPY will ignore all errors, similar to when REJECT_LIMIT=INFINITY is
used.In line with my previous suggestion, if we support only REJECT_LIMIT
without ON_ERROR, having INFINITY makes sense. However,
with ON_ERROR supported, REJECT_LIMIT=INFINITY seems redundant.
Users can just set ON_ERROR=ignore to ignore all errors,
making INFINITY unnecessary for REJECT_LIMIT. This is open for
discussion, but I believe it's better to remove INFINITY from
the REJECT_LIMIT options.Agreed.
Unless there are opposing opinions, I'm going to remove 'INFINITY' in
the next patch.
As there are no opposing opinions, removed 'INFINITY' as well.
+ else if (strcmp(defel->defname, "reject_limit") == 0) + { + if (reject_limit_specified) + errorConflictingDefElem(defel, pstate); + if (!opts_out->on_error) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("REJECT_LIMIT requires ON_ERROR to be set to other than stop")));Using "ignore" instead of "other than stop" in the error message
is clearer and more intuitive.Agreed.
Checking if ON_ERROR and REJECT_LIMIT are specified should be
done after the foreach() processing of the options. Otherwise,
if REJECT_LIMIT is specified before ON_ERROR, unexpected errors can
occur.---------------
=# copy test from '...' WITH (REJECT_LIMIT 7, ON_ERROR 'ignore');
ERROR: REJECT_LIMIT requires ON_ERROR to be set to other than stop
---------------Ugh, I'll modify it.
+ ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("exceeded the number specified by REJECT_LIMIT \"%lld\"", + (long long) cstate->opts.reject_limits.num_err)));The error message isn't clear about what exceeded REJECT_LIMIT.
How about saying "skipped more than REJECT_LIMIT rows" or something
instead?Agreed.
+/* + * A struct to hold reject_limit options, in a parsed form. + * More values to be added in another patch. + */The latter comment doesn't seem necessary or helpful.
Agreed.
[1]
/messages/by-id/CAKFQuwYP91_G6tktYFTZq_CmkZ_=zuWjkz1+25Nd8bpsrDkx5Q@mail.gmail.com
On Tue, Jul 23, 2024 at 11:10 PM torikoshia <torikoshia@oss.nttdata.com>
wrote:
On 2024-07-23 02:06, Kirill Reshke wrote:
If nothing is specified, then the maximum tolerable number of errors
is one, right? Should we state this explicitly in the documentation?
REJECT_LIMIT now can be used wonly when on_error=ignore, I think the
default(when only on_error=ignore is specified) is unlimited.
Anyway, I'm going to add a description about the default.
Added the following description:
+ Just setting <literal>ON_ERROR</literal> to
<literal>ignore</literal>
+ tolerates unlimited number of errors.
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachments:
v4-0001-Add-new-COPY-option-REJECT_LIMIT-number.patchtext/x-diff; name=v4-0001-Add-new-COPY-option-REJECT_LIMIT-number.patchDownload
From 0e92a90bd9f944b7f14845e942147d11f81e3b6f Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Tue, 24 Sep 2024 14:11:22 +0900
Subject: [PATCH v4] Add new COPY option REJECT_LIMIT number
9e2d870 enabled the COPY FROM command to skip soft errors, but there
was no limitation and skipped all the soft errors.
In some cases, there would be a limitation how many errors are
tolerable. This patch introduces REJECT_LIMIT to specify how many
errors can be skipped.
---
doc/src/sgml/ref/copy.sgml | 20 ++++++++++++++
src/backend/commands/copy.c | 41 +++++++++++++++++++++++++++++
src/backend/commands/copyfrom.c | 5 ++++
src/include/commands/copy.h | 1 +
src/test/regress/expected/copy2.out | 10 +++++++
src/test/regress/sql/copy2.sql | 21 +++++++++++++++
6 files changed, 98 insertions(+)
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 1518af8a04..3244384b1a 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,6 +44,7 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
FORCE_NOT_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
FORCE_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
ON_ERROR <replaceable class="parameter">error_action</replaceable>
+ REJECT_LIMIT { <replaceable class="parameter">integer</replaceable> }
ENCODING '<replaceable class="parameter">encoding_name</replaceable>'
LOG_VERBOSITY <replaceable class="parameter">verbosity</replaceable>
</synopsis>
@@ -411,6 +412,25 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>REJECT_LIMIT</literal></term>
+ <listitem>
+ <para>
+ When a positive integer value is specified, <command>COPY</command> limits
+ the maximum tolerable number of errors while converting a column's input
+ value into its data type.
+ If input data caused more errors than the specified value, entire
+ <command>COPY</command> fails.
+ Otherwise, <command>COPY</command> discards the input row and continues
+ with the next one.
+ This option must be used with <literal>ON_ERROR</literal> to be set to
+ <literal>ignore</literal>.
+ Just setting <literal>ON_ERROR</literal> to <literal>ignore</literal>
+ tolerates unlimited number of errors.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>ENCODING</literal></term>
<listitem>
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3bb579a3a4..91e01dbc9f 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -418,6 +418,30 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
return COPY_ON_ERROR_STOP; /* keep compiler quiet */
}
+/*
+ * Extract REJECT_LIMIT value from a DefElem.
+ */
+static int64
+defGetCopyRejectLimitOptions(DefElem *def)
+{
+ int64 reject_limit;
+
+ if (nodeTag(def->arg) == T_Integer)
+ {
+ reject_limit = defGetInt64(def);
+ if (reject_limit <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number for REJECT_LIMIT must be greater than zero")));
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("value for REJECT_LIMIT must be positive integer")));
+
+ return reject_limit;
+}
+
/*
* Extract a CopyLogVerbosityChoice value from a DefElem.
*/
@@ -470,6 +494,7 @@ ProcessCopyOptions(ParseState *pstate,
bool header_specified = false;
bool on_error_specified = false;
bool log_verbosity_specified = false;
+ bool reject_limit_specified = false;
ListCell *option;
/* Support external use for option sanity checking */
@@ -636,6 +661,13 @@ ProcessCopyOptions(ParseState *pstate,
log_verbosity_specified = true;
opts_out->log_verbosity = defGetCopyLogVerbosityChoice(defel, pstate);
}
+ else if (strcmp(defel->defname, "reject_limit") == 0)
+ {
+ if (reject_limit_specified)
+ errorConflictingDefElem(defel, pstate);
+ reject_limit_specified = true;
+ opts_out->reject_limit = defGetCopyRejectLimitOptions(defel);
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -669,6 +701,15 @@ ProcessCopyOptions(ParseState *pstate,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("only ON_ERROR STOP is allowed in BINARY mode")));
+ if (opts_out->reject_limit && !opts_out->on_error)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ /*- translator: first and second %s are the names of COPY
+ * option, e.g. ON_ERROR, thrid is the value of the COPY option,
+ * e.g. IGNORE */
+ errmsg("COPY %s requires %s to be set to %s",
+ "REJECT_LIMIT", "ON_ERROR", "IGNORE")));
+
/* Set defaults for omitted options */
if (!opts_out->delim)
opts_out->delim = opts_out->csv_mode ? "," : "\t";
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 2d3462913e..c17fbf71bc 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1024,6 +1024,11 @@ CopyFrom(CopyFromState cstate)
pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
++skipped);
+ if (cstate->opts.reject_limit && skipped > cstate->opts.reject_limit)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("skipped more than REJECT_LIMIT rows: \"%lld\",",
+ (long long) cstate->opts.reject_limit)));
continue;
}
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index 141fd48dc1..c3a7613778 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -83,6 +83,7 @@ typedef struct CopyFormatOptions
bool convert_selectively; /* do selective binary conversion? */
CopyOnErrorChoice on_error; /* what to do when error happened */
CopyLogVerbosityChoice log_verbosity; /* verbosity of logged messages */
+ int64 reject_limit; /* maximum tolerable number of errors */
List *convert_select; /* list of column names (can be NIL) */
} CopyFormatOptions;
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 61a19cdc4c..f63228f5bc 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -116,6 +116,10 @@ COPY x to stdout (log_verbosity unsupported);
ERROR: COPY LOG_VERBOSITY "unsupported" not recognized
LINE 1: COPY x to stdout (log_verbosity unsupported);
^
+COPY x from stdin with (reject_limit 1);
+ERROR: COPY REJECT_LIMIT requires ON_ERROR to be set to IGNORE
+COPY x from stdin with (on_error ignore, reject_limit 0);
+ERROR: number for REJECT_LIMIT must be greater than zero
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
ERROR: column "d" specified more than once
@@ -789,6 +793,12 @@ CONTEXT: COPY check_ign_err, line 1: "1 {1}"
COPY check_ign_err FROM STDIN WITH (on_error ignore);
ERROR: extra data after last expected column
CONTEXT: COPY check_ign_err, line 1: "1 {1} 3 abc"
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+ERROR: skipped more than REJECT_LIMIT rows: "3",
+CONTEXT: COPY check_ign_err, line 5, column n: ""
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+NOTICE: 4 rows were skipped due to data type incompatibility
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index 8b14962194..2d775d9c97 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -82,6 +82,8 @@ COPY x to stdout (format TEXT, force_null(a));
COPY x to stdin (format CSV, force_null(a));
COPY x to stdin (format BINARY, on_error unsupported);
COPY x to stdout (log_verbosity unsupported);
+COPY x from stdin with (reject_limit 1);
+COPY x from stdin with (on_error ignore, reject_limit 0);
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
@@ -557,6 +559,25 @@ COPY check_ign_err FROM STDIN WITH (on_error ignore);
1 {1} 3 abc
\.
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
base-commit: bbba59e69a56e1622e270f5e47b402c3a904cefc
--
2.39.2
On 2024/09/24 14:25, torikoshia wrote:
Updated the patch.
Thanks for updating the patch!
+ REJECT_LIMIT { <replaceable class="parameter">integer</replaceable> }
The curly braces {} seem unnecessary here.
+ When a positive integer value is specified, <command>COPY</command> limits
+ the maximum tolerable number of errors while converting a column's input
+ value into its data type.
+ If input data caused more errors than the specified value, entire
+ <command>COPY</command> fails.
+ Otherwise, <command>COPY</command> discards the input row and continues
+ with the next one.
+ This option must be used with <literal>ON_ERROR</literal> to be set to
+ <literal>ignore</literal>.
+ Just setting <literal>ON_ERROR</literal> to <literal>ignore</literal>
+ tolerates unlimited number of errors.
Regarding the description of REJECT_LIMIT, how about clarifying what the option specifies up front, like this:
----------------
Specifies the maximum number of errors tolerated while converting a column's
input value to its data type, when on_error is set to "ignore." If the input
causes more errors than the specified value, the COPY command fails,
even with on_error set to "ignore." This value must be positive and used with
on_error="ignore". If not specified, on_error="ignore" allows an unlimited
number of errors, meaning COPY will skip all erroneous data.
----------------
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number for REJECT_LIMIT must be greater than zero")));
+ }
+ else
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("value for REJECT_LIMIT must be positive integer")));
The phrases "number for" and "value for" seem unnecessary in the error messages.
Also, "positive number" should be "a positive number." Would it be better to display
the actual specified value in the error message, like: errmsg("REJECT_LIMIT (%s) must
be a positive number", value)?
Lastly, during my testing, if nothing is specified for REJECT_LIMIT
(e.g., COPY ... WITH (..., REJECT_LIMIT)), the COPY command caused a segmentation fault.
I'd like to hear if anyone has an opinion on the need for supporting ratio.
Since there was no opinion about it, attached a patch only for REJECT_LIMIT specifying number.
+1
As there are no opposing opinions, removed 'INFINITY' as well.
+1
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 2024-09-25 02:22, Fujii Masao wrote:
Thanks for your review!
Attached v5 patch.
On 2024/09/24 14:25, torikoshia wrote:
Updated the patch.
Thanks for updating the patch!
+ REJECT_LIMIT { <replaceable
class="parameter">integer</replaceable> }The curly braces {} seem unnecessary here.
+ When a positive integer value is specified, <command>COPY</command> limits + the maximum tolerable number of errors while converting a column's input + value into its data type. + If input data caused more errors than the specified value, entire + <command>COPY</command> fails. + Otherwise, <command>COPY</command> discards the input row and continues + with the next one. + This option must be used with <literal>ON_ERROR</literal> to be set to + <literal>ignore</literal>. + Just setting <literal>ON_ERROR</literal> to <literal>ignore</literal> + tolerates unlimited number of errors.Regarding the description of REJECT_LIMIT, how about clarifying what
the option specifies up front, like this:----------------
Specifies the maximum number of errors tolerated while converting a
column's
input value to its data type, when on_error is set to "ignore." If the
input
causes more errors than the specified value, the COPY command fails,
even with on_error set to "ignore." This value must be positive and
used with
on_error="ignore". If not specified, on_error="ignore" allows an
unlimited
number of errors, meaning COPY will skip all erroneous data.
----------------
Thanks for writing it. It seems better.
+ ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("number for REJECT_LIMIT must be greater than zero"))); + } + else + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("value for REJECT_LIMIT must be positive integer")));The phrases "number for" and "value for" seem unnecessary in the error
messages.
Also, "positive number" should be "a positive number." Would it be
better to display
the actual specified value in the error message, like:
errmsg("REJECT_LIMIT (%s) must
be a positive number", value)?
Agreed.
Lastly, during my testing, if nothing is specified for REJECT_LIMIT
(e.g., COPY ... WITH (..., REJECT_LIMIT)), the COPY command caused a
segmentation fault.
Oh, thanks for finding it. Fixed.
I'd like to hear if anyone has an opinion on the need for supporting
ratio.Since there was no opinion about it, attached a patch only for
REJECT_LIMIT specifying number.+1
As there are no opposing opinions, removed 'INFINITY' as well.
+1
Regards,
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachments:
v5-0001-Add-new-COPY-option-REJECT_LIMIT.patchtext/x-diff; name=v5-0001-Add-new-COPY-option-REJECT_LIMIT.patchDownload
From e40e378ac6eb8583b8cfbac07ec459acc03fc4b1 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Thu, 26 Sep 2024 22:05:28 +0900
Subject: [PATCH v5] Add new COPY option REJECT_LIMIT
9e2d870 enabled the COPY FROM command to skip soft errors, but there
was no limitation about the number of errors and skipped all the soft
errors.
In some cases there would be a limitation how many errors are
tolerable. This patch introduces REJECT_LIMIT to specify how many
errors can be skipped.
---
doc/src/sgml/ref/copy.sgml | 18 ++++++++++
src/backend/commands/copy.c | 51 +++++++++++++++++++++++++++++
src/backend/commands/copyfrom.c | 5 +++
src/include/commands/copy.h | 1 +
src/test/regress/expected/copy2.out | 10 ++++++
src/test/regress/sql/copy2.sql | 21 ++++++++++++
6 files changed, 106 insertions(+)
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 1518af8a04..08139116d4 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,6 +44,7 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
FORCE_NOT_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
FORCE_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
ON_ERROR <replaceable class="parameter">error_action</replaceable>
+ REJECT_LIMIT <replaceable class="parameter">integer</replaceable>
ENCODING '<replaceable class="parameter">encoding_name</replaceable>'
LOG_VERBOSITY <replaceable class="parameter">verbosity</replaceable>
</synopsis>
@@ -411,6 +412,23 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>REJECT_LIMIT</literal></term>
+ <listitem>
+ <para>
+ Specifies the maximum number of errors tolerated while converting a
+ column's input value to its data type, when <literal>ON_ERROR</literal> is
+ set to <literal>ignore</literal>.
+ If the input causes more errors than the specified value, the <command>COPY</command>
+ command fails, even with <literal>ON_ERROR</literal> set to <literal>ignore</literal>.
+ This value must be positive and used with <literal>ON_ERROR</literal>=<literal>ignore</literal>.
+ If not specified, <literal>ON_ERROR</literal>=<literal>ignore</literal>
+ allows an unlimited number of errors, meaning <command>COPY</command> will
+ skip all erroneous data.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>ENCODING</literal></term>
<listitem>
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3bb579a3a4..398d1a2add 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -418,6 +418,40 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
return COPY_ON_ERROR_STOP; /* keep compiler quiet */
}
+/*
+ * Extract REJECT_LIMIT value from a DefElem.
+ */
+static int64
+defGetCopyRejectLimitOptions(DefElem *def)
+{
+ int64 reject_limit;
+
+ if (def->arg == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("REJECT_LIMIT requires a positive integer")));
+
+ if (nodeTag(def->arg) == T_Integer)
+ {
+ reject_limit = defGetInt64(def);
+ if (reject_limit <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT (%lld) must be greater than zero",
+ (long long) reject_limit)));
+ }
+ else
+ {
+ char *sval = defGetString(def);
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT (%s) must be a positive integer",
+ sval)));
+ }
+
+ return reject_limit;
+}
+
/*
* Extract a CopyLogVerbosityChoice value from a DefElem.
*/
@@ -470,6 +504,7 @@ ProcessCopyOptions(ParseState *pstate,
bool header_specified = false;
bool on_error_specified = false;
bool log_verbosity_specified = false;
+ bool reject_limit_specified = false;
ListCell *option;
/* Support external use for option sanity checking */
@@ -636,6 +671,13 @@ ProcessCopyOptions(ParseState *pstate,
log_verbosity_specified = true;
opts_out->log_verbosity = defGetCopyLogVerbosityChoice(defel, pstate);
}
+ else if (strcmp(defel->defname, "reject_limit") == 0)
+ {
+ if (reject_limit_specified)
+ errorConflictingDefElem(defel, pstate);
+ reject_limit_specified = true;
+ opts_out->reject_limit = defGetCopyRejectLimitOptions(defel);
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -669,6 +711,15 @@ ProcessCopyOptions(ParseState *pstate,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("only ON_ERROR STOP is allowed in BINARY mode")));
+ if (opts_out->reject_limit && !opts_out->on_error)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ /*- translator: first and second %s are the names of COPY
+ * option, e.g. ON_ERROR, third is the value of the COPY option,
+ * e.g. IGNORE */
+ errmsg("COPY %s requires %s to be set to %s",
+ "REJECT_LIMIT", "ON_ERROR", "IGNORE")));
+
/* Set defaults for omitted options */
if (!opts_out->delim)
opts_out->delim = opts_out->csv_mode ? "," : "\t";
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 2d3462913e..c17fbf71bc 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1024,6 +1024,11 @@ CopyFrom(CopyFromState cstate)
pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
++skipped);
+ if (cstate->opts.reject_limit && skipped > cstate->opts.reject_limit)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("skipped more than REJECT_LIMIT rows: \"%lld\",",
+ (long long) cstate->opts.reject_limit)));
continue;
}
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index 141fd48dc1..c3a7613778 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -83,6 +83,7 @@ typedef struct CopyFormatOptions
bool convert_selectively; /* do selective binary conversion? */
CopyOnErrorChoice on_error; /* what to do when error happened */
CopyLogVerbosityChoice log_verbosity; /* verbosity of logged messages */
+ int64 reject_limit; /* maximum tolerable number of errors */
List *convert_select; /* list of column names (can be NIL) */
} CopyFormatOptions;
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 61a19cdc4c..8fb27d1d5f 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -116,6 +116,10 @@ COPY x to stdout (log_verbosity unsupported);
ERROR: COPY LOG_VERBOSITY "unsupported" not recognized
LINE 1: COPY x to stdout (log_verbosity unsupported);
^
+COPY x from stdin with (reject_limit 1);
+ERROR: COPY REJECT_LIMIT requires ON_ERROR to be set to IGNORE
+COPY x from stdin with (on_error ignore, reject_limit 0);
+ERROR: REJECT_LIMIT (0) must be greater than zero
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
ERROR: column "d" specified more than once
@@ -789,6 +793,12 @@ CONTEXT: COPY check_ign_err, line 1: "1 {1}"
COPY check_ign_err FROM STDIN WITH (on_error ignore);
ERROR: extra data after last expected column
CONTEXT: COPY check_ign_err, line 1: "1 {1} 3 abc"
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+ERROR: skipped more than REJECT_LIMIT rows: "3",
+CONTEXT: COPY check_ign_err, line 5, column n: ""
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+NOTICE: 4 rows were skipped due to data type incompatibility
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index 8b14962194..2d775d9c97 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -82,6 +82,8 @@ COPY x to stdout (format TEXT, force_null(a));
COPY x to stdin (format CSV, force_null(a));
COPY x to stdin (format BINARY, on_error unsupported);
COPY x to stdout (log_verbosity unsupported);
+COPY x from stdin with (reject_limit 1);
+COPY x from stdin with (on_error ignore, reject_limit 0);
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
@@ -557,6 +559,25 @@ COPY check_ign_err FROM STDIN WITH (on_error ignore);
1 {1} 3 abc
\.
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
base-commit: d66572d9fedb632af5df65ce513d04dc2a1682a3
--
2.39.2
+/*
+ * Extract REJECT_LIMIT value from a DefElem.
+ */
+static int64
+defGetCopyRejectLimitOptions(DefElem *def)
+{
+ int64 reject_limit;
+
+ if (def->arg == NULL)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("REJECT_LIMIT requires a positive integer")));
+
+ if (nodeTag(def->arg) == T_Integer)
+ {
+ reject_limit = defGetInt64(def);
+ if (reject_limit <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT (%lld) must be greater than zero",
+ (long long) reject_limit)));
+ }
+ else
+ {
+ char *sval = defGetString(def);
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT (%s) must be a positive integer",
+ sval)));
+ }
+
+ return reject_limit;
+}
there will be an integer overflow issue?
Can you try the following?
static int64
defGetCopyRejectLimitOptions(DefElem *def)
{
int64 reject_limit;
reject_limit = defGetInt64(def);
if (reject_limit <= 0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("REJECT_LIMIT (%lld) must be greater than zero",
(long long) reject_limit)));
}
REJECT_LIMIT <replaceable class="parameter">integer</replaceable>
i think, you want REJECT_LIMIT be bigint?
so here it should be
REJECT_LIMIT <replaceable class="parameter">bigint</replaceable>\
?
On 2024-09-28 10:48, Jian he wrote:
Thanks for your comments!
+/* + * Extract REJECT_LIMIT value from a DefElem. + */ +static int64 +defGetCopyRejectLimitOptions(DefElem *def) +{ + int64 reject_limit; + + if (def->arg == NULL) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("REJECT_LIMIT requires a positive integer"))); + + if (nodeTag(def->arg) == T_Integer) + { + reject_limit = defGetInt64(def); + if (reject_limit <= 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("REJECT_LIMIT (%lld) must be greater than zero", + (long long) reject_limit))); + } + else + { + char *sval = defGetString(def); + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("REJECT_LIMIT (%s) must be a positive integer", + sval))); + } + + return reject_limit; +} there will be an integer overflow issue?Can you try the following?
Since defGetInt64() checks not only whether the input value exceeds the
range of bigint but also input value is specified, attached v6 patch
checks them by directly calling defGetInt64().
static int64
defGetCopyRejectLimitOptions(DefElem *def)
{
int64 reject_limit;
reject_limit = defGetInt64(def);
if (reject_limit <= 0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("REJECT_LIMIT (%lld) must be greater than zero",
(long long) reject_limit)));
}REJECT_LIMIT <replaceable class="parameter">integer</replaceable>
i think, you want REJECT_LIMIT be bigint?
so here it should be
REJECT_LIMIT <replaceable class="parameter">bigint</replaceable>\
?
Hmm, bigint and integer include numbers less than 0, but REJECT_LIMIT
only accepts numbers greater than 0. I now feel both may not be
appropriate.
Referring to the manual of CREATE SEQUENCE[1]https://www.postgresql.org/docs/devel/sql-createsequence.html, here we may be able to
use not only the data types, such as bigint and integer but something
like minvalue, maxvalue.
I'm wondering if we can use the wording maxerror as in the attached
patch.
[1]: https://www.postgresql.org/docs/devel/sql-createsequence.html
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachments:
v6-0001-Add-new-COPY-option-REJECT_LIMIT.patchtext/x-diff; name=v6-0001-Add-new-COPY-option-REJECT_LIMIT.patchDownload
From 55a99fc186c263cdd7741a38a9c684c9cb8ac1d1 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Mon, 30 Sep 2024 20:33:26 +0900
Subject: [PATCH v6] Add new COPY option REJECT_LIMIT
9e2d870 enabled the COPY FROM command to skip soft errors, but there
was no limitation about the number of errors and skipped all the soft
errors.
In some cases there would be a limitation how many errors are
tolerable. This patch introduces REJECT_LIMIT to specify how many
errors can be skipped.
---
doc/src/sgml/ref/copy.sgml | 19 ++++++++++++++++
src/backend/commands/copy.c | 34 +++++++++++++++++++++++++++++
src/backend/commands/copyfrom.c | 5 +++++
src/include/commands/copy.h | 1 +
src/test/regress/expected/copy2.out | 10 +++++++++
src/test/regress/sql/copy2.sql | 21 ++++++++++++++++++
6 files changed, 90 insertions(+)
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 1518af8a04..2d645ed255 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,6 +44,7 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
FORCE_NOT_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
FORCE_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
ON_ERROR <replaceable class="parameter">error_action</replaceable>
+ REJECT_LIMIT <replaceable class="parameter">maxerror</replaceable>
ENCODING '<replaceable class="parameter">encoding_name</replaceable>'
LOG_VERBOSITY <replaceable class="parameter">verbosity</replaceable>
</synopsis>
@@ -411,6 +412,24 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>REJECT_LIMIT</literal></term>
+ <listitem>
+ <para>
+ Specifies the maximum number of errors tolerated while converting a
+ column's input value to its data type, when <literal>ON_ERROR</literal> is
+ set to <literal>ignore</literal>.
+ If the input causes more errors than the specified value, the <command>COPY</command>
+ command fails, even with <literal>ON_ERROR</literal> set to <literal>ignore</literal>.
+ This clause must be used with <literal>ON_ERROR</literal>=<literal>ignore</literal>
+ and <replaceable class="parameter">maxerror</replaceable> must be positive.
+ If not specified, <literal>ON_ERROR</literal>=<literal>ignore</literal>
+ allows an unlimited number of errors, meaning <command>COPY</command> will
+ skip all erroneous data.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>ENCODING</literal></term>
<listitem>
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3bb579a3a4..f3edc01605 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -418,6 +418,23 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
return COPY_ON_ERROR_STOP; /* keep compiler quiet */
}
+/*
+ * Extract REJECT_LIMIT value from a DefElem.
+ */
+static int64
+defGetCopyRejectLimitOption(DefElem *def)
+{
+ int64 reject_limit = defGetInt64(def);
+
+ if (reject_limit <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT (%lld) must be greater than zero",
+ (long long) reject_limit)));
+
+ return reject_limit;
+}
+
/*
* Extract a CopyLogVerbosityChoice value from a DefElem.
*/
@@ -470,6 +487,7 @@ ProcessCopyOptions(ParseState *pstate,
bool header_specified = false;
bool on_error_specified = false;
bool log_verbosity_specified = false;
+ bool reject_limit_specified = false;
ListCell *option;
/* Support external use for option sanity checking */
@@ -636,6 +654,13 @@ ProcessCopyOptions(ParseState *pstate,
log_verbosity_specified = true;
opts_out->log_verbosity = defGetCopyLogVerbosityChoice(defel, pstate);
}
+ else if (strcmp(defel->defname, "reject_limit") == 0)
+ {
+ if (reject_limit_specified)
+ errorConflictingDefElem(defel, pstate);
+ reject_limit_specified = true;
+ opts_out->reject_limit = defGetCopyRejectLimitOption(defel);
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -669,6 +694,15 @@ ProcessCopyOptions(ParseState *pstate,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("only ON_ERROR STOP is allowed in BINARY mode")));
+ if (opts_out->reject_limit && !opts_out->on_error)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ /*- translator: first and second %s are the names of COPY
+ * option, e.g. ON_ERROR, third is the value of the COPY option,
+ * e.g. IGNORE */
+ errmsg("COPY %s requires %s to be set to %s",
+ "REJECT_LIMIT", "ON_ERROR", "IGNORE")));
+
/* Set defaults for omitted options */
if (!opts_out->delim)
opts_out->delim = opts_out->csv_mode ? "," : "\t";
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 2d3462913e..c17fbf71bc 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1024,6 +1024,11 @@ CopyFrom(CopyFromState cstate)
pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
++skipped);
+ if (cstate->opts.reject_limit && skipped > cstate->opts.reject_limit)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("skipped more than REJECT_LIMIT rows: \"%lld\",",
+ (long long) cstate->opts.reject_limit)));
continue;
}
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index 141fd48dc1..c3a7613778 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -83,6 +83,7 @@ typedef struct CopyFormatOptions
bool convert_selectively; /* do selective binary conversion? */
CopyOnErrorChoice on_error; /* what to do when error happened */
CopyLogVerbosityChoice log_verbosity; /* verbosity of logged messages */
+ int64 reject_limit; /* maximum tolerable number of errors */
List *convert_select; /* list of column names (can be NIL) */
} CopyFormatOptions;
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 61a19cdc4c..8fb27d1d5f 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -116,6 +116,10 @@ COPY x to stdout (log_verbosity unsupported);
ERROR: COPY LOG_VERBOSITY "unsupported" not recognized
LINE 1: COPY x to stdout (log_verbosity unsupported);
^
+COPY x from stdin with (reject_limit 1);
+ERROR: COPY REJECT_LIMIT requires ON_ERROR to be set to IGNORE
+COPY x from stdin with (on_error ignore, reject_limit 0);
+ERROR: REJECT_LIMIT (0) must be greater than zero
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
ERROR: column "d" specified more than once
@@ -789,6 +793,12 @@ CONTEXT: COPY check_ign_err, line 1: "1 {1}"
COPY check_ign_err FROM STDIN WITH (on_error ignore);
ERROR: extra data after last expected column
CONTEXT: COPY check_ign_err, line 1: "1 {1} 3 abc"
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+ERROR: skipped more than REJECT_LIMIT rows: "3",
+CONTEXT: COPY check_ign_err, line 5, column n: ""
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+NOTICE: 4 rows were skipped due to data type incompatibility
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index 8b14962194..2d775d9c97 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -82,6 +82,8 @@ COPY x to stdout (format TEXT, force_null(a));
COPY x to stdin (format CSV, force_null(a));
COPY x to stdin (format BINARY, on_error unsupported);
COPY x to stdout (log_verbosity unsupported);
+COPY x from stdin with (reject_limit 1);
+COPY x from stdin with (on_error ignore, reject_limit 0);
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
@@ -557,6 +559,25 @@ COPY check_ign_err FROM STDIN WITH (on_error ignore);
1 {1} 3 abc
\.
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
base-commit: 4c7cd07aa62abc29e6885e95b5307e5e08bb3bf7
--
2.39.2
I'm wondering if we can use the wording maxerror as in the attached
patch.
+ <varlistentry>
+ <term><literal>REJECT_LIMIT</literal></term>
+ <listitem>
+ <para>
+ Specifies the maximum number of errors tolerated while converting a
+ column's input value to its data type, when
<literal>ON_ERROR</literal> is
+ set to <literal>ignore</literal>.
+ If the input causes more errors than the specified value, the
<command>COPY</command>
+ command fails, even with <literal>ON_ERROR</literal> set to
<literal>ignore</literal>.
+ This clause must be used with
<literal>ON_ERROR</literal>=<literal>ignore</literal>
+ and <replaceable class="parameter">maxerror</replaceable> must
be positive.
+ If not specified, <literal>ON_ERROR</literal>=<literal>ignore</literal>
+ allows an unlimited number of errors, meaning
<command>COPY</command> will
+ skip all erroneous data.
+ </para>
+ </listitem>
+ </varlistentry>
mentioning <replaceable class="parameter">maxerror</replaceable> is a
bigint type
or explicitly mentioning the maximum allowed value of "maxerror" would be great.
other than that, it looks good to me.
On 2024/09/30 21:08, torikoshia wrote:
Since defGetInt64() checks not only whether the input value exceeds the range of bigint but also input value is specified, attached v6 patch checks them by directly calling defGetInt64().
Thanks for updating the patch! But the patch no longer applied cleanly
to the master branch. Could you rebase it?
+ if (opts_out->reject_limit && !opts_out->on_error)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ /*- translator: first and second %s are the names of COPY
+ * option, e.g. ON_ERROR, third is the value of the COPY option,
+ * e.g. IGNORE */
+ errmsg("COPY %s requires %s to be set to %s",
+ "REJECT_LIMIT", "ON_ERROR", "IGNORE")));
This check might be better moved to other place, for example at
the bottom of ProcessCopyOptions(). I noticed a comment in the code:
"Check for incompatible options (must do these two before inserting defaults)."
You added your condition after this comment and before inserting the defaults,
but it's not related to that process. So, moving this check to other place
seems more appropriate.
While reviewing, I also noticed that the check for
"opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP"
is similarly placed before setting the defaults, which might not
be correct. This check should probably be moved as well.
Additionally, the comment mentioning "must do these two" should be
updated to "must do these three." These changes should be handled
in a separate patch.
+ if (cstate->opts.reject_limit &&
Wouldn't it be clearer to use cstate->opts.reject_limit > 0 for better readability?
+ cstate->num_errors > cstate->opts.reject_limit)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("skipped more than REJECT_LIMIT rows: \"%lld\",",
+ (long long) cstate->opts.reject_limit)));
To make the error message clearer, similar to the NOTICE about
skipped rows, how about rewording it to:
"skipped more than REJECT_LIMIT (%lld) rows due to data type incompatibility"?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Thanks for your review!
On Thu, Oct 3, 2024 at 4:27 PM jian he <jian.universality@gmail.com>
wrote:
mentioning <replaceable class="parameter">maxerror</replaceable> is a
bigint type
or explicitly mentioning the maximum allowed value of "maxerror" would
be great.
Added a description that it allows positive bigint.
On Thu, Oct 3, 2024 at 11:43 PM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:
+ if (opts_out->reject_limit && !opts_out->on_error) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + /*- translator: first and second %s are the names of COPY + * option, e.g. ON_ERROR, third is the value of the COPY option, + * e.g. IGNORE */ + errmsg("COPY %s requires %s to be set to %s", + "REJECT_LIMIT", "ON_ERROR", "IGNORE")));This check might be better moved to other place, for example at
the bottom of ProcessCopyOptions(). I noticed a comment in the code:
"Check for incompatible options (must do these two before inserting
defaults)."
You added your condition after this comment and before inserting the
defaults,
but it's not related to that process. So, moving this check to other
place
seems more appropriate.
Agreed. Moved it to the bottom of ProcessCopyOptions().
While reviewing, I also noticed that the check for
"opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP"
is similarly placed before setting the defaults, which might not
be correct. This check should probably be moved as well.
Additionally, the comment mentioning "must do these two" should be
updated to "must do these three." These changes should be handled
in a separate patch.
Agreed and attached 0002 patch.
+ if (cstate->opts.reject_limit &&
Wouldn't it be clearer to use cstate->opts.reject_limit > 0 for better
readability?
Agreed.
+ cstate->num_errors > cstate->opts.reject_limit) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + errmsg("skipped more than REJECT_LIMIT rows: \"%lld\",", + (long long) cstate->opts.reject_limit)));To make the error message clearer, similar to the NOTICE about
skipped rows, how about rewording it to:
"skipped more than REJECT_LIMIT (%lld) rows due to data type
incompatibility"?
Agreed.
Also considering when REJECT_LIMIT is specified to 1, attached patch
uses errmsg_plural() instead of errmsg.
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
Attachments:
v7-0001-Subject-Add-REJECT_LIMIT-option-to-COPY-command.patchtext/x-diff; name=v7-0001-Subject-Add-REJECT_LIMIT-option-to-COPY-command.patchDownload
From 360e170b8aee378c0f498d8899d9f3b5d41d0310 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Mon, 7 Oct 2024 19:57:51 +0900
Subject: [PATCH v7 1/2] Subject: Add REJECT_LIMIT option to COPY command.
9e2d870 enabled the COPY FROM command to skip soft errors, but there
was no limitation about the number of errors and skipped all the soft
errors.
In some cases there would be a limitation how many errors are
tolerable. This patch introduces REJECT_LIMIT to specify how many
errors can be skipped.
---
doc/src/sgml/ref/copy.sgml | 19 +++++++++++++++++
src/backend/commands/copy.c | 33 +++++++++++++++++++++++++++++
src/backend/commands/copyfrom.c | 9 ++++++++
src/include/commands/copy.h | 1 +
src/test/regress/expected/copy2.out | 10 +++++++++
src/test/regress/sql/copy2.sql | 21 ++++++++++++++++++
6 files changed, 93 insertions(+)
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index b9413d4892..59a17d7793 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,6 +44,7 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
FORCE_NOT_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
FORCE_NULL { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
ON_ERROR <replaceable class="parameter">error_action</replaceable>
+ REJECT_LIMIT <replaceable class="parameter">maxerror</replaceable>
ENCODING '<replaceable class="parameter">encoding_name</replaceable>'
LOG_VERBOSITY <replaceable class="parameter">verbosity</replaceable>
</synopsis>
@@ -413,6 +414,24 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>REJECT_LIMIT</literal></term>
+ <listitem>
+ <para>
+ Specifies the maximum number of errors tolerated while converting a
+ column's input value to its data type, when <literal>ON_ERROR</literal> is
+ set to <literal>ignore</literal>.
+ If the input causes more errors than the specified value, the <command>COPY</command>
+ command fails, even with <literal>ON_ERROR</literal> set to <literal>ignore</literal>.
+ This clause must be used with <literal>ON_ERROR</literal>=<literal>ignore</literal>
+ and <replaceable class="parameter">maxerror</replaceable> must be positive <type>bigint</type>.
+ If not specified, <literal>ON_ERROR</literal>=<literal>ignore</literal>
+ allows an unlimited number of errors, meaning <command>COPY</command> will
+ skip all erroneous data.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>ENCODING</literal></term>
<listitem>
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 03eb7a4eba..befab92074 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -418,6 +418,23 @@ defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
return COPY_ON_ERROR_STOP; /* keep compiler quiet */
}
+/*
+ * Extract REJECT_LIMIT value from a DefElem.
+ */
+static int64
+defGetCopyRejectLimitOption(DefElem *def)
+{
+ int64 reject_limit = defGetInt64(def);
+
+ if (reject_limit <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("REJECT_LIMIT (%lld) must be greater than zero",
+ (long long) reject_limit)));
+
+ return reject_limit;
+}
+
/*
* Extract a CopyLogVerbosityChoice value from a DefElem.
*/
@@ -472,6 +489,7 @@ ProcessCopyOptions(ParseState *pstate,
bool header_specified = false;
bool on_error_specified = false;
bool log_verbosity_specified = false;
+ bool reject_limit_specified = false;
ListCell *option;
/* Support external use for option sanity checking */
@@ -638,6 +656,13 @@ ProcessCopyOptions(ParseState *pstate,
log_verbosity_specified = true;
opts_out->log_verbosity = defGetCopyLogVerbosityChoice(defel, pstate);
}
+ else if (strcmp(defel->defname, "reject_limit") == 0)
+ {
+ if (reject_limit_specified)
+ errorConflictingDefElem(defel, pstate);
+ reject_limit_specified = true;
+ opts_out->reject_limit = defGetCopyRejectLimitOption(defel);
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -874,6 +899,14 @@ ProcessCopyOptions(ParseState *pstate,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("NULL specification and DEFAULT specification cannot be the same")));
}
+ /* Check on_error */
+ if (opts_out->reject_limit && !opts_out->on_error)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ /*- translator: first and second %s are the names of COPY option, e.g.
+ * ON_ERROR, third is the value of the COPY option, e.g. IGNORE */
+ errmsg("COPY %s requires %s to be set to %s",
+ "REJECT_LIMIT", "ON_ERROR", "IGNORE")));
}
/*
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 9139a40785..b04e770cfa 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1018,6 +1018,15 @@ CopyFrom(CopyFromState cstate)
pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
cstate->num_errors);
+ if (cstate->opts.reject_limit > 0 && \
+ cstate->num_errors > cstate->opts.reject_limit)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg_plural("skipped more than REJECT_LIMIT (%lld) row due to data type incompatibility",
+ "skipped more than REJECT_LIMIT (%lld) rows due to data type incompatibility",
+ (long long) cstate->opts.reject_limit,
+ (long long) cstate->opts.reject_limit)));
+
/* Repeat NextCopyFrom() until no soft error occurs */
continue;
}
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index 6f64d97fdd..4002a7f538 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -85,6 +85,7 @@ typedef struct CopyFormatOptions
bool convert_selectively; /* do selective binary conversion? */
CopyOnErrorChoice on_error; /* what to do when error happened */
CopyLogVerbosityChoice log_verbosity; /* verbosity of logged messages */
+ int64 reject_limit; /* maximum tolerable number of errors */
List *convert_select; /* list of column names (can be NIL) */
} CopyFormatOptions;
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 4e752977b5..d7d975522f 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -116,6 +116,10 @@ COPY x to stdout (log_verbosity unsupported);
ERROR: COPY LOG_VERBOSITY "unsupported" not recognized
LINE 1: COPY x to stdout (log_verbosity unsupported);
^
+COPY x from stdin with (reject_limit 1);
+ERROR: COPY REJECT_LIMIT requires ON_ERROR to be set to IGNORE
+COPY x from stdin with (on_error ignore, reject_limit 0);
+ERROR: REJECT_LIMIT (0) must be greater than zero
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
ERROR: column "d" specified more than once
@@ -791,6 +795,12 @@ CONTEXT: COPY check_ign_err, line 1: "1 {1}"
COPY check_ign_err FROM STDIN WITH (on_error ignore);
ERROR: extra data after last expected column
CONTEXT: COPY check_ign_err, line 1: "1 {1} 3 abc"
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+ERROR: skipped more than REJECT_LIMIT (3) rows due to data type incompatibility
+CONTEXT: COPY check_ign_err, line 5, column n: ""
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+NOTICE: 4 rows were skipped due to data type incompatibility
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index fa6aa17344..1aa0e41b68 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -82,6 +82,8 @@ COPY x to stdout (format TEXT, force_null(a));
COPY x to stdin (format CSV, force_null(a));
COPY x to stdin (format BINARY, on_error unsupported);
COPY x to stdout (log_verbosity unsupported);
+COPY x from stdin with (reject_limit 1);
+COPY x from stdin with (on_error ignore, reject_limit 0);
-- too many columns in column list: should fail
COPY x (a, b, c, d, e, d, c) from stdin;
@@ -561,6 +563,25 @@ COPY check_ign_err FROM STDIN WITH (on_error ignore);
1 {1} 3 abc
\.
+-- tests for reject_limit option
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 3);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
+COPY check_ign_err FROM STDIN WITH (on_error ignore, reject_limit 4);
+6 {6} 6
+a {7} 7
+8 {8} 8888888888
+9 {a, 9} 9
+
+10 {10} 10
+\.
+
-- clean up
DROP TABLE forcetest;
DROP TABLE vistest;
--
2.39.2
v7-0002-Modify-the-place-and-comments-of-ProcessCopyOptio.patchtext/x-diff; name=v7-0002-Modify-the-place-and-comments-of-ProcessCopyOptio.patchDownload
From cf0ce6d2e9362d3c36755a306b913c9cc89b0007 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Mon, 7 Oct 2024 20:10:22 +0900
Subject: [PATCH v7 2/2] Move check for binary mode and on_error option to appropriate location
The current location for checking for binary mode and on_error option
is not appropriate, as this is where checks should be performed before
inserting default values.
Additionally, the comment mentions two checks, but there are now three.
This patch moves the check for incompatible options to the end of
ProcessCopyOptions() and updates the comment accordingly.
---
src/backend/commands/copy.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index befab92074..0b093dbb2a 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -672,7 +672,7 @@ ProcessCopyOptions(ParseState *pstate,
}
/*
- * Check for incompatible options (must do these two before inserting
+ * Check for incompatible options (must do these three before inserting
* defaults)
*/
if (opts_out->binary && opts_out->delim)
@@ -691,11 +691,6 @@ ProcessCopyOptions(ParseState *pstate,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("cannot specify %s in BINARY mode", "DEFAULT")));
- if (opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP)
- ereport(ERROR,
- (errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("only ON_ERROR STOP is allowed in BINARY mode")));
-
/* Set defaults for omitted options */
if (!opts_out->delim)
opts_out->delim = opts_out->csv_mode ? "," : "\t";
@@ -900,6 +895,11 @@ ProcessCopyOptions(ParseState *pstate,
errmsg("NULL specification and DEFAULT specification cannot be the same")));
}
/* Check on_error */
+ if (opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("only ON_ERROR STOP is allowed in BINARY mode")));
+
if (opts_out->reject_limit && !opts_out->on_error)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
--
2.39.2
On 2024/10/07 21:51, torikoshia wrote:
While reviewing, I also noticed that the check for
"opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP"
is similarly placed before setting the defaults, which might not
be correct. This check should probably be moved as well.
Additionally, the comment mentioning "must do these two" should be
updated to "must do these three." These changes should be handled
in a separate patch.Agreed and attached 0002 patch.
Thanks for updating the 0001 patch and creating the 0002 patch! I've pushed both.
Also considering when REJECT_LIMIT is specified to 1, attached patch uses errmsg_plural() instead of errmsg.
I don't think errmsg_plural() is needed here since, when 1 is specified,
"rows" should follow "more than REJECT_LIMIT (1)". No?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On 2024-10-08 18:39, Fujii Masao wrote:
On 2024/10/07 21:51, torikoshia wrote:
While reviewing, I also noticed that the check for
"opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP"
is similarly placed before setting the defaults, which might not
be correct. This check should probably be moved as well.
Additionally, the comment mentioning "must do these two" should be
updated to "must do these three." These changes should be handled
in a separate patch.Agreed and attached 0002 patch.
Thanks for updating the 0001 patch and creating the 0002 patch! I've
pushed both.
Thanks a lot!
Also considering when REJECT_LIMIT is specified to 1, attached patch
uses errmsg_plural() instead of errmsg.I don't think errmsg_plural() is needed here since, when 1 is
specified,
"rows" should follow "more than REJECT_LIMIT (1)". No?
You are right.
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation