Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Hi,
The option choice of "ignore" in the COPY ON_ERROR clause seems overly
generic. There would seem to be two relevant ways to ignore bad column
input data - drop the entire row or just set the column value to null. I
can see us wanting to provide the set to null option and in any case having
the option name be explicit that it ignores the row seems like a good idea.
David J.
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
Hi,
The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic. There would seem to be two relevant ways to ignore bad column input data - drop the entire row or just set the column value to null. I can see us wanting to provide the set to null option and in any case having the option name be explicit that it ignores the row seems like a good idea.
two issue I found out while playing around with it;
create table x1(a int not null, b int not null );
you can only do:
COPY x1 from stdin (on_error 'null');
but you cannot do
COPY x1 from stdin (on_error null);
we need to hack the gram.y to escape the "null". I don't know how to
make it work.
related post I found:
https://stackoverflow.com/questions/31786611/how-to-escape-flex-keyword
another issue:
COPY x1 from stdin (on_error null);
when we already have `not null` top level constraint for table x1.
Do we need an error immediately?
"on_error null" seems to conflict with `not null` constraint (assume
refers to the same column).
it may fail while doing bulk inserts while on_error is set to null
because of violating a not null constraint.
On Sun, Jan 28, 2024 at 4:51 PM jian he <jian.universality@gmail.com> wrote:
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:Hi,
The option choice of "ignore" in the COPY ON_ERROR clause seems overly
generic. There would seem to be two relevant ways to ignore bad column
input data - drop the entire row or just set the column value to null. I
can see us wanting to provide the set to null option and in any case having
the option name be explicit that it ignores the row seems like a good idea.two issue I found out while playing around with it;
create table x1(a int not null, b int not null );another issue:
COPY x1 from stdin (on_error null);when we already have `not null` top level constraint for table x1.
Do we need an error immediately?
"on_error null" seems to conflict with `not null` constraint (assume
refers to the same column).
it may fail while doing bulk inserts while on_error is set to null
because of violating a not null constraint.
You should not error immediately since whether or not there is a problem is
table and data dependent. I would not check for the case of all columns
being defined not null and just let the mismatch happen.
That said, maybe with this being a string we can accept something like:
'null, ignore'
And so if attempting to place any one null fails, assuming we can make that
a soft error too, we would then ignore the entire row.
David J.
On Fri, 26 Jan 2024 08:08:29 -0700
"David G. Johnston" <david.g.johnston@gmail.com> wrote:
Hi,
The option choice of "ignore" in the COPY ON_ERROR clause seems overly
generic. There would seem to be two relevant ways to ignore bad column
input data - drop the entire row or just set the column value to null. I
can see us wanting to provide the set to null option and in any case having
the option name be explicit that it ignores the row seems like a good idea.
I am not in favour of renaming the option name "ignore", instead we can
use another style of name for the option to set the column value to NULL,
for example, "set_to_null".
(Maybe, we can make a more generic option "set_to (col, val)" that can set
the value of column specified by "col" value to the specified value "val"
(e.g. 'N/A') on a soft error, although the syntax would be a bit complex...)
IMO, it is more simple to define "ignore" as to skip the entire row rather
than having variety of "ignore". Once defined it so, the option to set the
column value to NULL should not be called "ignore" because values in other
columns will be inserted.
Regards,
Yugo Nagata
David J.
--
Yugo NAGATA <nagata@sraoss.co.jp>
The idea of on_error is to tolerate errors, I think.
if a column has a not null constraint, let it cannot be used with
(on_error 'null')
Based on this, I've made a patch.
based on COPY Synopsis: ON_ERROR 'error_action'
on_error 'null', the keyword NULL should be single quoted.
demo:
COPY check_ign_err FROM STDIN WITH (on_error 'null');
1 {1} a
2 {2} 1
3 {3} 2
4 {4} b
a {5} c
\.
\pset null NULL
SELECT * FROM check_ign_err;
n | m | k
------+-----+------
1 | {1} | NULL
2 | {2} | 1
3 | {3} | 2
4 | {4} | NULL
NULL | {5} | NULL
Attachments:
v1-0001-introduce-copy-on_error-null-option.patchtext/x-patch; charset=US-ASCII; name=v1-0001-introduce-copy-on_error-null-option.patchDownload+107-7
Hi,
On 2024-02-03 15:22, jian he wrote:
The idea of on_error is to tolerate errors, I think.
if a column has a not null constraint, let it cannot be used with
(on_error 'null')
+ /* + * we can specify on_error 'null', but it can only apply to columns + * don't have not null constraint. + */ + if (att->attnotnull && cstate->opts.on_error == COPY_ON_ERROR_NULL) + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("copy on_error 'null' cannot be used with not null constraint column")));
This means we cannot use ON_ERROR 'null' even when there is one column
which have NOT NULL constraint, i.e. primary key, right?
IMHO this is strong constraint and will decrease the opportunity to use
this feature.
It might be better to allow error_action 'null' for tables which have
NOT NULL constraint columns, and when facing soft errors for those rows,
skip that row or stop COPY.
Based on this, I've made a patch.
based on COPY Synopsis: ON_ERROR 'error_action'
on_error 'null', the keyword NULL should be single quoted.
As you mentioned, single quotation seems a little odd..
I'm not sure what is the best name and syntax for this feature, but
since current error_action are verbs('stop' and 'ignore'), I feel 'null'
might not be appropriate.
demo:
COPY check_ign_err FROM STDIN WITH (on_error 'null');
1 {1} a
2 {2} 1
3 {3} 2
4 {4} b
a {5} c
\.\pset null NULL
SELECT * FROM check_ign_err;
n | m | k
------+-----+------
1 | {1} | NULL
2 | {2} | 1
3 | {3} | 2
4 | {4} | NULL
NULL | {5} | NULL
Since we notice the number of ignored rows when ON_ERROR is 'ignore',
users may want to know the number of rows which was changed to NULL when
using ON_ERROR 'null'.
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
Hi,
On 2024-02-03 15:22, jian he wrote:
The idea of on_error is to tolerate errors, I think.
if a column has a not null constraint, let it cannot be used with
(on_error 'null')+ /* + * we can specify on_error 'null', but it can only apply to columns + * don't have not null constraint. + */ + if (att->attnotnull && cstate->opts.on_error == COPY_ON_ERROR_NULL) + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("copy on_error 'null' cannot be used with not null constraint column")));This means we cannot use ON_ERROR 'null' even when there is one column
which have NOT NULL constraint, i.e. primary key, right?
IMHO this is strong constraint and will decrease the opportunity to use
this feature.
I don't want to fail in the middle of bulk inserts,
so I thought immediately erroring out would be a great idea.
Let's see what other people think.
On Mon, 05 Feb 2024 11:28:59 +0900
torikoshia <torikoshia@oss.nttdata.com> wrote:
Based on this, I've made a patch.
based on COPY Synopsis: ON_ERROR 'error_action'
on_error 'null', the keyword NULL should be single quoted.As you mentioned, single quotation seems a little odd..
I'm not sure what is the best name and syntax for this feature, but
since current error_action are verbs('stop' and 'ignore'), I feel 'null'
might not be appropriate.
I am not in favour of using 'null' either, so I suggested to use
"set_to_null" or more generic syntax like "set_to (col, val)" in my
previous post[1]/messages/by-id/20240129172858.ccb6c77c3be95a295e2b2b44@sraoss.co.jp, although I'm not convinced what is the best either.
[1]: /messages/by-id/20240129172858.ccb6c77c3be95a295e2b2b44@sraoss.co.jp
Regards,
Yugo Nagata
--
Yugo NAGATA <nagata@sraoss.co.jp>
At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in
On Mon, 05 Feb 2024 11:28:59 +0900
torikoshia <torikoshia@oss.nttdata.com> wrote:Based on this, I've made a patch.
based on COPY Synopsis: ON_ERROR 'error_action'
on_error 'null', the keyword NULL should be single quoted.As you mentioned, single quotation seems a little odd..
I'm not sure what is the best name and syntax for this feature, but
since current error_action are verbs('stop' and 'ignore'), I feel 'null'
might not be appropriate.I am not in favour of using 'null' either, so I suggested to use
"set_to_null" or more generic syntax like "set_to (col, val)" in my
previous post[1], although I'm not convinced what is the best either.[1] /messages/by-id/20240129172858.ccb6c77c3be95a295e2b2b44@sraoss.co.jp
Tom sugggested using a separate option, and I agree with the
suggestion. Taking this into consideration, I imagined something like
the following, for example. Although I'm not sure we are actually
going to do whole-tuple replacement, the action name in this example
has the suffix '-column'.
COPY (on_error 'replace-colomn', replacement 'null') ..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, 06 Feb 2024 09:39:09 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in
On Mon, 05 Feb 2024 11:28:59 +0900
torikoshia <torikoshia@oss.nttdata.com> wrote:Based on this, I've made a patch.
based on COPY Synopsis: ON_ERROR 'error_action'
on_error 'null', the keyword NULL should be single quoted.As you mentioned, single quotation seems a little odd..
I'm not sure what is the best name and syntax for this feature, but
since current error_action are verbs('stop' and 'ignore'), I feel 'null'
might not be appropriate.I am not in favour of using 'null' either, so I suggested to use
"set_to_null" or more generic syntax like "set_to (col, val)" in my
previous post[1], although I'm not convinced what is the best either.[1] /messages/by-id/20240129172858.ccb6c77c3be95a295e2b2b44@sraoss.co.jp
Tom sugggested using a separate option, and I agree with the
suggestion. Taking this into consideration, I imagined something like
the following, for example. Although I'm not sure we are actually
going to do whole-tuple replacement, the action name in this example
has the suffix '-column'.COPY (on_error 'replace-colomn', replacement 'null') ..
Thank you for your information. I've found a post[1]/messages/by-id/2070915.1705527477@sss.pgh.pa.us you mentioned,
where adding a separate option for error log destination was suggested.
Considering consistency with other options, adding a separate option
would be better if we want to specify a value to replace the invalid
value, without introducing a complex syntax that allows options with
more than one parameters. Maybe, if we allow to use values for the
replacement other than NULL, we have to also add a option to specify
a column (or a type) for each replacement value. Or, we may add a
option to specify a list of replacement values as many as the number of
columns, each of whose default is NULL.
Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option
value.
[1]: /messages/by-id/2070915.1705527477@sss.pgh.pa.us
Regards,
Yugo Nagata
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Yugo NAGATA <nagata@sraoss.co.jp>
On Tue, Feb 6, 2024 at 3:46 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote:
On Tue, 06 Feb 2024 09:39:09 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in
On Mon, 05 Feb 2024 11:28:59 +0900
torikoshia <torikoshia@oss.nttdata.com> wrote:Based on this, I've made a patch.
based on COPY Synopsis: ON_ERROR 'error_action'
on_error 'null', the keyword NULL should be single quoted.As you mentioned, single quotation seems a little odd..
I'm not sure what is the best name and syntax for this feature, but
since current error_action are verbs('stop' and 'ignore'), I feel 'null'
might not be appropriate.I am not in favour of using 'null' either, so I suggested to use
"set_to_null" or more generic syntax like "set_to (col, val)" in my
previous post[1], although I'm not convinced what is the best either.[1] /messages/by-id/20240129172858.ccb6c77c3be95a295e2b2b44@sraoss.co.jp
Tom sugggested using a separate option, and I agree with the
suggestion. Taking this into consideration, I imagined something like
the following, for example. Although I'm not sure we are actually
going to do whole-tuple replacement, the action name in this example
has the suffix '-column'.COPY (on_error 'replace-colomn', replacement 'null') ..
Thank you for your information. I've found a post[1] you mentioned,
where adding a separate option for error log destination was suggested.Considering consistency with other options, adding a separate option
would be better if we want to specify a value to replace the invalid
value, without introducing a complex syntax that allows options with
more than one parameters. Maybe, if we allow to use values for the
replacement other than NULL, we have to also add a option to specify
a column (or a type) for each replacement value. Or, we may add a
option to specify a list of replacement values as many as the number of
columns, each of whose default is NULL.Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option
value.
Let's say tabe t column (a,b,c)
if we support set_to_null(a,b), what should we do if column c has an error.
should we ignore this row or error out immediately?
also I am not sure it's doable to just extract columnList from the
function defGetCopyOnErrorChoice.
to make `COPY x from stdin (on_error set_to_null(a,b);` work,
we may need to refactor to gram.y, in a similar way we do force null
i am ok with
COPY x from stdin (on_error set_to_null);
On Mon, 5 Feb 2024 14:26:46 +0800
jian he <jian.universality@gmail.com> wrote:
On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
Hi,
On 2024-02-03 15:22, jian he wrote:
The idea of on_error is to tolerate errors, I think.
if a column has a not null constraint, let it cannot be used with
(on_error 'null')+ /* + * we can specify on_error 'null', but it can only apply to columns + * don't have not null constraint. + */ + if (att->attnotnull && cstate->opts.on_error == COPY_ON_ERROR_NULL) + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("copy on_error 'null' cannot be used with not null constraint column")));This means we cannot use ON_ERROR 'null' even when there is one column
which have NOT NULL constraint, i.e. primary key, right?
IMHO this is strong constraint and will decrease the opportunity to use
this feature.I don't want to fail in the middle of bulk inserts,
so I thought immediately erroring out would be a great idea.
Let's see what other people think.
I also think this restriction is too strong because it is very
common that a table has a primary key, unless there is some way
to specify columns that can be set to NULL. Even when ON_ERROR
is specified, any constraint violation errors cannot be generally
ignored, so we cannot elimiate the posibility COPY FROM fails in
the middle due to invalid data, anyway.
Regards,
Yugo Nagata
--
Yugo NAGATA <nagata@sraoss.co.jp>
attached v2.
syntax: `on_error set_to_null`
based on upthread discussion, now if you specified `on_error
set_to_null` and your column has `not
null` constraint, we convert the error field to null, so it may error
while bulk inserting for violating NOT NULL constraint.
Attachments:
v2-0001-on_error-set_to_null.patchtext/x-patch; charset=US-ASCII; name=v2-0001-on_error-set_to_null.patchDownload+99-9
Hi!
On 12.02.24 01:00, jian he wrote:
attached v2.
syntax: `on_error set_to_null`
based on upthread discussion, now if you specified `on_error
set_to_null` and your column has `not
null` constraint, we convert the error field to null, so it may error
while bulk inserting for violating NOT NULL constraint.
That's a very nice feature. Thanks for implementing it!
v2 applies cleanly and works as described.
\pset null '(NULL)'
CREATE TEMPORARY TABLE t1 (a int, b int);
COPY t1 (a,b) FROM STDIN;
1 a
2 1
3 2
4 b
a c
\.
SELECT * FROM t1;
CONTEXT: COPY t1, line 1, column b: "a"
a | b
---+---
(0 rows)
CREATE TEMPORARY TABLE t2 (a int, b int);
COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null);
1 a
2 1
3 2
4 b
a c
\.
SELECT * FROM t2;
psql:test-copy-on_error-2.sql:12: NOTICE: some columns of 3 rows, value
were converted to NULL due to data type incompatibility
COPY 5
a | b
--------+--------
1 | (NULL)
2 | 1
3 | 2
4 | (NULL)
(NULL) | (NULL)
(5 rows)
I have one question though:
In case all columns of a record have been set to null due to data type
incompatibility, should we insert it at all? See t2 example above.
I'm not sure if these records would be of any use in the table. What do
you think?
Since the parameter is already called "set_to_null", maybe it is not
necessary to mention in the NOTICE message that the values have been set
to null.
Perhaps something like "XX records were only partially copied due to
data type incompatibility"
--
Jim
On Fri, Feb 16, 2024 at 1:16 PM Jim Jones <jim.jones@uni-muenster.de> wrote:
In case all columns of a record have been set to null due to data type
incompatibility, should we insert it at all?
Yes. In particular not all columns in the table need be specified in the
copy command so while the parsed input data is all nulls the record itself
may not be.
The system should allow the user to exclude rows with incomplete data by
ignoring a not null constraint violation.
In short we shouldn't judge non-usefulness and instead give tools to the
user to decide for themselves.
David J.
On 16.02.24 21:31, David G. Johnston wrote:
Yes. In particular not all columns in the table need be specified in
the copy command so while the parsed input data is all nulls the
record itself may not be.
Yeah, you have a point there.
I guess if users want to avoid it to happen they can rely on NOT NULL
constraints.
Thanks
--
Jim
hi all.
patch updated.
simplified the code a lot.
idea is same:
COPY t_on_error_null FROM STDIN WITH (on_error set_to_null);
If the STDIN number of columns is the same as the target table, then
InputFunctionCallSafe
call failure will make that column values be null.
If the STDIN number of columns is not the same as the target table, then error
ERROR: missing data for column \"%s\"
ERROR: extra data after last expected column
which is status quo.
Attachments:
v3-0001-on_error-set_to_null.patchtext/x-patch; charset=US-ASCII; name=v3-0001-on_error-set_to_null.patchDownload+100-4
Hi there
On 26.08.24 02:00, jian he wrote:
hi all.
patch updated.
simplified the code a lot.idea is same:
COPY t_on_error_null FROM STDIN WITH (on_error set_to_null);If the STDIN number of columns is the same as the target table, then
InputFunctionCallSafe
call failure will make that column values be null.If the STDIN number of columns is not the same as the target table, then error
ERROR: missing data for column \"%s\"
ERROR: extra data after last expected column
which is status quo.
I wanted to give it another try, but the patch does not apply ...
$ git apply ~/patches/copy_on_error/v3-0001-on_error-set_to_null.patch -v
Checking patch doc/src/sgml/ref/copy.sgml...
Checking patch src/backend/commands/copy.c...
Checking patch src/backend/commands/copyfrom.c...
Checking patch src/backend/commands/copyfromparse.c...
Checking patch src/include/commands/copy.h...
Checking patch src/test/regress/expected/copy2.out...
error: while searching for:
NOTICE: skipping row due to data type incompatibility at line 8 for
column k: "a"
CONTEXT: COPY check_ign_err
NOTICE: 6 rows were skipped due to data type incompatibility
-- tests for on_error option with log_verbosity and null constraint via
domain
CREATE DOMAIN dcheck_ign_err2 varchar(15) NOT NULL;
CREATE TABLE check_ign_err2 (n int, m int[], k int, l dcheck_ign_err2);
error: patch failed: src/test/regress/expected/copy2.out:753
error: src/test/regress/expected/copy2.out: patch does not apply
Checking patch src/test/regress/sql/copy2.sql...
--
Jim
On Mon, Sep 9, 2024 at 10:34 PM Jim Jones <jim.jones@uni-muenster.de> wrote:
Hi there
On 26.08.24 02:00, jian he wrote:
hi all.
patch updated.
simplified the code a lot.idea is same:
COPY t_on_error_null FROM STDIN WITH (on_error set_to_null);If the STDIN number of columns is the same as the target table, then
InputFunctionCallSafe
call failure will make that column values be null.If the STDIN number of columns is not the same as the target table, then error
ERROR: missing data for column \"%s\"
ERROR: extra data after last expected column
which is status quo.I wanted to give it another try, but the patch does not apply ...
here we are.
please check the attached file.
Attachments:
v4-0001-on_error-set_to_null.patchtext/x-patch; charset=US-ASCII; name=v4-0001-on_error-set_to_null.patchDownload+114-6
On 12.09.24 12:13, jian he wrote:
please check the attached file.
v4 applies cleanly, it works as expected, and all tests pass.
postgres=# \pset null '(NULL)'
Null display is "(NULL)".
postgres=# CREATE TEMPORARY TABLE t2 (a int, b int);
CREATE TABLE
postgres=# COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null, format csv);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
1,a
2,1
3,2
4,b
a,c
\.
COPY 5
postgres=# SELECT * FROM t2;
a | b
--------+--------
1 | (NULL)
2 | 1
3 | 2
4 | (NULL)
(NULL) | (NULL)
(5 rows)
Perhaps small changes in the docs:
<literal>set_to_null</literal> means the input value will set to
<literal>null</literal> and continue with the next one.
"will set" -> "will be set"
"and continue with" -> "and will continue with"
Other than that, LGTM.
Thanks!
--
Jim