Is it possible to set end-of-data marker for COPY statement.
Hi hackers,
As described in the doc https://www.postgresql.org/docs/current/sql-copy.html, the TEXT format recognizes
backslash-period (\.) as end-of-data marker.
The example below will raise an error for the line contains `\.`.
CREATE TABLE test (
id int,
name text,
dep text
)
Data in file "/tmp/data".
122,as\.d,adad
133,sa dad,adadad
Then execute
copy test from '/tmp/data' DELIMITER ',';
An end-of-copy marker corrupt error will be raised.
This requires users to escape the end-of-data marker manually in their data.
Why we don't have a mechanism to define other characters as end-of-data marker?
Or there are other ways to avoid escape the end-of-data in data?
Regards,
Junfeng
[redirected from -hackers]
On Tue, 2020-09-01 at 06:14 +0000, Junfeng Yang wrote:
As described in the doc https://www.postgresql.org/docs/current/sql-copy.html, the TEXT format recognizes
backslash-period (\.) as end-of-data marker.The example below will raise an error for the line contains `\.`.
CREATE TABLE test (
id int,
name text,
dep text
)Data in file "/tmp/data".
122,as\.d,adad
133,sa dad,adadad
Then execute
copy test from '/tmp/data' DELIMITER ',';
An end-of-copy marker corrupt error will be raised.This requires users to escape the end-of-data marker manually in their data.
Why we don't have a mechanism to define other characters as end-of-data marker?
Or there are other ways to avoid escape the end-of-data in data?
Your problem is that the file contains bad data.
You are using the default TEXT format of copy, and backslashes must
be escaped there.
Everything will work as you want if you write the first line correctly like
122,as\\.d,adad
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
Thanks, Laurenz!
I understand `\\.` should work. But this requires users to modify huge data.
I'm wondering is it possible to change the default end-of-data marker or
could we implement a method to let users choose their own marker?
________________________________
发件人: Laurenz Albe <laurenz.albe@cybertec.at>
发送时间: 2020年9月1日 17:06
收件人: Junfeng Yang <yjerome@vmware.com>; pgsql-general@postgresql.org <pgsql-general@postgresql.org>
主题: Re: Is it possible to set end-of-data marker for COPY statement.
[redirected from -hackers]
On Tue, 2020-09-01 at 06:14 +0000, Junfeng Yang wrote:
As described in the doc https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fdocs%2Fcurrent%2Fsql-copy.html&amp;data=02%7C01%7Cyjerome%40vmware.com%7Cd75fda5803f54dcd9dc408d84e5640d7%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637345479659888992&amp;sdata=CyY%2FAwPTB%2Bl3qMZbcxts5zKrW1QOiqL5%2Ft8MSEuaQk4%3D&amp;reserved=0, the TEXT format recognizes
backslash-period (\.) as end-of-data marker.The example below will raise an error for the line contains `\.`.
CREATE TABLE test (
id int,
name text,
dep text
)Data in file "/tmp/data".
122,as\.d,adad
133,sa dad,adadad
Then execute
copy test from '/tmp/data' DELIMITER ',';
An end-of-copy marker corrupt error will be raised.This requires users to escape the end-of-data marker manually in their data.
Why we don't have a mechanism to define other characters as end-of-data marker?
Or there are other ways to avoid escape the end-of-data in data?
Your problem is that the file contains bad data.
You are using the default TEXT format of copy, and backslashes must
be escaped there.
Everything will work as you want if you write the first line correctly like
122,as\\.d,adad
Please, don't top-post on these lists.
On Tue, 2020-09-01 at 09:20 +0000, Junfeng Yang wrote:
发件人: Laurenz Albe <laurenz.albe@cybertec.at>
On Tue, 2020-09-01 at 06:14 +0000, Junfeng Yang wrote:
As described in the doc , the TEXT format recognizes
backslash-period (\.) as end-of-data marker.The example below will raise an error for the line contains `\.`.
CREATE TABLE test (
id int,
name text,
dep text
)Data in file "/tmp/data".
122,as\.d,adad
133,sa dad,adadad
Then execute
copy test from '/tmp/data' DELIMITER ',';
An end-of-copy marker corrupt error will be raised.This requires users to escape the end-of-data marker manually in their data.
Why we don't have a mechanism to define other characters as end-of-data marker?
Or there are other ways to avoid escape the end-of-data in data?Your problem is that the file contains bad data.
You are using the default TEXT format of copy, and backslashes must
be escaped there.Everything will work as you want if you write the first line correctly like
122,as\\.d,adad
I understand `\\.` should work. But this requires users to modify huge data.
I'm wondering is it possible to change the default end-of-data marker or
could we implement a method to let users choose their own marker?
I don't see the problem.
If you use the CSV format, your data will work fine the way they are:
COPY test FROM '/tmp/data' (FORMAT 'csv');
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On Tue, Sep 1, 2020 at 06:14:45AM +0000, Junfeng Yang wrote:
Hi hackers,
As described in the doc https://www.postgresql.org/docs/current/sql-copy.html,
the TEXT format recognizes
backslash-period (\.) as end-of-data marker.The example below will raise an error for the line contains `\.`.
CREATE TABLE test (
id int,
name text,
dep text
)Data in file "/tmp/data".
122,as\.d,adad
133,sa dad,adadadThen execute
copy test from '/tmp/data' DELIMITER ',';
An end-of-copy marker corrupt error will be raised.
This requires users to escape the end-of-data marker manually in their data.
Why we don't have a mechanism to define other characters as end-of-data marker?
Or there are other ways to avoid escape the end-of-data in data?
This is the first I am hearing of this. The problem is that the system
can't decide if \. is escaping a delimiter, or the end-of-copy marker.
I think we need to just disable period as a delimiter. I don't think
there is enough demand to allow the end-of-data marker to be
configurable.
Interestingly, you can use period as s delimiter if you are copying from
a file that doesn't need an end-of-data marker and you never need to
escape the delimiter, but that seems like too rare a use case to allow
period to be supported as a delimiter.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
On Tue, Sep 1, 2020 at 9:05 AM Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Sep 1, 2020 at 06:14:45AM +0000, Junfeng Yang wrote:
Hi hackers,
Data in file "/tmp/data".
122,as\.d,adad
133,sa dad,adadadThen execute
copy test from '/tmp/data' DELIMITER ',';
An end-of-copy marker corrupt error will be raised.
This is the first I am hearing of this. The problem is that the system
can't decide if \. is escaping a delimiter, or the end-of-copy marker.
I think we need to just disable period as a delimiter. I don't think
there is enough demand to allow the end-of-data marker to be
configurable.Interestingly, you can use period as s delimiter if you are copying from
a file that doesn't need an end-of-data marker and you never need to
escape the delimiter, but that seems like too rare a use case to allow
period to be supported as a delimiter.
Something isn't right here because the rules for end-of-copy are explicit
that the \. must appear on a line all by itself. That isn't the case with
the shown test data.
The system should do one of two things with that input (it seems option 2
is the one we've chosen):
One, see that the character following the backslash is not an action
character and just treat the backslash as data.
Two, complain that the character following the backslash is not a valid
action character.
The system is reporting an error, it's just trying to be helpful and seeing
the period it incorrectly reports that the error has something to do with
the end-of-copy marker when in reality all that can be said is that "a
period in this location is not valid" (unless the command uses DELIMITER
'<period>' at least, in which case the period is now valid and \. means a
literal period since its not alone on a line.)
The only limitation our definition of end-of-copy imposes is that a single
column input file cannot contain a record that is only a period. It does
not impose a limitation on which delimiters are valid.
David J.
On Tue, Sep 1, 2020 at 12:05:02PM -0400, Bruce Momjian wrote:
copy test from '/tmp/data' DELIMITER ',';
An end-of-copy marker corrupt error will be raised.
This requires users to escape the end-of-data marker manually in their data.
Why we don't have a mechanism to define other characters as end-of-data marker?
Or there are other ways to avoid escape the end-of-data in data?This is the first I am hearing of this. The problem is that the system
can't decide if \. is escaping a delimiter, or the end-of-copy marker.
I think we need to just disable period as a delimiter. I don't think
there is enough demand to allow the end-of-data marker to be
configurable.Interestingly, you can use period as s delimiter if you are copying from
a file that doesn't need an end-of-data marker and you never need to
escape the delimiter, but that seems like too rare a use case to allow
period to be supported as a delimiter.
I am sorry I mis-read this email. The example uses _comma_ for the
delimiter, rather than period. Let me reply again. We already disallow
period for delimiters:
COPY test TO '/u/postgres/tmp/x' WITH (DELIMITER '.');
ERROR: COPY delimiter cannot be "."
COPY test TO STDOUT WITH (DELIMITER '.');
ERROR: COPY delimiter cannot be "."
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
On Tue, Sep 1, 2020 at 06:14:45AM +0000, Junfeng Yang wrote:
Hi hackers,
As described in the doc https://www.postgresql.org/docs/current/sql-copy.html,
the TEXT format recognizes
backslash-period (\.) as end-of-data marker.The example below will raise an error for the line contains `\.`.
CREATE TABLE test (
id int,
name text,
dep text
)Data in file "/tmp/data".
122,as\.d,adad
133,sa dad,adadadThen execute
copy test from '/tmp/data' DELIMITER ',';
An end-of-copy marker corrupt error will be raised.
This requires users to escape the end-of-data marker manually in their data.
Why we don't have a mechanism to define other characters as end-of-data marker?
Or there are other ways to avoid escape the end-of-data in data?
So, you are using comma as the delimiter, but have \. (backslash-period)
as a data value. You need to double-up backslashes in your input data,
no matter what is after the backslash. You just happen to hit backslash
period, but other things like \N could cause problems --- literal
backslashes have to be doubled.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
Bruce Momjian <bruce@momjian.us> writes:
On Tue, Sep 1, 2020 at 06:14:45AM +0000, Junfeng Yang wrote:
Data in file "/tmp/data".
122,as\.d,adad
133,sa dad,adadad
So, you are using comma as the delimiter, but have \. (backslash-period)
as a data value. You need to double-up backslashes in your input data,
no matter what is after the backslash. You just happen to hit backslash
period, but other things like \N could cause problems --- literal
backslashes have to be doubled.
As mentioned upthread, using CSV format might work better. In CSV
you aren't necessarily stuck with backslash being treated as an
escape character.
regards, tom lane