encoding confusion with \copy command

Started by Martin Waiteover 11 years ago8 messagesgeneral
Jump to latest
#1Martin Waite
waite.134@gmail.com

Hi,

I have a postgresql 7.4 server and client on Centos 6.4. The database
server is using UTF-8 encoding.

I have been exploring the use of the \copy command for importing CSV data
generated by SQL Server 2008. SQL Server 2008 export tool does not escape
quotes that are in the content of fields, and so it is useful to be able to
specify obscure characters in the quote option in the \copy command to work
around this issue.

When I run the following commands in psql, I am surprised that QUOTE is
limited to characters in the range 0x01 - 0x7f, and that UTF8 is mentioned
in the error message if characters outside the range are chosen:

\encoding WIN1252
\copy yuml from '/tmp/yuml.csv' WITH CSV HEADER ENCODING 'WIN1252' QUOTE
as E'\xff';
ERROR: invalid byte sequence for encoding "UTF8": 0xff

I thought that if the client (psql) is WIN1252, and the CSV file is
specified as WIN1252, then I could specify any valid WIN1252 character as
the quote character. Instead, I am limited to the range of characters
that can be encoded as a single byte in UTF-8. Actually, 0x00 is not
accepted either, so the range is 0x01 - 0x7F.

Is this a bug or expected behaviour ?

Is it the case that the server does the actual CSV parsing, and that given
that my server is in UTF8, I am therefore limited to single-byte UTF8
characters ?

regards,
Martin

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Martin Waite (#1)
Re: encoding confusion with \copy command

On 09/17/2014 03:03 AM, Martin Waite wrote:

Hi,

I have a postgresql 7.4 server and client on Centos 6.4. The database
server is using UTF-8 encoding.

First I think we need to establish what version of Postgres you using.
Are you really using 7.4?

I have been exploring the use of the \copy command for importing CSV
data generated by SQL Server 2008. SQL Server 2008 export tool does not
escape quotes that are in the content of fields, and so it is useful to
be able to specify obscure characters in the quote option in the \copy
command to work around this issue.

When I run the following commands in psql, I am surprised that QUOTE is
limited to characters in the range 0x01 - 0x7f, and that UTF8 is
mentioned in the error message if characters outside the range are chosen:

\encoding WIN1252
\copy yuml from '/tmp/yuml.csv' WITH CSV HEADER ENCODING 'WIN1252'
QUOTE as E'\xff';
ERROR: invalid byte sequence for encoding "UTF8": 0xff

If you are actually on Postgres 7.4 the above would not be a viable command.

I thought that if the client (psql) is WIN1252, and the CSV file is
specified as WIN1252, then I could specify any valid WIN1252 character
as the quote character. Instead, I am limited to the range of
characters that can be encoded as a single byte in UTF-8. Actually, 0x00
is not accepted either, so the range is 0x01 - 0x7F.

Is this a bug or expected behaviour ?

Is it the case that the server does the actual CSV parsing, and that
given that my server is in UTF8, I am therefore limited to single-byte
UTF8 characters ?

Actually depending on version you may be limited to ASCII.

regards,
Martin

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Martin Waite
waite.134@gmail.com
In reply to: Adrian Klaver (#2)
Re: encoding confusion with \copy command

Hi Adrian,

I apologise - I meant 9.4

regards,
Martin

On 17 September 2014 14:35, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

Show quoted text

On 09/17/2014 03:03 AM, Martin Waite wrote:

Hi,

I have a postgresql 7.4 server and client on Centos 6.4. The database
server is using UTF-8 encoding.

First I think we need to establish what version of Postgres you using. Are
you really using 7.4?

I have been exploring the use of the \copy command for importing CSV
data generated by SQL Server 2008. SQL Server 2008 export tool does not
escape quotes that are in the content of fields, and so it is useful to
be able to specify obscure characters in the quote option in the \copy
command to work around this issue.

When I run the following commands in psql, I am surprised that QUOTE is
limited to characters in the range 0x01 - 0x7f, and that UTF8 is
mentioned in the error message if characters outside the range are chosen:

\encoding WIN1252
\copy yuml from '/tmp/yuml.csv' WITH CSV HEADER ENCODING 'WIN1252'
QUOTE as E'\xff';
ERROR: invalid byte sequence for encoding "UTF8": 0xff

If you are actually on Postgres 7.4 the above would not be a viable
command.

I thought that if the client (psql) is WIN1252, and the CSV file is
specified as WIN1252, then I could specify any valid WIN1252 character
as the quote character. Instead, I am limited to the range of
characters that can be encoded as a single byte in UTF-8. Actually, 0x00
is not accepted either, so the range is 0x01 - 0x7F.

Is this a bug or expected behaviour ?

Is it the case that the server does the actual CSV parsing, and that
given that my server is in UTF8, I am therefore limited to single-byte
UTF8 characters ?

Actually depending on version you may be limited to ASCII.

regards,
Martin

--
Adrian Klaver
adrian.klaver@aklaver.com

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Martin Waite (#3)
Re: encoding confusion with \copy command

On 09/17/2014 06:48 AM, Martin Waite wrote:

Hi Adrian,

I apologise - I meant 9.4

Looks like you will need an intermediate step. A quick search found the
below which might offer a solution:

http://www.excel-sql-server.com/sql-server-export-to-excel-using-bcp-sqlcmd-csv.htm

FYI the good stuff is at the bottom.

There is also this SO answer that refers to SQL Server Management
Studio(not sure if you have):

http://stackoverflow.com/questions/6115054/how-to-get-export-output-in-real-csv-format-in-sql-server-managment-studio

regards,
Martin

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Martin Waite
waite.134@gmail.com
In reply to: Adrian Klaver (#4)
Re: encoding confusion with \copy command

You are right that I need an intermediate step. I will probably use a CSV
parser that is liberal in what it accepts, but writes out strict CSV data
suitable for postgres.

Thanks for the help.

On 17 September 2014 15:40, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

Show quoted text

On 09/17/2014 06:48 AM, Martin Waite wrote:

Hi Adrian,

I apologise - I meant 9.4

Looks like you will need an intermediate step. A quick search found the
below which might offer a solution:

http://www.excel-sql-server.com/sql-server-export-to-
excel-using-bcp-sqlcmd-csv.htm

FYI the good stuff is at the bottom.

There is also this SO answer that refers to SQL Server Management
Studio(not sure if you have):

http://stackoverflow.com/questions/6115054/how-to-get-
export-output-in-real-csv-format-in-sql-server-managment-studio

regards,
Martin

--
Adrian Klaver
adrian.klaver@aklaver.com

#6John DeSoi
desoi@pgedit.com
In reply to: Martin Waite (#5)
Re: encoding confusion with \copy command

On Sep 17, 2014, at 11:52 AM, Martin Waite <waite.134@gmail.com> wrote:

You are right that I need an intermediate step. I will probably use a CSV parser that is liberal in what it accepts, but writes out strict CSV data suitable for postgres.

If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes them unreadable by Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells are empty. It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for Postgres import.

John DeSoi, Ph.D.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Adrian Klaver
adrian.klaver@aklaver.com
In reply to: John DeSoi (#6)
Re: encoding confusion with \copy command

On 09/17/2014 06:08 PM, John DeSoi wrote:

On Sep 17, 2014, at 11:52 AM, Martin Waite <waite.134@gmail.com> wrote:

You are right that I need an intermediate step. I will probably use a CSV parser that is liberal in what it accepts, but writes out strict CSV data suitable for postgres.

If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes them unreadable by Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells are empty. It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for Postgres import.

Have you tried Save As Unicode:

http://www.ablebits.com/office-addins-blog/2014/04/24/convert-excel-csv/#export-csv-utf8

I vaguely remember using Access to pull in an Excel file and saving CSV
from there. I also seem to remember an option when saving a text file to
create an export template that allowed you to specify the csv options
and formatting. This was with the Pro version of Office and the data
tools installed.

What I do now is use the Python xlrd module to read in the Excel file
and then the csv module to output a CSV file.

John DeSoi, Ph.D.

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8John R Pierce
pierce@hogranch.com
In reply to: John DeSoi (#6)
Re: encoding confusion with \copy command

On 9/17/2014 6:08 PM, John DeSoi wrote:

On Sep 17, 2014, at 11:52 AM, Martin Waite<waite.134@gmail.com> wrote:

You are right that I need an intermediate step. I will probably use a CSV parser that is liberal in what it accepts, but writes out strict CSV data suitable for postgres.

If you find such a utility, please share. My clients love Excel, but it takes perfectly valid CSV files and makes them unreadable by Postgres. In particular, Excel saves rows with fewer columns than the header header row if the cells are empty. It also mangles valid UTF-8. I often take Excel CSV files and re-save them from Open Office to fix them for Postgres import.

I've used the npgsql .net driver http://npgsql.projects.pgfoundry.org/
with excel... you can directly query tables into spreadsheets via the
'data' feature in excel.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general