Encoding problems with migration from 8.0.14 to 8.3.0 on Windows

Started by Meetesh Kariaover 18 years ago8 messageshackers

meetesh.karia@gmail.com

over 18 years ago

Hi all,

I'm trying to migrate from 8.0.14 on Windows (Vista Home Premium) to
8.3.0 and I've been trying to solve what appears to be an encoding
problem. My old db was in the UNICODE encoding. I know that this isn't
supported on 8.0.x, but it was a restore of a db from a Linux
environment and postgres didn't appear to have any problems with it.

My 8.3 server and client encodings are UTF8 and I used pg_dumpall (I
tried the 8.0 and 8.3 versions) to dump the db. However, when I tried
to restore the db, I got an error during index creation which wouldn't
let me create a unique index on a column that had all unique values (it
had the index in 8.0 and a group by having query with no indexes on the
table confirms uniqueness). The thing that this column does have
however is values like:

'Bruehl'
'Brühl'

I created a blank table with the unique index on it and inserted rows
one at a time until I confirmed that it was the above values that were
causing a problem. Running the following query shows the difference in
the hex encoded values (I changed my client encoding to WIN1250 to get
the below to show up correctly):

select name, encode(decode(name, 'escape'), 'hex') from ...

name | encode
---------------+----------------------------
Daniel Brühl | 44616e69656c204272c3bc686c
Daniel Bruehl | 44616e69656c2042727565686c
(2 rows)

I've also tried exporting using an encoding of WIN1250 but I get errors
like this:

pg_dump: Error message from server: ERROR: character 0xc383 of encoding
"UNICODE" has no equivalent in "WIN1250"

Anyone have any thoughts or suggestions? Why would the index creation
fail? Is there a workaround?

Thanks,
Meetesh

Meetesh Karia

meetesh.karia@gmail.com

over 18 years ago

In reply to: Meetesh Karia (#1)

Re: Encoding problems with migration from 8.0.14 to 8.3.0 on Windows

One quick addition to this:

The column I'm creating this unique index on is a varchar(255) and the
command I was running was:

create unique index foo_name on foo (name);

If I use the following, it now works:

create unique index foo_name on foo (cast(name as bytea));

Thoughts?

Meetesh

Meetesh Karia wrote:

Show quoted text

Hi all,

I'm trying to migrate from 8.0.14 on Windows (Vista Home Premium) to
8.3.0 and I've been trying to solve what appears to be an encoding
problem. My old db was in the UNICODE encoding. I know that this
isn't supported on 8.0.x, but it was a restore of a db from a Linux
environment and postgres didn't appear to have any problems with it.

My 8.3 server and client encodings are UTF8 and I used pg_dumpall (I
tried the 8.0 and 8.3 versions) to dump the db. However, when I tried
to restore the db, I got an error during index creation which wouldn't
let me create a unique index on a column that had all unique values
(it had the index in 8.0 and a group by having query with no indexes
on the table confirms uniqueness). The thing that this column does
have however is values like:

'Bruehl'
'Brühl'

I created a blank table with the unique index on it and inserted rows
one at a time until I confirmed that it was the above values that were
causing a problem. Running the following query shows the difference
in the hex encoded values (I changed my client encoding to WIN1250 to
get the below to show up correctly):

select name, encode(decode(name, 'escape'), 'hex') from ...

name | encode
---------------+----------------------------
Daniel Brühl | 44616e69656c204272c3bc686c
Daniel Bruehl | 44616e69656c2042727565686c
(2 rows)

I've also tried exporting using an encoding of WIN1250 but I get
errors like this:

pg_dump: Error message from server: ERROR: character 0xc383 of
encoding "UNICODE" has no equivalent in "WIN1250"

Anyone have any thoughts or suggestions? Why would the index creation
fail? Is there a workaround?

Thanks,
Meetesh

Robert Treat

xzilla@users.sourceforge.net

over 18 years ago

In reply to: Meetesh Karia (#2)

Re: Encoding problems with migration from 8.0.14 to 8.3.0 on Windows

On Wednesday 12 March 2008 09:37, Meetesh Karia wrote:

One quick addition to this:

The column I'm creating this unique index on is a varchar(255) and the
command I was running was:

create unique index foo_name on foo (name);

If I use the following, it now works:

create unique index foo_name on foo (cast(name as bytea));

Thoughts?

Meetesh

Meetesh Karia wrote:

Hi all,

I'm trying to migrate from 8.0.14 on Windows (Vista Home Premium) to
8.3.0 and I've been trying to solve what appears to be an encoding
problem. My old db was in the UNICODE encoding. I know that this
isn't supported on 8.0.x, but it was a restore of a db from a Linux
environment and postgres didn't appear to have any problems with it.

My 8.3 server and client encodings are UTF8 and I used pg_dumpall (I
tried the 8.0 and 8.3 versions) to dump the db. However, when I tried
to restore the db, I got an error during index creation which wouldn't
let me create a unique index on a column that had all unique values
(it had the index in 8.0 and a group by having query with no indexes
on the table confirms uniqueness). The thing that this column does
have however is values like:

'Bruehl'
'Brühl'

I created a blank table with the unique index on it and inserted rows
one at a time until I confirmed that it was the above values that were
causing a problem. Running the following query shows the difference
in the hex encoded values (I changed my client encoding to WIN1250 to
get the below to show up correctly):

select name, encode(decode(name, 'escape'), 'hex') from ...

name | encode
---------------+----------------------------
Daniel Brühl | 44616e69656c204272c3bc686c
Daniel Bruehl | 44616e69656c2042727565686c
(2 rows)

I've also tried exporting using an encoding of WIN1250 but I get
errors like this:

pg_dump: Error message from server: ERROR: character 0xc383 of
encoding "UNICODE" has no equivalent in "WIN1250"

Anyone have any thoughts or suggestions? Why would the index creation
fail? Is there a workaround?

I'm not convinced your problem isn't solved by proper setting of
client_encoding for both input and output:

pagila=# create table x (r varchar(255) unique);
NOTICE: CREATE TABLE / UNIQUE will create implicit index "x_r_key" for
table "x "
CREATE TABLE
pagila=# set client_encoding=WIN1250;
SET
pagila=# insert into x (r) values ('Daniel Brühl');
INSERT 0 1
pagila=# insert into x (r) values ('Daniel Bruehl');
INSERT 0 1
pagila=# select * from x;
r
---------------
Daniel Brühl
Daniel Bruehl
(2 rows)

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Meetesh Karia

meetesh.karia@gmail.com

over 18 years ago

In reply to: Robert Treat (#3)

Re: Encoding problems with migration from 8.0.14 to 8.3.0 on Windows

Thanks for your response Robert.

Unfortunately I don't think that will work (unless I'm misunderstanding
something).

I can't export using a client encoding of WIN1250 because I have true
UNICODE chars which can't be represented with it (I've tried and I get
the error below). Therefore, I have a sql dump with UNICODE chars and
setting the client encoding to WIN1250 when I run the sql dump script
would toast the characters.

Additionally, here's what I get when I run your test below (my server
encoding is UTF-8):

ltefull=# create table x (r varchar(255) unique);
NOTICE: CREATE TABLE / UNIQUE will create implicit index "x_r_key" for
table "x"
CREATE TABLE
ltefull=#
ltefull=# set client_encoding=WIN1250;
SET
ltefull=# insert into x (r) values ('Daniel Brï¿œhl');
INSERT 0 1
ltefull=#
ltefull=# insert into x (r) values ('Daniel Bruehl');
ERROR: duplicate key value violates unique constraint "x_r_key"

Meetesh

Robert Treat wrote:

Show quoted text

On Wednesday 12 March 2008 09:37, Meetesh Karia wrote:

One quick addition to this:

The column I'm creating this unique index on is a varchar(255) and the
command I was running was:

create unique index foo_name on foo (name);

If I use the following, it now works:

create unique index foo_name on foo (cast(name as bytea));

Thoughts?

Meetesh

Meetesh Karia wrote:

Hi all,

I'm trying to migrate from 8.0.14 on Windows (Vista Home Premium) to
8.3.0 and I've been trying to solve what appears to be an encoding
problem. My old db was in the UNICODE encoding. I know that this
isn't supported on 8.0.x, but it was a restore of a db from a Linux
environment and postgres didn't appear to have any problems with it.

My 8.3 server and client encodings are UTF8 and I used pg_dumpall (I
tried the 8.0 and 8.3 versions) to dump the db. However, when I tried
to restore the db, I got an error during index creation which wouldn't
let me create a unique index on a column that had all unique values
(it had the index in 8.0 and a group by having query with no indexes
on the table confirms uniqueness). The thing that this column does
have however is values like:

'Bruehl'
'Brï¿œhl'

I created a blank table with the unique index on it and inserted rows
one at a time until I confirmed that it was the above values that were
causing a problem. Running the following query shows the difference
in the hex encoded values (I changed my client encoding to WIN1250 to
get the below to show up correctly):

select name, encode(decode(name, 'escape'), 'hex') from ...

name | encode
---------------+----------------------------
Daniel Brï¿œhl | 44616e69656c204272c3bc686c
Daniel Bruehl | 44616e69656c2042727565686c
(2 rows)

I've also tried exporting using an encoding of WIN1250 but I get
errors like this:

pg_dump: Error message from server: ERROR: character 0xc383 of
encoding "UNICODE" has no equivalent in "WIN1250"

Anyone have any thoughts or suggestions? Why would the index creation
fail? Is there a workaround?

I'm not convinced your problem isn't solved by proper setting of
client_encoding for both input and output:

pagila=# create table x (r varchar(255) unique);
NOTICE: CREATE TABLE / UNIQUE will create implicit index "x_r_key" for
table "x "
CREATE TABLE
pagila=# set client_encoding=WIN1250;
SET
pagila=# insert into x (r) values ('Daniel Brï¿œhl');
INSERT 0 1
pagila=# insert into x (r) values ('Daniel Bruehl');
INSERT 0 1
pagila=# select * from x;
r
---------------
Daniel Brï¿œhl
Daniel Bruehl
(2 rows)

Meetesh Karia

meetesh.karia@gmail.com

over 18 years ago

In reply to: Meetesh Karia (#4)

Re: Encoding problems with migration from 8.0.14 to 8.3.0 on Windows

Thanks to help from Sergey Benner, I was able to narrow the problem
down. When I first initialized the PG cluster, I used the following
command:

initdb -D d:/postgres/8.3/data -E UTF8 -U postgres

And that set my LC_COLLATE and LC_CTYPE values to 'English_United
States.1252'. 'en_US-UTF-8' would be ideal for me but that doesn't seem
to exist on Windows.

Anyhow, when I recreated the PG cluster using this command:

initdb -D ./data -E UTF8 --locale=C -U postgres

It now correctly handles the values and allows me to create the unique
index. I can use the C locale, but I'm a bit surprised that the default
windows locale wouldn't work.

Is that to be expected?

Thanks,
Meetesh

Meetesh Karia wrote:

Show quoted text

Thanks for your response Robert.

Unfortunately I don't think that will work (unless I'm
misunderstanding something).

I can't export using a client encoding of WIN1250 because I have true
UNICODE chars which can't be represented with it (I've tried and I get
the error below). Therefore, I have a sql dump with UNICODE chars and
setting the client encoding to WIN1250 when I run the sql dump script
would toast the characters.

Additionally, here's what I get when I run your test below (my server
encoding is UTF-8):

ltefull=# create table x (r varchar(255) unique);
NOTICE: CREATE TABLE / UNIQUE will create implicit index "x_r_key"
for table "x"
CREATE TABLE
ltefull=#
ltefull=# set client_encoding=WIN1250;
SET
ltefull=# insert into x (r) values ('Daniel Brï¿œhl');
INSERT 0 1
ltefull=#
ltefull=# insert into x (r) values ('Daniel Bruehl');
ERROR: duplicate key value violates unique constraint "x_r_key"

Meetesh

Robert Treat wrote:

On Wednesday 12 March 2008 09:37, Meetesh Karia wrote:

One quick addition to this:

The column I'm creating this unique index on is a varchar(255) and the
command I was running was:

create unique index foo_name on foo (name);

If I use the following, it now works:

create unique index foo_name on foo (cast(name as bytea));

Thoughts?

Meetesh

Meetesh Karia wrote:

Hi all,

I'm trying to migrate from 8.0.14 on Windows (Vista Home Premium) to
8.3.0 and I've been trying to solve what appears to be an encoding
problem. My old db was in the UNICODE encoding. I know that this
isn't supported on 8.0.x, but it was a restore of a db from a Linux
environment and postgres didn't appear to have any problems with it.

My 8.3 server and client encodings are UTF8 and I used pg_dumpall (I
tried the 8.0 and 8.3 versions) to dump the db. However, when I tried
to restore the db, I got an error during index creation which wouldn't
let me create a unique index on a column that had all unique values
(it had the index in 8.0 and a group by having query with no indexes
on the table confirms uniqueness). The thing that this column does
have however is values like:

'Bruehl'
'Brï¿œhl'

I created a blank table with the unique index on it and inserted rows
one at a time until I confirmed that it was the above values that were
causing a problem. Running the following query shows the difference
in the hex encoded values (I changed my client encoding to WIN1250 to
get the below to show up correctly):

select name, encode(decode(name, 'escape'), 'hex') from ...

name | encode
---------------+----------------------------
Daniel Brï¿œhl | 44616e69656c204272c3bc686c
Daniel Bruehl | 44616e69656c2042727565686c
(2 rows)

I've also tried exporting using an encoding of WIN1250 but I get
errors like this:

pg_dump: Error message from server: ERROR: character 0xc383 of
encoding "UNICODE" has no equivalent in "WIN1250"

Anyone have any thoughts or suggestions? Why would the index creation
fail? Is there a workaround?

I'm not convinced your problem isn't solved by proper setting of
client_encoding for both input and output:

pagila=# create table x (r varchar(255) unique);
NOTICE: CREATE TABLE / UNIQUE will create implicit index "x_r_key" for
table "x "
CREATE TABLE
pagila=# set client_encoding=WIN1250;
SET
pagila=# insert into x (r) values ('Daniel Brï¿œhl');
INSERT 0 1
pagila=# insert into x (r) values ('Daniel Bruehl');
INSERT 0 1
pagila=# select * from x;
r
---------------
Daniel Brï¿œhl
Daniel Bruehl
(2 rows)

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Meetesh Karia (#4)

Re: Encoding problems with migration from 8.0.14 to 8.3.0 on Windows

Meetesh Karia <meetesh.karia@gmail.com> writes:

Additionally, here's what I get when I run your test below (my server
encoding is UTF-8):

ltefull=# create table x (r varchar(255) unique);
NOTICE: CREATE TABLE / UNIQUE will create implicit index "x_r_key" for
table "x"
CREATE TABLE
ltefull=#
ltefull=# set client_encoding=WIN1250;
SET
ltefull=# insert into x (r) values ('Daniel Br�hl');
INSERT 0 1
ltefull=#
ltefull=# insert into x (r) values ('Daniel Bruehl');
ERROR: duplicate key value violates unique constraint "x_r_key"

You said this was on Windows, right?

I was about to say "that should be impossible", until I looked at
varstr_cmp() and realized that whoever put in the WIN32/UTF8
special case omitted this part:

/*
* In some locales strcoll() can claim that nonidentical strings are
* equal. Believing that would be bad news for a number of reasons,
* so we follow Perl's lead and sort "equal" strings according to
* strcmp().
*/
if (result == 0)
result = strcmp(a1p, a2p);

So we behave differently on Windows (with UTF8) than anywhere else.
This is pretty nasty, not least because it means that texteq is
inconsistent with other text comparison operators.

I think this is a "must fix" bug for 8.3.1, anyone disagree?

regards, tom lane

Bruce Momjian

bruce@momjian.us

over 18 years ago

In reply to: Tom Lane (#6)

Re: Encoding problems with migration from 8.0.14 to 8.3.0 on Windows

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

I think this is a "must fix" bug for 8.3.1, anyone disagree?

Agreed.

It seems we should collect cases like this for the regression tests. The only
one I was aware of previously was the Turkish one.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

Meetesh Karia

meetesh.karia@gmail.com

over 18 years ago

In reply to: Bruce Momjian (#7)

Re: Encoding problems with migration from 8.0.14 to 8.3.0 on Windows

Thanks for the confirmation of this. For now I'll continue to use the C
locale and I'll switch with 8.3.1.

Meetesh

Gregory Stark wrote:

Show quoted text

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

I think this is a "must fix" bug for 8.3.1, anyone disagree?

Agreed.

It seems we should collect cases like this for the regression tests. The only
one I was aware of previously was the Turkish one.