Strange query results with invalid multibyte character

Started by Joost Kraaijeveldover 19 years ago9 messagesgeneral
Jump to latest
#1Joost Kraaijeveld
J.Kraaijeveld@Askesis.nl

Hi,

I have a strange problem.

I have 2 database, both created with:
CREATE DATABASE muntdev WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default;

Running the queries below on the first database is OK and do what I expect.

If I create a backup of the first datase and restore that backup in the second database the following happens:
The first query (see below) returns 17 records, all containing 'Boonk%'.
The second query (see below)returns 'ERROR: invalid multibyte character for locale'

Query 1:
select lastname from salesordercustomer where lower(lastname) like 'boonk%'

Query 2:
select lastname from salesordercustomer where upper(lastname) like 'BOONK%'

Both database are running *exactly* the same version of Debian Etch AMD64 and *exactly* the same version of PostgreSQL (8.1.4 for AMD64) ....

Can anyone explain this to me?

TIA

Joost

#2Gevik Babakhani
pgdev@xs4all.nl
In reply to: Joost Kraaijeveld (#1)
Re: Strange query results with invalid multibyte

Hi Joost. (hoe gaat het ermee?)

I would like to test this too. Could you please provide the data you are
inserting into the database.

Regards,
Gevik

On Wed, 2006-09-27 at 13:13 +0200, Joost Kraaijeveld wrote:

Hi,

I have a strange problem.

I have 2 database, both created with:
CREATE DATABASE muntdev WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default;

Running the queries below on the first database is OK and do what I expect.

If I create a backup of the first datase and restore that backup in the second database the following happens:
The first query (see below) returns 17 records, all containing 'Boonk%'.
The second query (see below)returns 'ERROR: invalid multibyte character for locale'

Query 1:
select lastname from salesordercustomer where lower(lastname) like 'boonk%'

Query 2:
select lastname from salesordercustomer where upper(lastname) like 'BOONK%'

Both database are running *exactly* the same version of Debian Etch AMD64 and *exactly* the same version of PostgreSQL (8.1.4 for AMD64) ....

Can anyone explain this to me?

TIA

Joost

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

--
Regards,
Gevik Babakhani

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joost Kraaijeveld (#1)
Re: Strange query results with invalid multibyte character

"Joost Kraaijeveld" <J.Kraaijeveld@Askesis.nl> writes:

I have 2 database, both created with:
CREATE DATABASE muntdev WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default;

But have they got the same locale settings (lc_collate, lc_ctype)?

regards, tom lane

#4Joost Kraaijeveld
J.Kraaijeveld@Askesis.nl
In reply to: Tom Lane (#3)
Re: Strange query results with invalid multibyte

Hi Tom,

On Wed, 2006-09-27 at 11:00 -0400, Tom Lane wrote:

"Joost Kraaijeveld" <J.Kraaijeveld@Askesis.nl> writes:

I have 2 database, both created with:
CREATE DATABASE muntdev WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default;

But have they got the same locale settings (lc_collate, lc_ctype)?

According to the postgresql.conf of the *working* database the locales
are:
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

According to the other obe:
lc_messages = 'en_US'
lc_monetary = 'en_US'
lc_numeric = 'en_US'
lc_time = 'en_US'

Could this be the difference? Is there any documentation available
somewhere, on how these locale settings work and interact (in
combination with the server and/or client platform???), besides the
explanation in the PostgreSQL manual
(http://www.postgresql.org/docs/8.1/interactive/charset.html#LOCALE)
(which is to terse for me to understand)?

TIA

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
web: www.askesis.nl

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joost Kraaijeveld (#4)
Re: Strange query results with invalid multibyte character

Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:

On Wed, 2006-09-27 at 11:00 -0400, Tom Lane wrote:

But have they got the same locale settings (lc_collate, lc_ctype)?

According to the postgresql.conf of the *working* database the locales
are:
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

According to the other obe:
lc_messages = 'en_US'
lc_monetary = 'en_US'
lc_numeric = 'en_US'
lc_time = 'en_US'

You did not show us the settings that actually count, but based on this
I'm guessing they are en_US also. What you need to find out next is
what character set encoding that locale implies on your machine. I'm
betting it's not utf8 though :-(

regards, tom lane

#6Joost Kraaijeveld
J.Kraaijeveld@Askesis.nl
In reply to: Tom Lane (#5)
Re: Strange query results with invalid multibyte

On Wed, 2006-09-27 at 12:10 -0400, Tom Lane wrote:

Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:

On Wed, 2006-09-27 at 11:00 -0400, Tom Lane wrote:

But have they got the same locale settings (lc_collate, lc_ctype)?

According to the postgresql.conf of the *working* database the locales
are:
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

According to the other obe:
lc_messages = 'en_US'
lc_monetary = 'en_US'
lc_numeric = 'en_US'
lc_time = 'en_US'

You did not show us the settings that actually count, but based on this
I'm guessing they are en_US also

Ah, sorry: "show lc_collate" and "show lc_ctype" confirm your guess.

What you need to find out next is
what character set encoding that locale implies on your machine. I'm
betting it's not utf8 though :-(

I am not sure what you mean by that but maybe this helps:

the output of "locale" on the working machine is:

LANG=
LANGUAGE=en_US:en_GB
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

The output of the same command on the non-working machine:

LANG=en_US
LANGUAGE=en_NL:en_US:en_GB:en
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE="en_US"
LC_MONETARY="en_US"
LC_MESSAGES="en_US"
LC_PAPER="en_US"
LC_NAME="en_US"
LC_ADDRESS="en_US"
LC_TELEPHONE="en_US"
LC_MEASUREMENT="en_US"
LC_IDENTIFICATION="en_US"
LC_ALL=

If this is not what you mean, could you help me in the right direction?

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
web: www.askesis.nl

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joost Kraaijeveld (#6)
Re: Strange query results with invalid multibyte

Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:

On Wed, 2006-09-27 at 12:10 -0400, Tom Lane wrote:

What you need to find out next is
what character set encoding that locale implies on your machine. I'm
betting it's not utf8 though :-(

If this is not what you mean, could you help me in the right direction?

Try this command:

LANG=en_US locale charmap

regards, tom lane

#8Joost Kraaijeveld
J.Kraaijeveld@Askesis.nl
In reply to: Tom Lane (#7)
Re: Strange query results with invalid multibyte

Hi Tom,

Thanks for putting up with the questions.

On Wed, 2006-09-27 at 14:58 -0400, Tom Lane wrote:

LANG=en_US locale charmap

I have done this on both machines:

The working machine says: ISO-8859-1
The not working machine says: ISO-8859-1

I still do not understand what is happening and maybe more important,
how I can solve this (without re-installing the OS / database).

A better understanding of how the server OS, database, codepages, client
OS, charmaps etc work would be nice....

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
web: www.askesis.nl

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joost Kraaijeveld (#8)
Re: Strange query results with invalid multibyte

Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:

On Wed, 2006-09-27 at 14:58 -0400, Tom Lane wrote:

LANG=en_US locale charmap

The working machine says: ISO-8859-1
The not working machine says: ISO-8859-1

OK, so the problem is that you have a locale that expects ISO-8859-1
encoding, but the database is set to UTF8 encoding, and that means
it may contain characters that the locale can't deal with.
Locale-dependent operations ... like upper() ... will fail when that
happens.

If we had a more reliable way of telling what encoding the locale
expects, we'd probably forbid creation of databases with incompatible
encodings.

I still do not understand what is happening and maybe more important,
how I can solve this (without re-installing the OS / database).

If you want to use UTF8 then you're stuck with a re-initdb I'm afraid.

regards, tom lane