BUG #3737: lower/upper fails to match extended chars in LATIN1

Started by Jonas Forsmanover 18 years ago7 messagesbugs
Jump to latest
#1Jonas Forsman
jonas@axier.org

The following bug has been logged online:

Bug reference: 3737
Logged by: Jonas Forsman
Email address: jonas@axier.org
PostgreSQL version: 8.1.10
Operating system: Ubuntu 6.06 LTS
Description: lower/upper fails to match extended chars in LATIN1
Details:

Try:
select * from table where lower(address) like '%å%'

This select fails to find addresses including capital Å and similars in
LATIN1 (like Å, Ä, Ö).

The same bug is valid for upper.

#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jonas Forsman (#1)
Re: BUG #3737: lower/upper fails to match extended chars in LATIN1

Jonas Forsman wrote:

The following bug has been logged online:

Bug reference: 3737
Logged by: Jonas Forsman
Email address: jonas@axier.org
PostgreSQL version: 8.1.10
Operating system: Ubuntu 6.06 LTS
Description: lower/upper fails to match extended chars in LATIN1
Details:

Try:
select * from table where lower(address) like '%Ã¥%'

This select fails to find addresses including capital Ã… and similars in
LATIN1 (like Å, Ä, Ö).

The behavior of lower() depends on your locale. What locale are you
running in? Make sure your locale matches the encoding.

Exact locale names and behavior are defined by the operating system, but
for example on Debian, which is probably the same as Ubuntu in this
case, use

initdb -D data --locale=sv_SE.iso88591 --encoding=LATIN1

for Swedish.

Also make sure that your client_encoding is set correctly. For example,
if you run the query in psql in a terminal that uses UTF-8 encoding, you
need to "SET client_encoding='UTF-8'"

See the manual for more details:
http://www.postgresql.org/docs/8.1/static/charset.html#LOCALE

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jonas Forsman (#1)
Re: BUG #3737: lower/upper fails to match extended chars in LATIN1

Jonas Forsman wrote:

it is possibly a locale-error. may I ask:
1. How do I check the locale?

Within psql:

show lc_ctype; (and other lc_* variables as well)
show server_encoding;
show client_encoding;

2. Can I change this on already running db:s ?

Unfortunately you can't. You'll have to re-initdb. :-(

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#2)
Re: BUG #3737: lower/upper fails to match extended chars in LATIN1

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

Jonas Forsman wrote:

Try:
select * from table where lower(address) like '%Ã¥%'

This select fails to find addresses including capital Ã… and similars in
LATIN1 (like Å, Ä, Ö).

Isn't à an upper-case letter? In which case lower(address) will never match it
since by definition lower(address) will only contain characters which are
lower case. That is if address contains a "Ã" then lower(address) will contain
an "ã" instead which won't match the "Ã" in the pattern.

I think you either need to put a lower() on both sides of the LIKE or use
ILIKE.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jonas Forsman (#1)
Re: BUG #3737: lower/upper fails to match extended chars in LATIN1

(please keep the list cc'd)

Jonas Forsman wrote:

if I cannot change charset within a postgres installation
and then benefit from all the features in postgres isn't that to be seen
as a bug
or at least it should be a workaround easier than dumping everything
which I think is a too large risk to take if not really neccessary.

Well, it's not a bug in the sense that it's documented and working as
designed.

But I agree that it sucks. Unfortunately changing the locale on-the-fly
isn't as easy as it might sound. The locale affects the sort-ordering,
for example, which means that you'd have to at least recreate all
indexes with text data in them anyway.

Changing the character encoding would be even worse. You'd have to
re-encode all data, which means rewriting all data in the database.

The best solution would be to support per-database and even per-table
and per-column locales and character encodings, so that you could do the
conversions piece by piece. Patches are welcome :-).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#6Jonas Forsman
jonas.forsman@openit.se
In reply to: Heikki Linnakangas (#5)
Re: BUG #3737: lower/upper fails to match extended chars in LATIN1

From my point of view (administrators view, not developers) it is such
a drawback
in postgres implementation that it seriously challenges the position as
a competitor
to the big dragons. (IF they suffer from the same missing feature)

Recreating indicies is maybe not great but can be done quit easily compared
to reinitializing a db or recoding an entire database.
Sometimes a bit of a solution is all what it is needed. Would it be possible
to change the locale on the fly and only recreate indicies afterwards?

/ Jonas

ps. I'd love to send you a patch, but there is no way it would be a good
thing
time/performance wise for me doing this. I am trying to be a great
ambassador though,
spreading the words everywhere. ds.

Heikki Linnakangas skrev:

(please keep the list cc'd)

Jonas Forsman wrote:

if I cannot change charset within a postgres installation
and then benefit from all the features in postgres isn't that to be
seen as a bug
or at least it should be a workaround easier than dumping everything
which I think is a too large risk to take if not really neccessary.

Well, it's not a bug in the sense that it's documented and working as
designed.

But I agree that it sucks. Unfortunately changing the locale
on-the-fly isn't as easy as it might sound. The locale affects the
sort-ordering, for example, which means that you'd have to at least
recreate all indexes with text data in them anyway.

Changing the character encoding would be even worse. You'd have to
re-encode all data, which means rewriting all data in the database.

The best solution would be to support per-database and even per-table
and per-column locales and character encodings, so that you could do
the conversions piece by piece. Patches are welcome :-).

--
------------------------------------------------
Jonas Forsman
VD, Axier Technologies AB
Malmgatan 2
SE-703 54 �rebro,
SWEDEN
Tel.: +46 (0)19 12 00 90
Mobile: +46 (0)73 506 33 00
jonas.forsman@openit.se

--------------------------------------------------------------------
This message (including any attachments) contains confidential
information intended for a specific individual and purpose,
and is protected by law. If you are not the intended recipient,
you should delete this message and are hereby notified that any disclosure,
copying, or distribution of this message, or the taking of any action
based on it, is strictly prohibited.
--------------------------------------------------------------------

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jonas Forsman (#6)
Re: BUG #3737: lower/upper fails to match extended chars in LATIN1

Jonas Forsman wrote:

Recreating indicies is maybe not great but can be done quit easily compared
to reinitializing a db or recoding an entire database.
Sometimes a bit of a solution is all what it is needed. Would it be
possible
to change the locale on the fly and only recreate indicies afterwards?

Hmm. I think you could change LC_CTYPE by just hacking the control file.
LC_COLLATE would require the reindexing. Perhaps you could hack that as
well, and then manually run REINDEX on all databases.

Did you confirm what the problem in your case really was? As Greg
pointed out, the example query you posted was in fact flawed..

BTW, 8.3 is more strict on this, and won't allow you to initdb a cluster
with mismatching locale and encoding.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com