Character encodings...

Started by Michael Sobolevabout 26 years ago13 messagesgeneral
Jump to latest
#1Michael Sobolev
mss@transas.com

I am trying to fill up a database using psql program. A file I have prepared
contains Russian in KOI8-R encoding. When I try to process this file using
`psql -f file db', it fails: no diagnostics, nothing; it just shows that EOF is
reached. When I replace Russian letters with something in ASCII, it works just
fine. The main problem is that my second file gets processed just fine.

Where to look to? What additional information is needed? :)

Thanks,

--
Mike

#2Oleg Broytmann
phd@phd.russ.ru
In reply to: Michael Sobolev (#1)
Re: Character encodings...

On Thu, 13 Apr 2000, Michael Sobolev wrote:

I am trying to fill up a database using psql program. A file I have prepared
contains Russian in KOI8-R encoding. When I try to process this file using
`psql -f file db', it fails: no diagnostics, nothing; it just shows that EOF is
reached. When I replace Russian letters with something in ASCII, it works just
fine. The main problem is that my second file gets processed just fine.

Where to look to? What additional information is needed? :)

OS, locale, Postgres version, whether Postgres was compiled with locale,
multibyte...

Oleg.
----
Oleg Broytmann http://members.xoom.com/phd2.1/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

#3Michael Sobolev
mss@transas.com
In reply to: Oleg Broytmann (#2)
Re: Character encodings...

On Thu, Apr 13, 2000 at 10:20:39AM +0000, Oleg Broytmann wrote:

OS, locale, Postgres version, whether Postgres was compiled with locale,
multibyte...

Debian GNU/Linux (frozen), 6.5.3-17 (-17 -- debian revision), yes, =UNICODE.

--
Mike

#4Oliver Elphick
olly@lfix.co.uk
In reply to: Michael Sobolev (#3)
Re: Character encodings...

Michael Sobolev wrote:

On Thu, Apr 13, 2000 at 10:20:39AM +0000, Oleg Broytmann wrote:

OS, locale, Postgres version, whether Postgres was compiled with locale

,

multibyte...

Debian GNU/Linux (frozen), 6.5.3-17 (-17 -- debian revision), yes, =UNICODE.

Turn on logging in the backend (edit /etc/postgresql/postmaster.init) and
restart the postmaster (/etc/init.d/postgresql restart). See what you get
in the log.

--
Oliver Elphick Oliver.Elphick@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
PGP key from public servers; key ID 32B8FAA1
========================================
"I sought the LORD, and he heard me, and delivered me
from all my fears." Psalms 34:41

#5Oleg Broytmann
phd@phd.russ.ru
In reply to: Michael Sobolev (#3)
Re: Character encodings...

On Thu, 13 Apr 2000, Michael Sobolev wrote:

OS, locale, Postgres version, whether Postgres was compiled with locale,
multibyte...

Debian GNU/Linux (frozen), 6.5.3-17 (-17 -- debian revision), yes, =UNICODE.

Not sure how well Postgres works with UNICODE. It works pretty well with
KOI8-R and Windows-1251 encodings...

Oleg.
----
Oleg Broytmann http://members.xoom.com/phd2.1/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

#6Michael Sobolev
mss@transas.com
In reply to: Oliver Elphick (#4)
Re: Character encodings...

On Thu, Apr 13, 2000 at 11:54:17AM +0100, Oliver Elphick wrote:

Turn on logging in the backend (edit /etc/postgresql/postmaster.init) and
restart the postmaster (/etc/init.d/postgresql restart). See what you get
in the log.

What level of debug should be sufficient?

I've got an impression that it's psql that does not process correctly the
stuff.

I have a very simple statement:

insert into news values ('2000-04-13', NULL, '');

This works just fine. Now I replace '' with 'A' (A -- 65). It still works
just fine. Now I replace this latin A with Russian A. And psql shows:

$ psql -f test.sql stuff
insert into news values ('2000-04-12', NULL, 'О©╫');
EOF

--
Mike

#7Oliver Elphick
olly@lfix.co.uk
In reply to: Michael Sobolev (#6)
Re: Character encodings...

Michael Sobolev wrote:

On Thu, Apr 13, 2000 at 11:54:17AM +0100, Oliver Elphick wrote:

Turn on logging in the backend (edit /etc/postgresql/postmaster.init) and
restart the postmaster (/etc/init.d/postgresql restart). See what you get
in the log.

What level of debug should be sufficient?

2

Also set PGECHO in postmaster.init, so that queries are echoed in the log.

I've got an impression that it's psql that does not process correctly the
stuff.

I have a very simple statement:

insert into news values ('2000-04-13', NULL, '');

This works just fine. Now I replace '' with 'A' (A -- 65). It still works
just fine. Now I replace this latin A with Russian A. And psql shows:

$ psql -f test.sql stuff
insert into news values ('2000-04-12', NULL, '�');
EOF

The trouble is, I don't know how to test this. How do I produce Russian
characters on an English keyboard?

--
Oliver Elphick Oliver.Elphick@lfix.co.uk
Isle of Wight http://www.lfix.co.uk/oliver
PGP key from public servers; key ID 32B8FAA1
========================================
"I sought the LORD, and he heard me, and delivered me
from all my fears." Psalms 34:41

#8Michael Sobolev
mss@transas.com
In reply to: Oliver Elphick (#7)
Re: Character encodings...

On Thu, Apr 13, 2000 at 02:52:13PM +0100, Oliver Elphick wrote:

What level of debug should be sufficient?

2

Also set PGECHO in postmaster.init, so that queries are echoed in the log.

OK. I'll try.

The trouble is, I don't know how to test this. How do I produce Russian
characters on an English keyboard?

I am almost sure that this may fail if it's just a character from the upper
half of 256. In vim: ^V240 :)

--
Mike

#9Michael Sobolev
mss@transas.com
In reply to: Oliver Elphick (#7)
Re: Character encodings...

On Thu, Apr 13, 2000 at 02:52:13PM +0100, Oliver Elphick wrote:

2

Also set PGECHO in postmaster.init, so that queries are echoed in the log.

Here it goes. I would not say it's very useful... Russian a has code 225
(decimal).

--
Mike

binding ShmemCreate(key=52e2c1, size=2006016)
/usr/lib/postgresql/bin/postmaster: ServerLoop: handling reading 4
/usr/lib/postgresql/bin/postmaster: ServerLoop: handling reading 4
/usr/lib/postgresql/bin/postmaster: ServerLoop: handling writing 4
/usr/lib/postgresql/bin/postmaster: BackendStartup: pid 30613 user mss db stuff socket 4
/usr/lib/postgresql/bin/postmaster child[30613]: starting with (/usr/lib/postgresql/bin/postgres -d2 -B 128 -E -v131072 -p stuff )
FindExec: found "/usr/lib/postgresql/bin/postgres" using argv[0]
debug info:
User = mss
RemoteHost = localhost
RemotePort = 0
DatabaseName = stuff
Verbose = 2
Noversion = f
timings = f
dates = European
bufsize = 128
sortmem = 512
query echo = t
InitPostgres
reset_client_encoding()..
reset_client_encoding() done.
StartTransactionCommand
query: select getdatabaseencoding()
ProcessQuery
CommitTransactionCommand
StartTransactionCommand
query: SET client_encoding = 'UNICODE'
ProcessUtility: SET client_encoding = 'UNICODE'
CommitTransactionCommand
proc_exit(0) [#0]
shmem_exit(0) [#0]
exit(0)
/usr/lib/postgresql/bin/postmaster: reaping dead processes...
/usr/lib/postgresql/bin/postmaster: CleanupProc: pid 30613 exited with status 0

#10Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Sobolev (#9)
Re: Character encodings...

On Thu, 13 Apr 2000, Michael Sobolev wrote:

Here it goes. I would not say it's very useful... Russian a has code 225
(decimal).

StartTransactionCommand
query: SET client_encoding = 'UNICODE'
ProcessUtility: SET client_encoding = 'UNICODE'
CommitTransactionCommand
proc_exit(0) [#0]
shmem_exit(0) [#0]
exit(0)
/usr/lib/postgresql/bin/postmaster: reaping dead processes...
/usr/lib/postgresql/bin/postmaster: CleanupProc: pid 30613 exited with status 0

That looks like the query never got to the backend. This is either a bug
in psql or the multibyte suite. I seem to recall that Unicode isn't fully
supported, so I'd go for the latter. Can Tatsuo comment?

--
Peter Eisentraut Sernanders v�g 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden

#11Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#10)
Re: Character encodings...

On Thu, 13 Apr 2000, Michael Sobolev wrote:

Here it goes. I would not say it's very useful... Russian a has code 225
(decimal).

StartTransactionCommand
query: SET client_encoding = 'UNICODE'
ProcessUtility: SET client_encoding = 'UNICODE'
CommitTransactionCommand
proc_exit(0) [#0]
shmem_exit(0) [#0]
exit(0)
/usr/lib/postgresql/bin/postmaster: reaping dead processes...
/usr/lib/postgresql/bin/postmaster: CleanupProc: pid 30613 exited with status 0

That looks like the query never got to the backend. This is either a bug
in psql or the multibyte suite. I seem to recall that Unicode isn't fully
supported, so I'd go for the latter. Can Tatsuo comment?

Oh, he is using the multibyte support and expects an automatic code
conversion between KOI8-R and UNICODE that is not supported yet.

What he need to do is creating a database with encoding KOI8-R or
ISO-8859-5.

# make a KOI8-R database
$ createdb -E KOI8

or

# make a ISO-8859-5 database
$ createdb -E LATIN5

In the next case, he might want to set PGCLIENTENCODING environment
variable so that a conversion between KOI8-R and ISO-8859-5
automatically performed.

# if you want to use KOI8-R on your client.
$ export PGCLIENTENCODING=KOI8
or
% setenv PGCLIENTENCODING KOI8
--
Tatsuo Ishii

#12Michael Sobolev
mss@transas.com
In reply to: Tatsuo Ishii (#11)
Re: Character encodings...

On Fri, Apr 14, 2000 at 03:44:09PM +0900, Tatsuo Ishii wrote:

Oh, he is using the multibyte support and expects an automatic code
conversion between KOI8-R and UNICODE that is not supported yet.

Not exactly. If you had a look on my first message, you would see that the
problem I see that the behaviour is not consistent. Some time this data gets
through, and sometimes it does not. I'd say that an arbitrary text in KOI8-R
can hardly be something reasonable in UTF-8, so I'd see that all (yes, ALL) my
requests would fail (and preferably with correct diagnostics).

# make a KOI8-R database
$ createdb -E KOI8

Thanks. I was looking for something like this in man page, but unfortunately
it does not have this information.

In the next case, he might want to set PGCLIENTENCODING environment
variable so that a conversion between KOI8-R and ISO-8859-5
automatically performed.

What are the requirements for this to work?

Thanks,

--
Mike

#13Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Michael Sobolev (#12)
Re: Character encodings...

On Fri, Apr 14, 2000 at 03:44:09PM +0900, Tatsuo Ishii wrote:

Oh, he is using the multibyte support and expects an automatic code
conversion between KOI8-R and UNICODE that is not supported yet.

Not exactly. If you had a look on my first message, you would see that the
problem I see that the behaviour is not consistent. Some time this data gets
through, and sometimes it does not. I'd say that an arbitrary text in KOI8-R
can hardly be something reasonable in UTF-8, so I'd see that all (yes, ALL) my
requests would fail (and preferably with correct diagnostics).

Sorry. I don't understand your point. What I wanted to say was KOI8-R
and UTF-8 are totally different encodings (except ASCII part).

# make a KOI8-R database
$ createdb -E KOI8

Thanks. I was looking for something like this in man page, but unfortunately
it does not have this information.

Please look at doc/README.mb.

In the next case, he might want to set PGCLIENTENCODING environment
variable so that a conversion between KOI8-R and ISO-8859-5
automatically performed.

What are the requirements for this to work?

Please explain your backgrounds. If you need KOI8-R only, you could
forget about ISO-8859-5.
--
Tatsuo Ishii