Knowing the length(convert(username using windows_1251_to_utf8))

Started by Alexander Farberabout 19 years ago5 messagesgeneral
Jump to latest
#1Alexander Farber
alexander.farber@gmail.com

Hello PostgreSQL users!

I have this data stored in WIN1251 encoding, which
is being fetched by a libpq application I'm developing:

phpbb=> show client_encoding;
-----------------
WIN1251
(1 row)

phpbb=> \d phpbb_users;
........
username | character varying(25) | not null default ''::character
........

phpbb=> select username, length(username), length(convert(username
using windows_1251_to_utf8)) from phpbb_users where user_id=224;
username | length | length
-----------------+--------+--------
Лукашенко И. В. | 15 | 26
(1 row)

My problem is that I need the username in the utf8 encoding.
So I use the convert(username using windows_1251_to_utf8)
which works fine except one thing:

Is there please a way to know the length of the utf8 data?
(I'm using a fixed char array in my C program)

I was using char name[25 + 1] initially, but now I see
that it isn't sufficient. Should I use char name[25 * 2 + 1] ?

How do you usually handle such cases?

Thank you for any advices
Alex

--
http://preferans.de

#2Alexander Farber
alexander.farber@gmail.com
In reply to: Alexander Farber (#1)
Re: Knowing the length(convert(username using windows_1251_to_utf8))

And additional question please:

Can I still be sure that the data returned in the
convert(username using windows_1251_to_utf8)
column will be 0-terminated or should I fetch
the data length using PQgetlength and maintain
that value in my C-program?

Thank you
Alex

On 1/11/07, Alexander Farber <alexander.farber@gmail.com> wrote:

phpbb=> show client_encoding;
-----------------
WIN1251
(1 row)

phpbb=> \d phpbb_users;
........
username | character varying(25) | not null default ''::character
........

phpbb=> select username, length(username), length(convert(username
using windows_1251_to_utf8)) from phpbb_users where user_id=224;
username | length | length
-----------------+--------+--------
Лукашенко И. В. | 15 | 26
(1 row)

--
http://preferans.de

#3Martijn van Oosterhout
kleptog@svana.org
In reply to: Alexander Farber (#1)
Re: Knowing the length(convert(username using windows_1251_to_utf8))

On Thu, Jan 11, 2007 at 10:19:38AM +0100, Alexander Farber wrote:

Hello PostgreSQL users!

I have this data stored in WIN1251 encoding, which
is being fetched by a libpq application I'm developing:

<snip>

phpbb=> select username, length(username), length(convert(username
using windows_1251_to_utf8)) from phpbb_users where user_id=224;
username | length | length
-----------------+--------+--------
????????? ?. ?. | 15 | 26
(1 row)

My problem is that I need the username in the utf8 encoding.
So I use the convert(username using windows_1251_to_utf8)
which works fine except one thing:

If you need the string in UTF-8, why not just set the "client_encoding"
to "utf8" and then the server will only send you strings in utf8, not
conversion necessary.

Is there please a way to know the length of the utf8 data?
(I'm using a fixed char array in my C program)

UTF-8 always variable length, I think up to 4 bytes per character.
Maybe you should n't be using a fixed-length array?

How do you usually handle such cases?

Variable length arrays.

In your next email you ask:

Can I still be sure that the data returned in the
convert(username using windows_1251_to_utf8)
column will be 0-terminated or should I fetch
the data length using PQgetlength and maintain
that value in my C-program?

In the client end (as long you're not doing binary transfers) the
strings are always null terminated.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#4Alexander Farber
alexander.farber@gmail.com
In reply to: Martijn van Oosterhout (#3)
Re: Knowing the length(convert(username using windows_1251_to_utf8))

Hi Martijn,

On 1/11/07, Martijn van Oosterhout <kleptog@svana.org> wrote:

If you need the string in UTF-8, why not just set the "client_encoding"
to "utf8" and then the server will only send you strings in utf8, not
conversion necessary.

actually you are right, because I need all my data in UTF8 anyway
(for a web flash client). So I've followed your advice and added:

PQsetClientEncoding(conn, "UTF8")

and now my program works same, but without that convert().

Is there please a way to know the length of the utf8 data?
(I'm using a fixed char array in my C program)

UTF-8 always variable length, I think up to 4 bytes per character.
Maybe you should n't be using a fixed-length array?

Ok I'll go for the 4 times bigger fixed array for now,
because I'd like to keep my webchat-like app quick.

In your next email you ask:

Can I still be sure that the data returned in the
convert(username using windows_1251_to_utf8)
column will be 0-terminated or should I fetch
the data length using PQgetlength and maintain
that value in my C-program?

In the client end (as long you're not doing binary transfers) the
strings are always null terminated.

May I ask you an off-topic question? I've read several
docs on Unicode, but they are difficult to understand.

Do you think that an UTF8 string will ever have a 0 byte
inside of it? Or is it safe to continue using strlen/strlcpy/strcmp
on the UTF8 values I'll be fetching from my database?

Regards
Alex

PS: Using postgresql-server-8.1.4 on OpenBSD 4.0-stable

--
http://preferans.de

#5Martijn van Oosterhout
kleptog@svana.org
In reply to: Alexander Farber (#4)
Re: Knowing the length(convert(username using windows_1251_to_utf8))

On Thu, Jan 11, 2007 at 12:37:32PM +0100, Alexander Farber wrote:

May I ask you an off-topic question? I've read several
docs on Unicode, but they are difficult to understand.

Have you read the Unicode FAQ?

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Do you think that an UTF8 string will ever have a 0 byte
inside of it? Or is it safe to continue using strlen/strlcpy/strcmp
on the UTF8 values I'll be fetching from my database?

The answer to your questions are no and yes respectivly. See the FAQ.
That is also one of the reasons why Linux/Unix went for utf-8, because
it required minimal changes to programs (and in particular, the C
library).

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.