Knowing the length(convert(username using windows_1251_to_utf8))

Started by Alexander Farberabout 19 years ago5 messagesgeneral

Jump to latest

Alexander Farber

alexander.farber@gmail.com

about 19 years ago

Hello PostgreSQL users!

I have this data stored in WIN1251 encoding, which
is being fetched by a libpq application I'm developing:

phpbb=> show client_encoding;
-----------------
WIN1251
(1 row)

phpbb=> \d phpbb_users;
........
username | character varying(25) | not null default ''::character
........

phpbb=> select username, length(username), length(convert(username
using windows_1251_to_utf8)) from phpbb_users where user_id=224;
username | length | length
-----------------+--------+--------
Лукашенко И. В. | 15 | 26
(1 row)

My problem is that I need the username in the utf8 encoding.
So I use the convert(username using windows_1251_to_utf8)
which works fine except one thing:

Is there please a way to know the length of the utf8 data?
(I'm using a fixed char array in my C program)

I was using char name[25 + 1] initially, but now I see
that it isn't sufficient. Should I use char name[25 * 2 + 1] ?

How do you usually handle such cases?

Thank you for any advices
Alex

--
http://preferans.de

Alexander Farber

alexander.farber@gmail.com

about 19 years ago

In reply to: Alexander Farber (#1)

Re: Knowing the length(convert(username using windows_1251_to_utf8))

And additional question please:

Can I still be sure that the data returned in the
convert(username using windows_1251_to_utf8)
column will be 0-terminated or should I fetch
the data length using PQgetlength and maintain
that value in my C-program?

Thank you
Alex

On 1/11/07, Alexander Farber <alexander.farber@gmail.com> wrote:

phpbb=> show client_encoding;
-----------------
WIN1251
(1 row)

phpbb=> \d phpbb_users;
........
username | character varying(25) | not null default ''::character
........

phpbb=> select username, length(username), length(convert(username
using windows_1251_to_utf8)) from phpbb_users where user_id=224;
username | length | length
-----------------+--------+--------
Лукашенко И. В. | 15 | 26
(1 row)

--
http://preferans.de

Martijn van Oosterhout

kleptog@svana.org

about 19 years ago

In reply to: Alexander Farber (#1)

Re: Knowing the length(convert(username using windows_1251_to_utf8))

On Thu, Jan 11, 2007 at 10:19:38AM +0100, Alexander Farber wrote:

Hello PostgreSQL users!

I have this data stored in WIN1251 encoding, which
is being fetched by a libpq application I'm developing:

<snip>

phpbb=> select username, length(username), length(convert(username
using windows_1251_to_utf8)) from phpbb_users where user_id=224;
username | length | length
-----------------+--------+--------
????????? ?. ?. | 15 | 26
(1 row)

My problem is that I need the username in the utf8 encoding.
So I use the convert(username using windows_1251_to_utf8)
which works fine except one thing:

If you need the string in UTF-8, why not just set the "client_encoding"
to "utf8" and then the server will only send you strings in utf8, not
conversion necessary.

Is there please a way to know the length of the utf8 data?
(I'm using a fixed char array in my C program)

UTF-8 always variable length, I think up to 4 bytes per character.
Maybe you should n't be using a fixed-length array?

How do you usually handle such cases?

Variable length arrays.

In your next email you ask:

Can I still be sure that the data returned in the
convert(username using windows_1251_to_utf8)
column will be 0-terminated or should I fetch
the data length using PQgetlength and maintain
that value in my C-program?

In the client end (as long you're not doing binary transfers) the
strings are always null terminated.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

Alexander Farber

alexander.farber@gmail.com

about 19 years ago

In reply to: Martijn van Oosterhout (#3)

Re: Knowing the length(convert(username using windows_1251_to_utf8))

Hi Martijn,

On 1/11/07, Martijn van Oosterhout <kleptog@svana.org> wrote:

If you need the string in UTF-8, why not just set the "client_encoding"
to "utf8" and then the server will only send you strings in utf8, not
conversion necessary.

actually you are right, because I need all my data in UTF8 anyway
(for a web flash client). So I've followed your advice and added:

PQsetClientEncoding(conn, "UTF8")

and now my program works same, but without that convert().

Is there please a way to know the length of the utf8 data?
(I'm using a fixed char array in my C program)

UTF-8 always variable length, I think up to 4 bytes per character.
Maybe you should n't be using a fixed-length array?

Ok I'll go for the 4 times bigger fixed array for now,
because I'd like to keep my webchat-like app quick.

In your next email you ask:

Can I still be sure that the data returned in the
convert(username using windows_1251_to_utf8)
column will be 0-terminated or should I fetch
the data length using PQgetlength and maintain
that value in my C-program?

In the client end (as long you're not doing binary transfers) the
strings are always null terminated.

May I ask you an off-topic question? I've read several
docs on Unicode, but they are difficult to understand.

Do you think that an UTF8 string will ever have a 0 byte
inside of it? Or is it safe to continue using strlen/strlcpy/strcmp
on the UTF8 values I'll be fetching from my database?

Regards
Alex

PS: Using postgresql-server-8.1.4 on OpenBSD 4.0-stable

--
http://preferans.de

Martijn van Oosterhout

kleptog@svana.org

about 19 years ago

In reply to: Alexander Farber (#4)

Re: Knowing the length(convert(username using windows_1251_to_utf8))

On Thu, Jan 11, 2007 at 12:37:32PM +0100, Alexander Farber wrote:

May I ask you an off-topic question? I've read several
docs on Unicode, but they are difficult to understand.

Have you read the Unicode FAQ?

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Do you think that an UTF8 string will ever have a 0 byte
inside of it? Or is it safe to continue using strlen/strlcpy/strcmp
on the UTF8 values I'll be fetching from my database?

The answer to your questions are no and yes respectivly. See the FAQ.
That is also one of the reasons why Linux/Unix went for utf-8, because
it required minimal changes to programs (and in particular, the C
library).

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.