UNICODE

Started by Per Aronssonover 24 years ago5 messagesgeneral
Jump to latest
#1Per Aronsson
per@mobilehits.com

Hi,

To enable localization of our new platform, we thought that saving all
character strings as UNICODE would be a good idea. Even if the front-end
(PHP) doesn't fully support UNICODE yet, we figured it's still good to have
the database in that format, for the future. We have not installed
mb_string.

We have created a UNICODE database and started experimenting with it
(PostgreSQL)
./configure --enable-multibyte
createdb -E UNICODE me-e

My question is: do you need to convert strings to UTF-8 before adding them
to the database, or is that done "automatically"?

Best regards,
Per Aronsson

#2Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Per Aronsson (#1)
Re: UNICODE

To enable localization of our new platform, we thought that saving all
character strings as UNICODE would be a good idea. Even if the front-end
(PHP) doesn't fully support UNICODE yet, we figured it's still good to have
the database in that format, for the future. We have not installed
mb_string.

We have created a UNICODE database and started experimenting with it
(PostgreSQL)
./configure --enable-multibyte
createdb -E UNICODE me-e

My question is: do you need to convert strings to UTF-8 before adding them
to the database, or is that done "automatically"?

PostgreSQL 7.1 can do the conversion in the backend side. You need to
add an option "--enable-unicode-conversion", however. Also, you need
to tell what kind of encoding you are using in your applications. To
do it in PHP4, you could use pg_set_client_encoding function.
If your PHP installation does not have it, you could issue a SQL:

set client_encoding to 'encoding_name_in_your_PHP_applicatoion';
--
Tatsuo Ishii

#3Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#2)
Re: UNICODE

I'm also trying to write some Chinese data to postgresql database. I'm
gibberish after it's written to the database.

I recognize the problem is at the http request.

More details of what you found, please.

How do I retrieve double
byte characters through http request using C/C++? And how do I write it the
database? And how do I tell it what kind of encoding to use?

--
Tatsuo Ishii

#4Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#2)
Re: UNICODE

I'm also trying to write some Chinese data to postgresql database. I'm
gibberish after it's written to the database.

I recognize the problem is at the http request. How do I retrieve double
byte characters through http request using C/C++? And how do I write it the
database?

Nothing special. Just read/write one by one.

And how do I tell it what kind of encoding to use?

set client_encoding.
--
Tatsuo Ishii

#5Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#4)
Re: UNICODE

Can you please do not send me a personal mail?
Let's share info among people in the mailing list.
Anyway...

I've tried that. Still not writing the Chinese characters correctly.

I don't know what kind of Chinese character set you are using, but at
least your code will not work if the Chinese character set is Big5
since the second byte of it contains ascii characters.
To learn more about character sets, see
ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf
for example.
--
Tatsuo Ishii

Show quoted text

Here is the code:

contentTypeFromPost = getenv("CONTENT_TYPE");
contentTypeLength = getenv("CONTENT_LENGTH");
icontentLength = atoi(contentTypeLength);

if((queryString = malloc(icontentLength + 1)) == NULL)
{
postMessage("Cannot allocate memory", 0);
return(0);
}
for(i=0; *queryString; i++)
{
splitword(items.Item, queryString, '&');
unescape_url(items.Item);
splitword(items.name, items.Item, '=');

// items.Item contains double byte characters
// However, when write to database I get unrecognizable data
}

void splitword(uchar *out, uchar *in, uchar stop)
{
int i, j;

while(*in == ' ') in++; /* skip past any spaces */

for(i = 0; in[i] && (in[i] != stop); i++)
out[i] = in[i];

out[i] = '\0'; /* terminate it */
if(in[i]) ++i; /* position past the stop */

while(in[i] == ' ') i++; /* skip past any spaces */

for(j = 0; in[j]; ) /* shift the rest of the in */
in[j++] = in[i++];
}

uchar x2c(uchar *x)
{
register uchar c;

/* note: (x & 0xdf) makes x upper case */
c = (x[0] >= 'A' ? ((x[0] & 0xdf) - 'A') + 10 : (x[0] - '0'));
c *= 16;
c += (x[1] >= 'A' ? ((x[1] & 0xdf) - 'A') + 10 : (x[1] - '0'));
return(c);
}

void unescape_url(uchar *url)
{
register int i, j;

for(i = 0, j = 0; url[j]; ++i, ++j)
{
if((url[i] = url[j]) == '%')
{
url[i] = x2c(&url[j + 1]);
j += 2;
}
else if (url[i] == '+')
url[i] = ' ';
}
url[i] = '\0'; /* terminate it at the new length */
}

-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: Sunday, October 28, 2001 4:57 PM
To: jklcom@mindspring.com
Cc: pgsql-general@postgresql.org
Subject: RE: [GENERAL] UNICODE

I'm also trying to write some Chinese data to postgresql database. I'm
gibberish after it's written to the database.

I recognize the problem is at the http request. How do I retrieve double
byte characters through http request using C/C++? And how do I write it

the

database?

Nothing special. Just read/write one by one.

And how do I tell it what kind of encoding to use?

set client_encoding.
--
Tatsuo Ishii