ORDER BY different locales

Started by Karel Zakalmost 22 years ago5 messages
#1Karel Zak
zakkr@zf.jcu.cz

Hi,

a lot of people sometimes need order same data in same DB by more
different locales. For example multi-language web application with DB
in UTF-8. It's problem in PostgreSQL, because PostgreSQL require set
LC_COLLATE by initdb.

I think possible solution is special function used ORDER BY clause
which knows to switch by safe way to wanted locales, convert string by
strxfrm() and switch back to backend locales.

Is this function interesting for PostgreSQL contrib or main tree? I
think it's very useful for a lot of users. I can prepare a patch.

Note, the original idea and patch is from
Honza Pazdziora <adelton@informatics.muni.cz>.

For example, the Czech alphabet has between 'h' and 'i' letter 'ch':

# SHOW LC_COLLATE;
lc_collate
------------
C

# SELECT data FROM str ORDER BY nls_string(data,'en_US');
data
-------
aaaa
cccc
chccc
dddd
hhhh
iiii
zzzz

# SELECT data FROM str ORDER BY nls_string(data,'cs_CZ');
data
-------
aaaa
cccc
dddd
hhhh
chccc
iiii
zzzz

The function returns result encoded in unsigned octal:

# SELECT nls_string('pg','en_US');
nls_string
--------------------------
033022001010010001002002

Source:

static char *lc_collate_cache = NULL;

PG_FUNCTION_INFO_V1(nls_string);

Datum
nls_string(PG_FUNCTION_ARGS)
{
text *locale = PG_GETARG_TEXT_P(1);
char *locale_str;
int locale_len;

text *txt = PG_GETARG_TEXT_P(0);
char *txt_str;
int txt_len;
text *txt_out;
char *txt_tmp;
size_t size = 0;
size_t rest = 0;
int i;

if ((VARSIZE(locale) - VARHDRSZ) <= 0 || (VARSIZE(txt) - VARHDRSZ) <= 0)
PG_RETURN_NULL();

/*
* Save original locale setting
*/
if (!lc_collate_cache)
{
if ((lc_collate_cache = setlocale(LC_COLLATE, NULL)))
/* cached independent on PostgreSQL mmgr */
lc_collate_cache = strdup(lc_collate_cache);
}
if (!lc_collate_cache)
elog(ERROR, "invalid system LC_COLLATE setting");

/*
* Conversion to standard strings
*/
locale_len = VARSIZE(locale) - VARHDRSZ;
locale_str = palloc(locale_len + 1);
memcpy(locale_str, VARDATA(locale), locale_len);
*(locale_str + locale_len) = '\0';

txt_len = VARSIZE(txt) - VARHDRSZ;
txt_str = palloc(txt_len + 1);
memcpy(txt_str, VARDATA(txt), txt_len);
*(txt_str + txt_len) = '\0';

/*
* Set wanted locale
*/
if (!setlocale(LC_COLLATE, locale_str))
{
setlocale(LC_COLLATE, lc_collate_cache); /* paranoid? */
elog(ERROR, "invalid LC_COLLATE setting: %s", locale_str);
}

pfree(locale_str);

/*
* Text transformation
*/
size = txt_len * 2;
txt_tmp = palloc(size);
memset(txt_tmp, 0, size);

rest = strxfrm(txt_tmp, txt_str, size) + 1;
if (rest >= size)
{
pfree(txt_tmp);
txt_tmp = palloc(rest);
memset(txt_tmp, 0, rest);
rest = strxfrm(txt_tmp, txt_str, rest);
}

/*
* Transformation to unsigned octal
*/
txt_out = (text *) palloc(3 * rest + VARHDRSZ);
memset(txt_out, 0, 3 * rest + VARHDRSZ);

for (i = 0; i < rest; i++)
{
sprintf(VARDATA(txt_out) + 3 * i, "%03o",
(int)(unsigned char)*(txt_tmp + i));
}
pfree(txt_tmp);

VARATT_SIZEP(txt_out) = 3 * rest + VARHDRSZ;

/*
* Set original locale
*/
if (!setlocale(LC_COLLATE, lc_collate_cache))
elog(ERROR, "invalid LC_COLLATE setting: %s", lc_collate_cache);

PG_RETURN_TEXT_P(txt_out);
}

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Karel Zak (#1)
Re: ORDER BY different locales

Karel Zak <zakkr@zf.jcu.cz> writes:

I think possible solution is special function used ORDER BY clause
which knows to switch by safe way to wanted locales, convert string by
strxfrm() and switch back to backend locales.

This function breaks the whole backend if an elog() failure occurs while
it's got the wrong locale set. I believe it would also be remarkably
slow --- doesn't setlocale() involve reading a new locale definition
file from whereever those are stored?

I think the ultimate solution to our multi-locale problems will have to
involve abandoning the C library's support functions and writing locale
support that allows multiple locale-defining structures referenced by
pointers. It's a big task though :-(. Peter was looking at it awhile
back but I don't know how far he's gotten.

regards, tom lane

#3Karel Zak
zakkr@zf.jcu.cz
In reply to: Tom Lane (#2)
Re: ORDER BY different locales

On Thu, Feb 26, 2004 at 09:16:03AM -0500, Tom Lane wrote:

Karel Zak <zakkr@zf.jcu.cz> writes:

I think possible solution is special function used ORDER BY clause
which knows to switch by safe way to wanted locales, convert string by
strxfrm() and switch back to backend locales.

This function breaks the whole backend if an elog() failure occurs while

I don't think so. There is setlocale() to original locales before
elog(). But important is idea of this function. We can rewrite it to
fix some minor problems...

it's got the wrong locale set. I believe it would also be remarkably
slow --- doesn't setlocale() involve reading a new locale definition
file from whereever those are stored?

Yes, speed can be problem. I will test it. But I hope libc read locales
one time only. The common usage is with SELECT where you apply same
locales to all lines of result.

I think the ultimate solution to our multi-locale problems will have to
involve abandoning the C library's support functions and writing locale

Yes, but I think nls_string() is nice solution for now. Butter than say
"no way"... :-)

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

#4Greg Stark
gsstark@mit.edu
In reply to: Tom Lane (#2)
Re: ORDER BY different locales

Tom Lane <tgl@sss.pgh.pa.us> writes:

This function breaks the whole backend if an elog() failure occurs while
it's got the wrong locale set. I believe it would also be remarkably
slow --- doesn't setlocale() involve reading a new locale definition
file from whereever those are stored?

I posted a similar function a while back using strxfrm and someone else
refined to eliminate a similar problem with elog(). I tested the speed under
glibc and it blazingly fast. I think the time difference was hardly even
noticeable over any other string function.

Certainly I expect there would be some platforms that would perform poorly,
but then there are lots of things various platforms do poorly. I don't think
that means postgres should reimplement everything itself for consistency. That
way lies Oracle.

--
greg

#5Karel Zak
zakkr@zf.jcu.cz
In reply to: Tom Lane (#2)
Re: [HACKERS] ORDER BY different locales

On Thu, Feb 26, 2004 at 09:16:03AM -0500, Tom Lane wrote:

Karel Zak <zakkr@zf.jcu.cz> writes:

I think possible solution is special function used ORDER BY clause
which knows to switch by safe way to wanted locales, convert string by
strxfrm() and switch back to backend locales.

This function breaks the whole backend if an elog() failure occurs while

Fixed by sigsetjmp(Warn_restart..). I hope it's more safe now.

it's got the wrong locale set. I believe it would also be remarkably
slow --- doesn't setlocale() involve reading a new locale definition

You're right, it's slow. But sometimes is more important that it works
and not all queries work with thousands records like my test below.

I think the ultimate solution to our multi-locale problems will have to
involve abandoning the C library's support functions and writing locale
support that allows multiple locale-defining structures referenced by

Agree. But as you said it's huge task and I think if it won't implement
in 7.5 we can add nls_string() to the contrib tree. BTW, nls_string()
is "product" of Czech database list where Oracle users have still
problems with PostgreSQL ;-)

Latest version:
ftp://ftp2.zf.jcu.cz/users/zakkr/pg/postgresql-nls-string-0.52.tar.gz

Note, I add "CC:" to pgsql-general, maybe it's interesting for some
normal users too.

Tests:

# SELECT count(*) FROM nlstest;
count
--------
100000

# SELECT data FROM nlstest ORDER BY upper(data) DESC LIMIT 1;
Time: 1213.87 ms

# SELECT data FROM nlstest ORDER BY nls_string(data, 'en_US') LIMIT 1;
Time: 4269.00 ms

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/