Re: Collation rules and multi-lingual databases

Started by Greg Starkover 22 years ago14 messages
#1Greg Stark
gsstark@mit.edu
1 attachment(s)

So, I needed a way to sort using collation rules other than the one the
database was built with. So I wrote up the following function exposing strxfrm
with an extra parameter to specify the LC_COLLATE value to use.

This is my first C function so I'm really unsure that I've done the right
thing. For the most part I pattern-matched off the string_io code in the
contrib directory.

In particular I'm unsure about the code postgres-interfacing code in
c_varcharxfrm which makes an extra copy of both parameters that are passed in
and an extra copy of the result value. Are varchars guaranteed to be
nul-terminated? If so I can dispose of two of the copies. And I can probably
eliminate the copying of the result by alloting extra space when I allocate it
initially.

But more generally. Would it make more sense to use text or bytea or something
else to store these opaque binary strings? At least with glibc they tend to be
unreadable anyways.

Other caveats: It's condemned to be permanently non-threadsafe because the
whole locale system is a non thread-safe API. Also I fear some systems will
leak memory like a sieve when you call setlocale a few thousand times instead
of the 1 time at initialization that they foresaw. At least glibc doesn't seem
to leak in my brief testing.

If it's deemed a reasonable approach and nobody has any fatal flaws then I
expect it would be useful to put in the contrib directory?

Attachments:

strxfrm.ctext/x-csrcDownload
#2Peter Eisentraut
peter_e@gmx.net
In reply to: Greg Stark (#1)

Greg Stark writes:

This is my first C function so I'm really unsure that I've done the right
thing. For the most part I pattern-matched off the string_io code in the
contrib directory.

That was just about the worst example you could have picked. Please
forget everything you have seen and start by reading the documentation.
In particular, learn about the version 1 call convention and about the
PostgreSQL license. And read some code under src/backend/utils/adt.

In particular I'm unsure about the code postgres-interfacing code in
c_varcharxfrm which makes an extra copy of both parameters that are passed in
and an extra copy of the result value. Are varchars guaranteed to be
nul-terminated?

They are guaranteed not to be null-terminated.

But more generally. Would it make more sense to use text or bytea or something
else to store these opaque binary strings? At least with glibc they tend to be
unreadable anyways.

bytea

If it's deemed a reasonable approach and nobody has any fatal flaws then I
expect it would be useful to put in the contrib directory?

I'd expect it to be too slow to be useful. Have you run performance tests?

--
Peter Eisentraut peter_e@gmx.net

#3Stephan Szabo
sszabo@megazone.bigpanda.com
In reply to: Greg Stark (#1)

On 22 Aug 2003, Greg Stark wrote:

So, I needed a way to sort using collation rules other than the one the
database was built with. So I wrote up the following function exposing strxfrm
with an extra parameter to specify the LC_COLLATE value to use.

This is my first C function so I'm really unsure that I've done the right
thing. For the most part I pattern-matched off the string_io code in the
contrib directory.

In particular I'm unsure about the code postgres-interfacing code in
c_varcharxfrm which makes an extra copy of both parameters that are passed in
and an extra copy of the result value. Are varchars guaranteed to be
nul-terminated? If so I can dispose of two of the copies. And I can probably
eliminate the copying of the result by alloting extra space when I allocate it
initially.

But more generally. Would it make more sense to use text or bytea or something
else to store these opaque binary strings? At least with glibc they tend to be
unreadable anyways.

Other caveats: It's condemned to be permanently non-threadsafe because the
whole locale system is a non thread-safe API. Also I fear some systems will
leak memory like a sieve when you call setlocale a few thousand times instead
of the 1 time at initialization that they foresaw. At least glibc doesn't seem
to leak in my brief testing.

If it's deemed a reasonable approach and nobody has any fatal flaws then I
expect it would be useful to put in the contrib directory?

I'm not sure that ERROR if the locale cannot be put back is sufficient
(although that case should be rare or non-existant). Unless something else
in the system resets the locale, after your transaction rolls back, you're
in a dangerous state. I'd think FATAL would be better.

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephan Szabo (#3)

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

On 22 Aug 2003, Greg Stark wrote:

If it's deemed a reasonable approach and nobody has any fatal flaws then I
expect it would be useful to put in the contrib directory?

I'm not sure that ERROR if the locale cannot be put back is sufficient
(although that case should be rare or non-existant).

A bigger risk is that something might elog(ERROR) while you have the
"wrong" locale set, denying you the chance to put back the right one.
I think this code is not nearly paranoid enough about how much it does
while the wrong locale is set.

Unless something else
in the system resets the locale, after your transaction rolls back, you're
in a dangerous state. I'd think FATAL would be better.

I'd go so far as to make it a critical section --- that ensures that any
ERROR will be turned to FATAL, even if it's in a subroutine you call.

regards, tom lane

#5Stephan Szabo
sszabo@megazone.bigpanda.com
In reply to: Tom Lane (#4)

On Fri, 22 Aug 2003, Tom Lane wrote:

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

On 22 Aug 2003, Greg Stark wrote:

If it's deemed a reasonable approach and nobody has any fatal flaws then I
expect it would be useful to put in the contrib directory?

I'm not sure that ERROR if the locale cannot be put back is sufficient
(although that case should be rare or non-existant).

A bigger risk is that something might elog(ERROR) while you have the
"wrong" locale set, denying you the chance to put back the right one.
I think this code is not nearly paranoid enough about how much it does
while the wrong locale is set.

True, there are calls to palloc, elog, etc inside there, although the elog
could be removed.

Unless something else
in the system resets the locale, after your transaction rolls back, you're
in a dangerous state. I'd think FATAL would be better.

I'd go so far as to make it a critical section --- that ensures that any
ERROR will be turned to FATAL, even if it's in a subroutine you call.

I didn't know we could do that, could be handy, although the comments
imply that it turns into PANIC which would force a complete restart. Then
again, it's better than a corrupted database.

#6Stephan Szabo
sszabo@megazone.bigpanda.com
In reply to: Stephan Szabo (#5)

On Fri, 22 Aug 2003, Stephan Szabo wrote:

On Fri, 22 Aug 2003, Tom Lane wrote:

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

On 22 Aug 2003, Greg Stark wrote:

If it's deemed a reasonable approach and nobody has any fatal flaws then I
expect it would be useful to put in the contrib directory?

I'm not sure that ERROR if the locale cannot be put back is sufficient
(although that case should be rare or non-existant).

A bigger risk is that something might elog(ERROR) while you have the
"wrong" locale set, denying you the chance to put back the right one.
I think this code is not nearly paranoid enough about how much it does
while the wrong locale is set.

True, there are calls to palloc, elog, etc inside there, although the elog
could be removed.

Since most of that work is for an exceptional case, maybe it'd be safer
(although slower) to structure the function as

setlocale
call strxfrm (and that's it)
setlocale back
if there wasn't enough space
make a new buffer
setlocale
call strxfrm (and that's it)
setlocale back

Probably putting the sl/strxfrm/sl into its own function.

#7Greg Stark
gsstark@mit.edu
In reply to: Stephan Szabo (#6)

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

Since most of that work is for an exceptional case, maybe it'd be safer
(although slower) to structure the function as

Yeah I thought of that. But if making it a critical section is cheap then it's
probably a better approach. The problem with restoring the locale for the
palloc is that if the user is unlucky he might sort a table of thousands of
strings that all trigger the exception case.

The glibc docs sample code suggests using 2x the original string length for
the initial buffer. My testing showed that *always* triggered the exceptional
case. A bit of experimentation lead to the 3x+4 which eliminates it except for
0 and 1 byte strings. I'm still tweaking it. But on another OS, or in a more
complex collation locale maybe you would still trigger it a lot. Even as it is
if you happy to try to sort a large list of single character strings you would
trigger it a lot.

I have some documentation reading to do apparently before I can fix this up.

setlocale
call strxfrm (and that's it)
setlocale back
if there wasn't enough space
make a new buffer
setlocale
call strxfrm (and that's it)
setlocale back

Probably putting the sl/strxfrm/sl into its own function.

--
greg

#8Stephan Szabo
sszabo@megazone.bigpanda.com
In reply to: Greg Stark (#7)

On 23 Aug 2003, Greg Stark wrote:

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

Since most of that work is for an exceptional case, maybe it'd be safer
(although slower) to structure the function as

Yeah I thought of that. But if making it a critical section is cheap then it's
probably a better approach. The problem with restoring the locale for the
palloc is that if the user is unlucky he might sort a table of thousands of
strings that all trigger the exception case.

True. I still worry about the critical section since an error will cause
the entire database system to restart, but as I'd also forgotten about
signals which might cause problems (uncertain without looking) I'm not
sure you have a good alternative even not counting the speed issues.

#9Joe Conway
mail@joeconway.com
In reply to: Greg Stark (#7)

Greg Stark wrote:

Yeah I thought of that. But if making it a critical section is cheap then it's
probably a better approach. The problem with restoring the locale for the
palloc is that if the user is unlucky he might sort a table of thousands of
strings that all trigger the exception case.

What about something like this?
8<--------------------------------

#include <setjmp.h>
#include <string.h>

#include "postgres.h"
#include "fmgr.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"

#define GET_STR(textp) \
DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp)))
#define GET_BYTEA(str_) \
DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
#define MAX_BYTEA_LEN 0x3fffffff

/*
* pg_strxfrm - Function to convert string similar to the strxfrm C
* function using a specified locale.
*/
extern Datum pg_strxfrm(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(pg_strxfrm);

Datum
pg_strxfrm(PG_FUNCTION_ARGS)
{
char *str = GET_STR(PG_GETARG_TEXT_P(0));
size_t str_len = strlen(str);
char *localestr = GET_STR(PG_GETARG_TEXT_P(1));
size_t approx_trans_len = 4 + (str_len * 3);
char *trans = (char *) palloc(approx_trans_len + 1);
size_t actual_trans_len;
char *oldlocale;
char *newlocale;
sigjmp_buf save_restart;

if (approx_trans_len > MAX_BYTEA_LEN)
elog(ERROR, "source string too long to transform");

oldlocale = setlocale(LC_COLLATE, NULL);
if (!oldlocale)
elog(ERROR, "setlocale failed to return a locale");

/* catch elog while locale is set other than the default */
memcpy(&save_restart, &Warn_restart, sizeof(save_restart));
if (sigsetjmp(Warn_restart, 1) != 0)
{
memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
newlocale = setlocale(LC_COLLATE, oldlocale);
if (!newlocale)
elog(PANIC, "setlocale failed to reset locale: %s", localestr);
siglongjmp(Warn_restart, 1);
}

newlocale = setlocale(LC_COLLATE, localestr);
if (!newlocale)
elog(ERROR, "setlocale failed to set a locale: %s", localestr);

actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);

/* if the buffer was not large enough, resize it and try again */
if (actual_trans_len >= approx_trans_len)
{
approx_trans_len = actual_trans_len + 1;
if (approx_trans_len > MAX_BYTEA_LEN)
elog(ERROR, "source string too long to transform");

trans = (char *) repalloc(trans, approx_trans_len + 1);
actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);

/* if the buffer still not large enough, punt */
if (actual_trans_len >= approx_trans_len)
elog(ERROR, "strxfrm failed, buffer insufficient");
}

newlocale = setlocale(LC_COLLATE, oldlocale);
if (!newlocale)
elog(PANIC, "setlocale failed to reset locale: %s", localestr);

PG_RETURN_BYTEA_P(GET_BYTEA(trans));
}

8<--------------------------------

Joe

#10Joe Conway
mail@joeconway.com
In reply to: Joe Conway (#9)

Joe Conway wrote:

What about something like this?

Oops! Forgot to restrore error handling. See below:

Joe

8<--------------------------------

#include <setjmp.h>
#include <string.h>

#include "postgres.h"
#include "fmgr.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"

#define GET_STR(textp) \
DatumGetCString(DirectFunctionCall1(textout, PointerGetDatum(textp)))
#define GET_BYTEA(str_) \
DatumGetTextP(DirectFunctionCall1(byteain, CStringGetDatum(str_)))
#define MAX_BYTEA_LEN 0x3fffffff

/*
* pg_strxfrm - Function to convert string similar to the strxfrm C
* function using a specified locale.
*/
extern Datum pg_strxfrm(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(pg_strxfrm);

Datum
pg_strxfrm(PG_FUNCTION_ARGS)
{
char *str = GET_STR(PG_GETARG_TEXT_P(0));
size_t str_len = strlen(str);
char *localestr = GET_STR(PG_GETARG_TEXT_P(1));
size_t approx_trans_len = 4 + (str_len * 3);
char *trans = (char *) palloc(approx_trans_len + 1);
size_t actual_trans_len;
char *oldlocale;
char *newlocale;
sigjmp_buf save_restart;

if (approx_trans_len > MAX_BYTEA_LEN)
elog(ERROR, "source string too long to transform");

oldlocale = setlocale(LC_COLLATE, NULL);
if (!oldlocale)
elog(ERROR, "setlocale failed to return a locale");

/* catch elog while locale is set other than the default */
memcpy(&save_restart, &Warn_restart, sizeof(save_restart));
if (sigsetjmp(Warn_restart, 1) != 0)
{
memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
newlocale = setlocale(LC_COLLATE, oldlocale);
if (!newlocale)
elog(PANIC, "setlocale failed to reset locale: %s", localestr);
siglongjmp(Warn_restart, 1);
}

newlocale = setlocale(LC_COLLATE, localestr);
if (!newlocale)
elog(ERROR, "setlocale failed to set a locale: %s", localestr);

actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);

/* if the buffer was not large enough, resize it and try again */
if (actual_trans_len >= approx_trans_len)
{
approx_trans_len = actual_trans_len + 1;
if (approx_trans_len > MAX_BYTEA_LEN)
elog(ERROR, "source string too long to transform");

trans = (char *) repalloc(trans, approx_trans_len + 1);
actual_trans_len = strxfrm(trans, str, approx_trans_len + 1);

/* if the buffer still not large enough, punt */
if (actual_trans_len >= approx_trans_len)
elog(ERROR, "strxfrm failed, buffer insufficient");
}

newlocale = setlocale(LC_COLLATE, oldlocale);
if (!newlocale)
elog(PANIC, "setlocale failed to reset locale: %s", localestr);

/* restore normal error handling */
memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));

Show quoted text

PG_RETURN_BYTEA_P(GET_BYTEA(trans));
}

8<--------------------------------

#11Greg Stark
gsstark@mit.edu
In reply to: Joe Conway (#10)

Joe Conway <mail@joeconway.com> writes:

if (sigsetjmp(Warn_restart, 1) != 0)
{
memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
newlocale = setlocale(LC_COLLATE, oldlocale);
if (!newlocale)
elog(PANIC, "setlocale failed to reset locale: %s", localestr);
siglongjmp(Warn_restart, 1);
}

Well presumably we want FATAL not PANIC.

And do we still need HOLD_INTERRUPTS() .. RESUME_INTERRUPTS() ?

I was afraid that was getting into bed too much with the error handling. I
have an implementation that restores the locale around the palloc and
increases the initial guess for future calls to avoid degenerate behaviour.
I'm not sure which approach is preferable.

--
greg

#12Joe Conway
mail@joeconway.com
In reply to: Greg Stark (#11)

Greg Stark wrote:

Joe Conway <mail@joeconway.com> writes:

if (sigsetjmp(Warn_restart, 1) != 0)
{
memcpy(&Warn_restart, &save_restart, sizeof(Warn_restart));
newlocale = setlocale(LC_COLLATE, oldlocale);
if (!newlocale)
elog(PANIC, "setlocale failed to reset locale: %s", localestr);
siglongjmp(Warn_restart, 1);
}

Well presumably we want FATAL not PANIC.

Yeah, that was a bit overzealous. I really intended FATAL.

And do we still need HOLD_INTERRUPTS() .. RESUME_INTERRUPTS() ?

I'm not sure, but I think not.

I was afraid that was getting into bed too much with the error handling. I
have an implementation that restores the locale around the palloc and
increases the initial guess for future calls to avoid degenerate behaviour.

Well the intention of the sigsetjmp was to avoid the need to flip the
locale to-and-fro. Increasing the initial guess might be good, but it
will further restrict the length of the input string you can work with.
But I'd guess you'll not want to use this with extremely long strings
anyway.

Joe

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephan Szabo (#5)

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

On Fri, 22 Aug 2003, Tom Lane wrote:

I'd go so far as to make it a critical section --- that ensures that any
ERROR will be turned to FATAL, even if it's in a subroutine you call.

I didn't know we could do that, could be handy, although the comments
imply that it turns into PANIC which would force a complete restart.

Right, my imprecision, it actually goes to PANIC.

Then again, it's better than a corrupted database.

Exactly.

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#7)

Greg Stark <gsstark@mit.edu> writes:

The glibc docs sample code suggests using 2x the original string
length for the initial buffer. My testing showed that *always*
triggered the exceptional case. A bit of experimentation lead to the
3x+4 which eliminates it except for 0 and 1 byte strings. I'm still
tweaking it. But on another OS, or in a more complex collation locale
maybe you would still trigger it a lot.

On HPUX it seems you always need 4x. Also, *there are bugs* in some
platforms' implementations of strxfrm, such that an undersized buffer
may get overrun anyway. I had originally tried to optimize the buffer
size like this in src/backend/utils/adt/selfuncs.c's use of strxfrm,
and eventually was forced to give it up as hopeless. I strongly suggest
using the same code now seen there:

char *xfrmstr;
size_t xfrmlen;
size_t xfrmlen2;

/*
* Note: originally we guessed at a suitable output buffer size,
* and only needed to call strxfrm twice if our guess was too
* small. However, it seems that some versions of Solaris have
* buggy strxfrm that can write past the specified buffer length
* in that scenario. So, do it the dumb way for portability.
*
* Yet other systems (e.g., glibc) sometimes return a smaller value
* from the second call than the first; thus the Assert must be <=
* not == as you'd expect. Can't any of these people program
* their way out of a paper bag?
*/
xfrmlen = strxfrm(NULL, val, 0);
xfrmstr = (char *) palloc(xfrmlen + 1);
xfrmlen2 = strxfrm(xfrmstr, val, xfrmlen + 1);
Assert(xfrmlen2 <= xfrmlen);

regards, tom lane