suitable text search configuration

Started by Alvaro Herreraover 18 years ago9 messageshackers

Jump to latest

Alvaro Herrera

alvherre@2ndquadrant.com

over 18 years ago

Hi,

Is initdb supposed to pick up reasonable TS configurations in general?

If so, it's failing for me:

initdb: could not find suitable text search configuration for locale fr_CA.UTF-8
The default text search configuration will be set to "simple".

It fails for es_CL as well.

... oh, I see there's a table in initdb.c

Are we supposed to add entries to it, one for each country? I'm
wondering if we should try to match the part before the _ using just the
language, if the complete match fails. (i.e. match "es_CL" using just
"es", "fr_CA" using just "fr", etc).

--
Alvaro Herrera http://www.PlanetPostgreSQL.org/
"When the proper man does nothing (wu-wei),
his thought is felt ten thousand miles." (Lao Tse)

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Alvaro Herrera (#1)

Re: suitable text search configuration

Alvaro Herrera <alvherre@commandprompt.com> writes:

... oh, I see there's a table in initdb.c

Are we supposed to add entries to it, one for each country? I'm
wondering if we should try to match the part before the _ using just the
language, if the complete match fails. (i.e. match "es_CL" using just
"es", "fr_CA" using just "fr", etc).

Actually, looking at the examples so far, I'm thinking we should just
consider the string up to the first _, period.

An alternative is to try to match the full locale (es_ES) and then try
the language (es) if that wasn't found. That would leave room to put
country-by-country exceptions in, but for the moment we'd not have any.

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

over 18 years ago

In reply to: Tom Lane (#2)

Re: suitable text search configuration

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

... oh, I see there's a table in initdb.c

Are we supposed to add entries to it, one for each country? I'm
wondering if we should try to match the part before the _ using just the
language, if the complete match fails. (i.e. match "es_CL" using just
"es", "fr_CA" using just "fr", etc).

Actually, looking at the examples so far, I'm thinking we should just
consider the string up to the first _, period.

An alternative is to try to match the full locale (es_ES) and then try
the language (es) if that wasn't found. That would leave room to put
country-by-country exceptions in, but for the moment we'd not have any.

Can anyone point to a real world example where country by country would
make sense? If we need to distinguish flavors of some languages, I would
not be at all surprised if this was not by country anyway.

cheers

andrew

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Andrew Dunstan (#3)

Re: suitable text search configuration

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

Actually, looking at the examples so far, I'm thinking we should just
consider the string up to the first _, period.

Can anyone point to a real world example where country by country would
make sense?

For the current set of built-in dictionaries it seems pretty clear that
country distinctions are useless. If we ever did need that distinction
it would only be after adding dictionaries that aren't going to be in
8.3 ... so I'm leaning to keeping the code simple for the moment.

regards, tom lane

Alvaro Herrera

alvherre@2ndquadrant.com

over 18 years ago

In reply to: Andrew Dunstan (#3)

Re: suitable text search configuration

Andrew Dunstan wrote:

Tom Lane wrote:

Actually, looking at the examples so far, I'm thinking we should just
consider the string up to the first _, period.

I studied the standards a bit to see if they mandated that the locale
names must be in the form "language_COUNTRY", and couldn't find
anything. Which makes me think it's mostly by (very well established)
convention. I think trying to parse the _ should not be done on a first
attempt.

An alternative is to try to match the full locale (es_ES) and then try
the language (es) if that wasn't found. That would leave room to put
country-by-country exceptions in, but for the moment we'd not have any.

Can anyone point to a real world example where country by country would
make sense? If we need to distinguish flavors of some languages, I would
not be at all surprised if this was not by country anyway.

pt_BR versus pt_PT. I'm not sure if it makes a difference to a stemmer,
but maybe to a thesaurus it does ...

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Alvaro Herrera (#5)

Re: suitable text search configuration

Alvaro Herrera <alvherre@commandprompt.com> writes:

Andrew Dunstan wrote:

Can anyone point to a real world example where country by country would
make sense? If we need to distinguish flavors of some languages, I would
not be at all surprised if this was not by country anyway.

pt_BR versus pt_PT. I'm not sure if it makes a difference to a stemmer,
but maybe to a thesaurus it does ...

Right, but only when we have built-in dictionaries that separately
address the two countries will there be any need to teach initdb about
it. I think we should KISS for now.

regards, tom lane

Alvaro Herrera

alvherre@2ndquadrant.com

over 18 years ago

In reply to: Tom Lane (#2)

Re: suitable text search configuration

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

... oh, I see there's a table in initdb.c

Are we supposed to add entries to it, one for each country? I'm
wondering if we should try to match the part before the _ using just the
language, if the complete match fails. (i.e. match "es_CL" using just
"es", "fr_CA" using just "fr", etc).

Actually, looking at the examples so far, I'm thinking we should just
consider the string up to the first _, period.

I found that there is an ISO spec for "cultural elements", ISO/IEC
15897, a working draft for which can be found at
http://www.open-std.org/jtc1/sc22/open/n3586.pdf

Chapter 13 talks about naming of locales.

I think glibc is supposed to follow this standard.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Tom Lane

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Tom Lane (#4)

Re: suitable text search configuration

Have we got consensus that initdb should just look at the first
component of the locale name to choose a text search configuration
(at least for 8.3)? If so, who's going to make the change?
I can do it but don't want to duplicate effort if someone else
was already on it.

regards, tom lane

Alvaro Herrera

alvherre@2ndquadrant.com

over 18 years ago

In reply to: Tom Lane (#8)

Re: suitable text search configuration

Tom Lane wrote:

Have we got consensus that initdb should just look at the first
component of the locale name to choose a text search configuration
(at least for 8.3)? If so, who's going to make the change?
I can do it but don't want to duplicate effort if someone else
was already on it.

Thanks, it works wonderfully for me now.

--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"Ni aun el genio muy grande llegarï¿½a muy lejos
si tuviera que sacarlo todo de su propio interior" (Goethe)