tsearch in core patch
http://www.sigaev.ru/misc/tsearch_core-0.52.gz
Plan was:
1) rename FULLTEXT to TEXT SEARCH in SQL command
done
2) rework Snowball stemmer's as Tom suggested
done
3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING
done
4) remove support of default configuration per scheme. Default configuration
will be only one per locale.
done
5) single encoded files. That will touch snowball, ispell, synonym, thesaurus
and simple dictionaries
done
6) use encoding names instead of locale's names in configuration
Ugh. I missed that knowledge of encoding doesn't allow to determine exact
language --- how do many languages use ISO8859-1 locale?. So, it's not done. Tom
pointed that locale's name isn't portable, but there isn't a lot of names of the
same locale (ru_RU.UTF-8, ru_RU.UTF8 for example). So it's possible to use array
of locales instead of one name.
I didn't see comments about security hole pointed by Tom, so I repeat:
About security holes in PARSER/DICTIONARY. I see following ways to resolve it now:
1) Allow to superuser only to do CREATE/ALTER/DROP PARSER/DICTIONARY
Disadvantage: hosting users will not be able to change dictionaries
2) Remove CREATE/ALTER/DROP PARSER, split pg_ts_dict to pg_ts_dict_template
and pg_ts_dict and accordingly change CREATE/ALTER/DROP DICTIONARY
Disadvantage: parser and dictionary's template will not dump/restore,
it should be restored manually (just a INSERT into
pg_ts_parser/pg_ts_dict_template)
3) Similar to previous point, but:
* CREATE/ALTER/DROP PARSER - super-user only
* CREATE/ALTER/DROP DICTIONARY TEMPLATE - super-user only
* CREATE/ALTER/DROP DICTIONARY - allowed to non-superuser
Disadvantage: new command CREATE/ALTER/DROP DICTIONARY TEMPLATE
Which way do we choose? or I miss some variant?
I would like to go by 3) way... Comments?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Ühel kenal päeval, N, 2007-06-21 kell 21:44, kirjutas Teodor Sigaev:
http://www.sigaev.ru/misc/tsearch_core-0.52.gz
Plan was:
1) rename FULLTEXT to TEXT SEARCH in SQL command
done2) rework Snowball stemmer's as Tom suggested
done3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING
done
Why not rename ALTER FULLTEXT CONFIGURATION --> ALTER TEXT SEARCH
CONFIGURATION here too ?
4) remove support of default configuration per scheme. Default configuration
will be only one per locale.
done5) single encoded files. That will touch snowball, ispell, synonym, thesaurus
and simple dictionaries
done6) use encoding names instead of locale's names in configuration
Ugh. I missed that knowledge of encoding doesn't allow to determine exact
language
most languages can be written using UNICODE charset and UTF-8 encoding,
so neither charset not encoding can be used to determine language.
--- how do many languages use ISO8859-1 locale?.
ISO8859-1 is encoding, not locale.
Show quoted text
So, it's not done. Tom
pointed that locale's name isn't portable, but there isn't a lot of names of the
same locale (ru_RU.UTF-8, ru_RU.UTF8 for example). So it's possible to use array
of locales instead of one name.I didn't see comments about security hole pointed by Tom, so I repeat:
About security holes in PARSER/DICTIONARY. I see following ways to resolve it now:
1) Allow to superuser only to do CREATE/ALTER/DROP PARSER/DICTIONARY
Disadvantage: hosting users will not be able to change dictionaries
2) Remove CREATE/ALTER/DROP PARSER, split pg_ts_dict to pg_ts_dict_template
and pg_ts_dict and accordingly change CREATE/ALTER/DROP DICTIONARY
Disadvantage: parser and dictionary's template will not dump/restore,
it should be restored manually (just a INSERT into
pg_ts_parser/pg_ts_dict_template)
3) Similar to previous point, but:
* CREATE/ALTER/DROP PARSER - super-user only
* CREATE/ALTER/DROP DICTIONARY TEMPLATE - super-user only
* CREATE/ALTER/DROP DICTIONARY - allowed to non-superuser
Disadvantage: new command CREATE/ALTER/DROP DICTIONARY TEMPLATE
Which way do we choose? or I miss some variant?I would like to go by 3) way... Comments?
Hannu Krosing <hannu@skype.net> writes:
Ühel kenal päeval, N, 2007-06-21 kell 21:44, kirjutas Teodor Sigaev:
6) use encoding names instead of locale's names in configuration
Ugh. I missed that knowledge of encoding doesn't allow to determine exact
language
most languages can be written using UNICODE charset and UTF-8 encoding,
so neither charset not encoding can be used to determine language.
The recommendation I was making was to use the language name, not the
encoding name, in the user-visible configuration.
regards, tom lane
3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING
doneWhy not rename ALTER FULLTEXT CONFIGURATION --> ALTER TEXT SEARCH
CONFIGURATION here too ?
It's renamed too.
most languages can be written using UNICODE charset and UTF-8 encoding,
so neither charset not encoding can be used to determine language.
yes
--- how do many languages use ISO8859-1 locale?.ISO8859-1 is encoding, not locale.
I meant, if we'll use encoding name (for example PG_LATIN1) we couldn't
distinguish languages which use that encoding (for example italian and finnish
and some more), but using locale names it's possible: it_IT.ISO8859-1,
fi_FI.ISO8859-1
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
The recommendation I was making was to use the language name, not the
encoding name, in the user-visible configuration.
How does it determine language of db automatically?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev wrote:
The recommendation I was making was to use the language name, not the
encoding name, in the user-visible configuration.
How does it determine language of db automatically?
I don't think we are going to do language selection automatically ---
the user is going to have to set tsearch_conf_name.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
I don't think we are going to do language selection automatically ---
the user is going to have to set tsearch_conf_name.
Are you suggest to remove long-lived feature of tsearch? In that case we don't
need cfglocale (or cfglanguage as Tom suggested) and cfgdefault columns in
pg_ts_cfg at all. Just set up tsearch_conf_name.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev wrote:
--- how do many languages use ISO8859-1 locale?.ISO8859-1 is encoding, not locale.
I meant, if we'll use encoding name (for example PG_LATIN1) we couldn't
distinguish languages which use that encoding (for example italian and
finnish and some more), but using locale names it's possible:
it_IT.ISO8859-1, fi_FI.ISO8859-1
I don't understand. Why use "it_IT.ISO8859-1"? You just need to know
the language, so "it" is enough. The _IT part specifies that it's the
italian spoken in Italy. This may be irrelevant in most cases, but
consider that pt_PT and pt_BR are AFAIK somewhat different languages.
I very much doubt that the different spanishes are any different in the
stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
but in the case of portuguese I'm not so sure. Maybe there are other
examples (like chinese, but I'm not sure how useful is tsearch for
chinese).
And the .ISO8859-1 part you don't need at all if you accept that the
files are UTF8 by design, as Tom proposed.
--
Alvaro Herrera Developer, http://www.PostgreSQL.org/
"Nadie esta tan esclavizado como el que se cree libre no siendolo" (Goethe)
Teodor Sigaev <teodor@sigaev.ru> writes:
I don't think we are going to do language selection automatically ---
the user is going to have to set tsearch_conf_name.
Are you suggest to remove long-lived feature of tsearch? In that case we don't
need cfglocale (or cfglanguage as Tom suggested) and cfgdefault columns in
pg_ts_cfg at all. Just set up tsearch_conf_name.
Is the point here for initdb to be able to establish a sane default
initially? Seems to me it can guess the language from the first
component of the locale (ru_RU -> russian).
regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes:
I very much doubt that the different spanishes are any different in the
stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
but in the case of portuguese I'm not so sure. Maybe there are other
examples (like chinese, but I'm not sure how useful is tsearch for
chinese).
And the .ISO8859-1 part you don't need at all if you accept that the
files are UTF8 by design, as Tom proposed.
Also, the problem we're dealing with here is mainly lack of
standardization of the encoding part of locale names. AFAIK, just about
everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes
after that (if any) that is not too consistent across platforms.
So I see no problem in distinguishing between pt_PT and pt_BR if it
turns out we have to. The trick is to not look at any more of the
locale name than that; and if we standardize on "stopword files are
UTF8" then I don't think we need to.
regards, tom lane
Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
I very much doubt that the different spanishes are any different in the
stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
but in the case of portuguese I'm not so sure. Maybe there are other
examples (like chinese, but I'm not sure how useful is tsearch for
chinese).And the .ISO8859-1 part you don't need at all if you accept that the
files are UTF8 by design, as Tom proposed.Also, the problem we're dealing with here is mainly lack of
standardization of the encoding part of locale names. AFAIK, just about
everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes
after that (if any) that is not too consistent across platforms.
So I see no problem in distinguishing between pt_PT and pt_BR if it
turns out we have to. The trick is to not look at any more of the
locale name than that; and if we standardize on "stopword files are
UTF8" then I don't think we need to.
OK, and the open question is when do we do this default setting. If we
do it in initdb then we can isolate all the detection there.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Fri, 22 Jun 2007, Bruce Momjian wrote:
Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
I very much doubt that the different spanishes are any different in the
stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
but in the case of portuguese I'm not so sure. Maybe there are other
examples (like chinese, but I'm not sure how useful is tsearch for
chinese).And the .ISO8859-1 part you don't need at all if you accept that the
files are UTF8 by design, as Tom proposed.Also, the problem we're dealing with here is mainly lack of
standardization of the encoding part of locale names. AFAIK, just about
everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes
after that (if any) that is not too consistent across platforms.
So I see no problem in distinguishing between pt_PT and pt_BR if it
turns out we have to. The trick is to not look at any more of the
locale name than that; and if we standardize on "stopword files are
UTF8" then I don't think we need to.OK, and the open question is when do we do this default setting. If we
do it in initdb then we can isolate all the detection there.
We can do that at initdb time, but we still have to decide how to map
human-readable language name and lang part of locale name. Are we going
to hardcode it ?
It's not friendly for hosting solution, when people often have no access
to the postgresql.conf, so they need to remember setting tsearch_conf_name.
It could be solved using 'alter user ... set tsearch_conf_name' command though.
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
I very much doubt that the different spanishes are any different in the
stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
but in the case of portuguese I'm not so sure. Maybe there are other
examples (like chinese, but I'm not sure how useful is tsearch for
chinese).And the .ISO8859-1 part you don't need at all if you accept that the
files are UTF8 by design, as Tom proposed.Also, the problem we're dealing with here is mainly lack of
standardization of the encoding part of locale names. AFAIK, just about
everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes
after that (if any) that is not too consistent across platforms.
That may have been true until we started supporting Windows...
Swedish_Sweden.1252 is what I get on my machine, for example. Principle
is the same, but values certainly aren't.
//Magnus
Magnus Hagander wrote:
Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
I very much doubt that the different spanishes are any different in the
stemming rules, so there's no need for es_ES, es_PE, es_AR, es_CL etc;
but in the case of portuguese I'm not so sure. Maybe there are other
examples (like chinese, but I'm not sure how useful is tsearch for
chinese).And the .ISO8859-1 part you don't need at all if you accept that the
files are UTF8 by design, as Tom proposed.Also, the problem we're dealing with here is mainly lack of
standardization of the encoding part of locale names. AFAIK, just about
everybody agrees on "es_ES", "ru_RU", etc; it's the part that comes
after that (if any) that is not too consistent across platforms.That may have been true until we started supporting Windows...
Swedish_Sweden.1252 is what I get on my machine, for example. Principle
is the same, but values certainly aren't.
Well, at least the name is not itself translated, so a mapping table is
not right out of the question. If they had put a name like
"Espa�ol_Chile" instead of "Spanish_Chile" we would be in serious
trouble.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Jun 22, 2007, at 9:28 , Tom Lane wrote:
Is the point here for initdb to be able to establish a sane default
initially? Seems to me it can guess the language from the first
component of the locale (ru_RU -> russian).
How would this work for initdb with locale C?
Michael Glaesemann
grzm seespotcode net
That may have been true until we started supporting Windows...
Swedish_Sweden.1252 is what I get on my machine, for example. Principle
is the same, but values certainly aren't.Well, at least the name is not itself translated, so a mapping table is
not right out of the question. If they had put a name like
"Español_Chile" instead of "Spanish_Chile" we would be in serious
trouble.
I don't think so, in oppsite case you can't type or show it to change
locale :).
So, final propose:
rename cfglocale to cfglanguages and store in it array of laguage names
which is produced from first part of locale names:
russian '{ru_RU, Russian_Russia}'
spanish '{es_ES, es_CL, Spanish_Spain, Spanish_Chile}'
Comments?
Is there some obstacles to use GIN indexes in pg_catalog?
Import Notes
Resolved by subject fallback
Michael Glaesemann wrote:
On Jun 22, 2007, at 9:28 , Tom Lane wrote:
Is the point here for initdb to be able to establish a sane default
initially? Seems to me it can guess the language from the first
component of the locale (ru_RU -> russian).How would this work for initdb with locale C?
Yea, that's a problem. I am thinking we should just avoid the entire
issue and require it to be set by the user, and throw an error if the
configuration is not set.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Jun 22, 2007, at 9:28 , Tom Lane wrote:
Is the point here for initdb to be able to establish a sane default
initially? Seems to me it can guess the language from the first
component of the locale (ru_RU -> russian).How would this work for initdb with locale C?
I'm worrying about that too.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
teodor@sigaev.ru wrote:
So, final propose:
rename cfglocale to cfglanguages and store in it array of laguage names
which is produced from first part of locale names:
russian '{ru_RU, Russian_Russia}'
spanish '{es_ES, es_CL, Spanish_Spain, Spanish_Chile}'Comments?
Why not do it the other way around?
es_ES spanish
Spanish_Spain spanish
ru_RU russian
pt_BR portuguese_brazil
That way you don't need any funny index. Or do you need the list of
locales for each language? (but even if you do, you can easily obtain it
by indexing both columns separately using btrees anyway)
--
Alvaro Herrera http://www.PlanetPostgreSQL.org/
"I can see support will not be a problem. 10 out of 10." (Simon Wittber)
(http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)
teodor@sigaev.ru wrote:
Why not do it the other way around?
es_ES spanish
Spanish_Spain spanish
ru_RU russian
pt_BR portuguese_brazilThat way you don't need any funny index. Or do you need the list of
locales for each language? (but even if you do, you can easily obtain it
by indexing both columns separately using btrees anyway)Yes, that's possible but that icreases number of identical configuration:
russian_win Russian_Russia
russian_unix ru_RUThey doesn't differ except locale name.
But why do you need them to be different at all? Just make it
russian Russian_Russia
russian ru_RU
Does that not work for some reason?
What I was really suggesting was having a table mapping locale names
into "tsearch languages". Then the configuration could be made based on
the language, not on the locale name. So the stopword list is for
"russian", regardless of whether the locale is Russian_Russia or ru_RU.
Is this only for the stopword list, or does it also affect selecting a
stemmer?
Note: it's possible that the stopword list is different for brazilian
portuguese than portuguese portuguese, which is why I was suggesting
using a language "portuguese_brazil" and not just "postuguese". Whereas
you need a single stopword list for all the countries speaking spanish,
which is why you need only one language called spanish.
--
Alvaro Herrera http://www.advogato.org/person/alvherre
"Llegar� una �poca en la que una investigaci�n diligente y prolongada sacar�
a la luz cosas que hoy est�n ocultas" (S�neca, siglo I)
Import Notes
Reply to msg id not found: 1189.91.76.165.155.1182529215.squirrel@mail.sigaev.ru
Why not do it the other way around?
es_ES spanish
Spanish_Spain spanish
ru_RU russian
pt_BR portuguese_brazilThat way you don't need any funny index. Or do you need the list of
locales for each language? (but even if you do, you can easily obtain it
by indexing both columns separately using btrees anyway)
Yes, that's possible but that icreases number of identical configuration:
russian_win Russian_Russia
russian_unix ru_RU
They doesn't differ except locale name.
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
On Jun 22, 2007, at 9:28 , Tom Lane wrote:
Is the point here for initdb to be able to establish a sane default
initially? Seems to me it can guess the language from the first
component of the locale (ru_RU -> russian).How would this work for initdb with locale C?
I'm worrying about that too.
I would be surprised if C locale defaulted to anything except English.
I suppose it would be sensible to add a switch to allow people to select
a different language. In any case, the only thing initdb would be doing
would be setting up an initial value of a table entry or GUC variable,
so you could always change it yourself later; it may not be worth
sweating too much about this.
regards, tom lane
Alvaro Herrera wrote:
What I was really suggesting was having a table mapping locale names
into "tsearch languages". Then the configuration could be made based on
the language, not on the locale name. So the stopword list is for
"russian", regardless of whether the locale is Russian_Russia or ru_RU.
Agreed. But I'm afraid we couldn't map all of the locale names in a
right way. Man, it's a large list. ;)
Is this only for the stopword list, or does it also affect selecting a
stemmer?
Both.
Note: it's possible that the stopword list is different for brazilian
portuguese than portuguese portuguese, which is why I was suggesting
using a language "portuguese_brazil" and not just "postuguese". Whereas
you need a single stopword list for all the countries speaking spanish,
which is why you need only one language called spanish.
Indeed it's possible for portuguese, because we have some words that are
written in different ways, e.g.,
pt_BR pt_PT english
M�nica M�nica Monica
a��o ac��o action
Ir� Ir�o Iran
.
.
.
Will it be possible to disable stemming or stopwords removal? I'm asking
this 'cause sometimes stemming doesn't lead to good results and/or
stopwords are relevant. Maybe it could be an GUC variables
('enable_stemming' and 'enable_stopwords').
--
Euler Taveira de Oliveira
http://www.timbira.com/
On Sat, 23 Jun 2007, Euler Taveira de Oliveira wrote:
Will it be possible to disable stemming or stopwords removal? I'm asking
this 'cause sometimes stemming doesn't lead to good results and/or
stopwords are relevant. Maybe it could be an GUC variables
('enable_stemming' and 'enable_stopwords').
Just use another configuration.
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
I would be surprised if C locale defaulted to anything except English.
Don't be surprised. The mechanism of collation is too simple for
Japanse Kanji, and locale is not usefull for Japanse anyway. That's
why Japanese installations of PostgreSQL tend to use C locale.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
Show quoted text
I suppose it would be sensible to add a switch to allow people to select
a different language. In any case, the only thing initdb would be doing
would be setting up an initial value of a table entry or GUC variable,
so you could always change it yourself later; it may not be worth
sweating too much about this.regards, tom lane
But why do you need them to be different at all? Just make it
russian Russian_Russia
russian ru_RUDoes that not work for some reason?
I'd like to have unique names of configuration. So, if user sets GUC variable or
call function with configuration's name then postgres should not have a choice
--- it should use pointed configuration exactly.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
But why do you need them to be different at all? Just make it
russian Russian_Russia
russian ru_RUDoes that not work for some reason?
I'd like to have unique names of configuration. So, if user sets GUC variable or call function with configuration's name then postgres should not have a choice --- it should use pointed configuration exactly.
Sure, but the configuration name in this example is "russian", and it's
unique, no?
regards, tom lane