[Fwd: Re: tsearch in core patch]

Started by Nonameover 18 years ago11 messages
#1Noname
teodor@sigaev.ru

How would this work for initdb with locale C?

I'm worrying about that too.

english '{en_GB, en_US, C}'

I suppose, that locale name always has a dot separator exept C locale ---
which is well known exception

#2Tatsuo Ishii
ishii@sraoss.co.jp
In reply to: Noname (#1)
Re: [Fwd: Re: tsearch in core patch]

How would this work for initdb with locale C?

I'm worrying about that too.

english '{en_GB, en_US, C}'

I suppose, that locale name always has a dot separator exept C locale ---
which is well known exception

So we would have to?:

japanese '{ja_JP, C}'

How would we know C -> japanese?

Also I'm wondering how we could handle texts including Japanese and
English. It's very common in Japan.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

In reply to: Tatsuo Ishii (#2)
Re: [Fwd: Re: tsearch in core patch]

Tatsuo Ishii wrote:

japanese '{ja_JP, C}'

How would we know C -> japanese?

You can't do that. You can't have different languages (not locales)
mapping to the same 'tsearch language' because the stemmer doesn't know
that a specific word is in english or japanese. So you have two options:
(a) disable stemming (b) leave the language set to 'japanese' and see if
it plays well.

--
Euler Taveira de Oliveira
http://www.timbira.com/

#4Tatsuo Ishii
ishii@sraoss.co.jp
In reply to: Euler Taveira de Oliveira (#3)
Re: [Fwd: Re: tsearch in core patch]

Tatsuo Ishii wrote:

japanese '{ja_JP, C}'

How would we know C -> japanese?

You can't do that. You can't have different languages (not locales)
mapping to the same 'tsearch language' because the stemmer doesn't know
that a specific word is in english or japanese. So you have two options:
(a) disable stemming (b) leave the language set to 'japanese' and see if
it plays well.

Ok, probably we need to copy the English stemming rule to the one for
Japanese. I think same thing (commonly used English with local
language) can be applied to Chinese and Korean.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#4)
Re: [Fwd: Re: tsearch in core patch]

Tatsuo Ishii <ishii@sraoss.co.jp> writes:

Ok, probably we need to copy the English stemming rule to the one for
Japanese.

Pardon my ignorance here, but is the concept of stemming even relevant
to Japanese/Chinese/Korean? What little I know about ideographic
languages suggests it wouldn't work well. And surely the specific rules
in the Snowball project's English stemmer wouldn't work.

I think same thing (commonly used English with local
language) can be applied to Chinese and Korean.

Well, it's not hard at all to find chunks of English text that have
embedded bits of French, Spanish, or what-have-you, but that's not an
argument for trying to intermix the stemmers. I doubt that such simple
bits of program could tell the language difference well enough to
determine which stemming rules to apply.

regards, tom lane

#6Tatsuo Ishii
ishii@sraoss.co.jp
In reply to: Tom Lane (#5)
Re: [Fwd: Re: tsearch in core patch]

Tatsuo Ishii <ishii@sraoss.co.jp> writes:

Ok, probably we need to copy the English stemming rule to the one for
Japanese.

Pardon my ignorance here, but is the concept of stemming even relevant
to Japanese/Chinese/Korean? What little I know about ideographic
languages suggests it wouldn't work well. And surely the specific rules
in the Snowball project's English stemmer wouldn't work.

Your undestanding is correct. English stemmer would not work for
Japanese "non English" part.

What I meant was the "chunks of English text" in Japanese.

I think same thing (commonly used English with local
language) can be applied to Chinese and Korean.

Well, it's not hard at all to find chunks of English text that have
embedded bits of French, Spanish, or what-have-you, but that's not an
argument for trying to intermix the stemmers. I doubt that such simple
bits of program could tell the language difference well enough to
determine which stemming rules to apply.

For Japanese, it will be fairly simple: 7bit ASCII range words must be
English (Note that mostly used Japanese encodings such as EUC do not
allow to mix with ISO 8859).
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#7Mike Rylander
mrylander@gmail.com
In reply to: Tom Lane (#5)
Re: [Fwd: Re: tsearch in core patch]

On 6/25/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, it's not hard at all to find chunks of English text that have
embedded bits of French, Spanish, or what-have-you, but that's not an
argument for trying to intermix the stemmers. I doubt that such simple
bits of program could tell the language difference well enough to
determine which stemming rules to apply.

While I imagine that is probably true of many, if not most, my project
in particular would greatly benefit from the ability to mix stemmers.
I work with complex bibliographic data, which has language information
embedded within records. This is not limited to the record level
either. Individual fields within each bibliographic record can be in
different langauges.

Especially in countries where making software multi-lingual (such as
Canada (en_CA/fr_CA)) is a requirement for use in public institutions,
the ability to choose a stemmer and stop-word list at will for any
particular record will actually provide the exact behavior needed.
The obvious generalization from Canada would be to support any mix of
languages supported by tsearch2.

I can certainly understand the benefit of making the default
configuration a simple locale to language map, but there are
definitely uses for searching using different stemmers/stop-lists even
within the same corpus/index. So, as a datapoint for the discussion,
I would ask that the option of multiple languages per DB locale not be
removed if it can be at all avoided.

Thanks for listening (and for all the great work on getting tsearch
into core! :) ...

--
Mike Rylander

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mike Rylander (#7)
Re: [Fwd: Re: tsearch in core patch]

"Mike Rylander" <mrylander@gmail.com> writes:

I can certainly understand the benefit of making the default
configuration a simple locale to language map, but there are
definitely uses for searching using different stemmers/stop-lists even
within the same corpus/index. So, as a datapoint for the discussion,
I would ask that the option of multiple languages per DB locale not be
removed if it can be at all avoided.

Nobody is proposing that --- the issue here is just how we set up the
"default" configuration.

regards, tom lane

#9Mike Rylander
mrylander@gmail.com
In reply to: Tom Lane (#8)
Re: [Fwd: Re: tsearch in core patch]

On 6/25/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Mike Rylander" <mrylander@gmail.com> writes:

I can certainly understand the benefit of making the default
configuration a simple locale to language map, but there are
definitely uses for searching using different stemmers/stop-lists even
within the same corpus/index. So, as a datapoint for the discussion,
I would ask that the option of multiple languages per DB locale not be
removed if it can be at all avoided.

Nobody is proposing that --- the issue here is just how we set up the
"default" configuration.

Then I misunderstood. Sorry for the noise, folks.

--
Mike Rylander

#10Josh Berkus
josh@agliodbs.com
In reply to: Tatsuo Ishii (#6)
Re: [Fwd: Re: tsearch in core patch]

Ishii-san,

Ok, probably we need to copy the English stemming rule to the one for
Japanese.

Pardon my ignorance here, but is the concept of stemming even relevant
to Japanese/Chinese/Korean? What little I know about ideographic
languages suggests it wouldn't work well. And surely the specific rules
in the Snowball project's English stemmer wouldn't work.

Your undestanding is correct. English stemmer would not work for
Japanese "non English" part.

That reminds me, don't you guys have your own full text search for
Japanese? Planning on merging it with the core code anytime soon?

--Josh

#11Tatsuo Ishii
ishii@sraoss.co.jp
In reply to: Josh Berkus (#10)
Re: [Fwd: Re: tsearch in core patch]

Ishii-san,

Ok, probably we need to copy the English stemming rule to the one for
Japanese.

Pardon my ignorance here, but is the concept of stemming even relevant
to Japanese/Chinese/Korean? What little I know about ideographic
languages suggests it wouldn't work well. And surely the specific rules
in the Snowball project's English stemmer wouldn't work.

Your undestanding is correct. English stemmer would not work for
Japanese "non English" part.

That reminds me, don't you guys have your own full text search for
Japanese? Planning on merging it with the core code anytime soon?

No. Actually Japanese (non English part) does not need stemming at
all. However, since Japanese is an agglutinative language, we have to
break continuous Japanese string into space separated "words". For
example, we need to break:

todayisfine

into:

today is fine

(of course those English are just for non-Japanese spearker's
understanding, actually they are Japanese).

For this we need good dictionary and software. Fortunately we have
several kinds of open source softwares for this pupose. Once I have
written a PostgreSQL C function envoking one of these software to do
the work and it works great with tsearch2.
--
Tatsuo Ishii
SRA OSS, Inc. Japan