[Fwd: Re: tsearch in core patch]

Started by Teodor Sigaevabout 19 years ago11 messageshackers

Jump to latest

Teodor Sigaev

teodor@sigaev.ru

about 19 years ago

How would this work for initdb with locale C?

I'm worrying about that too.

english '{en_GB, en_US, C}'

I suppose, that locale name always has a dot separator exept C locale ---
which is well known exception

Tatsuo Ishii

ishii@postgresql.org

about 19 years ago

In reply to: Teodor Sigaev (#1)

Re: [Fwd: Re: tsearch in core patch]

How would this work for initdb with locale C?

I'm worrying about that too.

english '{en_GB, en_US, C}'

I suppose, that locale name always has a dot separator exept C locale ---
which is well known exception

So we would have to?:

japanese '{ja_JP, C}'

How would we know C -> japanese?

Also I'm wondering how we could handle texts including Japanese and
English. It's very common in Japan.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Euler Taveira de Oliveira

euler@timbira.com

about 19 years ago

In reply to: Tatsuo Ishii (#2)

Re: [Fwd: Re: tsearch in core patch]

Tatsuo Ishii wrote:

japanese '{ja_JP, C}'

How would we know C -> japanese?

You can't do that. You can't have different languages (not locales)
mapping to the same 'tsearch language' because the stemmer doesn't know
that a specific word is in english or japanese. So you have two options:
(a) disable stemming (b) leave the language set to 'japanese' and see if
it plays well.

--
Euler Taveira de Oliveira
http://www.timbira.com/

Tatsuo Ishii

ishii@postgresql.org

about 19 years ago

In reply to: Euler Taveira de Oliveira (#3)

Re: [Fwd: Re: tsearch in core patch]

Tatsuo Ishii wrote:

japanese '{ja_JP, C}'

How would we know C -> japanese?

You can't do that. You can't have different languages (not locales)
mapping to the same 'tsearch language' because the stemmer doesn't know
that a specific word is in english or japanese. So you have two options:
(a) disable stemming (b) leave the language set to 'japanese' and see if
it plays well.

Ok, probably we need to copy the English stemming rule to the one for
Japanese. I think same thing (commonly used English with local
language) can be applied to Chinese and Korean.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Tatsuo Ishii (#4)

Re: [Fwd: Re: tsearch in core patch]

Tatsuo Ishii <ishii@sraoss.co.jp> writes:

Ok, probably we need to copy the English stemming rule to the one for
Japanese.

Pardon my ignorance here, but is the concept of stemming even relevant
to Japanese/Chinese/Korean? What little I know about ideographic
languages suggests it wouldn't work well. And surely the specific rules
in the Snowball project's English stemmer wouldn't work.

I think same thing (commonly used English with local
language) can be applied to Chinese and Korean.

Well, it's not hard at all to find chunks of English text that have
embedded bits of French, Spanish, or what-have-you, but that's not an
argument for trying to intermix the stemmers. I doubt that such simple
bits of program could tell the language difference well enough to
determine which stemming rules to apply.

regards, tom lane

Tatsuo Ishii

ishii@postgresql.org

about 19 years ago

In reply to: Tom Lane (#5)

Re: [Fwd: Re: tsearch in core patch]

Tatsuo Ishii <ishii@sraoss.co.jp> writes:

Ok, probably we need to copy the English stemming rule to the one for
Japanese.

Pardon my ignorance here, but is the concept of stemming even relevant
to Japanese/Chinese/Korean? What little I know about ideographic
languages suggests it wouldn't work well. And surely the specific rules
in the Snowball project's English stemmer wouldn't work.

Your undestanding is correct. English stemmer would not work for
Japanese "non English" part.

What I meant was the "chunks of English text" in Japanese.

I think same thing (commonly used English with local
language) can be applied to Chinese and Korean.

Well, it's not hard at all to find chunks of English text that have
embedded bits of French, Spanish, or what-have-you, but that's not an
argument for trying to intermix the stemmers. I doubt that such simple
bits of program could tell the language difference well enough to
determine which stemming rules to apply.

For Japanese, it will be fairly simple: 7bit ASCII range words must be
English (Note that mostly used Japanese encodings such as EUC do not
allow to mix with ISO 8859).
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Mike Rylander

mrylander@gmail.com

about 19 years ago

In reply to: Tom Lane (#5)

Re: [Fwd: Re: tsearch in core patch]

On 6/25/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, it's not hard at all to find chunks of English text that have
embedded bits of French, Spanish, or what-have-you, but that's not an
argument for trying to intermix the stemmers. I doubt that such simple
bits of program could tell the language difference well enough to
determine which stemming rules to apply.

While I imagine that is probably true of many, if not most, my project
in particular would greatly benefit from the ability to mix stemmers.
I work with complex bibliographic data, which has language information
embedded within records. This is not limited to the record level
either. Individual fields within each bibliographic record can be in
different langauges.

Especially in countries where making software multi-lingual (such as
Canada (en_CA/fr_CA)) is a requirement for use in public institutions,
the ability to choose a stemmer and stop-word list at will for any
particular record will actually provide the exact behavior needed.
The obvious generalization from Canada would be to support any mix of
languages supported by tsearch2.

I can certainly understand the benefit of making the default
configuration a simple locale to language map, but there are
definitely uses for searching using different stemmers/stop-lists even
within the same corpus/index. So, as a datapoint for the discussion,
I would ask that the option of multiple languages per DB locale not be
removed if it can be at all avoided.

Thanks for listening (and for all the great work on getting tsearch
into core! :) ...

--
Mike Rylander

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Mike Rylander (#7)

Re: [Fwd: Re: tsearch in core patch]

"Mike Rylander" <mrylander@gmail.com> writes:

I can certainly understand the benefit of making the default
configuration a simple locale to language map, but there are
definitely uses for searching using different stemmers/stop-lists even
within the same corpus/index. So, as a datapoint for the discussion,
I would ask that the option of multiple languages per DB locale not be
removed if it can be at all avoided.

Nobody is proposing that --- the issue here is just how we set up the
"default" configuration.

regards, tom lane

Mike Rylander

mrylander@gmail.com

about 19 years ago

In reply to: Tom Lane (#8)

Re: [Fwd: Re: tsearch in core patch]

On 6/25/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Mike Rylander" <mrylander@gmail.com> writes:

I can certainly understand the benefit of making the default
configuration a simple locale to language map, but there are
definitely uses for searching using different stemmers/stop-lists even
within the same corpus/index. So, as a datapoint for the discussion,
I would ask that the option of multiple languages per DB locale not be
removed if it can be at all avoided.

Nobody is proposing that --- the issue here is just how we set up the
"default" configuration.

Then I misunderstood. Sorry for the noise, folks.

--
Mike Rylander

#10

Josh Berkus

josh@agliodbs.com

about 19 years ago

In reply to: Tatsuo Ishii (#6)

Re: [Fwd: Re: tsearch in core patch]

Ishii-san,

Ok, probably we need to copy the English stemming rule to the one for
Japanese.

Pardon my ignorance here, but is the concept of stemming even relevant
to Japanese/Chinese/Korean? What little I know about ideographic
languages suggests it wouldn't work well. And surely the specific rules
in the Snowball project's English stemmer wouldn't work.

Your undestanding is correct. English stemmer would not work for
Japanese "non English" part.

That reminds me, don't you guys have your own full text search for
Japanese? Planning on merging it with the core code anytime soon?

--Josh

#11

Tatsuo Ishii

ishii@postgresql.org

about 19 years ago

In reply to: Josh Berkus (#10)

Re: [Fwd: Re: tsearch in core patch]

Ishii-san,

Ok, probably we need to copy the English stemming rule to the one for
Japanese.

Pardon my ignorance here, but is the concept of stemming even relevant
to Japanese/Chinese/Korean? What little I know about ideographic
languages suggests it wouldn't work well. And surely the specific rules
in the Snowball project's English stemmer wouldn't work.

Your undestanding is correct. English stemmer would not work for
Japanese "non English" part.

That reminds me, don't you guys have your own full text search for
Japanese? Planning on merging it with the core code anytime soon?

No. Actually Japanese (non English part) does not need stemming at
all. However, since Japanese is an agglutinative language, we have to
break continuous Japanese string into space separated "words". For
example, we need to break:

todayisfine

into:

today is fine

(of course those English are just for non-Japanese spearker's
understanding, actually they are Japanese).

For this we need good dictionary and software. Fortunately we have
several kinds of open source softwares for this pupose. Once I have
written a PostgreSQL C function envoking one of these software to do
the work and it works great with tsearch2.
--
Tatsuo Ishii
SRA OSS, Inc. Japan