tsearch2: enable non ascii stop words with C locale

Started by Tatsuo Ishiialmost 19 years ago6 messages
#1Tatsuo Ishii
ishii@postgresql.org
1 attachment(s)

Hi,

Currently tsearch2 does not accept non ascii stop words if locale is
C. Included patches should fix the problem. Patches against PostgreSQL
8.2.3.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Attachments:

tsearch2.patchtext/plain; charset=us-asciiDownload
*** wordparser/parser.c~	2007-01-16 00:16:11.000000000 +0900
--- wordparser/parser.c	2007-02-10 18:04:59.000000000 +0900
***************
*** 246,251 ****
--- 246,266 ----
  static int
  p_islatin(TParser * prs)
  {
+ 	if (prs->usewide)
+ 	{
+ 		if (lc_ctype_is_c())
+ 		{
+ 			unsigned int c = *(unsigned int*)(prs->wstr + prs->state->poschar);
+ 
+ 			/*
+ 			 * any non-ascii symbol with multibyte encoding
+ 			 * with C-locale is a latin character
+ 			 */
+ 			if ( c > 0x7f )
+ 				return 1;
+ 		}
+ 	}
+ 
  	return (p_isalpha(prs) && p_isascii(prs)) ? 1 : 0;
  }
  
#2Teodor Sigaev
teodor@sigaev.ru
In reply to: Tatsuo Ishii (#1)
Re: tsearch2: enable non ascii stop words with C locale

Currently tsearch2 does not accept non ascii stop words if locale is
C. Included patches should fix the problem. Patches against PostgreSQL
8.2.3.

I'm not sure about correctness of patch's description.

First, p_islatin() function is used only in words/lexemes parser, not stop-word
code. Second, p_islatin() function is used for catching lexemes like URL or HTML
entities, so, it's important to define real latin characters. And it works
right: it calls p_isalpha (already patched for your case), then it calls
p_isascii which should be correct for any encodings with C-locale.
Third (and last):
contrib_regression=# show server_encoding;
server_encoding
-----------------
UTF8
contrib_regression=# show lc_ctype;
lc_ctype
----------
C
contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD);
lexize
--------
{}

Russian characters with UTF8 take two bytes.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#3Tatsuo Ishii
ishii@sraoss.co.jp
In reply to: Teodor Sigaev (#2)
Re: tsearch2: enable non ascii stop words with C locale

Currently tsearch2 does not accept non ascii stop words if locale is
C. Included patches should fix the problem. Patches against PostgreSQL
8.2.3.

I'm not sure about correctness of patch's description.

First, p_islatin() function is used only in words/lexemes parser, not stop-word
code.

I know. My guess is the parser does not read the stop word file at
least with default configuration.

Second, p_islatin() function is used for catching lexemes like URL or HTML
entities, so, it's important to define real latin characters. And it works
right: it calls p_isalpha (already patched for your case), then it calls
p_isascii which should be correct for any encodings with C-locale.

original p_islatin is defined as follows:

static int
p_islatin(TParser * prs)
{
return (p_isalpha(prs) && p_isascii(prs)) ? 1 : 0;
}

So if a character is not ASCII, it returns 0 even if p_isalpha returns
1. Is this what you expect?

Third (and last):
contrib_regression=# show server_encoding;
server_encoding
-----------------
UTF8
contrib_regression=# show lc_ctype;
lc_ctype
----------
C
contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD);
lexize
--------
{}

Russian characters with UTF8 take two bytes.

In our case, we added JAPANESE_STOP_WORD into english.stop then:

select to_tsvector(JAPANESE_STOP_WORD)

which returns words even they are in JAPANESE_STOP_WORD.

And with the patches the problem was solved.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#4Teodor Sigaev
teodor@sigaev.ru
In reply to: Tatsuo Ishii (#3)
Re: tsearch2: enable non ascii stop words with C locale

I know. My guess is the parser does not read the stop word file at
least with default configuration.

Parser should not read stopword file: its deal for dictionaries.

So if a character is not ASCII, it returns 0 even if p_isalpha returns
1. Is this what you expect?

No, p_islatin should return true only for latin characters, not for national ones.

In our case, we added JAPANESE_STOP_WORD into english.stop then:
select to_tsvector(JAPANESE_STOP_WORD)
which returns words even they are in JAPANESE_STOP_WORD.
And with the patches the problem was solved.

Pls, show your configuration for lexemes/dictionaries. I suspect that you have
en_stem dictionary on for lword lexemes type. Better way is to use 'simple'
distionary (it's support stopword the same way as en_stem does) and set it for
nlword, word, part_hword, nlpart_hword, hword, nlhword lexeme's types. Note,
leave unchanged en_stem for any latin word.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#5Tatsuo Ishii
ishii@sraoss.co.jp
In reply to: Teodor Sigaev (#4)
Re: tsearch2: enable non ascii stop words with C locale

I know. My guess is the parser does not read the stop word file at
least with default configuration.

Parser should not read stopword file: its deal for dictionaries.

I'll come up with more detailed info, explaining why stopword file is
not read.

So if a character is not ASCII, it returns 0 even if p_isalpha returns
1. Is this what you expect?

No, p_islatin should return true only for latin characters, not for national ones.

Precise definition for "latin" in C locale please. Are you saying that
single byte encoding with range 0-7f? is "latin"? If so, it seems they
are exacty same as ASCII.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Show quoted text

In our case, we added JAPANESE_STOP_WORD into english.stop then:
select to_tsvector(JAPANESE_STOP_WORD)
which returns words even they are in JAPANESE_STOP_WORD.
And with the patches the problem was solved.

Pls, show your configuration for lexemes/dictionaries. I suspect that you have
en_stem dictionary on for lword lexemes type. Better way is to use 'simple'
distionary (it's support stopword the same way as en_stem does) and set it for
nlword, word, part_hword, nlpart_hword, hword, nlhword lexeme's types. Note,
leave unchanged en_stem for any latin word.

--
Teodor Sigaev E-mail: teodor@sigaev.ru

#6Teodor Sigaev
teodor@sigaev.ru
In reply to: Tatsuo Ishii (#5)
Re: tsearch2: enable non ascii stop words with C locale

Precise definition for "latin" in C locale please. Are you saying that
single byte encoding with range 0-7f? is "latin"? If so, it seems they
are exacty same as ASCII.

p_islatin returns true for ASCII alpha characters.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/