Extending range of to_tsvector et al

Started by johnkn63over 13 years ago8 messages
#1johnkn63
john.knightley@gmail.com

When using to_tsvector a number of newer unicode characters and pua
characters are not included. How do I add the characters which I desire to
be found?

Regards
John

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Extending-range-of-to-tsvector-et-al-tp5726041.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

#2Dan Scott
denials@gmail.com
In reply to: johnkn63 (#1)
Re: Extending range of to_tsvector et al

On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 <john.knightley@gmail.com> wrote:

When using to_tsvector a number of newer unicode characters and pua
characters are not included. How do I add the characters which I desire to
be found?

I've just started digging into this code a bit, but from what I've
found src/backend/tsearch/wparser_def.c defines much of the parser
functionality, and in the area of Unicode includes a number of
comments like:

* with multibyte encoding and C-locale isw* function may fail or give
wrong result.
* multibyte encoding and C-locale often are used for Asian languages.
* any non-ascii symbol with multibyte encoding with C-locale is an
alpha character

... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if
WCSTOMBS and TOWLOWER are available) to complicate testing scenarios
:)

Also note that src/test/regress/sql/tsearch.sql and
regress/sql/tsdicts.sql currently focus on English, ASCII-only data.

Perhaps this is a good opportunity for you to describe what your
environment looks like (OS, PostgreSQL version, encoding and locale
settings for the database) and show some sample to_tsquery() @@
to_tsvector() queries that don't behave the way you think they should
behave - and we could start building some test cases as a first step?

--
Dan Scott
Laurentian University

#3john knightley
john.knightley@gmail.com
In reply to: Dan Scott (#2)
Re: Extending range of to_tsvector et al

Dear Dan,

thank you for your reply.

The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
a utf8 local

A short 5 line dictionary file is sufficient to test:-

raeuz
我们
𦘭𥎵
𪽖𫖂
󶒘󴮬

line 1 "raeuz" Zhuang word written using English letters and show up
under ts_vector ok
line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
found in Unicode 3.1 which came in about the year 2000 and show up
under ts_vector ok
line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
found in Unicode 5.2 which came in about the year 2009 but do not show
up under ts_vector ok
line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
found in PUA area of the font Sawndip.ttf but do not show up under
ts_vector ok (Font can be downloaded from
http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

The last two words even though included in a dictionary do not get
accepted by ts_vector.

Regards
John

Show quoted text

On Mon, Oct 1, 2012 at 11:04 AM, Dan Scott <denials@gmail.com> wrote:

On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 <john.knightley@gmail.com> wrote:

When using to_tsvector a number of newer unicode characters and pua
characters are not included. How do I add the characters which I desire to
be found?

I've just started digging into this code a bit, but from what I've
found src/backend/tsearch/wparser_def.c defines much of the parser
functionality, and in the area of Unicode includes a number of
comments like:

* with multibyte encoding and C-locale isw* function may fail or give
wrong result.
* multibyte encoding and C-locale often are used for Asian languages.
* any non-ascii symbol with multibyte encoding with C-locale is an
alpha character

... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if
WCSTOMBS and TOWLOWER are available) to complicate testing scenarios
:)

Also note that src/test/regress/sql/tsearch.sql and
regress/sql/tsdicts.sql currently focus on English, ASCII-only data.

Perhaps this is a good opportunity for you to describe what your
environment looks like (OS, PostgreSQL version, encoding and locale
settings for the database) and show some sample to_tsquery() @@
to_tsvector() queries that don't behave the way you think they should
behave - and we could start building some test cases as a first step?

--
Dan Scott
Laurentian University

#4Dan Scott
denials@gmail.com
In reply to: john knightley (#3)
Re: Extending range of to_tsvector et al

Hi John:

On Sun, Sep 30, 2012 at 11:45 PM, john knightley
<john.knightley@gmail.com> wrote:

Dear Dan,

thank you for your reply.

The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
a utf8 local

A short 5 line dictionary file is sufficient to test:-

raeuz
我们
𦘭𥎵
𪽖𫖂
󶒘󴮬

line 1 "raeuz" Zhuang word written using English letters and show up
under ts_vector ok
line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
found in Unicode 3.1 which came in about the year 2000 and show up
under ts_vector ok
line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
found in Unicode 5.2 which came in about the year 2009 but do not show
up under ts_vector ok
line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
found in PUA area of the font Sawndip.ttf but do not show up under
ts_vector ok (Font can be downloaded from
http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

The last two words even though included in a dictionary do not get
accepted by ts_vector.

Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to
work using the default text search configuration (albeit with one
crucial note: I created the database with the "lc_ctype=C
lc_collate=C" options):

WORKING:

createdb --template=template0 --lc-ctype=C --lc-collate=C foobar
foobar=# select ts_debug('󶒘󴮬');
ts_debug
----------------------------------------------------------------
(word,"Word, all letters",󶒘󴮬,{english_stem},english_stem,{󶒘󴮬})
(1 row)

NOT WORKING AS EXPECTED:

foobaz=# SHOW LC_CTYPE;
lc_ctype
-------------
en_US.UTF-8
(1 row)

foobaz=# select ts_debug('󶒘󴮬');
ts_debug
---------------------------------
(blank,"Space symbols",󶒘󴮬,{},,)
(1 row)

So... perhaps LC_CTYPE=C is a possible workaround for you?

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: john knightley (#3)
Re: Extending range of to_tsvector et al

john knightley <john.knightley@gmail.com> writes:

The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
a utf8 local

A short 5 line dictionary file is sufficient to test:-

raeuz
我们
𦘭𥎵
𪽖𫖂
󶒘󴮬

line 1 "raeuz" Zhuang word written using English letters and show up
under ts_vector ok
line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
found in Unicode 3.1 which came in about the year 2000 and show up
under ts_vector ok
line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
found in Unicode 5.2 which came in about the year 2009 but do not show
up under ts_vector ok
line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
found in PUA area of the font Sawndip.ttf but do not show up under
ts_vector ok (Font can be downloaded from
http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

AFAIK there is nothing in Postgres itself that would distinguish, say,
𦘭 from 𪽖. I think this must be down to
your platform's locale definition: it probably thinks that the former is
a letter and the latter is not. You'd have to gripe to the locale
maintainers to get that fixed.

regards, tom lane

#6john knightley
john.knightley@gmail.com
In reply to: Tom Lane (#5)
Re: Extending range of to_tsvector et al

On Mon, Oct 1, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

john knightley <john.knightley@gmail.com> writes:

The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
a utf8 local

A short 5 line dictionary file is sufficient to test:-

raeuz
我们
𦘭𥎵
𪽖𫖂
󶒘󴮬

line 1 "raeuz" Zhuang word written using English letters and show up
under ts_vector ok
line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
found in Unicode 3.1 which came in about the year 2000 and show up
under ts_vector ok
line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
found in Unicode 5.2 which came in about the year 2009 but do not show
up under ts_vector ok
line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
found in PUA area of the font Sawndip.ttf but do not show up under
ts_vector ok (Font can be downloaded from
http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

AFAIK there is nothing in Postgres itself that would distinguish, say,
𦘭 from 𪽖. I think this must be down to
your platform's locale definition: it probably thinks that the former is
a letter and the latter is not. You'd have to gripe to the locale
maintainers to get that fixed.

regards, tom lane

PostgreSQL in general does not usually distinguish but full text search does:-

select ts_debug('𦘭 from 𪽖');

gives the result:-

ts_debug
-------------------------------------------------------------------
(word,"Word, all letters",𦘭,{english_stem},english_stem,{𦘭})
(blank,"Space symbols"," ",{},,)
(asciiword,"Word, all ASCII",from,{english_stem},english_stem,{})
(blank,"Space symbols"," 𪽖",{},,)
(4 rows)

Somewhere there is dictionary, or library that is based on @ Unicode
4.0 which includes "𦘭","U+2662d" but not "𫖂","U+2b582" which is
Unicode 5.1.

Also PUA characters are dropped in the same way by the full text
search, which is what google does but which I do not wish to do.

Regards
John

#7john knightley
john.knightley@gmail.com
In reply to: Dan Scott (#4)
Re: Extending range of to_tsvector et al

On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott <denials@gmail.com> wrote:

Hi John:

On Sun, Sep 30, 2012 at 11:45 PM, john knightley
<john.knightley@gmail.com> wrote:

Dear Dan,

thank you for your reply.

The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on
a utf8 local

A short 5 line dictionary file is sufficient to test:-

raeuz
我们
𦘭𥎵
𪽖𫖂
󶒘󴮬

line 1 "raeuz" Zhuang word written using English letters and show up
under ts_vector ok
line 2 "我们" uses everyday Chinese word and show up under ts_vector ok
line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters
found in Unicode 3.1 which came in about the year 2000 and show up
under ts_vector ok
line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters
found in Unicode 5.2 which came in about the year 2009 but do not show
up under ts_vector ok
line 5 "󶒘󴮬" Zhuang word written using rather old Chinese charcters
found in PUA area of the font Sawndip.ttf but do not show up under
ts_vector ok (Font can be downloaded from
http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf)

The last two words even though included in a dictionary do not get
accepted by ts_vector.

Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to
work using the default text search configuration (albeit with one
crucial note: I created the database with the "lc_ctype=C
lc_collate=C" options):

WORKING:

createdb --template=template0 --lc-ctype=C --lc-collate=C foobar
foobar=# select ts_debug('󶒘󴮬');
ts_debug
----------------------------------------------------------------
(word,"Word, all letters",󶒘󴮬,{english_stem},english_stem,{󶒘󴮬})
(1 row)

NOT WORKING AS EXPECTED:

foobaz=# SHOW LC_CTYPE;
lc_ctype
-------------
en_US.UTF-8
(1 row)

foobaz=# select ts_debug('󶒘󴮬');
ts_debug
---------------------------------
(blank,"Space symbols",󶒘󴮬,{},,)
(1 row)

So... perhaps LC_CTYPE=C is a possible workaround for you?

LC_CTYPE would not be a work around - this database needs to be in
utf8 , the full text search is to be used for a mediawiki. Is this a
bug that is being worked on?

Regards
John

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: john knightley (#7)
Re: Extending range of to_tsvector et al

john knightley <john.knightley@gmail.com> writes:

On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott <denials@gmail.com> wrote:

So... perhaps LC_CTYPE=C is a possible workaround for you?

LC_CTYPE would not be a work around - this database needs to be in
utf8 , the full text search is to be used for a mediawiki.

You're confusing locale and encoding. They are different things.

Is this a bug that is being worked on?

No. As I already tried to explain to you, this behavior is not
determined by Postgres, it's determined by the platform's locale
support. You need to complain to your OS vendor.

regards, tom lane