access to lexems or access to parsed elements

Started by Massa, Harald Arminover 14 years ago6 messagesgeneral
Jump to latest
#1Massa, Harald Armin
harald@2ndQuadrant.de

I want to access the single words in a text. Better yet: the relevant words
(i.e. without stop words) in a text.

to_tsvector or casting gets me the lexems as a tsvector:

select to_tsvector('the quick brown fox jumped over the lazy fox')
''brown':3 'fox':4,9 'jump':5 'lazi':8 'quick':2'

And I would like to access "brown", "fox", "jump", "lazi" and "quick" as
single values that I insert into another table.

But: no luck with any tries to convert to records, arrays or similiar.

Next step, the lesser-known-fts-functions:

select ts_parse('default','the quick brown fox jumped over the lazy fox')

(1,the)
(12," ")
(1,quick)
[...]
(1,fox)

is a set-returning-function, giving me 17 records of type pseudo-record.
Stopwords still in there, so what. But: No chance of accessing the second
field in that record.

Of course, there is allways:

select substr(what::text,position(',' in
what::text)+1,char_length(what::text)-position(',' in what::text)-1) from
(
select ts_parse('default','the quick brown fox jumped over the lazy fox') as
what
)x

but, comeon: having a two-field-record, casting it to one field of text,
searching for the "," that separates the two fields and then split the
one-field into two fields by substring?

So, is there a better way to access

a) the lexems of a tsvector
b) the (unnamed) fields of a set-of-record-returning function

?
Harald

--
Harald Armin Massa www.2ndQuadrant.d <http://www.2ndquadrant.com/&gt;e
PostgreSQL Training, Services and Support

2ndQuadrant Deutschland GmbH
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399

#2Sushant Sinha
sushant354@gmail.com
In reply to: Massa, Harald Armin (#1)
Re: access to lexems or access to parsed elements

Can this fit?

select plainto_tsquery('english', 'the quick brown fox jumped over the
lazy fox');
plainto_tsquery
-----------------------------------------------------
'quick' & 'brown' & 'fox' & 'jump' & 'lazi' & 'fox'

-Sushant.

Show quoted text

On Thu, 2011-08-25 at 18:21 +0200, Massa, Harald Armin wrote:

I want to access the single words in a text. Better yet: the relevant
words (i.e. without stop words) in a text.

to_tsvector or casting gets me the lexems as a tsvector:

select to_tsvector('the quick brown fox jumped over the lazy fox')
''brown':3 'fox':4,9 'jump':5 'lazi':8 'quick':2'

And I would like to access "brown", "fox", "jump", "lazi" and "quick"
as single values that I insert into another table.

But: no luck with any tries to convert to records, arrays or similiar.

Next step, the lesser-known-fts-functions:

select ts_parse('default','the quick brown fox jumped over the lazy
fox')

(1,the)
(12," ")
(1,quick)
[...]
(1,fox)

is a set-returning-function, giving me 17 records of type
pseudo-record. Stopwords still in there, so what. But: No chance of
accessing the second field in that record.

Of course, there is allways:

select substr(what::text,position(',' in
what::text)+1,char_length(what::text)-position(',' in what::text)-1)
from
(
select ts_parse('default','the quick brown fox jumped over the lazy
fox') as what
)x

but, comeon: having a two-field-record, casting it to one field of
text, searching for the "," that separates the two fields and then
split the one-field into two fields by substring?

So, is there a better way to access

a) the lexems of a tsvector
b) the (unnamed) fields of a set-of-record-returning function

?
Harald

--
Harald Armin Massa www.2ndQuadrant.de
PostgreSQL Training, Services and Support

2ndQuadrant Deutschland GmbH
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399

#3Ivan Sergio Borgonovo
mail@webthatworks.it
In reply to: Massa, Harald Armin (#1)
Re: access to lexems or access to parsed elements

On Thu, 25 Aug 2011 18:21:21 +0200
"Massa, Harald Armin" <harald@2ndQuadrant.de> wrote:

I want to access the single words in a text. Better yet: the
relevant words (i.e. without stop words) in a text.

to_tsvector or casting gets me the lexems as a tsvector:

I wrote this piece of C code more than a year ago. [1]http://www.webthatworks.it/d1/content/postgresql-c-module-turn-tsvectors-tsquery-and-return-tsvectors-tables
It has been working in a production environment for quite a lot.

It just works with versions < 8.4 since it doesn't support *.
I'd be willing to maintain the module or even expand its
features and release it on any license that will please postgresql
community if my effort would actually make it more easily available
to other people but throwing it on the internet won't be enough.

If someone think it is worth the effort to help me understand how
that could happen I'd be glad to learn.

[1]: http://www.webthatworks.it/d1/content/postgresql-c-module-turn-tsvectors-tsquery-and-return-tsvectors-tables
http://www.webthatworks.it/d1/content/postgresql-c-module-turn-tsvectors-tsquery-and-return-tsvectors-tables

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

#4Massa, Harald Armin
harald@2ndQuadrant.de
In reply to: Sushant Sinha (#2)
Re: access to lexems or access to parsed elements

Sushant,

Can this fit?

select plainto_tsquery('english', 'the quick brown fox jumped over the
lazy fox');
plainto_tsquery
-----------------------------------------------------
'quick' & 'brown' & 'fox' & 'jump' & 'lazi' & 'fox'

no, this cannot fit. This just adds a third variation, this time a tsquery
object. There is no way to access the single words in that ts_query
programmatically (besides the ugly "cast to text and split at &')

Harald

--
Harald Armin Massa www.2ndQuadrant.d <http://www.2ndquadrant.com/&gt;e
PostgreSQL Training, Services and Support

2ndQuadrant Deutschland GmbH
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399

#5marcin mank
marcin.mank@gmail.com
In reply to: Massa, Harald Armin (#1)
Re: access to lexems or access to parsed elements

a) the lexems of a tsvector

ts_debug (it`s a plain sql function, may give You some inspiration for
Your own queries)

b) the (unnamed) fields of a set-of-record-returning function

select * from ts_parse('default','the quick brown fox jumped over the lazy fox')
or
select (ts_parse('default','the quick brown fox jumped over the lazy
fox')).token

Greetings
Marcin Mańk

#6Daniel Verite
daniel@manitou-mail.org
In reply to: Massa, Harald Armin (#1)
Re: access to lexems or access to parsed elements

Massa, Harald Armin wrote:

select ts_parse('default','the quick brown fox jumped over the lazy fox')

(1,the)
(12," ")
(1,quick)
[...]
(1,fox)

is a set-returning-function, giving me 17 records of type pseudo-record.
Stopwords still in there, so what. But: No chance of accessing the second
field in that record.

What about:
select w from ts_parse('default','the quick brown fox jumped over the lazy
fox')
as lexems(i,w);

Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org