access to lexems or access to parsed elements
I want to access the single words in a text. Better yet: the relevant words
(i.e. without stop words) in a text.
to_tsvector or casting gets me the lexems as a tsvector:
select to_tsvector('the quick brown fox jumped over the lazy fox')
''brown':3 'fox':4,9 'jump':5 'lazi':8 'quick':2'
And I would like to access "brown", "fox", "jump", "lazi" and "quick" as
single values that I insert into another table.
But: no luck with any tries to convert to records, arrays or similiar.
Next step, the lesser-known-fts-functions:
select ts_parse('default','the quick brown fox jumped over the lazy fox')
(1,the)
(12," ")
(1,quick)
[...]
(1,fox)
is a set-returning-function, giving me 17 records of type pseudo-record.
Stopwords still in there, so what. But: No chance of accessing the second
field in that record.
Of course, there is allways:
select substr(what::text,position(',' in
what::text)+1,char_length(what::text)-position(',' in what::text)-1) from
(
select ts_parse('default','the quick brown fox jumped over the lazy fox') as
what
)x
but, comeon: having a two-field-record, casting it to one field of text,
searching for the "," that separates the two fields and then split the
one-field into two fields by substring?
So, is there a better way to access
a) the lexems of a tsvector
b) the (unnamed) fields of a set-of-record-returning function
?
Harald
--
Harald Armin Massa www.2ndQuadrant.d <http://www.2ndquadrant.com/>e
PostgreSQL Training, Services and Support
2ndQuadrant Deutschland GmbH
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399
Can this fit?
select plainto_tsquery('english', 'the quick brown fox jumped over the
lazy fox');
plainto_tsquery
-----------------------------------------------------
'quick' & 'brown' & 'fox' & 'jump' & 'lazi' & 'fox'
-Sushant.
Show quoted text
On Thu, 2011-08-25 at 18:21 +0200, Massa, Harald Armin wrote:
I want to access the single words in a text. Better yet: the relevant
words (i.e. without stop words) in a text.to_tsvector or casting gets me the lexems as a tsvector:
select to_tsvector('the quick brown fox jumped over the lazy fox')
''brown':3 'fox':4,9 'jump':5 'lazi':8 'quick':2'And I would like to access "brown", "fox", "jump", "lazi" and "quick"
as single values that I insert into another table.But: no luck with any tries to convert to records, arrays or similiar.
Next step, the lesser-known-fts-functions:
select ts_parse('default','the quick brown fox jumped over the lazy
fox')(1,the)
(12," ")
(1,quick)
[...]
(1,fox)is a set-returning-function, giving me 17 records of type
pseudo-record. Stopwords still in there, so what. But: No chance of
accessing the second field in that record.Of course, there is allways:
select substr(what::text,position(',' in
what::text)+1,char_length(what::text)-position(',' in what::text)-1)
from
(
select ts_parse('default','the quick brown fox jumped over the lazy
fox') as what
)xbut, comeon: having a two-field-record, casting it to one field of
text, searching for the "," that separates the two fields and then
split the one-field into two fields by substring?So, is there a better way to access
a) the lexems of a tsvector
b) the (unnamed) fields of a set-of-record-returning function?
Harald--
Harald Armin Massa www.2ndQuadrant.de
PostgreSQL Training, Services and Support2ndQuadrant Deutschland GmbH
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399
On Thu, 25 Aug 2011 18:21:21 +0200
"Massa, Harald Armin" <harald@2ndQuadrant.de> wrote:
I want to access the single words in a text. Better yet: the
relevant words (i.e. without stop words) in a text.to_tsvector or casting gets me the lexems as a tsvector:
I wrote this piece of C code more than a year ago. [1]http://www.webthatworks.it/d1/content/postgresql-c-module-turn-tsvectors-tsquery-and-return-tsvectors-tables
It has been working in a production environment for quite a lot.
It just works with versions < 8.4 since it doesn't support *.
I'd be willing to maintain the module or even expand its
features and release it on any license that will please postgresql
community if my effort would actually make it more easily available
to other people but throwing it on the internet won't be enough.
If someone think it is worth the effort to help me understand how
that could happen I'd be glad to learn.
[1]: http://www.webthatworks.it/d1/content/postgresql-c-module-turn-tsvectors-tsquery-and-return-tsvectors-tables
http://www.webthatworks.it/d1/content/postgresql-c-module-turn-tsvectors-tsquery-and-return-tsvectors-tables
--
Ivan Sergio Borgonovo
http://www.webthatworks.it
Sushant,
Can this fit?
select plainto_tsquery('english', 'the quick brown fox jumped over the
lazy fox');
plainto_tsquery
-----------------------------------------------------
'quick' & 'brown' & 'fox' & 'jump' & 'lazi' & 'fox'
no, this cannot fit. This just adds a third variation, this time a tsquery
object. There is no way to access the single words in that ts_query
programmatically (besides the ugly "cast to text and split at &')
Harald
--
Harald Armin Massa www.2ndQuadrant.d <http://www.2ndquadrant.com/>e
PostgreSQL Training, Services and Support
2ndQuadrant Deutschland GmbH
GF: Harald Armin Massa
Amtsgericht Stuttgart, HRB 736399
a) the lexems of a tsvector
ts_debug (it`s a plain sql function, may give You some inspiration for
Your own queries)
b) the (unnamed) fields of a set-of-record-returning function
select * from ts_parse('default','the quick brown fox jumped over the lazy fox')
or
select (ts_parse('default','the quick brown fox jumped over the lazy
fox')).token
Greetings
Marcin Mańk
Massa, Harald Armin wrote:
select ts_parse('default','the quick brown fox jumped over the lazy fox')
(1,the)
(12," ")
(1,quick)
[...]
(1,fox)is a set-returning-function, giving me 17 records of type pseudo-record.
Stopwords still in there, so what. But: No chance of accessing the second
field in that record.
What about:
select w from ts_parse('default','the quick brown fox jumped over the lazy
fox')
as lexems(i,w);
Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org