Shrinking TSvectors

Started by Howard Coleabout 10 years ago7 messagesgeneral
Jump to latest
#1Howard Cole
howardnews@selestial.com

Hi,

does anyone have any pointers for shrinking tsvectors

I have looked at the contents of some of these fields and they contain
many details that are not needed. For example...

"'+1':935,942 '-0500':72 '-0578':932 '-0667':938 '-266':937 '-873':944
'-9972':945 '/partners/application.html':222
'/partners/program/program-agreement.pdf':271
'/partners/reseller.html':181,1073 '01756':50,1083 '07767':54,1087
'1':753,771 '12':366 '14':66 (...)"

I am not interested in keeping the numbers or urls in the indexes.

Thanks,

Howard.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Howard Cole (#1)
Re: Shrinking TSvectors

On Tue, Apr 5, 2016 at 2:37 PM, Howard News <howardnews@selestial.com>
wrote:

Hi,

does anyone have any pointers for shrinking tsvectors

I have looked at the contents of some of these fields and they contain
many details that are not needed. For example...

"'+1':935,942 '-0500':72 '-0578':932 '-0667':938 '-266':937 '-873':944
'-9972':945 '/partners/application.html':222
'/partners/program/program-agreement.pdf':271
'/partners/reseller.html':181,1073 '01756':50,1083 '07767':54,1087
'1':753,771 '12':366 '14':66 (...)"

I am not interested in keeping the numbers or urls in the indexes.

select strip ('asd:23');
strip
-------
'asd'
(1 row)

Show quoted text

Thanks,

Howard.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Arthur Zakirov
a.zakirov@postgrespro.ru
In reply to: Howard Cole (#1)
Re: Shrinking TSvectors

On 05.04.2016 14:37, Howard News wrote:

Hi,

does anyone have any pointers for shrinking tsvectors

I have looked at the contents of some of these fields and they contain
many details that are not needed. For example...

"'+1':935,942 '-0500':72 '-0578':932 '-0667':938 '-266':937 '-873':944
'-9972':945 '/partners/application.html':222
'/partners/program/program-agreement.pdf':271
'/partners/reseller.html':181,1073 '01756':50,1083 '07767':54,1087
'1':753,771 '12':366 '14':66 (...)"

I am not interested in keeping the numbers or urls in the indexes.

Thanks,

Howard.

Hello,

You need create a new text search configuration. Here is an example of
commands:

CREATE TEXT SEARCH CONFIGURATION public.english_cfg (
PARSER = default
);
ALTER TEXT SEARCH CONFIGURATION public.english_cfg
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH pg_catalog.english_stem;

Instead of the "pg_catalog.english_stem" you can use your own dictionary.

Lets compare new configuration with the embedded configuration
"pg_catalog.english":

postgres=# select to_tsvector('english_cfg', 'home -9972
/partners/application.html /partners/program/program-agreement.pdf');
to_tsvector
-------------
'home':1
(1 row)

postgres=# select to_tsvector('english', 'home -9972
/partners/application.html /partners/program/program-agreement.pdf');
to_tsvector

-----------------------------------------------------------------------------------------------
'-9972':2 '/partners/application.html':3
'/partners/program/program-agreement.pdf':4 'home':1
(1 row)

You can get some additional information about configurations using \dF+:

postgres=# \dF+ english
Text search configuration "pg_catalog.english"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
email | simple
file | simple
float | simple
host | simple
hword | english_stem
hword_asciipart | english_stem
hword_numpart | simple
hword_part | english_stem
int | simple
numhword | simple
numword | simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | english_stem

postgres=# \dF+ english_cfg
Text search configuration "public.english_cfg"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
hword | english_stem
hword_asciipart | english_stem
hword_part | english_stem
word | english_stem

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Howard Cole
howardnews@selestial.com
In reply to: Oleg Bartunov (#2)
Re: Shrinking TSvectors

On 05/04/2016 14:44, Oleg Bartunov wrote:

On Tue, Apr 5, 2016 at 2:37 PM, Howard News <howardnews@selestial.com
<mailto:howardnews@selestial.com>> wrote:

Hi,

does anyone have any pointers for shrinking tsvectors

I have looked at the contents of some of these fields and they
contain many details that are not needed. For example...

"'+1':935,942 '-0500':72 '-0578':932 '-0667':938 '-266':937
'-873':944 '-9972':945 '/partners/application.html':222
'/partners/program/program-agreement.pdf':271
'/partners/reseller.html':181,1073 '01756':50,1083 '07767':54,1087
'1':753,771 '12':366 '14':66 (...)"

I am not interested in keeping the numbers or urls in the indexes.

select strip ('asd:23');
strip
-------
'asd'
(1 row)

Hi Oleg,

Is this function documented anywhere?

Howard.

#5Alexander Shereshevsky
shereshevsky@gmail.com
In reply to: Howard Cole (#4)
Re: Shrinking TSvectors

On Tue, Apr 5, 2016 at 5:37 PM, Howard News <howardnews@selestial.com>
wrote:

On 05/04/2016 14:44, Oleg Bartunov wrote:

On Tue, Apr 5, 2016 at 2:37 PM, Howard News <howardnews@selestial.com>
wrote:

Hi,

does anyone have any pointers for shrinking tsvectors

I have looked at the contents of some of these fields and they contain
many details that are not needed. For example...

"'+1':935,942 '-0500':72 '-0578':932 '-0667':938 '-266':937 '-873':944
'-9972':945 '/partners/application.html':222
'/partners/program/program-agreement.pdf':271
'/partners/reseller.html':181,1073 '01756':50,1083 '07767':54,1087
'1':753,771 '12':366 '14':66 (...)"

I am not interested in keeping the numbers or urls in the indexes.

select strip ('asd:23');
strip
-------
'asd'
(1 row)

Hi Oleg,

Is this function documented anywhere?

Howard.


http://www.postgresql.org/docs/9.4/static/textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Howard Cole (#4)
Re: Shrinking TSvectors

On 04/05/2016 07:37 AM, Howard News wrote:

On 05/04/2016 14:44, Oleg Bartunov wrote:

On Tue, Apr 5, 2016 at 2:37 PM, Howard News <howardnews@selestial.com
<mailto:howardnews@selestial.com>> wrote:

Hi,

does anyone have any pointers for shrinking tsvectors

I have looked at the contents of some of these fields and they
contain many details that are not needed. For example...

"'+1':935,942 '-0500':72 '-0578':932 '-0667':938 '-266':937
'-873':944 '-9972':945 '/partners/application.html':222
'/partners/program/program-agreement.pdf':271
'/partners/reseller.html':181,1073 '01756':50,1083 '07767':54,1087
'1':753,771 '12':366 '14':66 (...)"

I am not interested in keeping the numbers or urls in the indexes.

select strip ('asd:23');
strip
-------
'asd'
(1 row)

Hi Oleg,

Is this function documented anywhere?

http://www.postgresql.org/docs/9.5/static/functions-textsearch.html

Howard.

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Howard Cole
howardnews@selestial.com
In reply to: Arthur Zakirov (#3)
Re: Shrinking TSvectors

On 05/04/2016 15:15, Artur Zakirov wrote:

On 05.04.2016 14:37, Howard News wrote:

Hi,

does anyone have any pointers for shrinking tsvectors

I have looked at the contents of some of these fields and they contain
many details that are not needed. For example...

"'+1':935,942 '-0500':72 '-0578':932 '-0667':938 '-266':937 '-873':944
'-9972':945 '/partners/application.html':222
'/partners/program/program-agreement.pdf':271
'/partners/reseller.html':181,1073 '01756':50,1083 '07767':54,1087
'1':753,771 '12':366 '14':66 (...)"

I am not interested in keeping the numbers or urls in the indexes.

Thanks,

Howard.

Hello,

You need create a new text search configuration. Here is an example of
commands:

CREATE TEXT SEARCH CONFIGURATION public.english_cfg (
PARSER = default
);
ALTER TEXT SEARCH CONFIGURATION public.english_cfg
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH pg_catalog.english_stem;

Instead of the "pg_catalog.english_stem" you can use your own dictionary.

Lets compare new configuration with the embedded configuration
"pg_catalog.english":

postgres=# select to_tsvector('english_cfg', 'home -9972
/partners/application.html /partners/program/program-agreement.pdf');
to_tsvector
-------------
'home':1
(1 row)

postgres=# select to_tsvector('english', 'home -9972
/partners/application.html /partners/program/program-agreement.pdf');
to_tsvector
-----------------------------------------------------------------------------------------------

'-9972':2 '/partners/application.html':3
'/partners/program/program-agreement.pdf':4 'home':1
(1 row)

You can get some additional information about configurations using \dF+:

postgres=# \dF+ english
Text search configuration "pg_catalog.english"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
email | simple
file | simple
float | simple
host | simple
hword | english_stem
hword_asciipart | english_stem
hword_numpart | simple
hword_part | english_stem
int | simple
numhword | simple
numword | simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | english_stem

postgres=# \dF+ english_cfg
Text search configuration "public.english_cfg"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
hword | english_stem
hword_asciipart | english_stem
hword_part | english_stem
word | english_stem

Thanks Artur,

Thats amazing! Postgres never ceases to amaze me. And the same goes for
the contributors to this list.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general