NOTICE: word is too long INSERT 0 3014

Started by Joshua D. Drakeover 19 years ago27 messageshackersgeneral
Jump to latest
#1Joshua D. Drake
jd@commandprompt.com
hackersgeneral

Hello,

I am running into this limitation ALOT with Tsearch2. What are my
options to get around it. Do I have to compile PostgreSQL with a
different block size?

If yes, what are the downsides to doing so (outside of not being able to
do straight upgrades)?

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#2Joshua D. Drake
jd@commandprompt.com
In reply to: Joshua D. Drake (#1)
hackersgeneral
Re: NOTICE: word is too long INSERT 0 3014

Any thoughts on the below?

Joshua D. Drake wrote:

Hello,

I am running into this limitation ALOT with Tsearch2. What are my
options to get around it. Do I have to compile PostgreSQL with a
different block size?

If yes, what are the downsides to doing so (outside of not being able to
do straight upgrades)?

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#3Teodor Sigaev
teodor@sigaev.ru
In reply to: Joshua D. Drake (#1)
hackersgeneral
Re: NOTICE: word is too long INSERT 0 3014

I am running into this limitation ALOT with Tsearch2. What are my
options to get around it. Do I have to compile PostgreSQL with a
different block size?

If yes, what are the downsides to doing so (outside of not being able to
do straight upgrades)?

If you really need that, your should play around WordEntry definition
(tsvector.h). Sorry, right now I haven't possibility to look closer,
just tomorrow.
Limit for word is equal 2KB - I supposed that is long enough to store
any meaningful words.

#4Teodor Sigaev
teodor@sigaev.ru
In reply to: Teodor Sigaev (#3)
hackersgeneral
Re: NOTICE: word is too long INSERT 0 3014

For example, redefine by follow way:

typedef struct
{
uint32
haspos:1,
len:31;
uint32 pos;
} WordEntry;

/* <= 1Gb */
#define MAXSTRLEN ( 1<<30 )
#define MAXSTRPOS ( 1<<30 )

Teodor Sigaev wrote:

I am running into this limitation ALOT with Tsearch2. What are my
options to get around it. Do I have to compile PostgreSQL with a
different block size?

If yes, what are the downsides to doing so (outside of not being able to
do straight upgrades)?

If you really need that, your should play around WordEntry definition
(tsvector.h). Sorry, right now I haven't possibility to look closer,
just tomorrow.
Limit for word is equal 2KB - I supposed that is long enough to store
any meaningful words.

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org/

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#5Joshua D. Drake
jd@commandprompt.com
In reply to: Joshua D. Drake (#2)
hackersgeneral
Index greater than 8k

Hello,

I recently posted about a word being too long with Tsearch2. That isn't
actually the problem I am trying to solve (thanks for the feedback
though, now I understand it).

The problem I am after is the 8k index size issue. It is very easy to
get a GIST index (especially when using tsearch2) that is larger than that.

Is recompiling the block size the option there?
What are the downsides, except for the custom build?

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#6Teodor Sigaev
teodor@sigaev.ru
In reply to: Joshua D. Drake (#5)
hackersgeneral
Re: Index greater than 8k

The problem I am after is the 8k index size issue. It is very easy to
get a GIST index (especially when using tsearch2) that is larger than that.

Hmm, tsearch2 GIST index is specially designed for support huge index entry:
first, every lexemes in tsvectore are transformed to hash value (with a help of
crc32), second, it's stripped all position infos, third, if size of array is
greater than TOAST_INDEX_TARGET then tsearch2 will make bit signature of
tsvector. Signature's length is fixed and equals to 252 bytes by default (+ 8
bytes for header of datum). All values on internal pages are represented as
signatures below.

So, tsearch2 guarantees that index entry will be small enough. If it's not true,
then there is a bug - pls, make test suite demonstrating the problem.

Is recompiling the block size the option there?
What are the downsides, except for the custom build?

Can you send exact error message?

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#7Joshua D. Drake
jd@commandprompt.com
In reply to: Teodor Sigaev (#6)
hackersgeneral
Re: [HACKERS] Index greater than 8k

Teodor Sigaev wrote:

The problem I am after is the 8k index size issue. It is very easy to
get a GIST index (especially when using tsearch2) that is larger than
that.

Hmm, tsearch2 GIST index is specially designed for support huge index
entry:
first, every lexemes in tsvectore are transformed to hash value (with a
help of crc32), second, it's stripped all position infos, third, if size
of array is greater than TOAST_INDEX_TARGET then tsearch2 will make bit
signature of tsvector. Signature's length is fixed and equals to 252
bytes by default (+ 8 bytes for header of datum). All values on internal
pages are represented as signatures below.

So, tsearch2 guarantees that index entry will be small enough. If it's
not true, then there is a bug - pls, make test suite demonstrating the
problem.

Is recompiling the block size the option there?
What are the downsides, except for the custom build?

Can you send exact error message?

I am training this week, but Darcy can do it. Can you give them a test
case on what we were working on with that customer?

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#8Darcy Buskermolen
darcyb@commandprompt.com
In reply to: Joshua D. Drake (#7)
hackersgeneral
Re: [HACKERS] Index greater than 8k

On October 31, 2006 06:42 am, Joshua D. Drake wrote:

Teodor Sigaev wrote:

The problem I am after is the 8k index size issue. It is very easy to
get a GIST index (especially when using tsearch2) that is larger than
that.

The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a
self contained test case directly to Teodor which shows the error.

'ERROR: index row requires 8792 bytes, maximum size is 8191'

Hmm, tsearch2 GIST index is specially designed for support huge index
entry:
first, every lexemes in tsvectore are transformed to hash value (with a
help of crc32), second, it's stripped all position infos, third, if size
of array is greater than TOAST_INDEX_TARGET then tsearch2 will make bit
signature of tsvector. Signature's length is fixed and equals to 252
bytes by default (+ 8 bytes for header of datum). All values on internal
pages are represented as signatures below.

So, tsearch2 guarantees that index entry will be small enough. If it's
not true, then there is a bug - pls, make test suite demonstrating the
problem.

Is recompiling the block size the option there?
What are the downsides, except for the custom build?

Can you send exact error message?

I am training this week, but Darcy can do it. Can you give them a test
case on what we were working on with that customer?

Joshua D. Drake

--
Darcy Buskermolen
Command Prompt, Inc.
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997
http://www.commandprompt.com/

#9Teodor Sigaev
teodor@sigaev.ru
In reply to: Darcy Buskermolen (#8)
hackersgeneral
Re: [HACKERS] Index greater than 8k

The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a
self contained test case directly to Teodor which shows the error.

'ERROR: index row requires 8792 bytes, maximum size is 8191'

Uh, I see. But I'm really surprised why do you use pg_trgm on big text? pg_trgm
is designed to find similar words and use technique known as trigrams. This will
work good on small pieces of text such as words or set expression. But all big
texts (on the same language) will be similar :(. So, I didn't take care about
guarantee that index tuple's size limitation. In principle, it's possible to
modify pg_trgm to have such guarantee, but index becomes lossy - all tuples
gotten from index should be checked by table's tuple evaluation.

If you want to search similar documents I can recommend to have a look to
fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty
close to trigrams and metrics of similarity is the same, but uses another
signature calculations. And, there are some tips and trics: removing HTML
marking,removing punctuation, lowercasing text and so on - it's interesting and
complex task.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#10Darcy Buskermolen
darcyb@commandprompt.com
In reply to: Teodor Sigaev (#9)
hackersgeneral
Re: [HACKERS] Index greater than 8k

On October 31, 2006 08:53 am, Teodor Sigaev wrote:

The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
a self contained test case directly to Teodor which shows the error.

'ERROR: index row requires 8792 bytes, maximum size is 8191'

Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
pg_trgm is designed to find similar words and use technique known as
trigrams. This will work good on small pieces of text such as words or set
expression. But all big texts (on the same language) will be similar :(.
So, I didn't take care about guarantee that index tuple's size limitation.
In principle, it's possible to modify pg_trgm to have such guarantee, but
index becomes lossy - all tuples gotten from index should be checked by
table's tuple evaluation.

The problem is some of the data we are working with is not strictly "text" but
bytea that we've run through encode(bytea, 'escape'), and we've had to resort
to trigrams in an attempt to mimic LIKE for searches. From our findings
tsearch2 does not match partial words, in the same way that a LIKE would. ie
col LIKE 'go%' would match good, gopher. pg_tgrm will return those with the
limit set appropriately, but tsearch2 does not.

If you want to search similar documents I can recommend to have a look to
fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty
close to trigrams and metrics of similarity is the same, but uses another
signature calculations. And, there are some tips and trics: removing HTML
marking,removing punctuation, lowercasing text and so on - it's interesting
and complex task.

--
Darcy Buskermolen
Command Prompt, Inc.
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997
http://www.commandprompt.com/

#11Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Darcy Buskermolen (#10)
hackersgeneral
Re: [HACKERS] Index greater than 8k

Darcy Buskermolen wrote:

On October 31, 2006 08:53 am, Teodor Sigaev wrote:

The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
a self contained test case directly to Teodor which shows the error.

'ERROR: index row requires 8792 bytes, maximum size is 8191'

Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
pg_trgm is designed to find similar words and use technique known as
trigrams. This will work good on small pieces of text such as words or set
expression. But all big texts (on the same language) will be similar :(.
So, I didn't take care about guarantee that index tuple's size limitation.
In principle, it's possible to modify pg_trgm to have such guarantee, but
index becomes lossy - all tuples gotten from index should be checked by
table's tuple evaluation.

The problem is some of the data we are working with is not strictly "text" but
bytea that we've run through encode(bytea, 'escape'),

I think one good question is why are you storing bytea and then
searching like it were text. Why not store the text as text, and put
the extraneous bytes somewhere else? Certainly you wouldn't expect to
be able to find text among the bytes, would you?

I remember suggesting you to store the Content-type next to each object,
and then creating partial trigram indexes where Content-type: text/*.
Did that plan not work for some reason?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#12Joshua D. Drake
jd@commandprompt.com
In reply to: Teodor Sigaev (#9)
hackersgeneral
Re: [HACKERS] Index greater than 8k

Teodor Sigaev wrote:

The problem as I remember it is pg_tgrm not tsearch2 directly, I've
sent a self contained test case directly to Teodor which shows the
error.
'ERROR: index row requires 8792 bytes, maximum size is 8191'

Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
pg_trgm is designed to find similar words and use technique known as
trigrams. This will work good on small pieces of text such as words or
set expression. But all big texts (on the same language) will be similar
:(. So, I didn't take care about guarantee that index tuple's size
limitation. In principle, it's possible to modify pg_trgm to have such
guarantee, but index becomes lossy - all tuples gotten from index
should be checked by table's tuple evaluation.

We are trying to get something faster than ~ '%foo%';

Which Tsearch2 does not give us :)

Joshua D. Drake

If you want to search similar documents I can recommend to have a look
to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
pretty close to trigrams and metrics of similarity is the same, but uses
another signature calculations. And, there are some tips and trics:
removing HTML marking,removing punctuation, lowercasing text and so on -
it's interesting and complex task.

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#13Joshua D. Drake
jd@commandprompt.com
In reply to: Alvaro Herrera (#11)
hackersgeneral
Re: [HACKERS] Index greater than 8k

Alvaro Herrera wrote:

Darcy Buskermolen wrote:

On October 31, 2006 08:53 am, Teodor Sigaev wrote:

The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
a self contained test case directly to Teodor which shows the error.

'ERROR: index row requires 8792 bytes, maximum size is 8191'

Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
pg_trgm is designed to find similar words and use technique known as
trigrams. This will work good on small pieces of text such as words or set
expression. But all big texts (on the same language) will be similar :(.
So, I didn't take care about guarantee that index tuple's size limitation.
In principle, it's possible to modify pg_trgm to have such guarantee, but
index becomes lossy - all tuples gotten from index should be checked by
table's tuple evaluation.

The problem is some of the data we are working with is not strictly "text" but
bytea that we've run through encode(bytea, 'escape'),

I think one good question is why are you storing bytea and then
searching like it were text.

We are not storing bytea, a customer is. We are trying to work around
customer requirements. The data that is being stored is not always text,
sometimes it is binary (a flash file or jpeg). We are using escaped text
to be able to search the string contents of that file .

Why not store the text as text, and put
the extraneous bytes somewhere else? Certainly you wouldn't expect to
be able to find text among the bytes, would you?

Yes we do (and can) expect to find text among the bytes. We have
searches running, we are just running into the maximum size issues for
certain rows.

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#14Gregory S. Williamson
gsw@globexplorer.com
In reply to: Joshua D. Drake (#1)
hackersgeneral
Re: [HACKERS] Index greater than 8k

I hesitate to mention it, since it's retrograde, uses OIDS, may not handle your locale/encoding correctly, may not scale well for what you need etc., etc.

But we've used fti (in the contrib package) to do fast searches for any bit of text in people's names ... we didn't go with tesearch2 because we were a bit worried about the need to search for fragments of names, and that names don't follow stemming rules and the like very well. Still it might be a way of handling some of the uglier data. It was a bit of a pain to set up but seems to work well. Of course, users can ask for something commonplace and get back gazillions of rows, but apparently that's ok for the application this is part of. Caveat: only about 32 million rows in this dataset, partitioned into unequal grouings (about 90 total).

HTH (but doubt it for reasons that undoubtedly be made clear ;-)

Greg Williamson
DBA
GlobeXplorer LLC

-----Original Message-----
From: pgsql-general-owner@postgresql.org on behalf of Joshua D. Drake
Sent: Tue 10/31/2006 7:46 PM
To: Teodor Sigaev
Cc: Darcy Buskermolen; PgSQL General; PostgreSQL-development
Subject: Re: [HACKERS] [GENERAL] Index greater than 8k

Teodor Sigaev wrote:

The problem as I remember it is pg_tgrm not tsearch2 directly, I've
sent a self contained test case directly to Teodor which shows the
error.
'ERROR: index row requires 8792 bytes, maximum size is 8191'

Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
pg_trgm is designed to find similar words and use technique known as
trigrams. This will work good on small pieces of text such as words or
set expression. But all big texts (on the same language) will be similar
:(. So, I didn't take care about guarantee that index tuple's size
limitation. In principle, it's possible to modify pg_trgm to have such
guarantee, but index becomes lossy - all tuples gotten from index
should be checked by table's tuple evaluation.

We are trying to get something faster than ~ '%foo%';

Which Tsearch2 does not give us :)

Joshua D. Drake

If you want to search similar documents I can recommend to have a look
to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
pretty close to trigrams and metrics of similarity is the same, but uses
another signature calculations. And, there are some tips and trics:
removing HTML marking,removing punctuation, lowercasing text and so on -
it's interesting and complex task.

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

-------------------------------------------------------
Click link below if it is SPAM gsw@globexplorer.com
"https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=454815f5242304846743324&amp;user=gsw@globexplorer.com&amp;retrain=spam&amp;template=history&amp;history_page=1&quot;
!DSPAM:454815f5242304846743324!
-------------------------------------------------------

#15Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Joshua D. Drake (#13)
hackersgeneral
Re: [HACKERS] Index greater than 8k

Joshua D. Drake wrote:

Alvaro Herrera wrote:

Darcy Buskermolen wrote:

On October 31, 2006 08:53 am, Teodor Sigaev wrote:

The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
a self contained test case directly to Teodor which shows the error.

'ERROR: index row requires 8792 bytes, maximum size is 8191'

Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
pg_trgm is designed to find similar words and use technique known as
trigrams. This will work good on small pieces of text such as words or set
expression. But all big texts (on the same language) will be similar :(.
So, I didn't take care about guarantee that index tuple's size limitation.
In principle, it's possible to modify pg_trgm to have such guarantee, but
index becomes lossy - all tuples gotten from index should be checked by
table's tuple evaluation.

The problem is some of the data we are working with is not strictly "text" but
bytea that we've run through encode(bytea, 'escape'),

I think one good question is why are you storing bytea and then
searching like it were text.

We are not storing bytea, a customer is. We are trying to work around
customer requirements. The data that is being stored is not always text,
sometimes it is binary (a flash file or jpeg). We are using escaped text
to be able to search the string contents of that file .

Hmm, have you tried to create a functional trigram index on the
equivalent of "strings(bytea_column)" or something like that?

I imagine strings(bytea) would be a function that returns the
concatenation of all pure (7 bit) ASCII strings in the byte sequence.

On the other hand, based on Teodor's comment on pg_trgm, maybe this
won't be possible at all.

Why not store the text as text, and put
the extraneous bytes somewhere else? Certainly you wouldn't expect to
be able to find text among the bytes, would you?

Yes we do (and can) expect to find text among the bytes. We have
searches running, we are just running into the maximum size issues for
certain rows.

Do you mean you actually find stuff based on text attributes in JPEG
images and the like? I thought those were compressed ...

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#16Joshua D. Drake
jd@commandprompt.com
In reply to: Alvaro Herrera (#15)
hackersgeneral
Re: [HACKERS] Index greater than 8k

We are not storing bytea, a customer is. We are trying to work around
customer requirements. The data that is being stored is not always text,
sometimes it is binary (a flash file or jpeg). We are using escaped text
to be able to search the string contents of that file .

Hmm, have you tried to create a functional trigram index on the
equivalent of "strings(bytea_column)" or something like that?

I did consider that. I wonder what size we are going to deal with
though. Part of the problem is that some of the data we are dealing with
is quite large.

I imagine strings(bytea) would be a function that returns the
concatenation of all pure (7 bit) ASCII strings in the byte sequence.

On the other hand, based on Teodor's comment on pg_trgm, maybe this
won't be possible at all.

Yes we do (and can) expect to find text among the bytes. We have
searches running, we are just running into the maximum size issues for
certain rows.

Do you mean you actually find stuff based on text attributes in JPEG
images and the like? I thought those were compressed ...

Well a jpeg is probably a bad example, but yes they do search jpeg, I am
guessing mostly for header information. A better example would be
postscript files, flash files and of course large amounts of text + Html.

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#15)
hackersgeneral
Re: [HACKERS] Index greater than 8k

Alvaro Herrera <alvherre@commandprompt.com> writes:

Do you mean you actually find stuff based on text attributes in JPEG
images and the like? I thought those were compressed ...

Typically not --- the design assumption is that the text size wouldn't
amount to anything anyway compared to the image data, and it's better to
be able to pull it out with minimal processing.

I do suggest though that an image containing auxiliary data like text
comments is a multi-part structure, and that dumping it into a single
uninterpreted database field is spectacularly bad schema design.
You should pull the text out into a separate column once when you store
the data, instead of trying to fix things up when you search.

regards, tom lane

#18tomas@tuxteam.de
tomas@tuxteam.de
In reply to: Joshua D. Drake (#16)
hackersgeneral
Re: [HACKERS] Index greater than 8k

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, Oct 31, 2006 at 08:55:04PM -0800, Joshua D. Drake wrote:

We are not storing bytea [...]

[...]

Hmm, have you tried to create a functional trigram index on the
equivalent of "strings(bytea_column)" or something like that?

Hrm. Sorry for my impolite interuption, but... is there such a thing as
a "functional trigram index"? (this would be very cool).

Thanks
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFSG33Bcgs9XrR2kYRAnB7AJ4l6UPK/4vhtgr7Ux2/L7VtYq6d7ACeLBZP
IMPCEj5zqhYR7b2eYPgjRRE=
=6uiR
-----END PGP SIGNATURE-----

#19Teodor Sigaev
teodor@sigaev.ru
In reply to: Joshua D. Drake (#12)
hackersgeneral
Re: [HACKERS] Index greater than 8k

We are trying to get something faster than ~ '%foo%';
Which Tsearch2 does not give us :)

Hmm, why not with GIN? :) Although, right now GIN doesn't support prefix search,
it's possible to use it.

Brain storm method:

Develop a dictionary which returns all substring for lexeme, for example for
word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo obar oba ob
bar ba ar'. And make GIN functional index over your column (to save disk space).
So, your query will be looked as
select ... where to_tsvector(text_column) @@ 'foo';
Notices:
Time of search in GIN weak depend on number of words (opposite to
tsearch2/GiST), but insertion of row may be slow enough....

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#20Joshua D. Drake
jd@commandprompt.com
In reply to: Teodor Sigaev (#19)
hackersgeneral
Re: [HACKERS] Index greater than 8k

Teodor Sigaev wrote:

We are trying to get something faster than ~ '%foo%';
Which Tsearch2 does not give us :)

Hmm, why not with GIN? :) Although, right now GIN doesn't support prefix
search, it's possible to use it.

Well they run 8.1 :)

Joshua D. Drake

Brain storm method:

Develop a dictionary which returns all substring for lexeme, for example
for word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo
obar oba ob bar ba ar'. And make GIN functional index over your column
(to save disk space).
So, your query will be looked as
select ... where to_tsvector(text_column) @@ 'foo';
Notices:
Time of search in GIN weak depend on number of words (opposite to
tsearch2/GiST), but insertion of row may be slow enough....

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#21Gregory Maxwell
gmaxwell@gmail.com
In reply to: Teodor Sigaev (#19)
hackersgeneral
#22Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: tomas@tuxteam.de (#18)
hackersgeneral
#23Oleg Bartunov
oleg@sai.msu.su
In reply to: Joshua D. Drake (#13)
hackersgeneral
#24tomas@tuxteam.de
tomas@tuxteam.de
In reply to: Alvaro Herrera (#22)
hackersgeneral
#25Oleg Bartunov
oleg@sai.msu.su
In reply to: Gregory S. Williamson (#14)
hackersgeneral
#26Joshua D. Drake
jd@commandprompt.com
In reply to: Oleg Bartunov (#25)
hackersgeneral
#27Oleg Bartunov
oleg@sai.msu.su
In reply to: Joshua D. Drake (#26)
hackersgeneral