multi terabyte fulltext searching

Started by Benjamin Araiabout 19 years ago18 messagesgeneral

benjamin@araisoft.com

about 19 years ago

Hi,

I have been struggling with getting fulltext searching for very large
databases. I can fulltext index 10s if gigs without any problem but
when I start geting to hundreds of gigs it becomes slow. My current
system is a quad core with 8GB of memory. I have the resource to
throw more hardware at it but realistically it is not cost effective
to buy a system with 128GB of memory. Is there any solutions that
people have come up with for indexing very large text databases?

Essentially I have several terabytes of text that I need to index.
Each record is about 5 paragraphs of text. I am currently using
TSearch2 (stemming and etc) and getting sub-optimal results. Queries
take more than a second to execute. Has anybody implemented such a
database using multiple systems or some special add-on to TSearch2 to
make things faster? I want to do something like partitioning the
data into multiple systems and merging the ranked results at some
master node. Is something like this possible for PostgreSQL or must
it be a software solution?

Benjamin

Oleg Bartunov

oleg@sai.msu.su

about 19 years ago

In reply to: Benjamin Arai (#1)

Re: multi terabyte fulltext searching

Benjamin,

as one of the author of tsearch2 I'd like to know more about your setup.
tsearch2 in 8.2 has GIN index support, which scales much better than old
GiST index.

Oleg

On Wed, 21 Mar 2007, Benjamin Arai wrote:

Hi,

I have been struggling with getting fulltext searching for very large
databases. I can fulltext index 10s if gigs without any problem but when I
start geting to hundreds of gigs it becomes slow. My current system is a
quad core with 8GB of memory. I have the resource to throw more hardware at
it but realistically it is not cost effective to buy a system with 128GB of
memory. Is there any solutions that people have come up with for indexing
very large text databases?

Essentially I have several terabytes of text that I need to index. Each
record is about 5 paragraphs of text. I am currently using TSearch2
(stemming and etc) and getting sub-optimal results. Queries take more than a
second to execute. Has anybody implemented such a database using multiple
systems or some special add-on to TSearch2 to make things faster? I want to
do something like partitioning the data into multiple systems and merging the
ranked results at some master node. Is something like this possible for
PostgreSQL or must it be a software solution?

Benjamin

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Joshua D. Drake

jd@commandprompt.com

about 19 years ago

In reply to: Benjamin Arai (#1)

Re: multi terabyte fulltext searching

Benjamin Arai wrote:

Hi,

I have been struggling with getting fulltext searching for very large
databases. I can fulltext index 10s if gigs without any problem but
when I start geting to hundreds of gigs it becomes slow. My current
system is a quad core with 8GB of memory. I have the resource to throw
more hardware at it but realistically it is not cost effective to buy a
system with 128GB of memory. Is there any solutions that people have
come up with for indexing very large text databases?

GIST indexes are very large.

Essentially I have several terabytes of text that I need to index. Each
record is about 5 paragraphs of text. I am currently using TSearch2
(stemming and etc) and getting sub-optimal results. Queries take more
than a second to execute.

you are complaining about more than a second with a terabyte of text?

Has anybody implemented such a database using
multiple systems or some special add-on to TSearch2 to make things
faster? I want to do something like partitioning the data into multiple
systems and merging the ranked results at some master node. Is
something like this possible for PostgreSQL or must it be a software
solution?

Benjamin

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

Benjamin Arai

benjamin@araisoft.com

about 19 years ago

In reply to: Oleg Bartunov (#2)

Re: multi terabyte fulltext searching

Hi Oleg,

I am currently using GIST indexes because I receive about 10GB of new
data a week (then again I am not deleting any information). The do
not expect to be able to stop receiving text for about 5 years, so
the data is not going to become static any time soon. The reason I
am concerned with performance is that I am providing a search system
for several newspapers since essentially the beginning of time. Many
bibliographer etc would like to use this utility but if each search
takes too long I am not going to be able to support many concurrent
users.

Benjamin

On Mar 21, 2007, at 8:42 AM, Oleg Bartunov wrote:

Show quoted text

Benjamin,

as one of the author of tsearch2 I'd like to know more about your
setup.
tsearch2 in 8.2 has GIN index support, which scales much better
than old
GiST index.

Oleg

On Wed, 21 Mar 2007, Benjamin Arai wrote:

Hi,

I have been struggling with getting fulltext searching for very
large databases. I can fulltext index 10s if gigs without any
problem but when I start geting to hundreds of gigs it becomes
slow. My current system is a quad core with 8GB of memory. I
have the resource to throw more hardware at it but realistically
it is not cost effective to buy a system with 128GB of memory. Is
there any solutions that people have come up with for indexing
very large text databases?

Essentially I have several terabytes of text that I need to
index. Each record is about 5 paragraphs of text. I am currently
using TSearch2 (stemming and etc) and getting sub-optimal
results. Queries take more than a second to execute. Has anybody
implemented such a database using multiple systems or some special
add-on to TSearch2 to make things faster? I want to do something
like partitioning the data into multiple systems and merging the
ranked results at some master node. Is something like this
possible for PostgreSQL or must it be a software solution?

Benjamin

---------------------------(end of
broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Teodor Sigaev

teodor@sigaev.ru

about 19 years ago

In reply to: Benjamin Arai (#1)

Re: multi terabyte fulltext searching

I'm afraid that fulltext search on multiterabytes set of documents can not be
implemented on any RDBMS, at least on single box. Specialized fulltext search
engines (with exact matching and time to search about one second) has practical
limit near 20 millions of docs, cluster - near 100 millions. Bigger collections
require engines like a google.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Benjamin Arai

benjamin@araisoft.com

about 19 years ago

In reply to: Joshua D. Drake (#3)

Re: multi terabyte fulltext searching

True, but what happens when my database reaches 100 terabytes? Is 5
seconds ok? How about 10? My problem is that I do not believe the
performance loss I am experiencing as the data becomes large is (log
the # of records). This worries me because I could be doing
something wrong. Or I might be able to do something better.

Benjamin

On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote:

Show quoted text

Benjamin Arai wrote:

Hi,

I have been struggling with getting fulltext searching for very large
databases. I can fulltext index 10s if gigs without any problem but
when I start geting to hundreds of gigs it becomes slow. My current
system is a quad core with 8GB of memory. I have the resource to
throw
more hardware at it but realistically it is not cost effective to
buy a
system with 128GB of memory. Is there any solutions that people have
come up with for indexing very large text databases?

GIST indexes are very large.

Essentially I have several terabytes of text that I need to
index. Each
record is about 5 paragraphs of text. I am currently using TSearch2
(stemming and etc) and getting sub-optimal results. Queries take
more
than a second to execute.

you are complaining about more than a second with a terabyte of text?

Has anybody implemented such a database using
multiple systems or some special add-on to TSearch2 to make things
faster? I want to do something like partitioning the data into
multiple
systems and merging the ranked results at some master node. Is
something like this possible for PostgreSQL or must it be a software
solution?

Benjamin

---------------------------(end of
broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/
donate
PostgreSQL Replication: http://www.commandprompt.com/products/

Benjamin Arai

benjamin@araisoft.com

about 19 years ago

In reply to: Oleg Bartunov (#2)

Re: multi terabyte fulltext searching

By the way, what is the largest TSearch2 database that you know of
and how fast does it return results? Maybe my expectations are
unrealistic.

Benjamin

On Mar 21, 2007, at 8:42 AM, Oleg Bartunov wrote:

Show quoted text

Benjamin,

as one of the author of tsearch2 I'd like to know more about your
setup.
tsearch2 in 8.2 has GIN index support, which scales much better
than old
GiST index.

Oleg

On Wed, 21 Mar 2007, Benjamin Arai wrote:

Hi,

I have been struggling with getting fulltext searching for very
large databases. I can fulltext index 10s if gigs without any
problem but when I start geting to hundreds of gigs it becomes
slow. My current system is a quad core with 8GB of memory. I
have the resource to throw more hardware at it but realistically
it is not cost effective to buy a system with 128GB of memory. Is
there any solutions that people have come up with for indexing
very large text databases?

Essentially I have several terabytes of text that I need to
index. Each record is about 5 paragraphs of text. I am currently
using TSearch2 (stemming and etc) and getting sub-optimal
results. Queries take more than a second to execute. Has anybody
implemented such a database using multiple systems or some special
add-on to TSearch2 to make things faster? I want to do something
like partitioning the data into multiple systems and merging the
ranked results at some master node. Is something like this
possible for PostgreSQL or must it be a software solution?

Benjamin

---------------------------(end of
broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Joshua D. Drake

jd@commandprompt.com

about 19 years ago

In reply to: Benjamin Arai (#6)

Re: multi terabyte fulltext searching

Benjamin Arai wrote:

True, but what happens when my database reaches 100 terabytes? Is 5
seconds ok? How about 10? My problem is that I do not believe the
performance loss I am experiencing as the data becomes large is (log the
# of records). This worries me because I could be doing something
wrong. Or I might be able to do something better.

Well a couple of things you could do, especially if you have the ability
to throw hardware at it.

How many spindles do you have?

Benjamin

On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote:

Benjamin Arai wrote:

Hi,

I have been struggling with getting fulltext searching for very large
databases. I can fulltext index 10s if gigs without any problem but
when I start geting to hundreds of gigs it becomes slow. My current
system is a quad core with 8GB of memory. I have the resource to throw
more hardware at it but realistically it is not cost effective to buy a
system with 128GB of memory. Is there any solutions that people have
come up with for indexing very large text databases?

GIST indexes are very large.

Essentially I have several terabytes of text that I need to index. Each
record is about 5 paragraphs of text. I am currently using TSearch2
(stemming and etc) and getting sub-optimal results. Queries take more
than a second to execute.

you are complaining about more than a second with a terabyte of text?

Has anybody implemented such a database using
multiple systems or some special add-on to TSearch2 to make things
faster? I want to do something like partitioning the data into multiple
systems and merging the ranked results at some master node. Is
something like this possible for PostgreSQL or must it be a software
solution?

Benjamin

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

Oleg Bartunov

oleg@sai.msu.su

about 19 years ago

In reply to: Benjamin Arai (#4)

Re: multi terabyte fulltext searching

On Wed, 21 Mar 2007, Benjamin Arai wrote:

Hi Oleg,

I am currently using GIST indexes because I receive about 10GB of new data a
week (then again I am not deleting any information). The do not expect to be
able to stop receiving text for about 5 years, so the data is not going to
become static any time soon. The reason I am concerned with performance is
that I am providing a search system for several newspapers since essentially
the beginning of time. Many bibliographer etc would like to use this utility
but if each search takes too long I am not going to be able to support many
concurrent users.

GiST is ok for your feed, but archive part should use GIN index.
inheritance+CE should help your life.

Benjamin

On Mar 21, 2007, at 8:42 AM, Oleg Bartunov wrote:

Benjamin,

as one of the author of tsearch2 I'd like to know more about your setup.
tsearch2 in 8.2 has GIN index support, which scales much better than old
GiST index.

Oleg

On Wed, 21 Mar 2007, Benjamin Arai wrote:

Hi,

I have been struggling with getting fulltext searching for very large
databases. I can fulltext index 10s if gigs without any problem but when
I start geting to hundreds of gigs it becomes slow. My current system is
a quad core with 8GB of memory. I have the resource to throw more
hardware at it but realistically it is not cost effective to buy a system
with 128GB of memory. Is there any solutions that people have come up
with for indexing very large text databases?

Essentially I have several terabytes of text that I need to index. Each
record is about 5 paragraphs of text. I am currently using TSearch2
(stemming and etc) and getting sub-optimal results. Queries take more
than a second to execute. Has anybody implemented such a database using
multiple systems or some special add-on to TSearch2 to make things faster?
I want to do something like partitioning the data into multiple systems
and merging the ranked results at some master node. Is something like
this possible for PostgreSQL or must it be a software solution?

Benjamin

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#10

Benjamin Arai

benjamin@araisoft.com

about 19 years ago

In reply to: Joshua D. Drake (#8)

Re: multi terabyte fulltext searching

24.

Benjamin

On Mar 21, 2007, at 9:09 AM, Joshua D. Drake wrote:

Show quoted text

Benjamin Arai wrote:

True, but what happens when my database reaches 100 terabytes? Is 5
seconds ok? How about 10? My problem is that I do not believe the
performance loss I am experiencing as the data becomes large is
(log the
# of records). This worries me because I could be doing something
wrong. Or I might be able to do something better.

Well a couple of things you could do, especially if you have the
ability
to throw hardware at it.

How many spindles do you have?

J

Benjamin

On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote:

Benjamin Arai wrote:

Hi,

I have been struggling with getting fulltext searching for very
large
databases. I can fulltext index 10s if gigs without any problem
but
when I start geting to hundreds of gigs it becomes slow. My
current
system is a quad core with 8GB of memory. I have the resource
to throw
more hardware at it but realistically it is not cost effective
to buy a
system with 128GB of memory. Is there any solutions that people
have
come up with for indexing very large text databases?

GIST indexes are very large.

Essentially I have several terabytes of text that I need to
index. Each
record is about 5 paragraphs of text. I am currently using
TSearch2
(stemming and etc) and getting sub-optimal results. Queries
take more
than a second to execute.

you are complaining about more than a second with a terabyte of
text?

Has anybody implemented such a database using
multiple systems or some special add-on to TSearch2 to make things
faster? I want to do something like partitioning the data into
multiple
systems and merging the ranked results at some master node. Is
something like this possible for PostgreSQL or must it be a
software
solution?

Benjamin

---------------------------(end of
broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your
desire to
choose an index scan if your joining column's datatypes do
not
match

--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/
donate
PostgreSQL Replication: http://www.commandprompt.com/products/

---------------------------(end of
broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/
donate
PostgreSQL Replication: http://www.commandprompt.com/products/

#11

Benjamin Arai

benjamin@araisoft.com

about 19 years ago

In reply to: Teodor Sigaev (#5)

Re: multi terabyte fulltext searching

Can't you implement something similar to google by aggregating
results for TSearch2 over many machines?

Benjamin
On Mar 21, 2007, at 8:59 AM, Teodor Sigaev wrote:

Show quoted text

I'm afraid that fulltext search on multiterabytes set of documents
can not be implemented on any RDBMS, at least on single box.
Specialized fulltext search engines (with exact matching and time
to search about one second) has practical limit near 20 millions of
docs, cluster - near 100 millions. Bigger collections require
engines like a google.

--
Teodor Sigaev E-mail:
teodor@sigaev.ru
WWW: http://
www.sigaev.ru/

#12

Teodor Sigaev

teodor@sigaev.ru

about 19 years ago

In reply to: Benjamin Arai (#4)

Re: multi terabyte fulltext searching

I am currently using GIST indexes because I receive about 10GB of new
data a week (then again I am not deleting any information). The do not
expect to be able to stop receiving text for about 5 years, so the data
is not going to become static any time soon. The reason I am concerned
with performance is that I am providing a search system for several
newspapers since essentially the beginning of time. Many bibliographer
etc would like to use this utility but if each search takes too long I
am not going to be able to support many concurrent users.

Use GiST and GIN indexes together: any data older than one month (which doesn't
change) with GIN index and new data with GiST. And one time per month moves data
from GiST to GIN.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#13

Benjamin Arai

benjamin@araisoft.com

about 19 years ago

In reply to: Oleg Bartunov (#9)

Re: multi terabyte fulltext searching

What is inheritance+CE?

Benjamin
On Mar 21, 2007, at 9:10 AM, Oleg Bartunov wrote:

Show quoted text

inheritance+CE

#14

Oleg Bartunov

oleg@sai.msu.su

about 19 years ago

In reply to: Benjamin Arai (#13)

Re: multi terabyte fulltext searching

On Wed, 21 Mar 2007, Benjamin Arai wrote:

What is inheritance+CE?

Hmm,

http://www.postgresql.org/docs/8.2/static/ddl-inherit.html
http://www.postgresql.org/docs/8.2/static/ddl-partitioning.html

Benjamin
On Mar 21, 2007, at 9:10 AM, Oleg Bartunov wrote:

inheritance+CE

#15

Oleg Bartunov

oleg@sai.msu.su

about 19 years ago

In reply to: Benjamin Arai (#11)

Re: multi terabyte fulltext searching

On Wed, 21 Mar 2007, Benjamin Arai wrote:

Can't you implement something similar to google by aggregating results for
TSearch2 over many machines?

tsearch2 doesn't use any global statistics, so, in principle, you
should be able to run fts on several machines and combine them using
dblink (contrib/dblink).

Benjamin
On Mar 21, 2007, at 8:59 AM, Teodor Sigaev wrote:

I'm afraid that fulltext search on multiterabytes set of documents can not
be implemented on any RDBMS, at least on single box. Specialized fulltext
search engines (with exact matching and time to search about one second)
has practical limit near 20 millions of docs, cluster - near 100 millions.
Bigger collections require engines like a google.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW:
http://www.sigaev.ru/

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

#16

Joshua D. Drake

jd@commandprompt.com

about 19 years ago

In reply to: Benjamin Arai (#10)

Re: multi terabyte fulltext searching

Benjamin Arai wrote:

24.

I can think of a couple of things.

1. Increase your spindle count.
2. Push your gist indexes off to another array entirely (with separate
controllers)
3. Split your actual tables between other arrays

Or... by a SAN (but then again, I just replaced a million dollar SAN
with a split array/controller enviroment that cost 10% of the cost and
the multiple controller solution is kicking its butt.)

Joshua D. Drake

Benjamin

On Mar 21, 2007, at 9:09 AM, Joshua D. Drake wrote:

Benjamin Arai wrote:

True, but what happens when my database reaches 100 terabytes? Is 5
seconds ok? How about 10? My problem is that I do not believe the
performance loss I am experiencing as the data becomes large is (log the
# of records). This worries me because I could be doing something
wrong. Or I might be able to do something better.

Well a couple of things you could do, especially if you have the ability
to throw hardware at it.

How many spindles do you have?

J

Benjamin

On Mar 21, 2007, at 8:49 AM, Joshua D. Drake wrote:

Benjamin Arai wrote:

Hi,

I have been struggling with getting fulltext searching for very large
databases. I can fulltext index 10s if gigs without any problem but
when I start geting to hundreds of gigs it becomes slow. My current
system is a quad core with 8GB of memory. I have the resource to
throw
more hardware at it but realistically it is not cost effective to
buy a
system with 128GB of memory. Is there any solutions that people have
come up with for indexing very large text databases?

GIST indexes are very large.

Essentially I have several terabytes of text that I need to index.
Each
record is about 5 paragraphs of text. I am currently using TSearch2
(stemming and etc) and getting sub-optimal results. Queries take more
than a second to execute.

you are complaining about more than a second with a terabyte of text?

Has anybody implemented such a database using
multiple systems or some special add-on to TSearch2 to make things
faster? I want to do something like partitioning the data into
multiple
systems and merging the ranked results at some master node. Is
something like this possible for PostgreSQL or must it be a software
solution?

Benjamin

---------------------------(end of
broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

-- === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project:
http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

#17

Benjamin Arai

benjamin@araisoft.com

about 19 years ago

In reply to: Oleg Bartunov (#15)

Re: multi terabyte fulltext searching

Are there any examples of dblink being used in commercial
environments. I am curious to understand how it deals with node
failures and etc.

Benjamin

On Mar 21, 2007, at 9:35 AM, Oleg Bartunov wrote:

Show quoted text

On Wed, 21 Mar 2007, Benjamin Arai wrote:

Can't you implement something similar to google by aggregating
results for TSearch2 over many machines?

tsearch2 doesn't use any global statistics, so, in principle, you
should be able to run fts on several machines and combine them using
dblink (contrib/dblink).

Benjamin
On Mar 21, 2007, at 8:59 AM, Teodor Sigaev wrote:

I'm afraid that fulltext search on multiterabytes set of
documents can not be implemented on any RDBMS, at least on single
box. Specialized fulltext search engines (with exact matching and
time to search about one second) has practical limit near 20
millions of docs, cluster - near 100 millions. Bigger collections
require engines like a google.
--
Teodor Sigaev E-mail:
teodor@sigaev.ru
WWW: http://
www.sigaev.ru/

---------------------------(end of
broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#18

Arturo Perez

aperez@hayesinc.com

about 19 years ago

In reply to: Benjamin Arai (#1)

Re: multi terabyte fulltext searching

On Wed, 21 Mar 2007 08:57:39 -0700, Benjamin Arai wrote:

Hi Oleg,

I am currently using GIST indexes because I receive about 10GB of new data
a week (then again I am not deleting any information). The do not expect
to be able to stop receiving text for about 5 years, so the data is not
going to become static any time soon. The reason I am concerned with
performance is that I am providing a search system for several newspapers
since essentially the beginning of time. Many bibliographer etc would
like to use this utility but if each search takes too long I am not going
to be able to support many concurrent users.

Benjamin

At a previous job, I built a system to do this. We had 3,000 publications
and approx 70M newspaper articles. Total content size (postprocessed) was
on the order of >100GB, IIRC. We used a proprietary (closed-source
not ours) search engine.

In order to reach subsecond response time we needed to horizontally scale
to about 50-70 machines, each a low-end Dell 1650. This was after about 5
years of trying to vertically scale.

-arturo