full text searching

Started by Culley Harrelsonalmost 25 years ago17 messages
#1Culley Harrelson
culleyharrelson@yahoo.com

Hi,

OK full text searching. Will the full text index
catch changes in verb tense? i.e. will a search for
woman catch women?

I'm researching before I dive in to this later in the
week so please excuse this incompletely informed
question: Will I need to rebuild postgresql with the
full-text index module included? Unfortunately I'm
away from my linux machine-- would someone be willing
to email me the README?

Thanks in advance,

Culley

__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/

#2Mitch Vincent
mitch@venux.net
In reply to: Culley Harrelson (#1)
Re: full text searching

Hi,

OK full text searching. Will the full text index
catch changes in verb tense? i.e. will a search for
woman catch women?

I'm researching before I dive in to this later in the
week so please excuse this incompletely informed
question: Will I need to rebuild postgresql with the
full-text index module included? Unfortunately I'm
away from my linux machine-- would someone be willing
to email me the README?

Regardless of indexing, you're still searching for a specific string (if you
search using the = operator).

SELECT * from people WHERE whatever = 'woman';

-- Isn't going to catch anything but the literal string "woman".. (it's case
sensitive too, mind you)

SELECT * from people WHERE whatever LIKE 'wom%n';

-- Should check either.

A regex search is going to get more specific but when using the regex
search, you can't use indexes.

Anyone, please correct me if I'm wrong.

-Mitch

#3Mitch Vincent
mitch@venux.net
In reply to: Culley Harrelson (#1)
Re: full text searching

Another thing..

Full text indexing, last time I checked, was just a trigger/function, you
don't have to rebuild anything that I'm aware of to include it..

-Mitch

Show quoted text

Hi,

OK full text searching. Will the full text index
catch changes in verb tense? i.e. will a search for
woman catch women?

I'm researching before I dive in to this later in the
week so please excuse this incompletely informed
question: Will I need to rebuild postgresql with the
full-text index module included? Unfortunately I'm
away from my linux machine-- would someone be willing
to email me the README?

Thanks in advance,

Culley

__________________________________________________
Get personalized email addresses from Yahoo! Mail - only $35
a year! http://personal.mail.yahoo.com/

#4Gunnar R|nning
gunnar@candleweb.no
In reply to: Culley Harrelson (#1)
Re: full text searching

"Mitch Vincent" <mitch@venux.net> writes:

-- Isn't going to catch anything but the literal string "woman".. (it's case
sensitive too, mind you)

SELECT * from people WHERE whatever LIKE 'wom%n';

-- Should check either.

Well you wouldn't want to start building these kind of rules in your
application - better to have them in the search engine. The fulltextindex
in the contrib package does of course not offer anything like this, it
would be nice to see a third party addon provide fulltext capabilities for
Postgresql.

regards,

Gunnar

#5Mitch Vincent
mitch@venux.net
In reply to: Culley Harrelson (#1)
Re: full text searching

Well you wouldn't want to start building these kind of rules in your
application - better to have them in the search engine. The fulltextindex
in the contrib package does of course not offer anything like this, it
would be nice to see a third party addon provide fulltext capabilities for
Postgresql.

Well, the search engine isn't the database, IMHO. The search "engine" is
your application... The database will go get anything you tell it to, you
just have to know how to tell it and make sure that your application tells
it in the correct way.

Teaching an application or database the English language is going to be a
hell of a project, good luck!

Personally, I think the FTI trigger and function that's in contrib is
pretty bad. It's not usable in a lot of situations, I re-wrote it to remove
duplicates and index whole words but it still didn't work out for me...
Namely when you have fairly large chunks of text (30k or so), one for each
record in a row (and you have 10,000 rows).. Well, ripping out and indexing
30k*10k text chunks is a lot by itself but then when you search it you have
to join the two tables... It becomes a mess and was actually slower than
when I used LIKE to search the big text fields in my single table. It only
take a few seconds for the seq scan and the index scan on the FTI table but
with FTI updating became a 30 second job (of course there were like 4
million rows and each app did have 30k of text or so).. I don't have too
many small databases, so maybe this works for a lot of people :-)

Anyway. Moral of the story.. I'd like to see native PostgreSQL full text
indexing before we go adding on to the contrib'd trigger/function
implementation...

-Mitch

#6Gunnar R|nning
gunnar@candleweb.no
In reply to: Culley Harrelson (#1)
Re: full text searching

"Mitch Vincent" <mitch@venux.net> writes:

Well, the search engine isn't the database, IMHO. The search "engine" is
your application... The database will go get anything you tell it to, you
just have to know how to tell it and make sure that your application tells
it in the correct way.

Teaching an application or database the English language is going to be a
hell of a project, good luck!

Well, I don't want to write another search engine. What I would like to see
is a way to integrate with different third party products. It would be cool
with Lucene or some other free search engine as an optional add on
for PostgreSQL.

Anyway. Moral of the story.. I'd like to see native PostgreSQL full text
indexing before we go adding on to the contrib'd trigger/function
implementation...

Well, I think any attempt at a "complete" full text indexing implementation
in the database itself is futile. Find a way to move this out of the
database and integrate with another product.

I've been using a variant of the FTI system in an application, but this is
far from sufficient when it comes to matching. Speed is OK, but the quality
of the results could have been a lot better.

Regards,

Gunnar

#7Mitch Vincent
mitch@venux.net
In reply to: Culley Harrelson (#1)
Re: full text searching

Well, the search engine isn't the database, IMHO. The search "engine" is
your application... The database will go get anything you tell it to,

you

just have to know how to tell it and make sure that your application

tells

it in the correct way.

Teaching an application or database the English language is going to be

a

hell of a project, good luck!

Well, I don't want to write another search engine. What I would like to

see

is a way to integrate with different third party products. It would be

cool

with Lucene or some other free search engine as an optional add on
for PostgreSQL.

Anyway. Moral of the story.. I'd like to see native PostgreSQL full text
indexing before we go adding on to the contrib'd trigger/function
implementation...

Well, I think any attempt at a "complete" full text indexing

implementation

in the database itself is futile. Find a way to move this out of the
database and integrate with another product.

Futile? Nah, I don't think it's futile anymore than indexing for any other
field is futile. If you could have both then well, that would rock.

I'm talking about indexing from the standpoint of fast searching, not really
smart searching (I wouldn't want a search for "woman" to return results with
"women"). I put it upon myself to generate the queries needed to give the
proper results.. I work for a custom software shop and so far every
application I've written needs a search and the client needs it to do a very
customized, very specific thing. I can't help but write it from scratch (of
course all I'm really doing is writing a frontend to PostgreSQL).. I'm not
sure that a generic search engine would work for me because all the clients
I've worked with have very specific needs.. PostgreSQL is my generic search
engine for all intents and ppurposes and I make it give me what I want..
With regard to FTI, I just want it to search large chunks of text faster...

I've been using a variant of the FTI system in an application, but this is
far from sufficient when it comes to matching. Speed is OK, but the

quality

of the results could have been a lot better.

Really? How are you using it? If it's better than the one I wrote (and it
almost has to be!) I'd love to take a look.. Speed is OK on the machine I'm
searching through large text chunks with now because of a combination of a
good database (PostgreSQL) and a hefty machine (Dual PII 800, 512M ECC RAM,
Ultra 160 SCSI drives).. Still it's only doing sequential scans and using
LIKE to give me matches.. My search is generic SELECT * from whatever WHERE
textfield LIKE '%<searched word>%'; ----- That's fairly fast -- it would
be a hell of a lot faster if I could do an index scan there.. Of course it
was, it's just that updating and inserting suffered too much; something that
will happen anytime you're indexing large amount of data on the fly, I would
just like to see it suffer less, which might happen if FTI was built into
PG.. I'm just talking here, I don't know how FTI would be implemented better
if it was built in, other than I'm sure the person doing it would know more
about the internals of PG and more about C then me (Sadly I'm not all that
good with C anymore)..

Have a good one!

-Mitch

#8Gunnar R|nning
gunnar@candleweb.no
In reply to: Culley Harrelson (#1)
Re: full text searching

"Mitch Vincent" <mitch@venux.net> writes:

I've been using a variant of the FTI system in an application, but this is
far from sufficient when it comes to matching. Speed is OK, but the

quality

of the results could have been a lot better.

Really? How are you using it? If it's better than the one I wrote (and it
almost has to be!) I'd love to take a look.. Speed is OK on the machine I'm

It is really not based on the FTI code in PostgreSQL, since with we started
out with a Progress database last year before porting to PostgreSQL. The
idea is the same though, a separate lookup table containing the words for
exact matching. Last time I had a look at the clients database it had about
50-60K rows in the content table, which amounted to about ~3500K rows in
the lookup table. Searches return results instantly even though most of
them are joins involving 3-4 tables. The database(7.0.2) is running on a
Sun 220R with one 450MHZ processor, 10000RPM disks, 1GB RAM and Solaris
7. (As a curiosity my P466 laptop with Linux is actually running PostgreSQL
faster...)

Since we're only doing exact searches, the index is utilized. But the
quality isn't good enough - I would love to have language sensitive
searches. "car" should match "cars" but not cartography and "ship"
should/could match "boat" etc.

Regards,

Gunnar

#9Thomas T. Thai
tom@minnesota.com
In reply to: Gunnar R|nning (#8)
Re: full text searching

On 7 Feb 2001, Gunnar R|nning wrote:

It is really not based on the FTI code in PostgreSQL, since with we started
out with a Progress database last year before porting to PostgreSQL. The
idea is the same though, a separate lookup table containing the words for
exact matching. Last time I had a look at the clients database it had about
50-60K rows in the content table, which amounted to about ~3500K rows in
the lookup table. Searches return results instantly even though most of
them are joins involving 3-4 tables. The database(7.0.2) is running on a
Sun 220R with one 450MHZ processor, 10000RPM disks, 1GB RAM and Solaris
7. (As a curiosity my P466 laptop with Linux is actually running PostgreSQL
faster...)

Since we're only doing exact searches, the index is utilized. But the
quality isn't good enough - I would love to have language sensitive
searches. "car" should match "cars" but not cartography and "ship"
should/could match "boat" etc.

you can use ispell prefix/suffix for searching base words.

#10Lincoln Yeoh
lyeoh@pop.jaring.my
In reply to: Thomas T. Thai (#9)
Re: full text searching

At 11:40 AM 07-02-2001 -0600, Thomas T. Thai wrote:

On 7 Feb 2001, Gunnar R|nning wrote:

Since we're only doing exact searches, the index is utilized. But the
quality isn't good enough - I would love to have language sensitive
searches. "car" should match "cars" but not cartography and "ship"
should/could match "boat" etc.

you can use ispell prefix/suffix for searching base words.

Sometimes I want it literal too. e.g. I'm searching for car I want car and
nothing else.

What I figured first was to create yet another dictionary of meanings.

Where ship is converted to meaning1 (verb), meaning2 (noun), etc.
And boat is converted to meaning2 (noun).

But then boat is not really the same as ship. So they should actually be
separate meanings maybe linked by a weight. Maybe we should make use of
Postgresql's polygon stuff, and stick a boat polygon somewhere where it
intersects a ship polygon and a canoe polygon.

In the end it seems too complex. I'm starting to think it may be better to
keep things literal but fast, and do the smart stuff at the client (do
multiple searches if necessary).

Cheerio,
Link.

#11Gunnar R|nning
gunnar@candleweb.no
In reply to: Gunnar R|nning (#8)
Re: full text searching

Lincoln Yeoh <lyeoh@pop.jaring.my> writes:

Sometimes I want it literal too. e.g. I'm searching for car I want car and
nothing else.

Of course, you want this as well.

In the end it seems too complex. I'm starting to think it may be better to
keep things literal but fast, and do the smart stuff at the client (do
multiple searches if necessary).

You got commercial products like Verity that is able to these kind of
things. I've used Verity in conjunction with Sybase, apart from stability
problems, that was a very nice combination on doing free text searches. I
could define which columns I wanted indexed, and then I could do you joins
against a proxy table(the index) to do synonym searches, word searches,
regex searches, soundex searches etc.

Verity was running in a separate process and that Sybase forwards the free
text search to when you join against the proxy table. Maybe we could have a
similar architecture in PostgreSQL as well some day.

Does anybody know how Oracle has implemented their "context" search or
whatever it is called nowadays ?

regards,

Gunnar

#12Ned Lilly
ned@greatbridge.com
In reply to: Gunnar R|nning (#8)
Re: [GENERAL] Re: full text searching

(bcc'ed to -hackers)

Gunnar R|nning wrote:

Does anybody know how Oracle has implemented their "context" search or
whatever it is called nowadays ?

They're calling it Intermedia now ... http://www.oracle.com/intermedia/

I have yet to meet an Oracle customer who likes it.

I think there's a lot of agreement that this is an area where Postgres
could use some work. I know Oleg Bartunov has done some interesting
work with Postgres and the search engine at the Russian portal site
"Rambler" ... http://www.rambler.ru/ . Oleg, could you talk a bit about
what you guys did?

If there's interest in spinning up a separate project to sit outside the
database, a la Intermedia or Verity, we'd be happy to sponsor such a
thing on our GreatBridge.org project hosting site (CVS, bug tracking,
mail lists, etc.)

Regards,
Ned

--
----------------------------------------------------
Ned Lilly e: ned@greatbridge.com
Vice President w: www.greatbridge.com
Evangelism / Hacker Relations v: 757.233.5523
Great Bridge, LLC f: 757.233.5555

#13Oleg Bartunov
oleg@sai.msu.su
In reply to: Ned Lilly (#12)
Re: [GENERAL] Re: full text searching

On Thu, 8 Feb 2001, Ned Lilly wrote:

(bcc'ed to -hackers)

Gunnar R|nning wrote:

Does anybody know how Oracle has implemented their "context" search or
whatever it is called nowadays ?

They're calling it Intermedia now ... http://www.oracle.com/intermedia/

I have yet to meet an Oracle customer who likes it.

I think there's a lot of agreement that this is an area where Postgres
could use some work. I know Oleg Bartunov has done some interesting
work with Postgres and the search engine at the Russian portal site
"Rambler" ... http://www.rambler.ru/ . Oleg, could you talk a bit about
what you guys did?

Well, we have FTS engine fully based on postgresql. It was developed
specifically for indexing dynamic text collections like online
news. It has support of morphology, uses coordinate information and
sophisticated ranking of search results. Search and ranking are built
in postgres. Currently the biggest collection we have is about 300,000
messages. We're not very happy with performance on such size collection
and specifically to improve it we did researching in GiST area.
Using GiST we did index support for integer arrays which greatly
improves search performance ! Right now we are trying to understand
how to improve sort performance, which is a final (we hope) stopper
for our FTS. Let me explain a bit:
Search performance is great, but in real life application we have to
display result of search on Web page, page by page. Results could be sorted
by relevancy or another parameter. In case of online news or mailing
list archive results are sorted by publication date. We found that most
time is spent to sort full set of results while we need just
10-15 rows to display on Web page (using ORDER BY .. LIMIT,OFFSET)
Some queries in our case produce
about 50,000 rows (search "Putin" for example) ! Sort time is enormous and
eats all the performance gain we did for search. One solution we currently
investigating is implementation of partial sort into postgres.
We don't need to sort full set. Currently LIMIT provides rather simple
optimization - only part of results are transferred from backend to client.
We propose stop sorting after getting those part of results already
sorted. From our experience and literature we know that 95% of all
hits gets 2 first pages of search results. In our worst case with
50,000 rows we could get first page to display about 5-6 times faster
if we do partial sorting. I understand it looks rather limited area
for optimization but many people would appreciate such optimization.
I remember when I asked Jan to implement LIMIT feature many friends
momentally moved from mysql to postgres. This feature isn't standard
but it's Web friendly and most web applications utilize it.
We have a patch for 7.1, well, just a sketch we did for benchmarking
purposes. Tom isn't happy and we still need some help from core developers.
But time is for 7.1 release and we dont' want to bother developers
right now. Anyway, for medium size collection our FTS is good enough
even using plain 7.0.3. We was planning to release FTS as open source
before new year but were messed with organizational problem (still have :-(

If there's interest in spinning up a separate project to sit outside the
database, a la Intermedia or Verity, we'd be happy to sponsor such a
thing on our GreatBridge.org project hosting site (CVS, bug tracking,
mail lists, etc.)

We plan to develope sample application - searching postgres mail archives
( I have collection from 1995) and present it for testing. If people will
happy with performance and quality of results we could install it
on www.postgresql.org.

Regards,
Ned

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#14The Hermit Hacker
scrappy@hub.org
In reply to: Oleg Bartunov (#13)
Re: [GENERAL] Re: full text searching

On Fri, 9 Feb 2001, Oleg Bartunov wrote:

We plan to develope sample application - searching postgres mail archives
( I have collection from 1995) and present it for testing. If people will
happy with performance and quality of results we could install it
on www.postgresql.org.

You tell me what resources you need, and, hell, what access you need for
testing, and I'll happily provide it ... :)

#15Gordan Bobic
gordan@freeuk.com
In reply to: Thomas T. Thai (#9)
Replication and on-line recovery

Hi!

I will be setting up a 3-way replicated server setup over the next
week or two, so this is probably a good time to make sure I know
exactly what I'm in for. To my knowledge, rserv is the only currently
available replication server for Postgres - is that correct?

1) Is there a way to designate 1 server as master, and do transparent
clustering by just connecting to that one server? Or does the
clustering have to be handled on the client end by making a connection
to each server and doing a "round-robin" for picking the server for
each query (multi-threaded application)?

2) If a server fails, how can it be resynchronised with the rest of
the cluster? I am asuming that inserts all have to be channeled
through the same server, in order to avoid race conditions. Is this
correct? If a server fails, and inserts are constantly being made,
then I am guessing that a dump/restore will not work properly if the
master (insert) server is not taken off-line and inserts are stopped.
Is this the case? How can this be done without taking the master
server off-line? Taking any server off line would take it out of sync,
so just doing a restore from another secondary server would then
result in two servers being out of sync. How is this worked around?

3) I know this has been asked before, and I've managed to get a few
responses about how to implement a quick and useable solution to this,
but I seem to have misplaced the emails, so please forgive me for
asking this again.

Is it possible to run Linux + Mosix + GFS to achieve the functionality
of a truly distributed database system? The bandwidth of communication
between the servers is not a huge problem, because I have the option
of connecting them either in a "star" configuration with 100 Mb
ethernet, connect them to a switch, or use a fiber link between them.
I've been told that "postmaster" won't get migrated properly due to
IPC and shared memory issues. Can anyone suggest a work-around? DIPC,
perhaps? I can't see how to work around the shared memory, though...

I know Oracle has a full distributed clustering support, but I have
made a decision to stick with open source software all the way (plus
the cost of running Oracle on a commercial setup is quite
prohibitive).

Still, even if the answer to the fully distributed database question
here is still just big, fat, flat "NO!", I'd really appreciate some
input regarding failure recovery and insert handling on a replicated
database cluster.

Regards.

Gordan

#16Justin Clift
justin@postgresql.org
In reply to: Thomas T. Thai (#9)
Re: Replication and on-line recovery

Hi Gordan,

Whilst probably not really useful just yet, (as I presently know very
little about replication, but I'm learning), I have just begun writing
up my initial attempts with rserv and Usogres (another PostgreSQL
replication approach).

techdocs.postgresql.org/installguides.html#replication

If you get it all setup and working in good order, can you write up a
guide on doing it, as I haven't found anything "out there" about it,
which is why I'm starting?

:-)

Regards and best wishes,

Justin Clift

Gordan Bobic wrote:

Hi!

I will be setting up a 3-way replicated server setup over the next
week or two, so this is probably a good time to make sure I know
exactly what I'm in for. To my knowledge, rserv is the only currently
available replication server for Postgres - is that correct?

1) Is there a way to designate 1 server as master, and do transparent
clustering by just connecting to that one server? Or does the
clustering have to be handled on the client end by making a connection
to each server and doing a "round-robin" for picking the server for
each query (multi-threaded application)?

2) If a server fails, how can it be resynchronised with the rest of
the cluster? I am asuming that inserts all have to be channeled
through the same server, in order to avoid race conditions. Is this
correct? If a server fails, and inserts are constantly being made,
then I am guessing that a dump/restore will not work properly if the
master (insert) server is not taken off-line and inserts are stopped.
Is this the case? How can this be done without taking the master
server off-line? Taking any server off line would take it out of sync,
so just doing a restore from another secondary server would then
result in two servers being out of sync. How is this worked around?

3) I know this has been asked before, and I've managed to get a few
responses about how to implement a quick and useable solution to this,
but I seem to have misplaced the emails, so please forgive me for
asking this again.

Is it possible to run Linux + Mosix + GFS to achieve the functionality
of a truly distributed database system? The bandwidth of communication
between the servers is not a huge problem, because I have the option
of connecting them either in a "star" configuration with 100 Mb
ethernet, connect them to a switch, or use a fiber link between them.
I've been told that "postmaster" won't get migrated properly due to
IPC and shared memory issues. Can anyone suggest a work-around? DIPC,
perhaps? I can't see how to work around the shared memory, though...

I know Oracle has a full distributed clustering support, but I have
made a decision to stick with open source software all the way (plus
the cost of running Oracle on a commercial setup is quite
prohibitive).

Still, even if the answer to the fully distributed database question
here is still just big, fat, flat "NO!", I'd really appreciate some
input regarding failure recovery and insert handling on a replicated
database cluster.

Regards.

Gordan

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi

#17Gregory Wood
gregw@com-stock.com
In reply to: Thomas T. Thai (#9)
Re: Replication and on-line recovery

I will be setting up a 3-way replicated server setup over the next
week or two, so this is probably a good time to make sure I know
exactly what I'm in for. To my knowledge, rserv is the only currently
available replication server for Postgres - is that correct?

PostgreSQL Replicator (http://pgreplicator.sourceforge.net/) appears to be
another. Justin has also mentioned Usogres (which I think is at
http://usogres.good-day.net/ but appears to be down at the moment).

I'm starting to do some research (kinda sorta) on replication and I'd love
to hear anything about either of these three servers. I'll try to do the
same if my time allows.

Greg