Full text Indexing -out of contrib and into main..

Started by John Huttleyabout 25 years ago14 messages
#1John Huttley
John@mwk.co.nz

Maybe asking 'Why isn't the contrib full-text-indexer not in the main
tree?' would be more productive on that front.

Well, yes. Why isn't it?

Full text indexing should be just as much a feature as any other key feature in
PG.
With the advent of unlimited file and record lengths in 7.1, this would be a good
time to
include it.

FTI is particularly useful in the context of web content engines.

How did you attempt to build it under the RPM install? I assume you had
the postgresql-devel package installed, and the include paths set
properly.... Of course, most of what is in contrib assumes a full source
tree is lying around (argh)....ie, it's wanting to include
Makefile.global in its Makefile. And, on the RPM dist, Makefile.global
isn't (yet) packaged.

Yes, I have the devel RPM, but FTI couldn't find its include files. Or its
libraries. Or something.

Building from a source tree has always been better for me. The catch is mixing
that with the RPMS,
which put things in unholy locations.
Its very hard (for me at least) to update an RPM version with a version compiled
from the source.

I recently updated my PG system from 6.5.3 to 7.0.3, still RPMs, (another fun
job) and have not tried to compile FTI
subsequently.

However if I tried hard enough I'm sure I could fix it. For the moment, I'm on
another job so I'm not worrying.

Regards

John

#2Don Baccus
dhogaza@pacifier.com
In reply to: John Huttley (#1)
Re: Full text Indexing -out of contrib and into main..

At 02:51 PM 11/28/00 +1300, John Huttley wrote:

Maybe asking 'Why isn't the contrib full-text-indexer not in the main
tree?' would be more productive on that front.

Well, yes. Why isn't it?

Full text indexing should be just as much a feature as any other key feature in
PG.
With the advent of unlimited file and record lengths in 7.1, this would be a good
time to
include it.

FTI is particularly useful in the context of web content engines.

Well ... it's pretty inadequate, actually. That might be one reason it's only
in contrib.

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#3Lamar Owen
lamar.owen@wgcr.org
In reply to: John Huttley (#1)
Re: Full text Indexing -out of contrib and into main..

John Huttley wrote:

Maybe asking 'Why isn't the contrib full-text-indexer not in the main
tree?' would be more productive on that front.

Well, yes. Why isn't it?

I'm hoping to see the answer to that one myself, as that is outside my
scope currently. I just RPMize things... Although, I didn't intend for
my statement to appear as harsh as it does -- for that I apologize.

Yes, I have the devel RPM, but FTI couldn't find its include files. Or its
libraries. Or something.

Makefile.global. Tried it here. The RPMset hasn't heretofore needed
Makefile.global. I may package that, amongst other stuff necessary to
build certain things in the devel package -- once I find out how to go
about doing it.

Building from a source tree has always been better for me. The catch is mixing
that with the RPMS,
which put things in unholy locations.
Its very hard (for me at least) to update an RPM version with a version compiled
from the source.

I recommend completely removing the RPM version and installing from
source rather than trying to upgrade from an RPM distribution to the
from-source distribution. Or just install the next RPM version. Let
RPM work the headaches for you. If you want to run from a 'from-source'
build, then nix the RPMset altogether and don't worry about it
afterward.

Although I am going to consider a pre-built set of contribs -- most
notably, geodistance is likely to find its way into an RPM in the
future. I just haven't decided whether to split out to individual
contribs or to just make a single 'postgresql-contrib' subpackage. I'm
open to suggestions.

You can, however, install a source-tree preconfigured and built for the
RPM modifications by installing the _source_ RPM, and then issuing, as
root, 'rpm -bi postgresql.spec' from within the /usr/src/redhat/SPECS
dir. You will then have a source tree primed for building whatever in
/usr/src/redhat/BUILD/postgresql-x.y.z (where x.y.z is the version, of
course). You will need python-devel installed in order to do that,
however.

HTH.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

#4Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Don Baccus (#2)
Re: Full text Indexing -out of contrib and into main..

Well, yes. Why isn't it?

Full text indexing should be just as much a feature as any other key feature in
PG.
With the advent of unlimited file and record lengths in 7.1, this would be a good
time to
include it.

FTI is particularly useful in the context of web content engines.

Well ... it's pretty inadequate, actually. That might be one reason it's only
in contrib.

OK, can someone collect suggestions, add the code, and integrate it for
7.1?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#5The Hermit Hacker
scrappy@hub.org
In reply to: Bruce Momjian (#4)
Re: Full text Indexing -out of contrib and into main..

On Mon, 27 Nov 2000, Bruce Momjian wrote:

Well, yes. Why isn't it?

Full text indexing should be just as much a feature as any other key feature in
PG.
With the advent of unlimited file and record lengths in 7.1, this would be a good
time to
include it.

FTI is particularly useful in the context of web content engines.

Well ... it's pretty inadequate, actually. That might be one reason it's only
in contrib.

OK, can someone collect suggestions, add the code, and integrate it for
7.1?

too late in cycle ...

#6Don Baccus
dhogaza@pacifier.com
In reply to: The Hermit Hacker (#5)
Re: Full text Indexing -out of contrib and into main..

At 11:06 PM 11/27/00 -0400, The Hermit Hacker wrote:

On Mon, 27 Nov 2000, Bruce Momjian wrote:

OK, can someone collect suggestions, add the code, and integrate it for
7.1?

too late in cycle ...

Yes...

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

#7Mitch Vincent
mitch@venux.net
In reply to: The Hermit Hacker (#5)
Re: Full text Indexing -out of contrib and into main..

I modified the FTI trigger for my own use a while ago (indexes whole
words, eliminates duplicate a few other things) -- I'm not sure if it would
do anyone any good but you're welcome to it. To whom should I send it?

-Mitch

----- Original Message -----
From: "The Hermit Hacker" <scrappy@hub.org>
To: "Bruce Momjian" <pgman@candle.pha.pa.us>
Cc: "Don Baccus" <dhogaza@pacifier.com>; "John Huttley" <John@mwk.co.nz>;
<pgsql-hackers@postgresql.org>
Sent: Monday, November 27, 2000 7:06 PM
Subject: Re: [HACKERS] Full text Indexing -out of contrib and into main..

On Mon, 27 Nov 2000, Bruce Momjian wrote:

Well, yes. Why isn't it?

Full text indexing should be just as much a feature as any other key

feature in

PG.
With the advent of unlimited file and record lengths in 7.1, this

would be a good

time to
include it.

FTI is particularly useful in the context of web content engines.

Well ... it's pretty inadequate, actually. That might be one reason

it's only

Show quoted text

in contrib.

OK, can someone collect suggestions, add the code, and integrate it for
7.1?

too late in cycle ...

#8Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: John Huttley (#1)
Re: Full text Indexing -out of contrib and into main..

Maybe asking 'Why isn't the contrib full-text-indexer not in the main
tree?' would be more productive on that front.

Well, yes. Why isn't it?

I believe that it is appropriate for contrib/ because it is a good demo
of FTI-like capabilities. But nothing more, yet. For at least a couple
of reasons:

1) It generates the "index" as a table, not a PostgreSQL index or
index-like thing.

2) It has a hardcoded list of non-indexed words. This should come from a
table, to allow it to be tuned to the application requirements.

Comments?

- Thomas

#9Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Mitch Vincent (#7)
Re: Full text Indexing -out of contrib and into main..

[ Charset ISO-8859-1 unsupported, converting... ]

I modified the FTI trigger for my own use a while ago (indexes whole
words, eliminates duplicate a few other things) -- I'm not sure if it would
do anyone any good but you're welcome to it. To whom should I send it?

Is full-word optional or mandatory? It has to be an option.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#10john huttley
john@mwk.co.nz
In reply to: The Hermit Hacker (#5)
Re: Full text Indexing -out of contrib and into main..

OK, can someone collect suggestions, add the code, and integrate it for
7.1?

too late in cycle ...

How about first thing for 7.2 then? While it lies in limbo,
its never going to get the attention it deserves.

Regards

#11john huttley
john@mwk.co.nz
In reply to: John Huttley (#1)
Re: Full text Indexing -out of contrib and into main..

I believe that it is appropriate for contrib/ because it is a good demo
of FTI-like capabilities. But nothing more, yet. For at least a couple
of reasons:

1) It generates the "index" as a table, not a PostgreSQL index or
index-like thing.

2) It has a hardcoded list of non-indexed words. This should come from a
table, to allow it to be tuned to the application requirements.

Comments?

- Thomas

In general..
a) Considering that I was coding up the same thing with triggers and such,
things could only get better.

b) Check out MSSQL 7's capabilities and weep.

c) It would be a start. One its in the tree, it gets used more, gets
improved..

It would be a while yet before 7.2 starts, plenty of time then to develop
it further.

Regards

John

#12Hannu Krosing
hannu@tm.ee
In reply to: John Huttley (#1)
Re: Full text Indexing -out of contrib and into main..

john huttley wrote:

I believe that it is appropriate for contrib/ because it is a good demo
of FTI-like capabilities. But nothing more, yet. For at least a couple
of reasons:

1) It generates the "index" as a table, not a PostgreSQL index or
index-like thing.

2) It has a hardcoded list of non-indexed words. This should come from a
table, to allow it to be tuned to the application requirements.

Comments?

- Thomas

In general..
a) Considering that I was coding up the same thing with triggers and such,
things could only get better.

AFAIK, the one in contrib _is_ the same thing coded up with triggers and
such ;)

b) Check out MSSQL 7's capabilities and weep.

BTW, have you studied MSSQL enough to tell me if it has a
separate/standalone
(as a process) fti engine or just another index type.

I have been contemplating about implementing FTI for postgres for some
time and my
current plan would be to implement a out-of-process fti engine (API +
sample
implementation, in the spirit of PostgreSQLs extensibility) that could
postpone
the actual indexing but still help with queries even for not yet fully
indexed stuff.

Will probably need some choreography but essential for high performance.

You generally don't want to wait for all index entries of an inverted
index to be saved.

Also the thing should be more general than the one in contrib , being
able to index
both fields and full records and support functional indexes.

Is there a way to make PostgresQL optimiser aware of the
selectivity/cost of function,
so that it can do the right thing for a query like

SELECT * FROM ARTICLES
WHERE ADATE BETWEEN YESTERDAY AND TOMORROW
AND ARTICLES.FTI_MATCHES('(CAT & DOG) ! PRESIDENT')

It would be almost automatic if functions could return sets and then be
used like

SELECT * FROM ARTICLE
WHERE ADATE BETWEEN YESTERDAY AND TOMORROW
AND ARTICLE_ID = ARTICLE.FTI_MATCHING_IDS('(CAT & DOG) ! PRESIDENT')

and somehow the optimiser would know that it can join on the returned
ids but this
is probably not the case ;)

c) It would be a start. One its in the tree, it gets used more, gets
improved..

But, it is not a _real_ full text index, just a postgresql sample
application that
implements a full text index using an sql database.

----------
Hannu

#13Magnus Hagander
mha@sollentuna.net
In reply to: Hannu Krosing (#12)
RE: Full text Indexing -out of contrib and into main..

b) Check out MSSQL 7's capabilities and weep.

BTW, have you studied MSSQL enough to tell me if it has a
separate/standalone
(as a process) fti engine or just another index type.

It is standalone - separate process, data is stored in separate files (not
in db).

In SQL Server 7.0, you also have to manually update the index. Just updating
the values in the table does *NOT* update the index. (Can be scheduled, of
course, but not live)
In SQL Server 2000 the index can be auto-updated when rows change, but it's
not default.

//Magnus

#14Don Baccus
dhogaza@pacifier.com
In reply to: Magnus Hagander (#13)
RE: Full text Indexing -out of contrib and into main..

At 10:52 AM 11/28/00 +0100, Magnus Hagander wrote:

b) Check out MSSQL 7's capabilities and weep.

BTW, have you studied MSSQL enough to tell me if it has a
separate/standalone
(as a process) fti engine or just another index type.

It is standalone - separate process, data is stored in separate files (not
in db).

In SQL Server 7.0, you also have to manually update the index. Just updating
the values in the table does *NOT* update the index. (Can be scheduled, of
course, but not live)
In SQL Server 2000 the index can be auto-updated when rows change, but it's
not default.

This is similar to Oracle's InterMedia. In practice, using auto-update on a
busy, live website is impractical, though how much this is due to InterMedia's
being flakey and how much due to the computational expense isn't clear (or rather
IM's so flakey one can't really explore enough to see how expensive
auto-update on a busy site would be).

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.