Dumb question involving to_tsvector and a view

Started by Raymond C. Rodgersover 13 years ago6 messagesgeneral

sinful622@gmail.com

over 13 years ago

Hi folks,
I'm building a PHP script for a web site I'm developing. At the
moment, there is absolutely no real data in the database, so obviously
performance is pretty good right now. I'm in the midst of developing an
administration page for the site, which will do a full text search on
several tables separately, and I realized that one of the tables
currently doesn't have a tsvector column. As I went to add a tsvector
column, it occurred to me that it might be possible to add a dynamic
tsvector column through the use of a view, so I created a temporary view
with a command along the lines of:

CREATE TEMPORARY VIEW ftstest AS SELECT id, field1, field2,
TO_TSVECTOR(COALESCE(field1,'') || ' ' || COALESCE(field2,'')) AS
txtsrch FROM mytable;

To my surprise, it worked. Now, I'm sitting here thinking about the
performance impact that doing this would have. I can't help but think
that a query to this view when the table is filled with thousands or
tens of thousands of entries would be painfully slow, but would there be
any real advantage to doing it in a view rather than just adding the
column to the table? (That's the dumb question.) If the site only had a
few dozen users, and the amount of data on the site was minimal, this
wouldn't be too big an issue. Still a bad design decision, but are there
any good reasons to do it?

Thanks for your patience with this dumb question. :)
Raymond

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Raymond C. Rodgers (#1)

Re: Dumb question involving to_tsvector and a view

Raymond C. Rodgers <sinful622@gmail.com> wrote:

As I went to add a tsvector column, it occurred to me that it
might be possible to add a dynamic tsvector column through the
use of a view, so I created a temporary view with a command along
the lines of:

CREATE TEMPORARY VIEW ftstest AS SELECT id, field1, field2,
TO_TSVECTOR(COALESCE(field1,'') || ' ' ||
COALESCE(field2,'')) AS txtsrch FROM mytable;

To my surprise, it worked. Now, I'm sitting here thinking about
the performance impact that doing this would have.

I had a similar situation and benchmarked it both ways. For my
situation I came out ahead writing the extra column for inserts and
updates than generating the tsvector values on the fly each time it
was queried. YMMV. It probably depends mostly on the ratio of
inserts and updates to selects.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Kevin Grittner (#2)

Re: Dumb question involving to_tsvector and a view

Kevin Grittner <kgrittn@ymail.com> writes:

Raymond C. Rodgers <sinful622@gmail.com> wrote:

As I went to add a tsvector column, it occurred to me that it
might be possible to add a dynamic tsvector column through the
use of a view, so I created a temporary view with a command along
the lines of:

�� CREATE TEMPORARY VIEW ftstest AS SELECT id, field1, field2,
TO_TSVECTOR(COALESCE(field1,'') || ' ' ||
COALESCE(field2,'')) AS txtsrch FROM mytable;

To my surprise, it worked. Now, I'm sitting here thinking about
the performance impact that doing this would have.

I had a similar situation and benchmarked it both ways.� For my
situation I came out ahead writing the extra column for inserts and
updates than generating the tsvector values on the fly each time it
was queried.� YMMV.� It probably depends mostly on the ratio of
inserts and updates to selects.

A "virtual" tsvector like that is probably going to be useless for
searching as soon as you get a meaningful amount of data, because the
only way the DB can implement a search is to compute the tsvector value
for each table row and then examine it for the target word(s).

What you want is a GIST or GIN index on the contents of the tsvector.
You can either realize the tsvector as a table column and put a regular
index on it, or you can build a functional index on the to_tsvector()
expression. The latter is kind of like your idea in that the tsvector
as a whole isn't stored anywhere --- but there's an index containing all
the words, which is what you need for searching.

I think there are examples of both ways in the "text search" chapter of
the manual. (If not, there should be ...)

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Raymond C. Rodgers

sinful622@gmail.com

over 13 years ago

In reply to: Tom Lane (#3)

Re: Dumb question involving to_tsvector and a view

On 02/23/2013 05:26 AM, Tom Lane wrote:

A "virtual" tsvector like that is probably going to be useless for
searching as soon as you get a meaningful amount of data, because the
only way the DB can implement a search is to compute the tsvector
value for each table row and then examine it for the target word(s).
What you want is a GIST or GIN index on the contents of the tsvector.
You can either realize the tsvector as a table column and put a
regular index on it, or you can build a functional index on the
to_tsvector() expression. The latter is kind of like your idea in that
the tsvector as a whole isn't stored anywhere --- but there's an index
containing all the words, which is what you need for searching. I
think there are examples of both ways in the "text search" chapter of
the manual. (If not, there should be ...) regards, tom lane

I think the only real advantage to using something like this would be a
space savings in terms of storing the tsvector data, but I don't see
that being a significant enough reason to go ahead and use this idea in
a production situation. As mentioned [by pretty much all of us], once
the table size is sufficiently large there would be a performance
penalty by to_tsvector being executed on every record in the table. (If
I'm not mistaken, with the way I wrote that "create view", every record
in "mytable" would be subject to the function call, then any narrowing
parameters in the where clause would be applied afterwards.)

Any way, like I said originally, it was a dumb question. It might be ok
to use that in a situation where the table size is known to be small,
but there's little to no reason to do it in a production situation.

Thanks!
Raymond

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jasen Betts

jasen@xnet.co.nz

over 13 years ago

In reply to: Raymond C. Rodgers (#1)

Re: Dumb question involving to_tsvector and a view

On 2013-02-23, Raymond C. Rodgers <sinful622@gmail.com> wrote:

On 02/23/2013 05:26 AM, Tom Lane wrote:

A "virtual" tsvector like that is probably going to be useless for
searching as soon as you get a meaningful amount of data, because the
only way the DB can implement a search is to compute the tsvector
value for each table row and then examine it for the target word(s).
What you want is a GIST or GIN index on the contents of the tsvector.

I think the only real advantage to using something like this would be a
space savings in terms of storing the tsvector data, but I don't see
that being a significant enough reason to go ahead and use this idea in
a production situation. As mentioned [by pretty much all of us], once
the table size is sufficiently large there would be a performance
penalty by to_tsvector being executed on every record in the table.

Unless the plan comes out as a table scan the index will be used
instead ot to_tsvector()

When there is a table scan to_tsvector will be used instead of reading
from disk, I don't know how fast to_tsvector is compared to disk, but
usually computing a result is faster than reading it from disk.

Storing the tsvector in the table is likely to be faster only when a
tablescan is done and the table is fully cached in ram.

--
⚂⚃ 100% natural

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 13 years ago

In reply to: Jasen Betts (#5)

Re: Dumb question involving to_tsvector and a view

Jasen Betts <jasen@xnet.co.nz> wrote:

On 2013-02-23, Raymond C. Rodgers <sinful622@gmail.com> wrote:

On 02/23/2013 05:26 AM, Tom Lane wrote:

A "virtual" tsvector like that is probably going to be useless for
searching as soon as you get a meaningful amount of data, because the
only way the DB can implement a search is to compute the tsvector
value for each table row and then examine it for the target word(s).
What you want is a GIST or GIN index on the contents of the tsvector.

I think the only real advantage to using something like this would be a
space savings in terms of storing the tsvector data, but I don't see
that being a significant enough reason to go ahead and use this idea in
a production situation. As mentioned [by pretty much all of us], once
the table size is sufficiently large there would be a performance
penalty by to_tsvector being executed on every record in the table.

Unless the plan comes out as a table scan the index will be used
instead ot to_tsvector()

When there is a table scan to_tsvector will be used instead of reading
from disk, I don't know how fast to_tsvector is compared to disk, but
usually computing a result is faster than reading it from disk.

Storing the tsvector in the table is likely to be faster only when a
tablescan is done and the table is fully cached in ram.

I guess I was being dumb in assuming that it was obvious that a GIN
or GiST index would be needed for decent performance at scale.
Without that, a scan of the whole table (or at least all rows
matching other search criteria) is needed, which is going to hurt.
The benchmarks I mentioned were for a GIN index on the results of
the function which generated the tsvector, versus a GIN index on
the stored tsvector. In our case, a typical scan for document text
against years of accumulated court documents was about 300 ms
versus about 1.5 seconds. It may matter that we weren't just
looking for matches, but the top K matches based on the ranking
function.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general