tsearch2 in PostgreSQL 8.3?
A lot of work has been done to try to get /contrib/tsearch2 into the
core backend for 8.3, but we have hit a roadblock in how to handle
multiple text search configurations. (FYI, the documentation is at
http://momjian.us/expire/textsearch/HTML/textsearch.html.)
There are three options for controlling text search configurations:
1) have a GUC variable which specifies the default configuration
2) require the configuration to be always specified
3) use the type system to automatically use the right configuration
The problem with #1 is that is it error-prone (easy to mismatch
configurations). One idea was to have the GUC be super-user-only but
then restoring a dump as non-super-user is a problem.
The problem with #2 is that it makes implicit and explicit casting
impossible (there is no place to specify the configuration).
#3 requires more code and is probably not something we want to do at
this stage in 8.3 development. It requires passing typmod values
between functions and operators (not something we have done easily in
the past).
Given this, should we decide to not include full text search in 8.3?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote:
A lot of work has been done to try to get /contrib/tsearch2 into the
core backend for 8.3, but we have hit a roadblock in how to handle
multiple text search configurations. (FYI, the documentation is at
http://momjian.us/expire/textsearch/HTML/textsearch.html.)There are three options for controlling text search configurations:
1) have a GUC variable which specifies the default configuration
2) require the configuration to be always specified
3) use the type system to automatically use the right configurationThe problem with #1 is that is it error-prone (easy to mismatch
configurations). One idea was to have the GUC be super-user-only but
then restoring a dump as non-super-user is a problem.
What is the worst consequence of mismatching configuration? Does it
cause a system crash? A backend hang? A corrupted index? Lost data?
Or does it, as I assume, just fail to return the exact result set that
would be returned if the correct configuration was supplied?
If the answer is the latter, I think this is not so huge a problem that
FULL TEXT should be rejected for 8.3 on these grounds. We just tell
people to use the correct query and be done with it.
--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"La vida es para el que se aventura"
Bruce Momjian wrote:
#3 requires more code and is probably not something we want to do at
this stage in 8.3 development. It requires passing typmod values
between functions and operators (not something we have done easily in
the past).
It does? I was thinking of implicitly creating a new type, with no
typmod, when you create a new configuration. Similar to enums, I think.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Alvaro Herrera wrote:
What is the worst consequence of mismatching configuration? Does it
cause a system crash? A backend hang? A corrupted index? Lost data?
Or does it, as I assume, just fail to return the exact result set that
would be returned if the correct configuration was supplied?
Your assumption is correct.
You can mismatch configurations not just by querying in a wrong
configuration, but also by accidentally storing tsvectors generated with
different configurations in the same column (with no additional column
like Mike Rylander had to tell them apart), but it's still going to look
OK from PostgreSQL's point of view.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Aug 14, 2007 at 03:15:44PM -0400, Alvaro Herrera wrote:
Bruce Momjian wrote:
A lot of work has been done to try to get /contrib/tsearch2 into the
core backend for 8.3, but we have hit a roadblock in how to handle
multiple text search configurations. (FYI, the documentation is at
http://momjian.us/expire/textsearch/HTML/textsearch.html.)There are three options for controlling text search configurations:
1) have a GUC variable which specifies the default configuration
2) require the configuration to be always specified
3) use the type system to automatically use the right configurationThe problem with #1 is that is it error-prone (easy to mismatch
configurations). One idea was to have the GUC be super-user-only but
then restoring a dump as non-super-user is a problem.What is the worst consequence of mismatching configuration? Does it
cause a system crash? A backend hang? A corrupted index? Lost data?
Or does it, as I assume, just fail to return the exact result set that
would be returned if the correct configuration was supplied?If the answer is the latter, I think this is not so huge a problem that
FULL TEXT should be rejected for 8.3 on these grounds. We just tell
people to use the correct query and be done with it.
I think that users of the full-text option would need to read the
documentation and we could include any needed caveats. This would
certainly do for the 8.3 release and would give us time to simplify
the management and use in the 8.4 release. I, for one, have been
waiting a long time for it to be integrated into the database.
Ken
Heikki Linnakangas wrote:
Bruce Momjian wrote:
#3 requires more code and is probably not something we want to do at
this stage in 8.3 development. It requires passing typmod values
between functions and operators (not something we have done easily in
the past).It does? I was thinking of implicitly creating a new type, with no
typmod, when you create a new configuration. Similar to enums, I think.
So each new configuration is a new data type? How do the tsearch
functions handle these new data types? I also question if this can be
completed soon. I have seen no specification yet, let alone someone
coding it.
Basically, the default GUC doesn't work because of:
error prone
if super-user only, non-super-user doesn't work on restore
if non-super-user, can cause mismatch (perhaps this is the best
option), and restore still a problem (no storage of config in
indexes or tables)
No one seems to like the always-specify the configuration (loses cast
ability). And I don't see the code for new type appearing anytime soon.
Bottom line --- we better figure out something quick or it isn't going
to be in 8.3, and at this point, I am starting to doubt a solution will
magically appear.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote:
Basically, the default GUC doesn't work because of:
error prone
if super-user only, non-super-user doesn't work on restore
if non-super-user, can cause mismatch (perhaps this is the best
option), and restore still a problem (no storage of config in
indexes or tables)
I haven't really seen anyone else arguing about this. I wonder whether
you are being overly zealous about it.
--
Alvaro Herrera http://www.flickr.com/photos/alvherre/
"Doing what he did amounts to sticking his fingers under the hood of the
implementation; if he gets his fingers burnt, it's his problem." (Tom Lane)
Alvaro Herrera wrote:
Bruce Momjian wrote:
Basically, the default GUC doesn't work because of:
error prone
if super-user only, non-super-user doesn't work on restore
if non-super-user, can cause mismatch (perhaps this is the best
option), and restore still a problem (no storage of config in
indexes or tables)I haven't really seen anyone else arguing about this. I wonder whether
you are being overly zealous about it.
Uh, OK, but no one has told me how a database restore without a
configuration name would work, so I am all ears.
CREATE INDEX ii on x(to_tsvector(col))
There is nothing that says what configuration that index should use
except the default setting, and if that is different in the restore
database, you have problems. Same for a trigger that calls to_tsvector
and is restored into a different database.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On 8/14/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Bruce Momjian wrote:
Basically, the default GUC doesn't work because of:
error prone
if super-user only, non-super-user doesn't work on restore
if non-super-user, can cause mismatch (perhaps this is the best
option), and restore still a problem (no storage of config in
indexes or tables)I haven't really seen anyone else arguing about this. I wonder whether
you are being overly zealous about it.
I hate to just pile on Bruce, but as a production user I tend to
agree... let me shoot myself in the foot. :)
--miker
Bruce Momjian <bruce@momjian.us> writes:
Alvaro Herrera wrote:
I haven't really seen anyone else arguing about this. I wonder whether
you are being overly zealous about it.
Uh, OK, but no one has told me how a database restore without a
configuration name would work, so I am all ears.
It's the dump/restore problem that really is the kicker here. I don't
mind so much a GUC that only controls the interpretation of queries, but
if it determines how an index or a tsvector column gets rebuilt, we've
got problems.
I've just finished re-reading the prior thread, and here are what seem
to me to be the salient points:
* Oleg, Teodor, and all of the old-line users of tsearch2 are
comfortable with setting up a trigger to maintain a materialized
tsvector column for a table. They argue that recomputing the tsvector
(possibly more than once) during a query is more expensive than fetching
it from disk. My suspicion is that the latter argument gets weaker
every year --- CPUs are getting faster lots faster than disks are.
* Bruce (and I ... not sure about anyone else) want to support usage of
text search via a functional index. This is argued to be easier to set
up (no fooling with triggers) and possibly faster depending on CPU vs
I/O speeds. I don't think there is any desire here to eliminate the
trigger approach, just to provide an alternative.
* For *either* the trigger or functional-index approach, I think it is
unacceptable to rely on a GUC variable to determine how the tsvector is
derived from the raw-document fields for storage or indexing. It's just
too error-prone, particularly when you consider dump-and-reload cases.
What I think we should say is that the ts parsing configuration name
can be either hardwired or taken from another field of the table.
In the trigger case this would mean providing a couple of standard
triggers, one taking the config name as a trigger parameter, and the
other accepting a trigger parameter that's the name of the config name
column. In the index case this would mean that the index expression
has to be either to_tsvector('constant', ...) or to_tsvector(field, ...).
Note that all four cases boil down to saying that stored or indexed
tsvectors have to be derived from the two-parameter form of to_tsvector.
* For queries, there is not anything very wrong with having a default
configuration, but the trick is how to get the planner to match that up
with an index that's written with the two-parameter form of to_tsvector.
One hackish possibility is to define the single-parameter form of
to_tsvector like this:
create function to_tsvector(text) returns tsvector as $$
select to_tsvector(get_default_text_search_config(), $1)
$$ language sql strict stable;
where get_default_text_search_config() is essentially just
current_setting('default_text_search_config') except it is
misleadingly marked immutable. Then, a query with
WHERE to_tsvector(document_col) @@ tsquery(...)
will have the SQL function inlined, and the
get_default_text_search_config() call const-folded, and suddenly
it looks like
WHERE to_tsvector('english', document_col) @@ tsquery(...)
and can be matched to a functional index that's declared using the
explicit 'english' configuration name. This is pretty grotty though
... can anyone think of a better way? (The main objection I can see
to it is that someone could shoot himself in the foot by using this
function instead of two-parameter to_tsvector in a custom trigger
function. But hopefully, anyone writing a custom trigger function
will have read the manual's warning not to do that. Note that I
suggest marking the function stable so that it can't be misused
that way in a functional index. Another possible objection is that
get_default_text_search_config() is a foot-gun all by itself, since
it could be used in a functional index. Aside from not documenting
it I'm not sure there's much to be done about that.)
* I'm not enamored of Heikki's idea about a datatype-based solution,
because I don't think that it will allow columns containing tsvectors
derived with different configurations. It's perfectly clear that
advanced users want to be able to do that, and it's also clear that
as long as the config name is coming from a stored column (or can be
reconstructed somehow from the stored data) that it's perfectly
well-defined.
Since I don't think that a datatype solution is the way to go,
I don't feel that we are as far away from an agreement as Bruce
is worried about.
regards, tom lane
Mike Rylander wrote:
On 8/14/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Bruce Momjian wrote:
Basically, the default GUC doesn't work because of:
error prone
if super-user only, non-super-user doesn't work on restore
if non-super-user, can cause mismatch (perhaps this is the best
option), and restore still a problem (no storage of config in
indexes or tables)I haven't really seen anyone else arguing about this. I wonder whether
you are being overly zealous about it.I hate to just pile on Bruce, but as a production user I tend to
agree... let me shoot myself in the foot. :)--miker
Sure, we can document hazards, but the larger problem is related to the
fact that the default controls what gets stored in the database. This is
a similar problem to when we had an autocommit GUC which caused
problems.
Technically, this is like how the server encoding affects what is stored
in the database. If we allowed users to change the server encoding in a
database that already had data in it, there would be no way to identify
which data was using the old encoding and which was using the new one.
Now, the application might be able to identify them just fine, but a
database restore would be unable to recreate the data the same way.
If we want to keep the default GUC we would have to allow non-super-user
changes so we can use it in pg_dump for restore, but even then if the
default is different there is going to be a mix of old/new after the
restore because table changes after the restore is going to use the new
default config.
Perhaps the best we could do is to tell people who change the default GUC
that they are on their own in restoring the database, or they have to be
very carful like with triggers to assign the configuration properly. I
can imagine how complex that part of the documentation will be, but it
is doable.
The other point is that we should have a good idea of the API because
if it gets into 8.3 it will be harder to change.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes:
The other point is that we should have a good idea of the API because
if it gets into 8.3 it will be harder to change.
Yeah, once it's in core we have a pretty strong backwards-compatibility
restriction to deal with. Someone upthread claimed "we can always
simplify it later" but that's exactly backward --- we can add features
later, but we can't subtract.
Maybe we should be looking to implement just the minimum set of features
for 8.3 and leave some of the more controversial stuff for 8.4. I hate
to admit it, but if we take that point of view then triggers are in
and functional-index support is out. We have to support the trigger
approach because it's what is in tsearch2 now, and the existing users
will expect to continue to have that option.
However, allowing the standard triggers to pay attention to a
configuration GUC variable is simply broken; that bit has to go away.
regards, tom lane
Tom Lane wrote:
* For queries, there is not anything very wrong with having a default
configuration, but the trick is how to get the planner to match that up
with an index that's written with the two-parameter form of to_tsvector.
One hackish possibility is to define the single-parameter form of
to_tsvector like this:create function to_tsvector(text) returns tsvector as $$
select to_tsvector(get_default_text_search_config(), $1)
$$ language sql strict stable;where get_default_text_search_config() is essentially just
current_setting('default_text_search_config') except it is
misleadingly marked immutable. Then, a query withWHERE to_tsvector(document_col) @@ tsquery(...)
will have the SQL function inlined, and the
get_default_text_search_config() call const-folded, and suddenly
it looks likeWHERE to_tsvector('english', document_col) @@ tsquery(...)
This is an interesting idea that would allow queries without a
configuration to match an expression index. The only trick is to train
users not to use such tricks on triggers or expression index, but only
in queries. The idea of 'stable' helps to fix that for expression
indexes, but not for trigger use, I assume.
* I'm not enamored of Heikki's idea about a datatype-based solution,
because I don't think that it will allow columns containing tsvectors
derived with different configurations. It's perfectly clear that
advanced users want to be able to do that, and it's also clear that
as long as the config name is coming from a stored column (or can be
reconstructed somehow from the stored data) that it's perfectly
well-defined.Since I don't think that a datatype solution is the way to go,
I don't feel that we are as far away from an agreement as Bruce
is worried about.
Well, from where I sit, there is one person saying give me the foot gun,
and Heikki saying he wants a bullet-proof type system, and you and I are
in the middle, so the big problem is I don't see a concensus forming,
and we have been discussing this for a while.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes:
Tom Lane wrote:
Since I don't think that a datatype solution is the way to go,
I don't feel that we are as far away from an agreement as Bruce
is worried about.
Well, from where I sit, there is one person saying give me the foot gun,
and Heikki saying he wants a bullet-proof type system, and you and I are
in the middle, so the big problem is I don't see a concensus forming,
and we have been discussing this for a while.
The people who actually use tsearch2 seem to all have the same opinion ...
so I think we can't go too far in the bullet-proofing direction.
But I would like a design that is bulletproof in dump/reload scenarios,
and I think it's fair to question that aspect of the tsearch2 design
because we've seen many reports of people having trouble updating
databases that use tsearch2.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
The other point is that we should have a good idea of the API because
if it gets into 8.3 it will be harder to change.Yeah, once it's in core we have a pretty strong backwards-compatibility
restriction to deal with. Someone upthread claimed "we can always
simplify it later" but that's exactly backward --- we can add features
later, but we can't subtract.Maybe we should be looking to implement just the minimum set of features
for 8.3 and leave some of the more controversial stuff for 8.4. I hate
to admit it, but if we take that point of view then triggers are in
and functional-index support is out. We have to support the trigger
approach because it's what is in tsearch2 now, and the existing users
will expect to continue to have that option.
Triggers and expression indexes were both in the documentation Oleg
supplied, so I am sure both are being used. I bet some users don't even
know they are using expression indexes because creating a GIN index on a
column automatically casts to tsvector. (But GIST does not.) I had to
ask Oleg to find out this out.
However, allowing the standard triggers to pay attention to a
configuration GUC variable is simply broken; that bit has to go away.
The only trigger example supplied by Oleg and Teodor is
tsvector_update_trigger(), and that doesn't take a configuration name,
meaning it uses the default GUC configuration.
Uh, how are we going to prevent the auto-casting to tsvector from using
the default GUC config, e.g. CREATE INDEX i ON x USING GIN(col)?
This is where I started to see the need for education and error-prone
nature of the default GUC just wasn't worth having it, though I know
others disagree.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Tom Lane wrote:
Since I don't think that a datatype solution is the way to go,
I don't feel that we are as far away from an agreement as Bruce
is worried about.Well, from where I sit, there is one person saying give me the foot gun,
and Heikki saying he wants a bullet-proof type system, and you and I are
in the middle, so the big problem is I don't see a concensus forming,
and we have been discussing this for a while.The people who actually use tsearch2 seem to all have the same opinion ...
so I think we can't go too far in the bullet-proofing direction.But I would like a design that is bulletproof in dump/reload scenarios,
and I think it's fair to question that aspect of the tsearch2 design
because we've seen many reports of people having trouble updating
databases that use tsearch2.
Yea, look at the trouble we are having trying to underestand it all.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote:
The people who actually use tsearch2 seem to all have the same opinion ...
so I think we can't go too far in the bullet-proofing direction.
Yeah.
But I would like a design that is bulletproof in dump/reload scenarios,
and I think it's fair to question that aspect of the tsearch2 design
because we've seen many reports of people having trouble updating
databases that use tsearch2.Yea, look at the trouble we are having trying to underestand it all.
True. But I wasn't too concerned about the forecast difficulties with
data only dumps. Those fail in plenty of circumstances. It is important
that there is *some* reliable dump/restore/upgrade path, though.
cheers
andrew
Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Tom Lane wrote:
Since I don't think that a datatype solution is the way to go,
I don't feel that we are as far away from an agreement as Bruce
is worried about.Well, from where I sit, there is one person saying give me the foot gun,
and Heikki saying he wants a bullet-proof type system, and you and I are
in the middle, so the big problem is I don't see a concensus forming,
and we have been discussing this for a while.The people who actually use tsearch2 seem to all have the same opinion ...
so I think we can't go too far in the bullet-proofing direction.But I would like a design that is bulletproof in dump/reload scenarios,
and I think it's fair to question that aspect of the tsearch2 design
because we've seen many reports of people having trouble updating
databases that use tsearch2.
dump/reload is *the* biggest problem I've had with tsearch2 so far. But
it hasn't been with the actual data - it's been the functions, and only
when migrating between versions. But solving dump/reload reliably is one
of the main things I'm hoping for in 8.3 ;-)
As for a nother use-pointer, I use different configurations in the same
database - but only one per table. I explicitly use the to_tsvector that
specifies a configuration always - to avoid surprising myself.
I don't use the functional index part, but for new users I can see how
that's certainly a *lot* easier. Requiring the specification of the
configuration explicitly when creating this index I don't see as a big
problem at all - compared to the work needed to set up triggers. But
it's nice not to have to do it when querying. But wouldn't that be
solved by having to_tsvector() require the configuration, but
to_tsquery() and plainto_tsquery() not require it?
//Magnus
On Wed, 15 Aug 2007, Magnus Hagander wrote:
Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Tom Lane wrote:
Since I don't think that a datatype solution is the way to go,
I don't feel that we are as far away from an agreement as Bruce
is worried about.Well, from where I sit, there is one person saying give me the foot gun,
and Heikki saying he wants a bullet-proof type system, and you and I are
in the middle, so the big problem is I don't see a concensus forming,
and we have been discussing this for a while.The people who actually use tsearch2 seem to all have the same opinion ...
so I think we can't go too far in the bullet-proofing direction.But I would like a design that is bulletproof in dump/reload scenarios,
and I think it's fair to question that aspect of the tsearch2 design
because we've seen many reports of people having trouble updating
databases that use tsearch2.dump/reload is *the* biggest problem I've had with tsearch2 so far. But
it hasn't been with the actual data - it's been the functions, and only
when migrating between versions. But solving dump/reload reliably is one
of the main things I'm hoping for in 8.3 ;-)
The dump/reload problem should be gone once tsearch2 became a part of core.
the problem is an inability to say what is a correct configuration in case
of expressional index when restoring. In any other case there are many
use cases when tsvector could be intentionally obtained using different
configurations.
As for a nother use-pointer, I use different configurations in the same
database - but only one per table. I explicitly use the to_tsvector that
specifies a configuration always - to avoid surprising myself.I don't use the functional index part, but for new users I can see how
that's certainly a *lot* easier. Requiring the specification of the
configuration explicitly when creating this index I don't see as a big
problem at all - compared to the work needed to set up triggers. But
it's nice not to have to do it when querying. But wouldn't that be
solved by having to_tsvector() require the configuration, but
to_tsquery() and plainto_tsquery() not require it?
or better to introduce novice-level interface with configuration name
required and insist on using it with expressional index (don't know
if there is a machinery to do so).
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
On Tue, 2007-08-14 at 17:41 -0400, Tom Lane wrote:
I've just finished re-reading the prior thread, and here are what seem
to me to be the salient points:* Oleg, Teodor, and all of the old-line users of tsearch2 are
comfortable with setting up a trigger to maintain a materialized
tsvector column for a table. They argue that recomputing the tsvector
(possibly more than once) during a query is more expensive than fetching
it from disk. My suspicion is that the latter argument gets weaker
every year --- CPUs are getting faster lots faster than disks are.* Bruce (and I ... not sure about anyone else) want to support usage of
text search via a functional index. This is argued to be easier to set
up (no fooling with triggers) and possibly faster depending on CPU vs
I/O speeds. I don't think there is any desire here to eliminate the
trigger approach, just to provide an alternative.
ISTM that the functional index would be considerably smaller than the
additional column approach, since tsvectors can be quite long. That
seems like a very desirable thing with larger textbases. However,
without an additional column certain queries would not be possible, such
as IndexScans on a non-text search index with an additional filter on
text search. So each way would be desirable in different situations.
Would it be wrong to allow both approaches? If there is strong
disagreement then it usually means both people are right.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com