BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd
The following bug has been logged on the website:
Bug reference: 8354
Logged by: Alex Hill
Email address: alex@hill.net.au
PostgreSQL version: 9.2.4
Operating system: OS X 10.8.4 Mountain Lion
Description:
Hi all,
The docs for ts_rank_cd state:
"This function requires positional information in its input. Therefore it
will not work on "stripped" tsvector values — it will always return zero."
However if a tsvector contains some stripped lexemes and some non-stripped,
ts_rank_cd will rank extents including the non-stripped values.
For example, this evaluates to zero as expected:
SELECT ts_rank_cd(strip(to_tsvector('text search')),
plainto_tsquery('text search'))
But this doesn't:
SELECT ts_rank_cd(to_tsvector('text') || strip(to_tsvector('search')),
plainto_tsquery('text search'))
I think this is a bug, if not in the code then in the documentation, which
isn't clear on what happens when stripped and positioned lexemes are mixed
in one tsvector.
I would prefer that stripped lexemes were completely ignored by ts_rank_cd:
my use case is using this as a fifth pseudo-weight, which matches a @@ query
but doesn't add to a ts_rank_cd ranking.
What do you think?
Cheers,
Alex
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Would someone please comment on this text search bug report? Thanks.
---------------------------------------------------------------------------
On Fri, Aug 2, 2013 at 07:03:42AM +0000, alex@hill.net.au wrote:
The following bug has been logged on the website:
Bug reference: 8354
Logged by: Alex Hill
Email address: alex@hill.net.au
PostgreSQL version: 9.2.4
Operating system: OS X 10.8.4 Mountain Lion
Description:Hi all,
The docs for ts_rank_cd state:
"This function requires positional information in its input. Therefore it
will not work on "stripped" tsvector values — it will always return zero."However if a tsvector contains some stripped lexemes and some non-stripped,
ts_rank_cd will rank extents including the non-stripped values.For example, this evaluates to zero as expected:
SELECT ts_rank_cd(strip(to_tsvector('text search')),
plainto_tsquery('text search'))But this doesn't:
SELECT ts_rank_cd(to_tsvector('text') || strip(to_tsvector('search')),
plainto_tsquery('text search'))I think this is a bug, if not in the code then in the documentation, which
isn't clear on what happens when stripped and positioned lexemes are mixed
in one tsvector.I would prefer that stripped lexemes were completely ignored by ts_rank_cd:
my use case is using this as a fifth pseudo-weight, which matches a @@ query
but doesn't add to a ts_rank_cd ranking.What do you think?
Cheers,
Alex--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hi Bruce, all,
I think this can be solved (if it's agreed that it's a bug) in a pretty
straightforward way: when creating the document representation used in
calculating cover density rank, we can just skip lexemes with no position
entirely.
Fix and tests here: https://github.com/AlexHill/postgres/compare/bug_8354
As a patch file here:
https://github.com/AlexHill/postgres/commit/cd522b254d166d569b86803115f0f499864e949b.patch
Cheers,
Alex
On Sat, Feb 1, 2014 at 5:22 AM, Bruce Momjian <bruce@momjian.us> wrote:
Show quoted text
Would someone please comment on this text search bug report? Thanks.
---------------------------------------------------------------------------
On Fri, Aug 2, 2013 at 07:03:42AM +0000, alex@hill.net.au wrote:
The following bug has been logged on the website:
Bug reference: 8354
Logged by: Alex Hill
Email address: alex@hill.net.au
PostgreSQL version: 9.2.4
Operating system: OS X 10.8.4 Mountain Lion
Description:Hi all,
The docs for ts_rank_cd state:
"This function requires positional information in its input. Therefore it
will not work on "stripped" tsvector values -- it will always returnzero."
However if a tsvector contains some stripped lexemes and some
non-stripped,
ts_rank_cd will rank extents including the non-stripped values.
For example, this evaluates to zero as expected:
SELECT ts_rank_cd(strip(to_tsvector('text search')),
plainto_tsquery('text search'))But this doesn't:
SELECT ts_rank_cd(to_tsvector('text') ||
strip(to_tsvector('search')),
plainto_tsquery('text search'))
I think this is a bug, if not in the code then in the documentation,
which
isn't clear on what happens when stripped and positioned lexemes are
mixed
in one tsvector.
I would prefer that stripped lexemes were completely ignored by
ts_rank_cd:
my use case is using this as a fifth pseudo-weight, which matches a @@
query
but doesn't add to a ts_rank_cd ranking.
What do you think?
Cheers,
Alex--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com+ Everyone has their own god. +
On Fri, Feb 7, 2014 at 02:07:59AM +0800, Alexander Hill wrote:
Hi Bruce, all,
I think this can be solved (if it's agreed that it's a bug) in a pretty
straightforward way: when creating the document representation used in
calculating cover density rank, we can just skip lexemes with no position
entirely.Fix and tests here: https://github.com/AlexHill/postgres/compare/bug_8354
As a patch file here: https://github.com/AlexHill/postgres/commit/
cd522b254d166d569b86803115f0f499864e949b.patch
OK, great. I am attaching this patch. Would this change the contents
of any indexes? That would be a problem for pg_upgrade.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
Attachments:
tsrank.patchtext/x-diff; charset=us-asciiDownload+22-3
Hi Bruce,
In normal use this won't affect any data that ends up in an index.
The only exception would be where somebody has created an index over the
result of ts_rank_cd - indexing search rank for the most common search
terms, maybe?
But that's true of any function whose result is indexed. What's the upgrade
policy in that case?
Cheers,
Alex
On Tue, Feb 11, 2014 at 11:45 PM, Bruce Momjian <bruce@momjian.us> wrote:
Show quoted text
On Fri, Feb 7, 2014 at 02:07:59AM +0800, Alexander Hill wrote:
Hi Bruce, all,
I think this can be solved (if it's agreed that it's a bug) in a pretty
straightforward way: when creating the document representation used in
calculating cover density rank, we can just skip lexemes with no position
entirely.Fix and tests here:
https://github.com/AlexHill/postgres/compare/bug_8354
As a patch file here: https://github.com/AlexHill/postgres/commit/
cd522b254d166d569b86803115f0f499864e949b.patchOK, great. I am attaching this patch. Would this change the contents
of any indexes? That would be a problem for pg_upgrade.--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com+ Everyone has their own god. +
On Wed, Feb 12, 2014 at 09:49:32AM +0800, Alexander Hill wrote:
Hi Bruce,
In normal use this won't affect any data that ends up in an index.
The only exception would be where somebody has created an index over the result
of ts_rank_cd - indexing search rank for the most common search terms, maybe?
OK, good.
But that's true of any function whose result is indexed. What's the upgrade
policy in that case?
Well, we document that the function output changed, and if anyone is
using the function for an index, they need to rebuild the index, so I
think we are fine fixing this for 9.4.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Hi Bruce,
I've just updated the docs for ts_rank_cd in my fork to reflect this
change. Would be great if that could make it in too.
Patch here:
https://github.com/AlexHill/postgres/commit/3e5d86ea23d82b07188192411cdc5c5a5194a5ae.patch
Is there anything else I need to do to make sure this patch ends up in 9.4?
Cheers,
Alex
On Feb 12, 2014 10:31 PM, "Bruce Momjian" <bruce@momjian.us> wrote:
Show quoted text
On Wed, Feb 12, 2014 at 09:49:32AM +0800, Alexander Hill wrote:
Hi Bruce,
In normal use this won't affect any data that ends up in an index.
The only exception would be where somebody has created an index over the
result
of ts_rank_cd - indexing search rank for the most common search terms,
maybe?
OK, good.
But that's true of any function whose result is indexed. What's the
upgrade
policy in that case?
Well, we document that the function output changed, and if anyone is
using the function for an index, they need to rebuild the index, so I
think we are fine fixing this for 9.4.--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com+ Everyone has their own god. +
On Fri, Mar 21, 2014 at 02:29:26PM +0800, Alexander Hill wrote:
Hi Bruce,
I've just updated the docs for ts_rank_cd in my fork to reflect this change.
Would be great if that could make it in too.Patch here: https://github.com/AlexHill/postgres/commit/
3e5d86ea23d82b07188192411cdc5c5a5194a5ae.patchIs there anything else I need to do to make sure this patch ends up in 9.4?
OK, combined patch attached. Is this correct? I keep hoping someone
else will chime in with an opinion on this, but at this point I think
what you have done is clear enough for me to commit for 9.4.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
Attachments:
rank.difftext/x-diff; charset=us-asciiDownload+30-10
Bruce Momjian <bruce@momjian.us> writes:
OK, combined patch attached. Is this correct? I keep hoping someone
else will chime in with an opinion on this, but at this point I think
what you have done is clear enough for me to commit for 9.4.
That whole function is seriously undercommented; I suppose it's not this
patch's charter to fix that, but we could at least write
! /* ignore words without positions */
! entry++;
! continue;
The proposed new documentation text seems pretty badly written. How about
This function requires lexeme positional information to perform its
calculation. Therefore it ignores any <quote>stripped</> lexemes in the
<type>tsvector</>. If there are no unstripped lexemes in the input, the
result will be zero.
The parenthetical "See" text is ok.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Fri, Mar 21, 2014 at 10:45:18AM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
OK, combined patch attached. Is this correct? I keep hoping someone
else will chime in with an opinion on this, but at this point I think
what you have done is clear enough for me to commit for 9.4.That whole function is seriously undercommented; I suppose it's not this
patch's charter to fix that, but we could at least write! /* ignore words without positions */
! entry++;
! continue;The proposed new documentation text seems pretty badly written. How about
This function requires lexeme positional information to perform its
calculation. Therefore it ignores any <quote>stripped</> lexemes in the
<type>tsvector</>. If there are no unstripped lexemes in the input, the
result will be zero.The parenthetical "See" text is ok.
OK, thanks. Updated patch attached. If Alexander Hill would like to
add more C comments to any of the ts*.c files, or docs explaining what
"cover desity" is, I would be glad to include those in the patch as
well.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
Attachments:
rank.difftext/x-diff; charset=us-asciiDownload+35-15
On Fri, Mar 21, 2014 at 06:10:15PM -0400, Bruce Momjian wrote:
On Fri, Mar 21, 2014 at 10:45:18AM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
OK, combined patch attached. Is this correct? I keep hoping someone
else will chime in with an opinion on this, but at this point I think
what you have done is clear enough for me to commit for 9.4.That whole function is seriously undercommented; I suppose it's not this
patch's charter to fix that, but we could at least write! /* ignore words without positions */
! entry++;
! continue;The proposed new documentation text seems pretty badly written. How about
This function requires lexeme positional information to perform its
calculation. Therefore it ignores any <quote>stripped</> lexemes in the
<type>tsvector</>. If there are no unstripped lexemes in the input, the
result will be zero.The parenthetical "See" text is ok.
OK, thanks. Updated patch attached. If Alexander Hill would like to
add more C comments to any of the ts*.c files, or docs explaining what
"cover desity" is, I would be glad to include those in the patch as
well.
Patch applied with this commit message:
Fix ts_rank_cd() to ignore stripped lexemes
Previously, stripped lexemes got a default location and could be
considered if mixed with non-stripped lexemes.
BACKWARD INCOMPATIBILITY CHANGE
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Mon, Mar 24, 2014 at 02:37:30PM -0400, Bruce Momjian wrote:
OK, thanks. Updated patch attached. If Alexander Hill would like to
add more C comments to any of the ts*.c files, or docs explaining what
"cover density" is, I would be glad to include those in the patch as
well.Patch applied with this commit message:
Fix ts_rank_cd() to ignore stripped lexemes
Previously, stripped lexemes got a default location and could be
considered if mixed with non-stripped lexemes.BACKWARD INCOMPATIBILITY CHANGE
FYI, I forgot to mention Alex Hill's name in this commit, so I have
added a little docs around cover density and mentioned his name as
author of the previous patch so we can reflect that in the 9.4 release
notes.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs