TS: Limited cover density ranking
Hello,
I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage:
-- normal cover density ranking : not changed
luben=> select ts_rank_cd(to_tsvector('a b c d e g h i j k'), to_tsquery('a&d'));
ts_rank_cd
------------
0.0333333
(1 row)
-- limited to 2
luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'), to_tsquery('a&d'));
ts_rank_cd
------------
0
(1 row)
luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'), to_tsquery('a&d'));
ts_rank_cd
------------
0.1
(1 row)
-- limited to 3
luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'), to_tsquery('a&d'));
ts_rank_cd
------------
0.0333333
(1 row)
luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'), to_tsquery('a&d'));
ts_rank_cd
------------
0.133333
(1 row)
Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension.
I have some questions:
1. Is it interesting to develop it further (documentation, cleanup, etc) for inclusion in one of the next versions? If this is the case, there are some further questions:
- should I overload ts_rank_cd (as in examples above and the patch) or should I define new set of functions, for example ts_rank_lcd ?
- should I define define this new sql level functions in core or should I go only with this 2 lines change in calc_rank_cd() and define the new functions as an extension? If we prefer the later, could I overload core functions with functions defined in extensions?
- and finally there is always the possibility to duplicate the code and make an independent extension.
2. If I run the patched version on cluster that was initialized with unpatched server, is there a way to register the new functions in the system catalog without reinitializing the cluster?
Best regards
luben
--
Luben Karavelov
And here is the patch, that I forgot to attach
Hello,
I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage:
...
Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension.
--
Luben Karavelov
Attachments:
ts_rank_lcd.difftext/x-diff; name=ts_rank_lcd.diffDownload
diff -pur postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c /usr/src/postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c
--- postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c 2011-12-01 23:47:20.000000000 +0200
+++ /usr/src/postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c 2012-01-27 07:45:34.558028176 +0200
@@ -724,7 +724,7 @@ get_docrep(TSVector txt, QueryRepresenta
}
static float4
-calc_rank_cd(float4 *arrdata, TSVector txt, TSQuery query, int method)
+calc_rank_cd(int limit, float4 *arrdata, TSVector txt, TSQuery query, int method)
{
DocRepresentation *doc;
int len,
@@ -768,6 +768,9 @@ calc_rank_cd(float4 *arrdata, TSVector t
int nNoise;
DocRepresentation *ptr = ext.begin;
+ if (limit > 0 && ext.end->pos - ext.begin->pos > limit)
+ continue;
+
while (ptr <= ext.end)
{
InvSum += invws[ptr->wclass];
@@ -834,7 +837,7 @@ ts_rankcd_wttf(PG_FUNCTION_ARGS)
int method = PG_GETARG_INT32(3);
float res;
- res = calc_rank_cd(getWeights(win), txt, query, method);
+ res = calc_rank_cd(0, getWeights(win), txt, query, method);
PG_FREE_IF_COPY(win, 0);
PG_FREE_IF_COPY(txt, 1);
@@ -850,7 +853,7 @@ ts_rankcd_wtt(PG_FUNCTION_ARGS)
TSQuery query = PG_GETARG_TSQUERY(2);
float res;
- res = calc_rank_cd(getWeights(win), txt, query, DEF_NORM_METHOD);
+ res = calc_rank_cd(0, getWeights(win), txt, query, DEF_NORM_METHOD);
PG_FREE_IF_COPY(win, 0);
PG_FREE_IF_COPY(txt, 1);
@@ -866,7 +869,7 @@ ts_rankcd_ttf(PG_FUNCTION_ARGS)
int method = PG_GETARG_INT32(2);
float res;
- res = calc_rank_cd(getWeights(NULL), txt, query, method);
+ res = calc_rank_cd(0, getWeights(NULL), txt, query, method);
PG_FREE_IF_COPY(txt, 0);
PG_FREE_IF_COPY(query, 1);
@@ -880,9 +883,75 @@ ts_rankcd_tt(PG_FUNCTION_ARGS)
TSQuery query = PG_GETARG_TSQUERY(1);
float res;
- res = calc_rank_cd(getWeights(NULL), txt, query, DEF_NORM_METHOD);
+ res = calc_rank_cd(0, getWeights(NULL), txt, query, DEF_NORM_METHOD);
PG_FREE_IF_COPY(txt, 0);
PG_FREE_IF_COPY(query, 1);
PG_RETURN_FLOAT4(res);
}
+
+Datum
+ts_rankcd_lwttf(PG_FUNCTION_ARGS)
+{
+ int limit = PG_GETARG_INT32(0);
+ ArrayType *win = (ArrayType *) PG_DETOAST_DATUM(PG_GETARG_DATUM(1));
+ TSVector txt = PG_GETARG_TSVECTOR(2);
+ TSQuery query = PG_GETARG_TSQUERY(3);
+ int method = PG_GETARG_INT32(4);
+ float res;
+
+ res = calc_rank_cd(limit, getWeights(win), txt, query, method);
+
+ PG_FREE_IF_COPY(win, 1);
+ PG_FREE_IF_COPY(txt, 2);
+ PG_FREE_IF_COPY(query, 3);
+ PG_RETURN_FLOAT4(res);
+}
+
+Datum
+ts_rankcd_lwtt(PG_FUNCTION_ARGS)
+{
+ int limit = PG_GETARG_INT32(0);
+ ArrayType *win = (ArrayType *) PG_DETOAST_DATUM(PG_GETARG_DATUM(1));
+ TSVector txt = PG_GETARG_TSVECTOR(2);
+ TSQuery query = PG_GETARG_TSQUERY(3);
+ float res;
+
+ res = calc_rank_cd(limit, getWeights(win), txt, query, DEF_NORM_METHOD);
+
+ PG_FREE_IF_COPY(win, 1);
+ PG_FREE_IF_COPY(txt, 2);
+ PG_FREE_IF_COPY(query, 3);
+ PG_RETURN_FLOAT4(res);
+}
+
+Datum
+ts_rankcd_lttf(PG_FUNCTION_ARGS)
+{
+ int limit = PG_GETARG_INT32(0);
+ TSVector txt = PG_GETARG_TSVECTOR(1);
+ TSQuery query = PG_GETARG_TSQUERY(2);
+ int method = PG_GETARG_INT32(3);
+ float res;
+
+ res = calc_rank_cd(limit, getWeights(NULL), txt, query, method);
+
+ PG_FREE_IF_COPY(txt, 1);
+ PG_FREE_IF_COPY(query, 2);
+ PG_RETURN_FLOAT4(res);
+}
+
+Datum
+ts_rankcd_ltt(PG_FUNCTION_ARGS)
+{
+ int limit = PG_GETARG_INT32(0);
+ TSVector txt = PG_GETARG_TSVECTOR(1);
+ TSQuery query = PG_GETARG_TSQUERY(2);
+ float res;
+
+ res = calc_rank_cd(limit, getWeights(NULL), txt, query, DEF_NORM_METHOD);
+
+ PG_FREE_IF_COPY(txt, 1);
+ PG_FREE_IF_COPY(query, 2);
+ PG_RETURN_FLOAT4(res);
+}
diff -pur postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h /usr/src/postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h
--- postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h 2011-12-01 23:47:20.000000000 +0200
+++ /usr/src/postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h 2012-01-27 05:45:53.944979678 +0200
@@ -4159,6 +4159,15 @@ DATA(insert OID = 3709 ( ts_rank_cd PGN
DESCR("relevance");
DATA(insert OID = 3710 ( ts_rank_cd PGNSP PGUID 12 1 0 0 f f f t f i 2 0 700 "3614 3615" _null_ _null_ _null_ _null_ ts_rankcd_tt _null_ _null_ _null_ ));
DESCR("relevance");
+DATA(insert OID = 3675 ( ts_rank_cd PGNSP PGUID 12 1 0 0 f f f t f i 5 0 700 "23 1021 3614 3615 23" _null_ _null_ _null_ _null_ ts_rankcd_lwttf _null_ _null_ _null_ ));
+DESCR("relevance");
+DATA(insert OID = 3676 ( ts_rank_cd PGNSP PGUID 12 1 0 0 f f f t f i 4 0 700 "23 1021 3614 3615" _null_ _null_ _null_ _null_ ts_rankcd_lwtt _null_ _null_ _null_ ));
+DESCR("relevance");
+DATA(insert OID = 3677 ( ts_rank_cd PGNSP PGUID 12 1 0 0 f f f t f i 4 0 700 "23 3614 3615 23" _null_ _null_ _null_ _null_ ts_rankcd_lttf _null_ _null_ _null_ ));
+DESCR("relevance");
+DATA(insert OID = 3678 ( ts_rank_cd PGNSP PGUID 12 1 0 0 f f f t f i 3 0 700 "23 3614 3615" _null_ _null_ _null_ _null_ ts_rankcd_ltt _null_ _null_ _null_ ));
+DESCR("relevance");
+
DATA(insert OID = 3713 ( ts_token_type PGNSP PGUID 12 1 16 0 f f f t t i 1 0 2249 "26" "{26,23,25,25}" "{i,o,o,o}" "{parser_oid,tokid,alias,description}" _null_ ts_token_type_byid _null_ _null_ _null_ ));
DESCR("get parser's token types");
diff -pur postgresql-9.1-9.1.2/src/include/tsearch/ts_type.h /usr/src/postgresql-9.1-9.1.2/src/include/tsearch/ts_type.h
--- postgresql-9.1-9.1.2/src/include/tsearch/ts_type.h 2011-12-01 23:47:20.000000000 +0200
+++ /usr/src/postgresql-9.1-9.1.2/src/include/tsearch/ts_type.h 2012-01-27 05:41:24.741571183 +0200
@@ -152,6 +152,10 @@ extern Datum ts_rankcd_tt(PG_FUNCTION_AR
extern Datum ts_rankcd_wtt(PG_FUNCTION_ARGS);
extern Datum ts_rankcd_ttf(PG_FUNCTION_ARGS);
extern Datum ts_rankcd_wttf(PG_FUNCTION_ARGS);
+extern Datum ts_rankcd_ltt(PG_FUNCTION_ARGS);
+extern Datum ts_rankcd_lwtt(PG_FUNCTION_ARGS);
+extern Datum ts_rankcd_lttf(PG_FUNCTION_ARGS);
+extern Datum ts_rankcd_lwttf(PG_FUNCTION_ARGS);
extern Datum tsmatchsel(PG_FUNCTION_ARGS);
extern Datum tsmatchjoinsel(PG_FUNCTION_ARGS);Import Notes
Resolved by subject fallback
The rank counts 1/coversize. So bigger covers will not have much impact
anyway. What is the need of the patch?
-Sushant.
Show quoted text
On Fri, 2012-01-27 at 18:06 +0200, karavelov@mail.bg wrote:
Hello,
I have developed a variation of cover density ranking functions that
counts only covers that are lesser than a specified limit. It is
useful for finding combinations of terms that appear nearby one
another. Here is an example of usage:-- normal cover density ranking : not changed
luben=> select ts_rank_cd(to_tsvector('a b c d e g h i j k'),
to_tsquery('a&d'));
ts_rank_cd
------------
0.0333333
(1 row)-- limited to 2
luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'),
to_tsquery('a&d'));
ts_rank_cd
------------
0
(1 row)luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'),
to_tsquery('a&d'));
ts_rank_cd
------------
0.1
(1 row)-- limited to 3
luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'),
to_tsquery('a&d'));
ts_rank_cd
------------
0.0333333
(1 row)luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'),
to_tsquery('a&d'));
ts_rank_cd
------------
0.133333
(1 row)Find attached a path agains 9.1.2 sources. I preferred to make a
patch, not a separate extension because it is only 1 statement change
in calc_rank_cd function. If I have to make an extension a lot of code
would be duplicated between backend/utils/adt/tsrank.c and the
extension.I have some questions:
1. Is it interesting to develop it further (documentation, cleanup,
etc) for inclusion in one of the next versions? If this is the case,
there are some further questions:- should I overload ts_rank_cd (as in examples above and the patch) or
should I define new set of functions, for example ts_rank_lcd ?
- should I define define this new sql level functions in core or
should I go only with this 2 lines change in calc_rank_cd() and define
the new functions as an extension? If we prefer the later, could I
overload core functions with functions defined in extensions?
- and finally there is always the possibility to duplicate the code
and make an independent extension.2. If I run the patched version on cluster that was initialized with
unpatched server, is there a way to register the new functions in the
system catalog without reinitializing the cluster?Best regards
luben--
Luben Karavelov
----- Цитат от Sushant Sinha (sushant354@gmail.com), на 27.01.2012 в 18:32 -----
The rank counts 1/coversize. So bigger covers will not have much impact
anyway. What is the need of the patch?-Sushant.
If you want to find only combinations of words that are close one to another, with the patch you could use something as:
WITH a AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, to_tsquery('a&d') AS query)
SELECT * FROM a WHERE vec @@ query AND ts_rank_cd(3,vec,query)>0;
I could not find another way to make this type of queries. If there is an alternative, I am open to suggestions
Best regards
--
Luben Karavelov
Import Notes
Resolved by subject fallback
----- Цитат от karavelov@mail.bg, на 27.01.2012 в 18:48 -----
----- Цитат от Sushant Sinha (sushant354@gmail.com), на 27.01.2012 в 18:32 -----
The rank counts 1/coversize. So bigger covers will not have much impact
anyway. What is the need of the patch?-Sushant.
If you want to find only combinations of words that are close one to another, with the patch you could use something as:
WITH a AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, to_tsquery('a&d') AS query)
SELECT * FROM a WHERE vec @@ query AND ts_rank_cd(3,vec,query)>0;
Another example, if you want to match 'b c d' only, you could use:
WITH A AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, to_tsquery('b&c&d') AS query)
SELECT * FROM A WHERE vec @@ query AND ts_rank_cd(2,vec,query)>0;
The catch is that it will match also 'b d c', 'd c b', 'd b c', 'c d b' and 'd b d', so it is not a
replacement for exact phrase match but something that I find useful
--
Luben Karavelov
Import Notes
Resolved by subject fallback
I suggest you work on more general approach, see
http://www.sai.msu.su/~megera/wiki/2009-08-12 for example.
btw, I don't like you changed ts_rank_cd arguments.
Oleg
On Fri, 27 Jan 2012, karavelov@mail.bg wrote:
Hello,
I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage:
-- normal cover density ranking : not changed
luben=> select ts_rank_cd(to_tsvector('a b c d e g h i j k'), to_tsquery('a&d'));
ts_rank_cd
------------
0.0333333
(1 row)-- limited to 2
luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'), to_tsquery('a&d'));
ts_rank_cd
------------
0
(1 row)luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'), to_tsquery('a&d'));
ts_rank_cd
------------
0.1
(1 row)-- limited to 3
luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'), to_tsquery('a&d'));
ts_rank_cd
------------
0.0333333
(1 row)luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'), to_tsquery('a&d'));
ts_rank_cd
------------
0.133333
(1 row)Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension.
I have some questions:
1. Is it interesting to develop it further (documentation, cleanup, etc) for inclusion in one of the next versions? If this is the case, there are some further questions:
- should I overload ts_rank_cd (as in examples above and the patch) or should I define new set of functions, for example ts_rank_lcd ?
- should I define define this new sql level functions in core or should I go only with this 2 lines change in calc_rank_cd() and define the new functions as an extension? If we prefer the later, could I overload core functions with functions defined in extensions?
- and finally there is always the possibility to duplicate the code and make an independent extension.2. If I run the patched version on cluster that was initialized with unpatched server, is there a way to register the new functions in the system catalog without reinitializing the cluster?
Best regards
luben--
Luben Karavelov
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
----- Цитат от Oleg Bartunov (oleg@sai.msu.su), на 28.01.2012 в 21:04 -----
I suggest you work on more general approach, see
http://www.sai.msu.su/~megera/wiki/2009-08-12 for example.btw, I don't like you changed ts_rank_cd arguments.
Hello Oleg,
Thanks for the feedback.
Is it OK to begin with adding an exta argument and check in calc_rank_cd?
I could change the function names in order not to overload ts_rank_cd
arguments. My proposition is :
at sql level:
ts_rank_lcd([weights], tsvector, tsquery, limit, [method])
at C level:
ts_ranklcd_wttlf
ts_ranklcd_wttl
ts_ranklcd_ttlf
ts_ranklcd_ttl
Adding the functions could be done as an extension but they are just
trampolines into calc_rank_cd().
I agree that what you describe in the wiki page is more general approach. So this :
SELECT ts_rank_lcd(to_tsvector('a b c'), to_tsquery('a&c'),2 )>0;
could be replaced with
SELECT to_tsvector('a b c') @@ to_tsquery('(a ?2 c)|(c ?2 a) ');
but if we need to look for 3 or more nearby terms without order the tsquery
with '?' operator will became quite complicated. For example
SELECT tsvec @@
'(a ? b ? c) | (a ? c ? b) | (b ? a ? c) | (b ? c ? a) | (c ? a ? b) | (c ? b ? a)'::tsquery;
is the same as
SELECT ts_rank_lcd(tsvec, 'a&b&c'::tsquery,2)>0;
So this is the reason to think that the general approach does not exclude the the
usefulness of the approach that I am proposing.
Best regards
--
Luben Karavelov
Import Notes
Resolved by subject fallback