new function for tsquery creartion
Dear all,
Now Postgres has a few functions to create tsqueries for full text
search. The main one is the to_tsquery function that allows to make
query with any operation. But to make correct query all of the operators
should be specified explicitly. In order to make it easier postgres has
functions like plainto_tsquery and phraseto_tsquery which allow to make
tsqueries from strings. But they are not flexible enough.
Let me introduce new function for full text search query creation(which
is called 'queryto_tsquery'). It takes 'google like' query string and
translates it to tsquery.
The main features are the following:
All the text inside double quotes would be treated as a phrase("a b c"
-> 'a <-> b <-> c')
New operator AROUND(N). It matches if the distance between words(or
maybe phrases) is less than or equal to N.
Alias for !('-rat' is the same as '!rat')
Alias for |('dog OR cat' is the same as 'dog | cat')
As a plainto_tsquery and phraseto_tsquery it will fill operators by
itself, but already placed operations won't be ignored. It allows to
combine two approaches.
In the attachment you can find patch with the new features, tests and
documentation for it.
What do you think about it?
Thank you very much for the attention!
--
------
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
queryto_tsquery.patchtext/x-diff; name=queryto_tsquery.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e073f7b..d6fb4ce 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9494,6 +9494,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
<row>
<entry>
<indexterm>
+ <primary>queryto_tsquery</primary>
+ </indexterm>
+ <literal><function>queryto_tsquery(<optional> <replaceable class="PARAMETER">config</> <type>regconfig</> , </optional> <replaceable class="PARAMETER">query</> <type>text</type>)</function></literal>
+ </entry>
+ <entry><type>tsquery</type></entry>
+ <entry>produce <type>tsquery</> from google like query</entry>
+ <entry><literal>queryto_tsquery('english', 'The Fat Rats')</literal></entry>
+ <entry><literal>'fat' & 'rat'</literal></entry>
+ </row>
+ <row>
+ <entry>
+ <indexterm>
<primary>querytree</primary>
</indexterm>
<literal><function>querytree(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>)</function></literal>
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index fe630a6..999e4ad 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,15 @@ UPDATE tt SET ti =
<para>
<productname>PostgreSQL</productname> provides the
functions <function>to_tsquery</function>,
- <function>plainto_tsquery</function>, and
- <function>phraseto_tsquery</function>
+ <function>plainto_tsquery</function>,
+ <function>phraseto_tsquery</function> and
+ <function>queryto_tsquery</function>
for converting a query to the <type>tsquery</type> data type.
<function>to_tsquery</function> offers access to more features
than either <function>plainto_tsquery</function> or
<function>phraseto_tsquery</function>, but it is less forgiving
- about its input.
+ about its input. <function>queryto_tsquery</function> provides a
+ different, Google like syntax to create tsquery.
</para>
<indexterm>
@@ -960,8 +962,68 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
-----------------------------
'fat' <-> 'rat' <-> 'c'
</screen>
+</para>
+
+<synopsis>
+queryto_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
+</synopsis>
+
+ <para>
+ <function>queryto_tsquery</> creates a <type>tsquery</type> from a unformated text.
+ But instead of <function>plainto_tsquery</> and <function>phraseto_tsquery</> it won't
+ ignore already placed operations. This function supports following operators:
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ '"some text" - any text inside quote signs will be treated as a phrase and will be
+ performed like in <function>phraseto_tsquery</>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ 'OR' - standard logical operator. It is just an alias for '|'' sign.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ 'terma AROUND(N) termb' - this operation will match if the distance between
+ terma and termb is less than N.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ '-' - standard logical negation sign. It is an alias for '!' sign.
+ </para>
+ </listitem>
+ </itemizedlist>
+ Other missing operators will be replaced by AND like in <function>plainto_tsquery</>.
</para>
+ <para>
+ Examples:
+ <screen>
+ select queryto_tsquery('The fat rats');
+ queryto_tsquery
+ -----------------
+ 'fat' & 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select queryto_tsquery('"supernovae stars" AND -crab');
+ queryto_tsquery
+ ----------------------------------
+ 'supernova' <-> 'star' & !'crab'
+ (1 row)
+ </screen>
+ <screen>
+ select queryto_tsquery('-run AROUND(5) "gnu debugger" OR "I like bananas"');
+ queryto_tsquery
+ -----------------------------------------------------------
+ !'run' AROUND(5) 'gnu' <-> 'debugg' | 'like' <-> 'banana'
+ (1 row)
+ </screen>
+ </para>
+
</sect2>
<sect2 id="textsearch-ranking">
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index 18368d1..10fd8c3 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -414,7 +414,8 @@ add_to_tsvector(void *_state, char *elem_value, int elem_len)
* and different variants are ORed together.
*/
static void
-pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval, int16 weight, bool prefix)
+pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
+ int16 weight, bool prefix, bool isphrase)
{
int32 count = 0;
ParsedText prs;
@@ -447,7 +448,12 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
/* put placeholders for each missing stop word */
pushStop(state);
if (cntpos)
- pushOperator(state, data->qoperator, 1);
+ {
+ if (isphrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
+ }
cntpos++;
pos++;
}
@@ -488,7 +494,10 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
if (cntpos)
{
/* distance may be useful */
- pushOperator(state, data->qoperator, 1);
+ if (isphrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
}
cntpos++;
@@ -514,6 +523,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
+ false,
false);
PG_RETURN_TSQUERY(query);
@@ -544,7 +554,8 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_POINTER(query);
}
@@ -575,7 +586,8 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_TSQUERY(query);
}
@@ -591,3 +603,36 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+queryto_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ false,
+ true);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+queryto_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(queryto_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+}
+
diff --git a/src/backend/tsearch/ts_selfuncs.c b/src/backend/tsearch/ts_selfuncs.c
index 046f543..375cb5c 100644
--- a/src/backend/tsearch/ts_selfuncs.c
+++ b/src/backend/tsearch/ts_selfuncs.c
@@ -396,6 +396,7 @@ tsquery_opr_selec(QueryItem *item, char *operand,
break;
case OP_PHRASE:
+ case OP_AROUND:
case OP_AND:
s1 = tsquery_opr_selec(item + 1, operand,
lookup, length, minfreq);
diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index 83a939d..7dc8f36 100644
--- a/src/backend/utils/adt/tsginidx.c
+++ b/src/backend/utils/adt/tsginidx.c
@@ -239,6 +239,7 @@ TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem, bool in_phrase)
return !result;
case OP_PHRASE:
+ case OP_AROUND:
/*
* GIN doesn't contain any information about positions, so treat
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 3e2fc6e..b9cb7ea 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -29,6 +29,7 @@ const int tsearch_op_priority[OP_COUNT] =
4, /* OP_NOT */
2, /* OP_AND */
1, /* OP_OR */
+ 3, /* OP_AROUND */
3 /* OP_PHRASE */
};
@@ -58,10 +59,11 @@ struct TSQueryParserStateData
};
/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
+#define WAITOPERAND 1
+#define WAITOPERATOR 2
+#define WAITFIRSTOPERAND 3
+#define WAITSINGLEOPERAND 4
+#define INSIDEQUOTES 5
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
@@ -210,6 +212,69 @@ typedef enum
PT_CLOSE = 5
} ts_tokentype;
+
+static bool
+has_prefix(char * str, char * prefix)
+{
+ if (strlen(prefix) > strlen(str))
+ {
+ return false;
+ }
+ while (*prefix != '\0')
+ {
+ if (*(str++) != *(prefix++))
+ {
+ return false;
+ }
+ }
+ return true;
+}
+
+/*
+ * Parse around operator. The operator
+ * have the following form:
+ *
+ * a AROUND(X) b (distance is no greater than X)
+ *
+ * The buffer should begin with "AROUND(" prefix
+ */
+static char *
+parse_around_operator(char *buf, int16 *distance)
+{
+ char *ptr = buf;
+ char *endptr;
+ long l = 1;
+
+ Assert(has_prefix(ptr, "AROUND("));
+
+ ptr += strlen("AROUND(");
+
+ while (t_isspace(ptr))
+ ptr++;
+
+ l = strtol(ptr, &endptr, 10);
+ if (ptr == endptr)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Invalid AROUND(X) operator!")));
+ else if (errno == ERANGE || l > MAXENTRYPOS)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("distance in AROUND operator should not be greater than %d",
+ MAXENTRYPOS)));
+
+ ptr = endptr;
+ *distance = l;
+ while (t_isspace(ptr))
+ ptr++;
+
+ if (!t_iseq(ptr, ')'))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Missing ')' in AROUND(X) operator")));
+
+ return ++ptr;
+}
/*
* get token from query string
*
@@ -221,7 +286,8 @@ typedef enum
static ts_tokentype
gettoken_query(TSQueryParserState state,
int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+ int *lenval, char **strval, int16 *weight, bool *prefix,
+ bool isquery)
{
*weight = 0;
*prefix = false;
@@ -232,7 +298,7 @@ gettoken_query(TSQueryParserState state,
{
case WAITFIRSTOPERAND:
case WAITOPERAND:
- if (t_iseq(state->buf, '!'))
+ if (t_iseq(state->buf, '!') || (isquery && t_iseq(state->buf, '-')))
{
(state->buf)++; /* can safely ++, t_iseq guarantee
* that pg_mblen()==1 */
@@ -254,6 +320,20 @@ gettoken_query(TSQueryParserState state,
errmsg("syntax error in tsquery: \"%s\"",
state->buffer)));
}
+ else if (isquery && t_iseq(state->buf, '"'))
+ {
+ char *quote = strchr(state->buf + 1, '"');
+ if (quote == NULL)
+ {
+ state->buf++;
+ continue;
+ }
+ *strval = state->buf + 1;
+ *lenval = quote - state->buf - 1;
+ state->buf = quote + 1;
+ state->state = INSIDEQUOTES;
+ return PT_VAL;
+ }
else if (!t_isspace(state->buf))
{
/*
@@ -291,6 +371,13 @@ gettoken_query(TSQueryParserState state,
(state->buf)++;
return PT_OPR;
}
+ else if (isquery && has_prefix(state->buf, "OR "))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_OR;
+ (state->buf) += 3;
+ return PT_OPR;
+ }
else if (t_iseq(state->buf, '<'))
{
state->state = WAITOPERAND;
@@ -301,14 +388,39 @@ gettoken_query(TSQueryParserState state,
return PT_ERR;
return PT_OPR;
}
+ else if (isquery && has_prefix(state->buf, "AROUND("))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_AROUND;
+ /* weight var is used as storage for distance */
+ state->buf = parse_around_operator(state->buf, weight);
+ if (*weight < 0)
+ return PT_ERR;
+ return PT_OPR;
+ }
else if (t_iseq(state->buf, ')'))
{
(state->buf)++;
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
+ else if (t_iseq(state->buf, '('))
+ {
+ *operator = OP_AND;
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
else if (*(state->buf) == '\0')
return (state->count) ? PT_ERR : PT_END;
+ else if (isquery &&
+ (t_isalpha(state->buf) || t_iseq(state->buf, '!')
+ || t_iseq(state->buf, '-')
+ || t_iseq(state->buf, '"')))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_AND;
+ return PT_OPR;
+ }
else if (!t_isspace(state->buf))
return PT_ERR;
break;
@@ -320,6 +432,9 @@ gettoken_query(TSQueryParserState state,
state->buf += strlen(state->buf);
state->count++;
return PT_VAL;
+ case INSIDEQUOTES:
+ state->state = WAITOPERATOR;
+ continue;
default:
return PT_ERR;
break;
@@ -336,12 +451,12 @@ pushOperator(TSQueryParserState state, int8 oper, int16 distance)
{
QueryOperator *tmp;
- Assert(oper == OP_NOT || oper == OP_AND || oper == OP_OR || oper == OP_PHRASE);
+ Assert(oper == OP_NOT || oper == OP_AND || oper == OP_OR || oper == OP_PHRASE || oper == OP_AROUND);
tmp = (QueryOperator *) palloc0(sizeof(QueryOperator));
tmp->type = QI_OPR;
tmp->oper = oper;
- tmp->distance = (oper == OP_PHRASE) ? distance : 0;
+ tmp->distance = (oper == OP_PHRASE || oper == OP_AROUND) ? distance : 0;
/* left is filled in later with findoprnd */
state->polstr = lcons(tmp, state->polstr);
@@ -475,7 +590,8 @@ cleanOpStack(TSQueryParserState state,
static void
makepol(TSQueryParserState state,
PushFunction pushval,
- Datum opaque)
+ Datum opaque,
+ bool isquery)
{
int8 operator = 0;
ts_tokentype type;
@@ -489,19 +605,19 @@ makepol(TSQueryParserState state,
/* since this function recurses, it could be driven to stack overflow */
check_stack_depth();
- while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix)) != PT_END)
+ while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix, isquery)) != PT_END)
{
switch (type)
{
case PT_VAL:
- pushval(opaque, state, strval, lenval, weight, prefix);
+ pushval(opaque, state, strval, lenval, weight, prefix, state->state == INSIDEQUOTES);
break;
case PT_OPR:
cleanOpStack(state, opstack, &lenstack, operator);
pushOpStack(opstack, &lenstack, operator, weight);
break;
case PT_OPEN:
- makepol(state, pushval, opaque);
+ makepol(state, pushval, opaque, isquery);
break;
case PT_CLOSE:
cleanOpStack(state, opstack, &lenstack, OP_OR /* lowest */ );
@@ -555,7 +671,8 @@ findoprnd_recurse(QueryItem *ptr, uint32 *pos, int nnodes, bool *needcleanup)
Assert(curitem->oper == OP_AND ||
curitem->oper == OP_OR ||
- curitem->oper == OP_PHRASE);
+ curitem->oper == OP_PHRASE ||
+ curitem->oper == OP_AROUND);
(*pos)++;
@@ -605,7 +722,8 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ bool isplain,
+ bool isquery)
{
struct TSQueryParserStateData state;
int i;
@@ -632,7 +750,7 @@ parse_tsquery(char *buf,
*(state.curop) = '\0';
/* parse query & make polish notation (postfix, but in reverse order) */
- makepol(&state, pushval, opaque);
+ makepol(&state, pushval, opaque, isquery);
close_tsvector_parser(state.valstate);
@@ -703,7 +821,7 @@ parse_tsquery(char *buf,
static void
pushval_asis(Datum opaque, TSQueryParserState state, char *strval, int lenval,
- int16 weight, bool prefix)
+ int16 weight, bool prefix, bool isphrase)
{
pushValue(state, strval, lenval, weight, prefix);
}
@@ -716,7 +834,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false, false));
}
/*
@@ -884,6 +1002,9 @@ infix(INFIX *in, int parentPriority, bool rightPhraseOp)
else
sprintf(in->cur, " <-> %s", nrm.buf);
break;
+ case OP_AROUND:
+ sprintf(in->cur, " AROUND(%d) %s", distance, nrm.buf);
+ break;
default:
/* OP_NOT is handled in above if-branch */
elog(ERROR, "unrecognized operator type: %d", op);
@@ -966,7 +1087,7 @@ tsquerysend(PG_FUNCTION_ARGS)
break;
case QI_OPR:
pq_sendint(&buf, item->qoperator.oper, sizeof(item->qoperator.oper));
- if (item->qoperator.oper == OP_PHRASE)
+ if (item->qoperator.oper == OP_PHRASE || item->qoperator.oper == OP_AROUND)
pq_sendint(&buf, item->qoperator.distance,
sizeof(item->qoperator.distance));
break;
@@ -1062,14 +1183,14 @@ tsqueryrecv(PG_FUNCTION_ARGS)
int8 oper;
oper = (int8) pq_getmsgint(buf, sizeof(int8));
- if (oper != OP_NOT && oper != OP_OR && oper != OP_AND && oper != OP_PHRASE)
+ if (oper != OP_NOT && oper != OP_OR && oper != OP_AND && oper != OP_PHRASE && oper != OP_AROUND)
elog(ERROR, "invalid tsquery: unrecognized operator type %d",
(int) oper);
if (i == size - 1)
elog(ERROR, "invalid pointer to right operand");
item->qoperator.oper = oper;
- if (oper == OP_PHRASE)
+ if (oper == OP_PHRASE || oper == OP_AROUND)
item->qoperator.distance = (int16) pq_getmsgint(buf, sizeof(int16));
}
else
diff --git a/src/backend/utils/adt/tsquery_cleanup.c b/src/backend/utils/adt/tsquery_cleanup.c
index 350171c..ca99a27 100644
--- a/src/backend/utils/adt/tsquery_cleanup.c
+++ b/src/backend/utils/adt/tsquery_cleanup.c
@@ -161,7 +161,8 @@ clean_NOT_intree(NODE *node)
NODE *res = node;
Assert(node->valnode->qoperator.oper == OP_AND ||
- node->valnode->qoperator.oper == OP_PHRASE);
+ node->valnode->qoperator.oper == OP_PHRASE ||
+ node->valnode->qoperator.oper == OP_AROUND);
node->left = clean_NOT_intree(node->left);
node->right = clean_NOT_intree(node->right);
@@ -277,7 +278,8 @@ clean_stopword_intree(NODE *node, int *ladd, int *radd)
node->right = clean_stopword_intree(node->right, &rladd, &rradd);
/* Check if current node is OP_PHRASE, get its distance */
- isphrase = (node->valnode->qoperator.oper == OP_PHRASE);
+ isphrase = (node->valnode->qoperator.oper == OP_PHRASE
+ || node->valnode->qoperator.oper == OP_AROUND);
ndistance = isphrase ? node->valnode->qoperator.distance : 0;
if (node->left == NULL && node->right == NULL)
diff --git a/src/backend/utils/adt/tsquery_op.c b/src/backend/utils/adt/tsquery_op.c
index 755c3e9..7cf9b8a 100644
--- a/src/backend/utils/adt/tsquery_op.c
+++ b/src/backend/utils/adt/tsquery_op.c
@@ -37,7 +37,7 @@ join_tsqueries(TSQuery a, TSQuery b, int8 operator, uint16 distance)
res->valnode = (QueryItem *) palloc0(sizeof(QueryItem));
res->valnode->type = QI_OPR;
res->valnode->qoperator.oper = operator;
- if (operator == OP_PHRASE)
+ if (operator == OP_PHRASE || operator == OP_AROUND)
res->valnode->qoperator.distance = distance;
res->child = (QTNode **) palloc0(sizeof(QTNode *) * 2);
diff --git a/src/backend/utils/adt/tsquery_util.c b/src/backend/utils/adt/tsquery_util.c
index 971bb81..548a846 100644
--- a/src/backend/utils/adt/tsquery_util.c
+++ b/src/backend/utils/adt/tsquery_util.c
@@ -121,7 +121,7 @@ QTNodeCompare(QTNode *an, QTNode *bn)
return res;
}
- if (ao->oper == OP_PHRASE && ao->distance != bo->distance)
+ if ((ao->oper == OP_PHRASE || ao->oper == OP_AROUND) && ao->distance != bo->distance)
return (ao->distance > bo->distance) ? -1 : 1;
return 0;
@@ -171,7 +171,8 @@ QTNSort(QTNode *in)
for (i = 0; i < in->nchild; i++)
QTNSort(in->child[i]);
- if (in->nchild > 1 && in->valnode->qoperator.oper != OP_PHRASE)
+ if (in->nchild > 1 && in->valnode->qoperator.oper != OP_PHRASE
+ && in->valnode->qoperator.oper != OP_AROUND)
qsort((void *) in->child, in->nchild, sizeof(QTNode *), cmpQTN);
}
diff --git a/src/backend/utils/adt/tsrank.c b/src/backend/utils/adt/tsrank.c
index 76e5e54..3637ac1 100644
--- a/src/backend/utils/adt/tsrank.c
+++ b/src/backend/utils/adt/tsrank.c
@@ -366,7 +366,8 @@ calc_rank(const float *w, TSVector t, TSQuery q, int32 method)
/* XXX: What about NOT? */
res = (item->type == QI_OPR && (item->qoperator.oper == OP_AND ||
- item->qoperator.oper == OP_PHRASE)) ?
+ item->qoperator.oper == OP_PHRASE ||
+ item->qoperator.oper == OP_AROUND)) ?
calc_rank_and(w, t, q) :
calc_rank_or(w, t, q);
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index c694637..0543e95 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -1429,6 +1429,7 @@ checkcondition_str(void *checkval, QueryOperand *val, ExecPhraseData *data)
#define TSPO_L_ONLY 0x01 /* emit positions appearing only in L */
#define TSPO_R_ONLY 0x02 /* emit positions appearing only in R */
#define TSPO_BOTH 0x04 /* emit positions appearing in both L&R */
+#define TS_NOT_EXAC 0x08 /* not exact distance for AROUND(X) */
static bool
TS_phrase_output(ExecPhraseData *data,
@@ -1473,8 +1474,18 @@ TS_phrase_output(ExecPhraseData *data,
Rpos = INT_MAX;
}
+ /* Processing OP_AROUND */
+ if ((emit & TS_NOT_EXAC) &&
+ Lpos - Rpos >= 0 &&
+ Lpos - Rpos <= (Loffset + Roffset) * 2 - Rdata->width + Ldata->width)
+ {
+ if (emit & TSPO_BOTH)
+ output_pos = Rpos;
+ Lindex++;
+ Rindex++;
+ }
/* Merge-join the two input lists */
- if (Lpos < Rpos)
+ else if (Lpos < Rpos)
{
/* Lpos is not matched in Rdata, should we output it? */
if (emit & TSPO_L_ONLY)
@@ -1625,6 +1636,7 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
}
case OP_PHRASE:
+ case OP_AROUND:
case OP_AND:
memset(&Ldata, 0, sizeof(Ldata));
memset(&Rdata, 0, sizeof(Rdata));
@@ -1647,7 +1659,7 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
(Rdata.npos == 0 && !Rdata.negate))
return (flags & TS_EXEC_PHRASE_NO_POS) ? true : false;
- if (curitem->qoperator.oper == OP_PHRASE)
+ if (curitem->qoperator.oper == OP_PHRASE || curitem->qoperator.oper == OP_AROUND)
{
/*
* Compute Loffset and Roffset suitable for phrase match, and
@@ -1703,7 +1715,7 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
{
/* straight AND */
return TS_phrase_output(data, &Ldata, &Rdata,
- TSPO_BOTH,
+ TSPO_BOTH | (curitem->qoperator.oper == OP_AROUND ? TS_NOT_EXAC : 0),
Loffset, Roffset,
Min(Ldata.npos, Rdata.npos));
}
@@ -1843,6 +1855,7 @@ TS_execute(QueryItem *curitem, void *arg, uint32 flags,
return TS_execute(curitem + 1, arg, flags, chkcond);
case OP_PHRASE:
+ case OP_AROUND:
return TS_phrase_execute(curitem, arg, flags, chkcond, NULL);
default:
@@ -1882,6 +1895,7 @@ tsquery_requires_match(QueryItem *curitem)
return false;
case OP_PHRASE:
+ case OP_AROUND:
/*
* Treat OP_PHRASE as OP_AND here
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f22cc4f..fe14ba3 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4857,6 +4857,8 @@ DATA(insert OID = 3746 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2
DESCR("make tsquery");
DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ plainto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( queryto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ queryto_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
@@ -4865,6 +4867,8 @@ DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1
DESCR("make tsquery");
DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ plainto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( queryto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ queryto_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_type.h b/src/include/tsearch/ts_type.h
index 873e2e1..49f8e68 100644
--- a/src/include/tsearch/ts_type.h
+++ b/src/include/tsearch/ts_type.h
@@ -175,8 +175,9 @@ typedef struct
#define OP_NOT 1
#define OP_AND 2
#define OP_OR 3
+#define OP_AROUND 5
#define OP_PHRASE 4 /* highest code, tsquery_cleanup.c */
-#define OP_COUNT 4
+#define OP_COUNT 5
extern const int tsearch_op_priority[OP_COUNT];
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index 20d90b1..cbd7c3f 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -45,11 +45,12 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
int16 tokenweights, /* bitmap as described
* in QueryOperand
* struct */
- bool prefix);
+ bool prefix,
+ bool isphrase);
extern TSQuery parse_tsquery(char *buf,
PushFunction pushval,
- Datum opaque, bool isplain);
+ Datum opaque, bool isplain, bool isquery);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index b2fc9e2..bdc3e14 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1661,3 +1661,115 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+--test queryto_tsquery function
+select queryto_tsquery('My brand new smartphone');
+ queryto_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select queryto_tsquery('My brand "new smartphone"');
+ queryto_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select queryto_tsquery('"A fat cat" has just eaten a -rat.');
+ queryto_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select queryto_tsquery('"A fat cat" has just eaten OR -rat.');
+ queryto_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select queryto_tsquery('"A fat cat" has just (eaten OR -rat)');
+ queryto_tsquery
+----------------------------------------
+ 'fat' <-> 'cat' & ( 'eaten' | !'rat' )
+(1 row)
+
+-- testing AROUND operator evaluation
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(5) runs');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(5) "gnu debugger"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(6) runs');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(6) "gnu debugger"');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(10) "portable debugger"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(10) "many programming languages"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(11) "portable debugger"');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(11) "many programming languages"');
+ ?column?
+----------
+ t
+(1 row)
+
+select queryto_tsquery('"fat cat AROUND(5) rat"');
+ queryto_tsquery
+------------------------------------------------
+ 'fat' <-> 'cat' <-> 'around' <-> '5' <-> 'rat'
+(1 row)
+
+select queryto_tsquery('simple','"fat cat OR rat"');
+ queryto_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select queryto_tsquery('fat*rat');
+ queryto_tsquery
+-----------------
+ 'fat' & 'rat'
+(1 row)
+
+select queryto_tsquery('fat-rat');
+ queryto_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
+
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index e4b21f8..75fbeda 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -535,3 +535,34 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+--test queryto_tsquery function
+select queryto_tsquery('My brand new smartphone');
+select queryto_tsquery('My brand "new smartphone"');
+select queryto_tsquery('"A fat cat" has just eaten a -rat.');
+select queryto_tsquery('"A fat cat" has just eaten OR -rat.');
+select queryto_tsquery('"A fat cat" has just (eaten OR -rat)');
+
+-- testing AROUND operator evaluation
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(5) runs');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(5) "gnu debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(6) runs');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(6) "gnu debugger"');
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(10) "portable debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(10) "many programming languages"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(11) "portable debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(11) "many programming languages"');
+
+select queryto_tsquery('"fat cat AROUND(5) rat"');
+select queryto_tsquery('simple','"fat cat OR rat"');
+select queryto_tsquery('fat*rat');
+select queryto_tsquery('fat-rat');
\ No newline at end of file
On Wed, Jul 19, 2017 at 12:43 PM, Victor Drobny <v.drobny@postgrespro.ru> wrote:
Let me introduce new function for full text search query creation(which is
called 'queryto_tsquery'). It takes 'google like' query string and
translates it to tsquery.
I haven't looked at the code, but that sounds like a neat idea.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 20, 2017 at 4:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 19, 2017 at 12:43 PM, Victor Drobny <v.drobny@postgrespro.ru> wrote:
Let me introduce new function for full text search query creation(which is
called 'queryto_tsquery'). It takes 'google like' query string and
translates it to tsquery.I haven't looked at the code, but that sounds like a neat idea.
+1
This is a very cool feature making tsquery much more accessible. Many
people know that sort of defacto search engine query language that
many websites accept using quotes, AND, OR, - etc.
Calling this search syntax just "query" seems too general and
overloaded. "Simple search", "simple query", "web search", "web
syntax", "web query", "Google-style query", "Poogle" (kidding!) ...
well I'm not sure, but I feel like it deserves a proper name.
websearch_to_tsquery()?
I see that your AROUND(n) is an undocumented Google search syntax.
That's a good trick to know.
Please send a rebased version of the patch for people to review and
test as that one has bit-rotted.
--
Thomas Munro
http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-09 06:03, Thomas Munro wrote:
Please send a rebased version of the patch for people to review and
test as that one has bit-rotted.
Hello,
Thank you for interest. In the attachment you can find rebased
version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit)
--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
queryto_tsquery2.patchtext/x-diff; name=queryto_tsquery2.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 641b3b8..a694801 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9523,6 +9523,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
<row>
<entry>
<indexterm>
+ <primary>queryto_tsquery</primary>
+ </indexterm>
+ <literal><function>queryto_tsquery(<optional> <replaceable class="PARAMETER">config</> <type>regconfig</> , </optional> <replaceable class="PARAMETER">query</> <type>text</type>)</function></literal>
+ </entry>
+ <entry><type>tsquery</type></entry>
+ <entry>produce <type>tsquery</> from google like query</entry>
+ <entry><literal>queryto_tsquery('english', 'The Fat Rats')</literal></entry>
+ <entry><literal>'fat' & 'rat'</literal></entry>
+ </row>
+ <row>
+ <entry>
+ <indexterm>
<primary>querytree</primary>
</indexterm>
<literal><function>querytree(<replaceable class="PARAMETER">query</replaceable> <type>tsquery</>)</function></literal>
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index fe630a6..999e4ad 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,15 @@ UPDATE tt SET ti =
<para>
<productname>PostgreSQL</productname> provides the
functions <function>to_tsquery</function>,
- <function>plainto_tsquery</function>, and
- <function>phraseto_tsquery</function>
+ <function>plainto_tsquery</function>,
+ <function>phraseto_tsquery</function> and
+ <function>queryto_tsquery</function>
for converting a query to the <type>tsquery</type> data type.
<function>to_tsquery</function> offers access to more features
than either <function>plainto_tsquery</function> or
<function>phraseto_tsquery</function>, but it is less forgiving
- about its input.
+ about its input. <function>queryto_tsquery</function> provides a
+ different, Google like syntax to create tsquery.
</para>
<indexterm>
@@ -960,8 +962,68 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
-----------------------------
'fat' <-> 'rat' <-> 'c'
</screen>
+</para>
+
+<synopsis>
+queryto_tsquery(<optional> <replaceable class="PARAMETER">config</replaceable> <type>regconfig</>, </optional> <replaceable class="PARAMETER">querytext</replaceable> <type>text</>) returns <type>tsquery</>
+</synopsis>
+
+ <para>
+ <function>queryto_tsquery</> creates a <type>tsquery</type> from a unformated text.
+ But instead of <function>plainto_tsquery</> and <function>phraseto_tsquery</> it won't
+ ignore already placed operations. This function supports following operators:
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ '"some text" - any text inside quote signs will be treated as a phrase and will be
+ performed like in <function>phraseto_tsquery</>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ 'OR' - standard logical operator. It is just an alias for '|'' sign.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ 'terma AROUND(N) termb' - this operation will match if the distance between
+ terma and termb is less than N.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ '-' - standard logical negation sign. It is an alias for '!' sign.
+ </para>
+ </listitem>
+ </itemizedlist>
+ Other missing operators will be replaced by AND like in <function>plainto_tsquery</>.
</para>
+ <para>
+ Examples:
+ <screen>
+ select queryto_tsquery('The fat rats');
+ queryto_tsquery
+ -----------------
+ 'fat' & 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select queryto_tsquery('"supernovae stars" AND -crab');
+ queryto_tsquery
+ ----------------------------------
+ 'supernova' <-> 'star' & !'crab'
+ (1 row)
+ </screen>
+ <screen>
+ select queryto_tsquery('-run AROUND(5) "gnu debugger" OR "I like bananas"');
+ queryto_tsquery
+ -----------------------------------------------------------
+ !'run' AROUND(5) 'gnu' <-> 'debugg' | 'like' <-> 'banana'
+ (1 row)
+ </screen>
+ </para>
+
</sect2>
<sect2 id="textsearch-ranking">
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index 35d9ab2..e820042 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -390,7 +390,8 @@ add_to_tsvector(void *_state, char *elem_value, int elem_len)
* and different variants are ORed together.
*/
static void
-pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval, int16 weight, bool prefix)
+pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
+ int16 weight, bool prefix, bool isphrase)
{
int32 count = 0;
ParsedText prs;
@@ -423,7 +424,12 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
/* put placeholders for each missing stop word */
pushStop(state);
if (cntpos)
- pushOperator(state, data->qoperator, 1);
+ {
+ if (isphrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
+ }
cntpos++;
pos++;
}
@@ -464,7 +470,10 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
if (cntpos)
{
/* distance may be useful */
- pushOperator(state, data->qoperator, 1);
+ if (isphrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
}
cntpos++;
@@ -490,6 +499,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
+ false,
false);
PG_RETURN_TSQUERY(query);
@@ -520,7 +530,8 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_POINTER(query);
}
@@ -551,7 +562,8 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_TSQUERY(query);
}
@@ -567,3 +579,36 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+queryto_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ false,
+ true);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+queryto_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(queryto_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+
+}
diff --git a/src/backend/tsearch/ts_selfuncs.c b/src/backend/tsearch/ts_selfuncs.c
index 046f543..375cb5c 100644
--- a/src/backend/tsearch/ts_selfuncs.c
+++ b/src/backend/tsearch/ts_selfuncs.c
@@ -396,6 +396,7 @@ tsquery_opr_selec(QueryItem *item, char *operand,
break;
case OP_PHRASE:
+ case OP_AROUND:
case OP_AND:
s1 = tsquery_opr_selec(item + 1, operand,
lookup, length, minfreq);
diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index 83a939d..7dc8f36 100644
--- a/src/backend/utils/adt/tsginidx.c
+++ b/src/backend/utils/adt/tsginidx.c
@@ -239,6 +239,7 @@ TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem, bool in_phrase)
return !result;
case OP_PHRASE:
+ case OP_AROUND:
/*
* GIN doesn't contain any information about positions, so treat
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index fdb0419..c238055 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -29,6 +29,7 @@ const int tsearch_op_priority[OP_COUNT] =
4, /* OP_NOT */
2, /* OP_AND */
1, /* OP_OR */
+ 3, /* OP_AROUND */
3 /* OP_PHRASE */
};
@@ -58,10 +59,11 @@ struct TSQueryParserStateData
};
/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
+#define WAITOPERAND 1
+#define WAITOPERATOR 2
+#define WAITFIRSTOPERAND 3
+#define WAITSINGLEOPERAND 4
+#define INSIDEQUOTES 5
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
@@ -210,6 +212,69 @@ typedef enum
PT_CLOSE = 5
} ts_tokentype;
+
+static bool
+has_prefix(char * str, char * prefix)
+{
+ if (strlen(prefix) > strlen(str))
+ {
+ return false;
+ }
+ while (*prefix != '\0')
+ {
+ if (*(str++) != *(prefix++))
+ {
+ return false;
+ }
+ }
+ return true;
+}
+
+/*
+ * Parse around operator. The operator
+ * have the following form:
+ *
+ * a AROUND(X) b (distance is no greater than X)
+ *
+ * The buffer should begin with "AROUND(" prefix
+ */
+static char *
+parse_around_operator(char *buf, int16 *distance)
+{
+ char *ptr = buf;
+ char *endptr;
+ long l = 1;
+
+ Assert(has_prefix(ptr, "AROUND("));
+
+ ptr += strlen("AROUND(");
+
+ while (t_isspace(ptr))
+ ptr++;
+
+ l = strtol(ptr, &endptr, 10);
+ if (ptr == endptr)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Invalid AROUND(X) operator!")));
+ else if (errno == ERANGE || l > MAXENTRYPOS)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("distance in AROUND operator should not be greater than %d",
+ MAXENTRYPOS)));
+
+ ptr = endptr;
+ *distance = l;
+ while (t_isspace(ptr))
+ ptr++;
+
+ if (!t_iseq(ptr, ')'))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Missing ')' in AROUND(X) operator")));
+
+ return ++ptr;
+}
/*
* get token from query string
*
@@ -221,7 +286,8 @@ typedef enum
static ts_tokentype
gettoken_query(TSQueryParserState state,
int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+ int *lenval, char **strval, int16 *weight, bool *prefix,
+ bool isquery)
{
*weight = 0;
*prefix = false;
@@ -232,7 +298,7 @@ gettoken_query(TSQueryParserState state,
{
case WAITFIRSTOPERAND:
case WAITOPERAND:
- if (t_iseq(state->buf, '!'))
+ if (t_iseq(state->buf, '!') || (isquery && t_iseq(state->buf, '-')))
{
(state->buf)++; /* can safely ++, t_iseq guarantee that
* pg_mblen()==1 */
@@ -254,6 +320,20 @@ gettoken_query(TSQueryParserState state,
errmsg("syntax error in tsquery: \"%s\"",
state->buffer)));
}
+ else if (isquery && t_iseq(state->buf, '"'))
+ {
+ char *quote = strchr(state->buf + 1, '"');
+ if (quote == NULL)
+ {
+ state->buf++;
+ continue;
+ }
+ *strval = state->buf + 1;
+ *lenval = quote - state->buf - 1;
+ state->buf = quote + 1;
+ state->state = INSIDEQUOTES;
+ return PT_VAL;
+ }
else if (!t_isspace(state->buf))
{
/*
@@ -291,6 +371,13 @@ gettoken_query(TSQueryParserState state,
(state->buf)++;
return PT_OPR;
}
+ else if (isquery && has_prefix(state->buf, "OR "))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_OR;
+ (state->buf) += 3;
+ return PT_OPR;
+ }
else if (t_iseq(state->buf, '<'))
{
state->state = WAITOPERAND;
@@ -301,14 +388,39 @@ gettoken_query(TSQueryParserState state,
return PT_ERR;
return PT_OPR;
}
+ else if (isquery && has_prefix(state->buf, "AROUND("))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_AROUND;
+ /* weight var is used as storage for distance */
+ state->buf = parse_around_operator(state->buf, weight);
+ if (*weight < 0)
+ return PT_ERR;
+ return PT_OPR;
+ }
else if (t_iseq(state->buf, ')'))
{
(state->buf)++;
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
+ else if (t_iseq(state->buf, '('))
+ {
+ *operator = OP_AND;
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
else if (*(state->buf) == '\0')
return (state->count) ? PT_ERR : PT_END;
+ else if (isquery &&
+ (t_isalpha(state->buf) || t_iseq(state->buf, '!')
+ || t_iseq(state->buf, '-')
+ || t_iseq(state->buf, '"')))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_AND;
+ return PT_OPR;
+ }
else if (!t_isspace(state->buf))
return PT_ERR;
break;
@@ -320,6 +432,9 @@ gettoken_query(TSQueryParserState state,
state->buf += strlen(state->buf);
state->count++;
return PT_VAL;
+ case INSIDEQUOTES:
+ state->state = WAITOPERATOR;
+ continue;
default:
return PT_ERR;
break;
@@ -336,12 +451,12 @@ pushOperator(TSQueryParserState state, int8 oper, int16 distance)
{
QueryOperator *tmp;
- Assert(oper == OP_NOT || oper == OP_AND || oper == OP_OR || oper == OP_PHRASE);
+ Assert(oper == OP_NOT || oper == OP_AND || oper == OP_OR || oper == OP_PHRASE || oper == OP_AROUND);
tmp = (QueryOperator *) palloc0(sizeof(QueryOperator));
tmp->type = QI_OPR;
tmp->oper = oper;
- tmp->distance = (oper == OP_PHRASE) ? distance : 0;
+ tmp->distance = (oper == OP_PHRASE || oper == OP_AROUND) ? distance : 0;
/* left is filled in later with findoprnd */
state->polstr = lcons(tmp, state->polstr);
@@ -475,7 +590,8 @@ cleanOpStack(TSQueryParserState state,
static void
makepol(TSQueryParserState state,
PushFunction pushval,
- Datum opaque)
+ Datum opaque,
+ bool isquery)
{
int8 operator = 0;
ts_tokentype type;
@@ -489,19 +605,19 @@ makepol(TSQueryParserState state,
/* since this function recurses, it could be driven to stack overflow */
check_stack_depth();
- while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix)) != PT_END)
+ while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix, isquery)) != PT_END)
{
switch (type)
{
case PT_VAL:
- pushval(opaque, state, strval, lenval, weight, prefix);
+ pushval(opaque, state, strval, lenval, weight, prefix, state->state == INSIDEQUOTES);
break;
case PT_OPR:
cleanOpStack(state, opstack, &lenstack, operator);
pushOpStack(opstack, &lenstack, operator, weight);
break;
case PT_OPEN:
- makepol(state, pushval, opaque);
+ makepol(state, pushval, opaque, isquery);
break;
case PT_CLOSE:
cleanOpStack(state, opstack, &lenstack, OP_OR /* lowest */ );
@@ -555,7 +671,8 @@ findoprnd_recurse(QueryItem *ptr, uint32 *pos, int nnodes, bool *needcleanup)
Assert(curitem->oper == OP_AND ||
curitem->oper == OP_OR ||
- curitem->oper == OP_PHRASE);
+ curitem->oper == OP_PHRASE ||
+ curitem->oper == OP_AROUND);
(*pos)++;
@@ -605,7 +722,8 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ bool isplain,
+ bool isquery)
{
struct TSQueryParserStateData state;
int i;
@@ -632,7 +750,7 @@ parse_tsquery(char *buf,
*(state.curop) = '\0';
/* parse query & make polish notation (postfix, but in reverse order) */
- makepol(&state, pushval, opaque);
+ makepol(&state, pushval, opaque, isquery);
close_tsvector_parser(state.valstate);
@@ -703,7 +821,7 @@ parse_tsquery(char *buf,
static void
pushval_asis(Datum opaque, TSQueryParserState state, char *strval, int lenval,
- int16 weight, bool prefix)
+ int16 weight, bool prefix, bool isphrase)
{
pushValue(state, strval, lenval, weight, prefix);
}
@@ -716,7 +834,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false, false));
}
/*
@@ -884,6 +1002,9 @@ infix(INFIX *in, int parentPriority, bool rightPhraseOp)
else
sprintf(in->cur, " <-> %s", nrm.buf);
break;
+ case OP_AROUND:
+ sprintf(in->cur, " AROUND(%d) %s", distance, nrm.buf);
+ break;
default:
/* OP_NOT is handled in above if-branch */
elog(ERROR, "unrecognized operator type: %d", op);
@@ -966,7 +1087,7 @@ tsquerysend(PG_FUNCTION_ARGS)
break;
case QI_OPR:
pq_sendint(&buf, item->qoperator.oper, sizeof(item->qoperator.oper));
- if (item->qoperator.oper == OP_PHRASE)
+ if (item->qoperator.oper == OP_PHRASE || item->qoperator.oper == OP_AROUND)
pq_sendint(&buf, item->qoperator.distance,
sizeof(item->qoperator.distance));
break;
@@ -1063,14 +1184,14 @@ tsqueryrecv(PG_FUNCTION_ARGS)
int8 oper;
oper = (int8) pq_getmsgint(buf, sizeof(int8));
- if (oper != OP_NOT && oper != OP_OR && oper != OP_AND && oper != OP_PHRASE)
+ if (oper != OP_NOT && oper != OP_OR && oper != OP_AND && oper != OP_PHRASE && oper != OP_AROUND)
elog(ERROR, "invalid tsquery: unrecognized operator type %d",
(int) oper);
if (i == size - 1)
elog(ERROR, "invalid pointer to right operand");
item->qoperator.oper = oper;
- if (oper == OP_PHRASE)
+ if (oper == OP_PHRASE || oper == OP_AROUND)
item->qoperator.distance = (int16) pq_getmsgint(buf, sizeof(int16));
}
else
diff --git a/src/backend/utils/adt/tsquery_cleanup.c b/src/backend/utils/adt/tsquery_cleanup.c
index 350171c..071bfa0 100644
--- a/src/backend/utils/adt/tsquery_cleanup.c
+++ b/src/backend/utils/adt/tsquery_cleanup.c
@@ -161,7 +161,8 @@ clean_NOT_intree(NODE *node)
NODE *res = node;
Assert(node->valnode->qoperator.oper == OP_AND ||
- node->valnode->qoperator.oper == OP_PHRASE);
+ node->valnode->qoperator.oper == OP_PHRASE ||
+ node->valnode->qoperator.oper == OP_AROUND);
node->left = clean_NOT_intree(node->left);
node->right = clean_NOT_intree(node->right);
@@ -277,7 +278,8 @@ clean_stopword_intree(NODE *node, int *ladd, int *radd)
node->right = clean_stopword_intree(node->right, &rladd, &rradd);
/* Check if current node is OP_PHRASE, get its distance */
- isphrase = (node->valnode->qoperator.oper == OP_PHRASE);
+ isphrase = (node->valnode->qoperator.oper == OP_PHRASE
+ || node->valnode->qoperator.oper == OP_AROUND);
ndistance = isphrase ? node->valnode->qoperator.distance : 0;
if (node->left == NULL && node->right == NULL)
diff --git a/src/backend/utils/adt/tsquery_op.c b/src/backend/utils/adt/tsquery_op.c
index 755c3e9..7cf9b8a 100644
--- a/src/backend/utils/adt/tsquery_op.c
+++ b/src/backend/utils/adt/tsquery_op.c
@@ -37,7 +37,7 @@ join_tsqueries(TSQuery a, TSQuery b, int8 operator, uint16 distance)
res->valnode = (QueryItem *) palloc0(sizeof(QueryItem));
res->valnode->type = QI_OPR;
res->valnode->qoperator.oper = operator;
- if (operator == OP_PHRASE)
+ if (operator == OP_PHRASE || operator == OP_AROUND)
res->valnode->qoperator.distance = distance;
res->child = (QTNode **) palloc0(sizeof(QTNode *) * 2);
diff --git a/src/backend/utils/adt/tsquery_util.c b/src/backend/utils/adt/tsquery_util.c
index 971bb81..548a846 100644
--- a/src/backend/utils/adt/tsquery_util.c
+++ b/src/backend/utils/adt/tsquery_util.c
@@ -121,7 +121,7 @@ QTNodeCompare(QTNode *an, QTNode *bn)
return res;
}
- if (ao->oper == OP_PHRASE && ao->distance != bo->distance)
+ if ((ao->oper == OP_PHRASE || ao->oper == OP_AROUND) && ao->distance != bo->distance)
return (ao->distance > bo->distance) ? -1 : 1;
return 0;
@@ -171,7 +171,8 @@ QTNSort(QTNode *in)
for (i = 0; i < in->nchild; i++)
QTNSort(in->child[i]);
- if (in->nchild > 1 && in->valnode->qoperator.oper != OP_PHRASE)
+ if (in->nchild > 1 && in->valnode->qoperator.oper != OP_PHRASE
+ && in->valnode->qoperator.oper != OP_AROUND)
qsort((void *) in->child, in->nchild, sizeof(QTNode *), cmpQTN);
}
diff --git a/src/backend/utils/adt/tsrank.c b/src/backend/utils/adt/tsrank.c
index 4577bcc..b62b14d 100644
--- a/src/backend/utils/adt/tsrank.c
+++ b/src/backend/utils/adt/tsrank.c
@@ -366,7 +366,8 @@ calc_rank(const float *w, TSVector t, TSQuery q, int32 method)
/* XXX: What about NOT? */
res = (item->type == QI_OPR && (item->qoperator.oper == OP_AND ||
- item->qoperator.oper == OP_PHRASE)) ?
+ item->qoperator.oper == OP_PHRASE ||
+ item->qoperator.oper == OP_AROUND)) ?
calc_rank_and(w, t, q) :
calc_rank_or(w, t, q);
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 8225202..92d267d 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -1429,6 +1429,7 @@ checkcondition_str(void *checkval, QueryOperand *val, ExecPhraseData *data)
#define TSPO_L_ONLY 0x01 /* emit positions appearing only in L */
#define TSPO_R_ONLY 0x02 /* emit positions appearing only in R */
#define TSPO_BOTH 0x04 /* emit positions appearing in both L&R */
+#define TS_NOT_EXAC 0x08 /* not exact distance for AROUND(X) */
static bool
TS_phrase_output(ExecPhraseData *data,
@@ -1473,8 +1474,18 @@ TS_phrase_output(ExecPhraseData *data,
Rpos = INT_MAX;
}
+ /* Processing OP_AROUND */
+ if ((emit & TS_NOT_EXAC) &&
+ Lpos - Rpos >= 0 &&
+ Lpos - Rpos <= (Loffset + Roffset) * 2 - Rdata->width + Ldata->width)
+ {
+ if (emit & TSPO_BOTH)
+ output_pos = Rpos;
+ Lindex++;
+ Rindex++;
+ }
/* Merge-join the two input lists */
- if (Lpos < Rpos)
+ else if (Lpos < Rpos)
{
/* Lpos is not matched in Rdata, should we output it? */
if (emit & TSPO_L_ONLY)
@@ -1625,6 +1636,7 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
}
case OP_PHRASE:
+ case OP_AROUND:
case OP_AND:
memset(&Ldata, 0, sizeof(Ldata));
memset(&Rdata, 0, sizeof(Rdata));
@@ -1647,7 +1659,7 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
(Rdata.npos == 0 && !Rdata.negate))
return (flags & TS_EXEC_PHRASE_NO_POS) ? true : false;
- if (curitem->qoperator.oper == OP_PHRASE)
+ if (curitem->qoperator.oper == OP_PHRASE || curitem->qoperator.oper == OP_AROUND)
{
/*
* Compute Loffset and Roffset suitable for phrase match, and
@@ -1703,7 +1715,7 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
{
/* straight AND */
return TS_phrase_output(data, &Ldata, &Rdata,
- TSPO_BOTH,
+ TSPO_BOTH | (curitem->qoperator.oper == OP_AROUND ? TS_NOT_EXAC : 0),
Loffset, Roffset,
Min(Ldata.npos, Rdata.npos));
}
@@ -1843,6 +1855,7 @@ TS_execute(QueryItem *curitem, void *arg, uint32 flags,
return TS_execute(curitem + 1, arg, flags, chkcond);
case OP_PHRASE:
+ case OP_AROUND:
return TS_phrase_execute(curitem, arg, flags, chkcond, NULL);
default:
@@ -1882,6 +1895,7 @@ tsquery_requires_match(QueryItem *curitem)
return false;
case OP_PHRASE:
+ case OP_AROUND:
/*
* Treat OP_PHRASE as OP_AND here
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d820b56..79d0c43 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4907,6 +4907,8 @@ DATA(insert OID = 3746 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2
DESCR("make tsquery");
DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ plainto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( queryto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ queryto_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
@@ -4915,6 +4917,8 @@ DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1
DESCR("make tsquery");
DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ plainto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( queryto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ queryto_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_type.h b/src/include/tsearch/ts_type.h
index 30d7c4b..2f4c374 100644
--- a/src/include/tsearch/ts_type.h
+++ b/src/include/tsearch/ts_type.h
@@ -166,8 +166,9 @@ typedef struct
#define OP_NOT 1
#define OP_AND 2
#define OP_OR 3
+#define OP_AROUND 5
#define OP_PHRASE 4 /* highest code, tsquery_cleanup.c */
-#define OP_COUNT 4
+#define OP_COUNT 5
extern const int tsearch_op_priority[OP_COUNT];
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index 3312353..034f36c 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -44,11 +44,12 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
char *token, int tokenlen,
int16 tokenweights, /* bitmap as described in
* QueryOperand struct */
- bool prefix);
+ bool prefix,
+ bool isphrase);
extern TSQuery parse_tsquery(char *buf,
PushFunction pushval,
- Datum opaque, bool isplain);
+ Datum opaque, bool isplain, bool isquery);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index b2fc9e2..93db7b8 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1661,3 +1661,115 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+--test queryto_tsquery function
+select queryto_tsquery('My brand new smartphone');
+ queryto_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select queryto_tsquery('My brand "new smartphone"');
+ queryto_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select queryto_tsquery('"A fat cat" has just eaten a -rat.');
+ queryto_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select queryto_tsquery('"A fat cat" has just eaten OR -rat.');
+ queryto_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select queryto_tsquery('"A fat cat" has just (eaten OR -rat)');
+ queryto_tsquery
+----------------------------------------
+ 'fat' <-> 'cat' & ( 'eaten' | !'rat' )
+(1 row)
+
+-- testing AROUND operator evaluation
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(5) runs');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(5) "gnu debugger"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(6) runs');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(6) "gnu debugger"');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(10) "portable debugger"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(10) "many programming languages"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(11) "portable debugger"');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(11) "many programming languages"');
+ ?column?
+----------
+ t
+(1 row)
+
+select queryto_tsquery('"fat cat AROUND(5) rat"');
+ queryto_tsquery
+------------------------------------------------
+ 'fat' <-> 'cat' <-> 'around' <-> '5' <-> 'rat'
+(1 row)
+
+select queryto_tsquery('simple','"fat cat OR rat"');
+ queryto_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select queryto_tsquery('fat*rat');
+ queryto_tsquery
+-----------------
+ 'fat' & 'rat'
+(1 row)
+
+select queryto_tsquery('fat-rat');
+ queryto_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
+
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index e4b21f8..12da75a 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -535,3 +535,34 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+--test queryto_tsquery function
+select queryto_tsquery('My brand new smartphone');
+select queryto_tsquery('My brand "new smartphone"');
+select queryto_tsquery('"A fat cat" has just eaten a -rat.');
+select queryto_tsquery('"A fat cat" has just eaten OR -rat.');
+select queryto_tsquery('"A fat cat" has just (eaten OR -rat)');
+
+-- testing AROUND operator evaluation
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(5) runs');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(5) "gnu debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(6) runs');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(6) "gnu debugger"');
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(10) "portable debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(10) "many programming languages"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(11) "portable debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(11) "many programming languages"');
+
+select queryto_tsquery('"fat cat AROUND(5) rat"');
+select queryto_tsquery('simple','"fat cat OR rat"');
+select queryto_tsquery('fat*rat');
+select queryto_tsquery('fat-rat');
\ No newline at end of file
Hi all,
I am extending phrase operator <n> is such way that it will have <n,m>
syntax that means from n to m words, so I will use such syntax (<n,m>)
further. I found that a AROUND(N) b is exactly the same as a <-N,N> b
and it can be replaced while parsing. So, what do you think of such
idea? In this patch I have noticed some unobvious behavior.
# select to_tsvector('Hello, cat world!') @@ queryto_tsquery('cat
AROUND(1) cat') as match;
match
-------
t
cat AROUND(1) cat is the same is "cat <1> cat || cat <0> cat" and:
# select to_tsvector('Hello, cat world!') @@ to_tsquery('cat <0> cat');
?column?
-------
t
It seems to be a proper logic behavior but it is a possible pitfall,
maybe it should be documented?
But more important question is how AROUND() operator should handle stop
words? Now it works as:
# select queryto_tsquery('cat <2> (a AROUND(10) rat)');
queryto_tsquery
------------------
'cat' <12> 'rat'
(1 row)
# select queryto_tsquery('cat <2> a AROUND(10) rat');
queryto_tsquery
------------------------
'cat' AROUND(12) 'rat'
(1 row)
In my opinion it should be like:
cat <2> (a AROUND(10) rat) == cat <2,2> (a <-10,10> rat) == cat <-8,12>
rat
cat <2> a AROUND(10) rat == cat <2,2> a <-10,10> rat = cat <-8, 12>
rat
Now <n,m> operator can be replaced with combination of phrase
operator <n>, AROUND(), and logical operators, but with <n,m> operator
it will be much painless. Correct me, please, if I am wrong.
--
Alexey Chernyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-10-13 16:37, Alexey Chernyshov wrote:
Hi all,
I am extending phrase operator <n> is such way that it will have <n,m>
syntax that means from n to m words, so I will use such syntax (<n,m>)
further. I found that a AROUND(N) b is exactly the same as a <-N,N> b
and it can be replaced while parsing. So, what do you think of such
idea? In this patch I have noticed some unobvious behavior.
Thank you for the interest and review!
# select to_tsvector('Hello, cat world!') @@ queryto_tsquery('cat
AROUND(1) cat') as match;
match
-------
tcat AROUND(1) cat is the same is "cat <1> cat || cat <0> cat" and:
# select to_tsvector('Hello, cat world!') @@ to_tsquery('cat <0> cat');
?column?
-------
tIt seems to be a proper logic behavior but it is a possible pitfall,
maybe it should be documented?
It is a tricky question. I think that this interpretation is confusing,
so
better to make it as <-N, -1> and <1, N>.
But more important question is how AROUND() operator should handle stop
words? Now it works as:# select queryto_tsquery('cat <2> (a AROUND(10) rat)');
queryto_tsquery
------------------
'cat' <12> 'rat'
(1 row)# select queryto_tsquery('cat <2> a AROUND(10) rat');
queryto_tsquery
------------------------
'cat' AROUND(12) 'rat'
(1 row)In my opinion it should be like:
cat <2> (a AROUND(10) rat) == cat <2,2> (a <-10,10> rat) == cat <-8,12>
rat
I think that correct version is:
cat <2> (a AROUND(10) rat) == cat <2,2> (a <-10,10> rat) == cat <-2,12>
rat.
cat <2> a AROUND(10) rat == cat <2,2> a <-10,10> rat = cat <-8, 12>
rat
It is a problem indeed. I did not catch it during implementation. Thank
you
for pointing it out.
Now <n,m> operator can be replaced with combination of phrase
operator <n>, AROUND(), and logical operators, but with <n,m> operator
it will be much painless. Correct me, please, if I am wrong.
I think that <n,m> operator is more general than around(n) so the last
one
should be based on yours. However, i think, that taking negative
parameters
is not the best idea because it is confusing. On top of that it is not
so
necessary and i think it won`t be popular among users.
It seems to me that AROUND operator can be easily implemented with
<n,m>,
also, it helps to avoid problems, that you showed above.
--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 09/13/2017 10:57 AM, Victor Drobny wrote:
On 2017-09-09 06:03, Thomas Munro wrote:
Please send a rebased version of the patch for people to review and
test as that one has bit-rotted.Hello,
Thank you for interest. In the attachment you can find rebased
version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit)
I did a quick review of the patch today. The patch unfortunately no
longer applies, so I had to use an older commit from September. Please
rebase to current master.
I've only looked on the diff at this point, will do more testing once
the rebase happens.
Some comments:
1) This seems to mix multiple improvements into one patch. There's the
support for alternative query syntax, and then there are the new
operators (AROUND and <m,n>). I propose to split the patch into two or
more parts, each addressing one of those bits.
I guess there will be two or three parts - first adding the syntax,
second adding <m,n> and third adding the AROUND(n). Seems reasonable?
2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...
3) In the SGML docs, please use <literal></literal> instead of just
quoting the values. So it should be <literal>|</literal> instead of '|'
etc. Just like in the parts describing plainto_tsquery, for example.
4) Also, I recommend adding a brief explanation what the examples do.
Right now there's just a bunch of queryto_tsquery, and the reader is
expected to understand the output. I suggest adding a sentence or two,
explaining what's happening (just like for plainto_tsquery examples).
5) I'm not sure about negative values in the <n,m> operator. I don't
find it particularly confusing - once you understand that (a <n,m> b)
means "there are 'k' words between 'a' and 'b' (n <= k <= m)", then
negative values seem like a fairly straightforward extension.
But I guess the main question is "Is there really a demand for the new
<n,m> operator, or have we just invented if because we can?"
6) There seem to be some new constants defined, with names not following
the naming convention. I mean this
#define WAITOPERAND 1
#define WAITOPERATOR 2
#define WAITFIRSTOPERAND 3
#define WAITSINGLEOPERAND 4
#define INSIDEQUOTES 5 <-- the new one
and
#define TSPO_L_ONLY 0x01
#define TSPO_R_ONLY 0x02
#define TSPO_BOTH 0x04
#define TS_NOT_EXAC 0x08 <-- the new one
Perhaps that's OK, but it seems a bit inconsistent.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-11-19 04:30, Tomas Vondra wrote:
Hello,
Hi,
On 09/13/2017 10:57 AM, Victor Drobny wrote:
On 2017-09-09 06:03, Thomas Munro wrote:
Please send a rebased version of the patch for people to review and
test as that one has bit-rotted.Hello,
Thank you for interest. In the attachment you can find rebased
version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit)I did a quick review of the patch today. The patch unfortunately no
longer applies, so I had to use an older commit from September. Please
rebase to current master.
Thank you for your time. In the attachment you can find rebased version.
(based on e842791b0 commit)
I've only looked on the diff at this point, will do more testing once
the rebase happens.Some comments:
1) This seems to mix multiple improvements into one patch. There's the
support for alternative query syntax, and then there are the new
operators (AROUND and <m,n>). I propose to split the patch into two or
more parts, each addressing one of those bits.
I agree. I have split it in 3 parts: support for around operator,
queryto_tsquery function and documentation.
I guess there will be two or three parts - first adding the syntax,
second adding <m,n> and third adding the AROUND(n). Seems reasonable?2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not
invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...
Yes, those syntax is not introduced by google, but, as for me, it was
the
easiest way to give a brief description of it. Of cause it can be
changed,
I just don't know how. Any suggestions are welcomed! ;)
3) In the SGML docs, please use <literal></literal> instead of just
quoting the values. So it should be <literal>|</literal> instead of '|'
etc. Just like in the parts describing plainto_tsquery, for example.
Fixed. I hope that i didn't miss anything.
4) Also, I recommend adding a brief explanation what the examples do.
Right now there's just a bunch of queryto_tsquery, and the reader is
expected to understand the output. I suggest adding a sentence or two,
explaining what's happening (just like for plainto_tsquery examples).5) I'm not sure about negative values in the <n,m> operator. I don't
find it particularly confusing - once you understand that (a <n,m> b)
means "there are 'k' words between 'a' and 'b' (n <= k <= m)", then
negative values seem like a fairly straightforward extension.But I guess the main question is "Is there really a demand for the new
<n,m> operator, or have we just invented if because we can?"
The operator <n,m> is not introduced yet. It's just a concept. It were
our
thoughts about implementation AROUND operator through <n,m> in future.
6) There seem to be some new constants defined, with names not
following
the naming convention. I mean this#define WAITOPERAND 1
#define WAITOPERATOR 2
#define WAITFIRSTOPERAND 3
#define WAITSINGLEOPERAND 4
#define INSIDEQUOTES 5 <-- the new oneand
#define TSPO_L_ONLY 0x01
#define TSPO_R_ONLY 0x02
#define TSPO_BOTH 0x04
#define TS_NOT_EXAC 0x08 <-- the new onePerhaps that's OK, but it seems a bit inconsistent.
I agree. I have fixed it.
regards
--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
queryto_tsquery3_01_around.patchtext/x-diff; name=queryto_tsquery3_01_around.patchDownload
diff --git a/src/backend/tsearch/ts_selfuncs.c b/src/backend/tsearch/ts_selfuncs.c
index 046f543..375cb5c 100644
--- a/src/backend/tsearch/ts_selfuncs.c
+++ b/src/backend/tsearch/ts_selfuncs.c
@@ -396,6 +396,7 @@ tsquery_opr_selec(QueryItem *item, char *operand,
break;
case OP_PHRASE:
+ case OP_AROUND:
case OP_AND:
s1 = tsquery_opr_selec(item + 1, operand,
lookup, length, minfreq);
diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index aba456e..1c852bd 100644
--- a/src/backend/utils/adt/tsginidx.c
+++ b/src/backend/utils/adt/tsginidx.c
@@ -239,10 +239,11 @@ TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem, bool in_phrase)
return !result;
case OP_PHRASE:
+ case OP_AROUND:
/*
* GIN doesn't contain any information about positions, so treat
- * OP_PHRASE as OP_AND with recheck requirement
+ * OP_PHRASE and OP_AROUND as OP_AND with recheck requirement
*/
*(gcv->need_recheck) = true;
/* Pass down in_phrase == true in case there's a NOT below */
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 5cdfe4d..07d56b0 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -29,7 +29,8 @@ const int tsearch_op_priority[OP_COUNT] =
4, /* OP_NOT */
2, /* OP_AND */
1, /* OP_OR */
- 3 /* OP_PHRASE */
+ 3, /* OP_PHRASE */
+ 3 /* OP_AROUND */
};
struct TSQueryParserStateData
@@ -58,10 +59,10 @@ struct TSQueryParserStateData
};
/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
+#define WAITOPERAND 1
+#define WAITOPERATOR 2
+#define WAITFIRSTOPERAND 3
+#define WAITSINGLEOPERAND 4
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
@@ -210,6 +211,69 @@ typedef enum
PT_CLOSE = 5
} ts_tokentype;
+static bool
+has_prefix(char * str, char * prefix)
+{
+ if (strlen(prefix) > strlen(str))
+ {
+ return false;
+ }
+ while (*prefix != '\0')
+ {
+ if (*(str++) != *(prefix++))
+ {
+ return false;
+ }
+ }
+ return true;
+}
+
+/*
+ * Parse around operator. The operator
+ * have the following form:
+ *
+ * a AROUND(X) b (distance is no greater than X)
+ *
+ * The buffer should begins with "AROUND(" prefix
+ */
+static char *
+parse_around_operator(char *buf, int16 *distance)
+{
+ char *ptr = buf;
+ char *endptr;
+ long l = 1;
+
+ Assert(has_prefix(ptr, "AROUND("));
+
+ ptr += strlen("AROUND(");
+
+ while (t_isspace(ptr))
+ ptr++;
+
+ l = strtol(ptr, &endptr, 10);
+ if (ptr == endptr)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Invalid AROUND(X) operator!")));
+ else if (errno == ERANGE || l > MAXENTRYPOS)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("distance in AROUND operator should not be greater than %d",
+ MAXENTRYPOS)));
+
+ ptr = endptr;
+ *distance = l;
+ while (t_isspace(ptr))
+ ptr++;
+
+ if (!t_iseq(ptr, ')'))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Missing ')' in AROUND(X) operator")));
+
+ return ++ptr;
+}
+
/*
* get token from query string
*
@@ -301,6 +365,16 @@ gettoken_query(TSQueryParserState state,
return PT_ERR;
return PT_OPR;
}
+ else if (has_prefix(state->buf, "AROUND("))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_AROUND;
+ /* weight var is used as storage for distance */
+ state->buf = parse_around_operator(state->buf, weight);
+ if (*weight < 0)
+ return PT_ERR;
+ return PT_OPR;
+ }
else if (t_iseq(state->buf, ')'))
{
(state->buf)++;
@@ -336,12 +410,13 @@ pushOperator(TSQueryParserState state, int8 oper, int16 distance)
{
QueryOperator *tmp;
- Assert(oper == OP_NOT || oper == OP_AND || oper == OP_OR || oper == OP_PHRASE);
+ Assert(oper == OP_NOT || oper == OP_AND || oper == OP_OR
+ || oper == OP_PHRASE || oper == OP_AROUND);
tmp = (QueryOperator *) palloc0(sizeof(QueryOperator));
tmp->type = QI_OPR;
tmp->oper = oper;
- tmp->distance = (oper == OP_PHRASE) ? distance : 0;
+ tmp->distance = (oper == OP_PHRASE || oper == OP_AROUND) ? distance : 0;
/* left is filled in later with findoprnd */
state->polstr = lcons(tmp, state->polstr);
@@ -555,7 +630,8 @@ findoprnd_recurse(QueryItem *ptr, uint32 *pos, int nnodes, bool *needcleanup)
Assert(curitem->oper == OP_AND ||
curitem->oper == OP_OR ||
- curitem->oper == OP_PHRASE);
+ curitem->oper == OP_PHRASE ||
+ curitem->oper == OP_AROUND);
(*pos)++;
@@ -884,6 +960,9 @@ infix(INFIX *in, int parentPriority, bool rightPhraseOp)
else
sprintf(in->cur, " <-> %s", nrm.buf);
break;
+ case OP_AROUND:
+ sprintf(in->cur, " AROUND(%d) %s", distance, nrm.buf);
+ break;
default:
/* OP_NOT is handled in above if-branch */
elog(ERROR, "unrecognized operator type: %d", op);
@@ -966,7 +1045,8 @@ tsquerysend(PG_FUNCTION_ARGS)
break;
case QI_OPR:
pq_sendint8(&buf, item->qoperator.oper);
- if (item->qoperator.oper == OP_PHRASE)
+ if (item->qoperator.oper == OP_PHRASE
+ || item->qoperator.oper == OP_AROUND)
pq_sendint16(&buf, item->qoperator.distance);
break;
default:
@@ -1062,14 +1142,15 @@ tsqueryrecv(PG_FUNCTION_ARGS)
int8 oper;
oper = (int8) pq_getmsgint(buf, sizeof(int8));
- if (oper != OP_NOT && oper != OP_OR && oper != OP_AND && oper != OP_PHRASE)
+ if (oper != OP_NOT && oper != OP_OR && oper != OP_AND
+ && oper != OP_PHRASE && oper != OP_AROUND)
elog(ERROR, "invalid tsquery: unrecognized operator type %d",
(int) oper);
if (i == size - 1)
elog(ERROR, "invalid pointer to right operand");
item->qoperator.oper = oper;
- if (oper == OP_PHRASE)
+ if (oper == OP_PHRASE || oper == OP_AROUND)
item->qoperator.distance = (int16) pq_getmsgint(buf, sizeof(int16));
}
else
diff --git a/src/backend/utils/adt/tsquery_cleanup.c b/src/backend/utils/adt/tsquery_cleanup.c
index 350171c..bb7fbd5 100644
--- a/src/backend/utils/adt/tsquery_cleanup.c
+++ b/src/backend/utils/adt/tsquery_cleanup.c
@@ -161,7 +161,8 @@ clean_NOT_intree(NODE *node)
NODE *res = node;
Assert(node->valnode->qoperator.oper == OP_AND ||
- node->valnode->qoperator.oper == OP_PHRASE);
+ node->valnode->qoperator.oper == OP_PHRASE ||
+ node->valnode->qoperator.oper == OP_AROUND);
node->left = clean_NOT_intree(node->left);
node->right = clean_NOT_intree(node->right);
@@ -277,7 +278,8 @@ clean_stopword_intree(NODE *node, int *ladd, int *radd)
node->right = clean_stopword_intree(node->right, &rladd, &rradd);
/* Check if current node is OP_PHRASE, get its distance */
- isphrase = (node->valnode->qoperator.oper == OP_PHRASE);
+ isphrase = (node->valnode->qoperator.oper == OP_PHRASE ||
+ node->valnode->qoperator.oper == OP_AROUND);
ndistance = isphrase ? node->valnode->qoperator.distance : 0;
if (node->left == NULL && node->right == NULL)
diff --git a/src/backend/utils/adt/tsquery_op.c b/src/backend/utils/adt/tsquery_op.c
index 755c3e9..7cf9b8a 100644
--- a/src/backend/utils/adt/tsquery_op.c
+++ b/src/backend/utils/adt/tsquery_op.c
@@ -37,7 +37,7 @@ join_tsqueries(TSQuery a, TSQuery b, int8 operator, uint16 distance)
res->valnode = (QueryItem *) palloc0(sizeof(QueryItem));
res->valnode->type = QI_OPR;
res->valnode->qoperator.oper = operator;
- if (operator == OP_PHRASE)
+ if (operator == OP_PHRASE || operator == OP_AROUND)
res->valnode->qoperator.distance = distance;
res->child = (QTNode **) palloc0(sizeof(QTNode *) * 2);
diff --git a/src/backend/utils/adt/tsquery_util.c b/src/backend/utils/adt/tsquery_util.c
index 971bb81..548a846 100644
--- a/src/backend/utils/adt/tsquery_util.c
+++ b/src/backend/utils/adt/tsquery_util.c
@@ -121,7 +121,7 @@ QTNodeCompare(QTNode *an, QTNode *bn)
return res;
}
- if (ao->oper == OP_PHRASE && ao->distance != bo->distance)
+ if ((ao->oper == OP_PHRASE || ao->oper == OP_AROUND) && ao->distance != bo->distance)
return (ao->distance > bo->distance) ? -1 : 1;
return 0;
@@ -171,7 +171,8 @@ QTNSort(QTNode *in)
for (i = 0; i < in->nchild; i++)
QTNSort(in->child[i]);
- if (in->nchild > 1 && in->valnode->qoperator.oper != OP_PHRASE)
+ if (in->nchild > 1 && in->valnode->qoperator.oper != OP_PHRASE
+ && in->valnode->qoperator.oper != OP_AROUND)
qsort((void *) in->child, in->nchild, sizeof(QTNode *), cmpQTN);
}
diff --git a/src/backend/utils/adt/tsrank.c b/src/backend/utils/adt/tsrank.c
index 4577bcc..b62b14d 100644
--- a/src/backend/utils/adt/tsrank.c
+++ b/src/backend/utils/adt/tsrank.c
@@ -366,7 +366,8 @@ calc_rank(const float *w, TSVector t, TSQuery q, int32 method)
/* XXX: What about NOT? */
res = (item->type == QI_OPR && (item->qoperator.oper == OP_AND ||
- item->qoperator.oper == OP_PHRASE)) ?
+ item->qoperator.oper == OP_PHRASE ||
+ item->qoperator.oper == OP_AROUND)) ?
calc_rank_and(w, t, q) :
calc_rank_or(w, t, q);
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 8225202..e8bc9eb 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -1429,6 +1429,7 @@ checkcondition_str(void *checkval, QueryOperand *val, ExecPhraseData *data)
#define TSPO_L_ONLY 0x01 /* emit positions appearing only in L */
#define TSPO_R_ONLY 0x02 /* emit positions appearing only in R */
#define TSPO_BOTH 0x04 /* emit positions appearing in both L&R */
+#define TSPO_NOT_EXAC 0x08 /* not exact distance for AROUND(X) */
static bool
TS_phrase_output(ExecPhraseData *data,
@@ -1473,8 +1474,18 @@ TS_phrase_output(ExecPhraseData *data,
Rpos = INT_MAX;
}
+ /* Processing OP_AROUND */
+ if ((emit & TSPO_NOT_EXAC) &&
+ Lpos - Rpos >= 0 &&
+ Lpos - Rpos <= (Loffset + Roffset) * 2 - Rdata->width + Ldata->width)
+ {
+ if (emit & TSPO_BOTH)
+ output_pos = Rpos;
+ Lindex++;
+ Rindex++;
+ }
/* Merge-join the two input lists */
- if (Lpos < Rpos)
+ else if (Lpos < Rpos)
{
/* Lpos is not matched in Rdata, should we output it? */
if (emit & TSPO_L_ONLY)
@@ -1625,6 +1636,7 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
}
case OP_PHRASE:
+ case OP_AROUND:
case OP_AND:
memset(&Ldata, 0, sizeof(Ldata));
memset(&Rdata, 0, sizeof(Rdata));
@@ -1647,7 +1659,7 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
(Rdata.npos == 0 && !Rdata.negate))
return (flags & TS_EXEC_PHRASE_NO_POS) ? true : false;
- if (curitem->qoperator.oper == OP_PHRASE)
+ if (curitem->qoperator.oper == OP_PHRASE || curitem->qoperator.oper == OP_AROUND)
{
/*
* Compute Loffset and Roffset suitable for phrase match, and
@@ -1703,7 +1715,8 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
{
/* straight AND */
return TS_phrase_output(data, &Ldata, &Rdata,
- TSPO_BOTH,
+ TSPO_BOTH |
+ (curitem->qoperator.oper == OP_AROUND ? TSPO_NOT_EXAC : 0),
Loffset, Roffset,
Min(Ldata.npos, Rdata.npos));
}
@@ -1843,6 +1856,7 @@ TS_execute(QueryItem *curitem, void *arg, uint32 flags,
return TS_execute(curitem + 1, arg, flags, chkcond);
case OP_PHRASE:
+ case OP_AROUND:
return TS_phrase_execute(curitem, arg, flags, chkcond, NULL);
default:
@@ -1882,9 +1896,10 @@ tsquery_requires_match(QueryItem *curitem)
return false;
case OP_PHRASE:
+ case OP_AROUND:
/*
- * Treat OP_PHRASE as OP_AND here
+ * Treat OP_PHRASE and OP_AROUND as OP_AND here
*/
case OP_AND:
/* If either side requires a match, we're good */
diff --git a/src/include/tsearch/ts_type.h b/src/include/tsearch/ts_type.h
index 30d7c4b..ee1a184 100644
--- a/src/include/tsearch/ts_type.h
+++ b/src/include/tsearch/ts_type.h
@@ -166,8 +166,9 @@ typedef struct
#define OP_NOT 1
#define OP_AND 2
#define OP_OR 3
-#define OP_PHRASE 4 /* highest code, tsquery_cleanup.c */
-#define OP_COUNT 4
+#define OP_PHRASE 4
+#define OP_AROUND 5 /* highest code, tsquery_cleanup.c */
+#define OP_COUNT 5
extern const int tsearch_op_priority[OP_COUNT];
queryto_tsquery3_02_queryto_tsquery.patchtext/x-diff; name=queryto_tsquery3_02_queryto_tsquery.patchDownload
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index cf55e39..2972540 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -390,7 +390,8 @@ add_to_tsvector(void *_state, char *elem_value, int elem_len)
* and different variants are ORed together.
*/
static void
-pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval, int16 weight, bool prefix)
+pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
+ int16 weight, bool prefix, bool isphrase)
{
int32 count = 0;
ParsedText prs;
@@ -423,7 +424,12 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
/* put placeholders for each missing stop word */
pushStop(state);
if (cntpos)
- pushOperator(state, data->qoperator, 1);
+ {
+ if (isphrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
+ }
cntpos++;
pos++;
}
@@ -464,7 +470,10 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
if (cntpos)
{
/* distance may be useful */
- pushOperator(state, data->qoperator, 1);
+ if (isphrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
}
cntpos++;
@@ -490,6 +499,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
+ false,
false);
PG_RETURN_TSQUERY(query);
@@ -520,7 +530,8 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_POINTER(query);
}
@@ -551,7 +562,8 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_TSQUERY(query);
}
@@ -567,3 +579,36 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+queryto_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ false,
+ true);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+queryto_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(queryto_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+
+}
\ No newline at end of file
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 07d56b0..53c9fc7 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -63,6 +63,7 @@ struct TSQueryParserStateData
#define WAITOPERATOR 2
#define WAITFIRSTOPERAND 3
#define WAITSINGLEOPERAND 4
+#define WAITQUOTE 5
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
@@ -285,7 +286,8 @@ parse_around_operator(char *buf, int16 *distance)
static ts_tokentype
gettoken_query(TSQueryParserState state,
int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+ int *lenval, char **strval, int16 *weight, bool *prefix,
+ bool isquery)
{
*weight = 0;
*prefix = false;
@@ -296,7 +298,8 @@ gettoken_query(TSQueryParserState state,
{
case WAITFIRSTOPERAND:
case WAITOPERAND:
- if (t_iseq(state->buf, '!'))
+ if (t_iseq(state->buf, '!') ||
+ (isquery && t_iseq(state->buf, '-')))
{
(state->buf)++; /* can safely ++, t_iseq guarantee that
* pg_mblen()==1 */
@@ -318,6 +321,20 @@ gettoken_query(TSQueryParserState state,
errmsg("syntax error in tsquery: \"%s\"",
state->buffer)));
}
+ else if (isquery && t_iseq(state->buf, '"'))
+ {
+ char *quote = strchr(state->buf + 1, '"');
+ if (quote == NULL)
+ {
+ state->buf++;
+ continue;
+ }
+ *strval = state->buf + 1;
+ *lenval = quote - state->buf - 1;
+ state->buf = quote + 1;
+ state->state = WAITQUOTE;
+ return PT_VAL;
+ }
else if (!t_isspace(state->buf))
{
/*
@@ -355,6 +372,13 @@ gettoken_query(TSQueryParserState state,
(state->buf)++;
return PT_OPR;
}
+ else if (isquery && has_prefix(state->buf, "OR "))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_OR;
+ (state->buf) += 3;
+ return PT_OPR;
+ }
else if (t_iseq(state->buf, '<'))
{
state->state = WAITOPERAND;
@@ -365,7 +389,7 @@ gettoken_query(TSQueryParserState state,
return PT_ERR;
return PT_OPR;
}
- else if (has_prefix(state->buf, "AROUND("))
+ else if (isquery && has_prefix(state->buf, "AROUND("))
{
state->state = WAITOPERAND;
*operator = OP_AROUND;
@@ -381,8 +405,23 @@ gettoken_query(TSQueryParserState state,
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
+ else if (t_iseq(state->buf, '('))
+ {
+ *operator = OP_AND;
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
else if (*(state->buf) == '\0')
return (state->count) ? PT_ERR : PT_END;
+ else if (isquery &&
+ (t_isalpha(state->buf) || t_iseq(state->buf, '!')
+ || t_iseq(state->buf, '-')
+ || t_iseq(state->buf, '"')))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_AND;
+ return PT_OPR;
+ }
else if (!t_isspace(state->buf))
return PT_ERR;
break;
@@ -394,6 +433,9 @@ gettoken_query(TSQueryParserState state,
state->buf += strlen(state->buf);
state->count++;
return PT_VAL;
+ case WAITQUOTE:
+ state->state = WAITOPERATOR;
+ continue;
default:
return PT_ERR;
break;
@@ -550,7 +592,8 @@ cleanOpStack(TSQueryParserState state,
static void
makepol(TSQueryParserState state,
PushFunction pushval,
- Datum opaque)
+ Datum opaque,
+ bool isquery)
{
int8 operator = 0;
ts_tokentype type;
@@ -564,19 +607,20 @@ makepol(TSQueryParserState state,
/* since this function recurses, it could be driven to stack overflow */
check_stack_depth();
- while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix)) != PT_END)
+ while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight,
+ &prefix, isquery)) != PT_END)
{
switch (type)
{
case PT_VAL:
- pushval(opaque, state, strval, lenval, weight, prefix);
+ pushval(opaque, state, strval, lenval, weight, prefix, state->state == WAITQUOTE);
break;
case PT_OPR:
cleanOpStack(state, opstack, &lenstack, operator);
pushOpStack(opstack, &lenstack, operator, weight);
break;
case PT_OPEN:
- makepol(state, pushval, opaque);
+ makepol(state, pushval, opaque, isquery);
break;
case PT_CLOSE:
cleanOpStack(state, opstack, &lenstack, OP_OR /* lowest */ );
@@ -681,7 +725,8 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ bool isplain,
+ bool isquery)
{
struct TSQueryParserStateData state;
int i;
@@ -708,7 +753,7 @@ parse_tsquery(char *buf,
*(state.curop) = '\0';
/* parse query & make polish notation (postfix, but in reverse order) */
- makepol(&state, pushval, opaque);
+ makepol(&state, pushval, opaque, isquery);
close_tsvector_parser(state.valstate);
@@ -779,7 +824,7 @@ parse_tsquery(char *buf,
static void
pushval_asis(Datum opaque, TSQueryParserState state, char *strval, int lenval,
- int16 weight, bool prefix)
+ int16 weight, bool prefix, bool isphrase)
{
pushValue(state, strval, lenval, weight, prefix);
}
@@ -792,7 +837,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false, false));
}
/*
diff --git a/src/backend/utils/adt/tsquery_cleanup.c b/src/backend/utils/adt/tsquery_cleanup.c
index e679bb5..bb7fbd5 100644
--- a/src/backend/utils/adt/tsquery_cleanup.c
+++ b/src/backend/utils/adt/tsquery_cleanup.c
@@ -278,7 +278,7 @@ clean_stopword_intree(NODE *node, int *ladd, int *radd)
node->right = clean_stopword_intree(node->right, &rladd, &rradd);
/* Check if current node is OP_PHRASE, get its distance */
- isphrase = (node->valnode->qoperator.oper == OP_PHRASE ||
+ isphrase = (node->valnode->qoperator.oper == OP_PHRASE ||
node->valnode->qoperator.oper == OP_AROUND);
ndistance = isphrase ? node->valnode->qoperator.distance : 0;
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index ce7ee25..e8bc9eb 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -1715,7 +1715,7 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
{
/* straight AND */
return TS_phrase_output(data, &Ldata, &Rdata,
- TSPO_BOTH |
+ TSPO_BOTH |
(curitem->qoperator.oper == OP_AROUND ? TSPO_NOT_EXAC : 0),
Loffset, Roffset,
Min(Ldata.npos, Rdata.npos));
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index c969375..9abde15 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4906,6 +4906,8 @@ DATA(insert OID = 3746 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2
DESCR("make tsquery");
DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ plainto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( queryto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ queryto_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
@@ -4914,6 +4916,8 @@ DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1
DESCR("make tsquery");
DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ plainto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( queryto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ queryto_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index 782548c..ffb1762 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -44,11 +44,12 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
char *token, int tokenlen,
int16 tokenweights, /* bitmap as described in
* QueryOperand struct */
- bool prefix);
+ bool prefix,
+ bool isphrase);
extern TSQuery parse_tsquery(char *buf,
PushFunction pushval,
- Datum opaque, bool isplain);
+ Datum opaque, bool isplain, bool isquery);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index d63fb12..2eb410b 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1672,3 +1672,115 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+
+--test queryto_tsquery function
+select queryto_tsquery('My brand new smartphone');
+ queryto_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select queryto_tsquery('My brand "new smartphone"');
+ queryto_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select queryto_tsquery('"A fat cat" has just eaten a -rat.');
+ queryto_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select queryto_tsquery('"A fat cat" has just eaten OR -rat.');
+ queryto_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select queryto_tsquery('"A fat cat" has just (eaten OR -rat)');
+ queryto_tsquery
+----------------------------------------
+ 'fat' <-> 'cat' & ( 'eaten' | !'rat' )
+(1 row)
+
+-- testing AROUND operator evaluation
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(5) runs');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(5) "gnu debugger"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(6) runs');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(6) "gnu debugger"');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(10) "portable debugger"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(10) "many programming languages"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(11) "portable debugger"');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(11) "many programming languages"');
+ ?column?
+----------
+ t
+(1 row)
+
+select queryto_tsquery('"fat cat AROUND(5) rat"');
+ queryto_tsquery
+------------------------------------------------
+ 'fat' <-> 'cat' <-> 'around' <-> '5' <-> 'rat'
+(1 row)
+
+select queryto_tsquery('simple','"fat cat OR rat"');
+ queryto_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select queryto_tsquery('fat*rat');
+ queryto_tsquery
+-----------------
+ 'fat' & 'rat'
+(1 row)
+
+select queryto_tsquery('fat-rat');
+ queryto_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index 1c8520b..65e71da 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -539,3 +539,34 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+--test queryto_tsquery function
+select queryto_tsquery('My brand new smartphone');
+select queryto_tsquery('My brand "new smartphone"');
+select queryto_tsquery('"A fat cat" has just eaten a -rat.');
+select queryto_tsquery('"A fat cat" has just eaten OR -rat.');
+select queryto_tsquery('"A fat cat" has just (eaten OR -rat)');
+
+-- testing AROUND operator evaluation
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(5) runs');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(5) "gnu debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"gnu debugger" AROUND(6) runs');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('run AROUND(6) "gnu debugger"');
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(10) "portable debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(10) "many programming languages"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"many programming languages" AROUND(11) "portable debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+queryto_tsquery('"portable debugger" AROUND(11) "many programming languages"');
+
+select queryto_tsquery('"fat cat AROUND(5) rat"');
+select queryto_tsquery('simple','"fat cat OR rat"');
+select queryto_tsquery('fat*rat');
+select queryto_tsquery('fat-rat');
\ No newline at end of file
queryto_tsquery3_03_documentation.patchtext/x-diff; name=queryto_tsquery3_03_documentation.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4dd9d02..b4d95ed 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9535,6 +9535,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
<row>
<entry>
<indexterm>
+ <primary>queryto_tsquery</primary>
+ </indexterm>
+ <literal><function>queryto_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type> , </optional> <replaceable class="parameter">query</replaceable> <type>text</type>)</function></literal>
+ </entry>
+ <entry><type>tsquery</type></entry>
+ <entry>produce <type>tsquery</type> from google like query</entry>
+ <entry><literal>queryto_tsquery('english', 'The Fat Rats')</literal></entry>
+ <entry><literal>'fat' & 'rat'</literal></entry>
+ </row>
+ <row>
+ <entry>
+ <indexterm>
<primary>querytree</primary>
</indexterm>
<literal><function>querytree(<replaceable class="parameter">query</replaceable> <type>tsquery</type>)</function></literal>
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 4dc52ec..5e05985 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,15 @@ UPDATE tt SET ti =
<para>
<productname>PostgreSQL</productname> provides the
functions <function>to_tsquery</function>,
- <function>plainto_tsquery</function>, and
- <function>phraseto_tsquery</function>
+ <function>plainto_tsquery</function>,
+ <function>phraseto_tsquery</function> and
+ <function>queryto_tsquery</function>
for converting a query to the <type>tsquery</type> data type.
<function>to_tsquery</function> offers access to more features
than either <function>plainto_tsquery</function> or
<function>phraseto_tsquery</function>, but it is less forgiving
- about its input.
+ about its input. <function>queryto_tsquery</function> provides a
+ different, Google like syntax to create tsquery.
</para>
<indexterm>
@@ -960,8 +962,72 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
-----------------------------
'fat' <-> 'rat' <-> 'c'
</screen>
- </para>
+</para>
+
+<synopsis>
+queryto_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
+</synopsis>
+ <para>
+ <function>queryto_tsquery</function> creates a <type>tsquery</type> from a unformated text.
+ But instead of <function>plainto_tsquery</function> and <function>phraseto_tsquery</function> it won't
+ ignore already placed operations. This function supports following operators:
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ <literal>some text</literal> - any text inside quote signs will be treated as a phrase and will be
+ performed like in <function>phraseto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>OR</literal> - standard logical operator. It is just an alias for <literal>|</literal> sign.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>terma AROUND(N) termb</literal> - this operation will match if the distance between
+ terma and termb is less than N.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>-</literal> - standard logical negation sign. It is an alias for <literal>!</literal> sign.
+ </para>
+ </listitem>
+ </itemizedlist>
+ Other missing operators will be replaced by AND like in <function>plainto_tsquery</function>.
+ </para>
+ <para>
+ Examples:
+ <screen>
+ select queryto_tsquery('The fat rats');
+ queryto_tsquery
+ -----------------
+ 'fat' & 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select queryto_tsquery('"supernovae stars" AND -crab');
+ queryto_tsquery
+ ----------------------------------
+ 'supernova' <-> 'star' & !'crab'
+ (1 row)
+ </screen>
+ <screen>
+ select queryto_tsquery('-run AROUND(5) "gnu debugger" OR "I like bananas"');
+ queryto_tsquery
+ -----------------------------------------------------------
+ !'run' AROUND(5) 'gnu' <-> 'debugg' | 'like' <-> 'banana'
+ (1 row)
+ </screen>
+ </para>
+ <para>
+ Note that in the examples <function>queryto_tsquery</function> didn't ignore
+ operators if they were placed and put required operations
+ if they were skipped. In case of inquote text <function>queryto_tsquery</function>
+ has placed phrase operator and & in other cases.
+ </para>
</sect2>
<sect2 id="textsearch-ranking">
The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed
Hi Victor,
I like the idea and I think it's a great patch. However in current shape it
requires some amount of reworking to meet PostgreSQL standards of code quality.
Particularly:
1. Many new procedures don't have a comment with a brief description. Ideally
every procedure should have not only a brief description but also a description
of every argument, return value and changes of global state if applicable.
2. I believe you could implement the has_prefix procedure just as a wrapper of
strstr().
3. I suggest to use snprintf instead of sprintf in a new code whenever
possible, especially if you are using %s - just to be on a safe side.
4. I noticed that your code affects the catalog. Did you check that your
changes will not cause problems during the migration from the older version of
PostgreSQL to the never one?
5. Tests for queryto_tsquery use only ASCII strings. I suggest to add a few
test that use non-ASCII characters as well, and a few corner cases like empty
string, string that contains only the stop-words, etc.
6. `make check-world` doesn't pass:
```
***************
*** 1672,1678 ****
(1 row)
set enable_seqscan = on;
-
--test queryto_tsquery function
select queryto_tsquery('My brand new smartphone');
queryto_tsquery
--- 1672,1677 ----
***************
*** 1784,1786 ****
--- 1783,1786 ----
---------------------------
'fat-rat' & 'fat' & 'rat'
(1 row)
+
```
Hi Victor,
I like the idea and I think it's a great patch. However in current shape it
requires some amount of reworking to meet PostgreSQL standards of code quality.
Also I would like to add that I agree with Thomas Munro:
Calling this search syntax just "query" seems too general and
overloaded. "Simple search", "simple query", "web search", "web
syntax", "web query", "Google-style query", "Poogle" (kidding!) ...
well I'm not sure, but I feel like it deserves a proper name.
websearch_to_tsquery()?
websearch_to_tsquery() sounds much better than query_to_tsquery().
Also I agree Tomas Vondra in regard that:
2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...
I suggest to rephrase:
```
+ about its input. <function>queryto_tsquery</function> provides a
+ different, Google like syntax to create tsquery.
```
.. to something more like "provides a different syntax, similar to one
used in web search engines, to create tsqeury". And maybe give a few
examples right in the next sentence.
--
Best regards,
Aleksander Alekseev
On Tue, Nov 28, 2017 at 11:57 PM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
I like the idea and I think it's a great patch. However in current shape it
requires some amount of reworking to meet PostgreSQL standards of code quality.Also I would like to add that I agree with Thomas Munro:
Calling this search syntax just "query" seems too general and
overloaded. "Simple search", "simple query", "web search", "web
syntax", "web query", "Google-style query", "Poogle" (kidding!) ...
well I'm not sure, but I feel like it deserves a proper name.
websearch_to_tsquery()?websearch_to_tsquery() sounds much better than query_to_tsquery().
Also I agree Tomas Vondra in regard that:
2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...I suggest to rephrase:
``` + about its input. <function>queryto_tsquery</function> provides a + different, Google like syntax to create tsquery. ```.. to something more like "provides a different syntax, similar to one
used in web search engines, to create tsqeury". And maybe give a few
examples right in the next sentence.
The patch got a review less than 1 day ago, so I am moving it to next
CF with the same status, waiting on author.
--
Michael
On 2017-11-28 17:57, Aleksander Alekseev wrote:
Hi Aleksander,
Thank you for review. I have tried to fix all of your comments.
However i want to mention that the absence of comments for functions
in to_tsany.c is justified by the absence of comments for other
similar functions.
Hi Victor,
I like the idea and I think it's a great patch. However in current
shape it
requires some amount of reworking to meet PostgreSQL standards of code
quality.Also I would like to add that I agree with Thomas Munro:
Calling this search syntax just "query" seems too general and
overloaded. "Simple search", "simple query", "web search", "web
syntax", "web query", "Google-style query", "Poogle" (kidding!) ...
well I'm not sure, but I feel like it deserves a proper name.
websearch_to_tsquery()?websearch_to_tsquery() sounds much better than query_to_tsquery().
Also I agree Tomas Vondra in regard that:
2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not
invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...I suggest to rephrase:
``` + about its input. <function>queryto_tsquery</function> provides a + different, Google like syntax to create tsquery. ```.. to something more like "provides a different syntax, similar to one
used in web search engines, to create tsqeury". And maybe give a few
examples right in the next sentence.
Best,
--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 2017-11-29 17:56, Victor Drobny wrote:
Sorry, forgot to attach new version of the patch.
On 2017-11-28 17:57, Aleksander Alekseev wrote:
Hi Aleksander,Thank you for review. I have tried to fix all of your comments.
However i want to mention that the absence of comments for functions
in to_tsany.c is justified by the absence of comments for other
similar functions.Hi Victor,
I like the idea and I think it's a great patch. However in current
shape it
requires some amount of reworking to meet PostgreSQL standards of
code quality.Also I would like to add that I agree with Thomas Munro:
Calling this search syntax just "query" seems too general and
overloaded. "Simple search", "simple query", "web search", "web
syntax", "web query", "Google-style query", "Poogle" (kidding!) ...
well I'm not sure, but I feel like it deserves a proper name.
websearch_to_tsquery()?websearch_to_tsquery() sounds much better than query_to_tsquery().
Also I agree Tomas Vondra in regard that:
2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not
invented
by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...I suggest to rephrase:
``` + about its input. <function>queryto_tsquery</function> provides a + different, Google like syntax to create tsquery. ```.. to something more like "provides a different syntax, similar to one
used in web search engines, to create tsqeury". And maybe give a few
examples right in the next sentence.Best,
--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
queryto_tsquery4_01_around.patchtext/x-diff; name=queryto_tsquery4_01_around.patchDownload
diff --git a/src/backend/tsearch/ts_selfuncs.c b/src/backend/tsearch/ts_selfuncs.c
index 046f543..375cb5c 100644
--- a/src/backend/tsearch/ts_selfuncs.c
+++ b/src/backend/tsearch/ts_selfuncs.c
@@ -396,6 +396,7 @@ tsquery_opr_selec(QueryItem *item, char *operand,
break;
case OP_PHRASE:
+ case OP_AROUND:
case OP_AND:
s1 = tsquery_opr_selec(item + 1, operand,
lookup, length, minfreq);
diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index aba456e..1c852bd 100644
--- a/src/backend/utils/adt/tsginidx.c
+++ b/src/backend/utils/adt/tsginidx.c
@@ -239,10 +239,11 @@ TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem, bool in_phrase)
return !result;
case OP_PHRASE:
+ case OP_AROUND:
/*
* GIN doesn't contain any information about positions, so treat
- * OP_PHRASE as OP_AND with recheck requirement
+ * OP_PHRASE and OP_AROUND as OP_AND with recheck requirement
*/
*(gcv->need_recheck) = true;
/* Pass down in_phrase == true in case there's a NOT below */
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 5cdfe4d..9c04391 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -29,7 +29,8 @@ const int tsearch_op_priority[OP_COUNT] =
4, /* OP_NOT */
2, /* OP_AND */
1, /* OP_OR */
- 3 /* OP_PHRASE */
+ 3, /* OP_PHRASE */
+ 3 /* OP_AROUND */
};
struct TSQueryParserStateData
@@ -58,10 +59,10 @@ struct TSQueryParserStateData
};
/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
+#define WAITOPERAND 1
+#define WAITOPERATOR 2
+#define WAITFIRSTOPERAND 3
+#define WAITSINGLEOPERAND 4
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
@@ -211,6 +212,65 @@ typedef enum
} ts_tokentype;
/*
+ * Checks if 'str' starts with a 'prefix'
+ */
+static bool
+has_prefix(char * str, char * prefix)
+{
+ if (strlen(prefix) > strlen(str))
+ {
+ return false;
+ }
+ return strstr(str, prefix) == str;
+}
+
+/*
+ * Parse around operator. The operator
+ * have the following form:
+ *
+ * a AROUND(X) b (distance is no greater than X)
+ *
+ * The buffer should begins with "AROUND(" prefix
+ */
+static char *
+parse_around_operator(char *buf, int16 *distance)
+{
+ char *ptr = buf;
+ char *endptr;
+ long l = 1;
+
+ Assert(has_prefix(ptr, "AROUND("));
+
+ ptr += strlen("AROUND(");
+
+ while (t_isspace(ptr))
+ ptr++;
+
+ l = strtol(ptr, &endptr, 10);
+ if (ptr == endptr)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Invalid AROUND(X) operator!")));
+ else if (errno == ERANGE || l > MAXENTRYPOS)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("distance in AROUND operator should not be greater than %d",
+ MAXENTRYPOS)));
+
+ ptr = endptr;
+ *distance = l;
+ while (t_isspace(ptr))
+ ptr++;
+
+ if (!t_iseq(ptr, ')'))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("Missing ')' in AROUND(X) operator")));
+
+ return ++ptr;
+}
+
+/*
* get token from query string
*
* *operator is filled in with OP_* when return values is PT_OPR,
@@ -301,6 +361,16 @@ gettoken_query(TSQueryParserState state,
return PT_ERR;
return PT_OPR;
}
+ else if (has_prefix(state->buf, "AROUND("))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_AROUND;
+ /* weight var is used as storage for distance */
+ state->buf = parse_around_operator(state->buf, weight);
+ if (*weight < 0)
+ return PT_ERR;
+ return PT_OPR;
+ }
else if (t_iseq(state->buf, ')'))
{
(state->buf)++;
@@ -336,12 +406,13 @@ pushOperator(TSQueryParserState state, int8 oper, int16 distance)
{
QueryOperator *tmp;
- Assert(oper == OP_NOT || oper == OP_AND || oper == OP_OR || oper == OP_PHRASE);
+ Assert(oper == OP_NOT || oper == OP_AND || oper == OP_OR
+ || oper == OP_PHRASE || oper == OP_AROUND);
tmp = (QueryOperator *) palloc0(sizeof(QueryOperator));
tmp->type = QI_OPR;
tmp->oper = oper;
- tmp->distance = (oper == OP_PHRASE) ? distance : 0;
+ tmp->distance = (oper == OP_PHRASE || oper == OP_AROUND) ? distance : 0;
/* left is filled in later with findoprnd */
state->polstr = lcons(tmp, state->polstr);
@@ -555,7 +626,8 @@ findoprnd_recurse(QueryItem *ptr, uint32 *pos, int nnodes, bool *needcleanup)
Assert(curitem->oper == OP_AND ||
curitem->oper == OP_OR ||
- curitem->oper == OP_PHRASE);
+ curitem->oper == OP_PHRASE ||
+ curitem->oper == OP_AROUND);
(*pos)++;
@@ -884,6 +956,9 @@ infix(INFIX *in, int parentPriority, bool rightPhraseOp)
else
sprintf(in->cur, " <-> %s", nrm.buf);
break;
+ case OP_AROUND:
+ snprintf(in->cur, 256, " AROUND(%d) %s", distance, nrm.buf);
+ break;
default:
/* OP_NOT is handled in above if-branch */
elog(ERROR, "unrecognized operator type: %d", op);
@@ -966,7 +1041,8 @@ tsquerysend(PG_FUNCTION_ARGS)
break;
case QI_OPR:
pq_sendint8(&buf, item->qoperator.oper);
- if (item->qoperator.oper == OP_PHRASE)
+ if (item->qoperator.oper == OP_PHRASE
+ || item->qoperator.oper == OP_AROUND)
pq_sendint16(&buf, item->qoperator.distance);
break;
default:
@@ -1062,14 +1138,15 @@ tsqueryrecv(PG_FUNCTION_ARGS)
int8 oper;
oper = (int8) pq_getmsgint(buf, sizeof(int8));
- if (oper != OP_NOT && oper != OP_OR && oper != OP_AND && oper != OP_PHRASE)
+ if (oper != OP_NOT && oper != OP_OR && oper != OP_AND
+ && oper != OP_PHRASE && oper != OP_AROUND)
elog(ERROR, "invalid tsquery: unrecognized operator type %d",
(int) oper);
if (i == size - 1)
elog(ERROR, "invalid pointer to right operand");
item->qoperator.oper = oper;
- if (oper == OP_PHRASE)
+ if (oper == OP_PHRASE || oper == OP_AROUND)
item->qoperator.distance = (int16) pq_getmsgint(buf, sizeof(int16));
}
else
diff --git a/src/backend/utils/adt/tsquery_cleanup.c b/src/backend/utils/adt/tsquery_cleanup.c
index 350171c..e679bb5 100644
--- a/src/backend/utils/adt/tsquery_cleanup.c
+++ b/src/backend/utils/adt/tsquery_cleanup.c
@@ -161,7 +161,8 @@ clean_NOT_intree(NODE *node)
NODE *res = node;
Assert(node->valnode->qoperator.oper == OP_AND ||
- node->valnode->qoperator.oper == OP_PHRASE);
+ node->valnode->qoperator.oper == OP_PHRASE ||
+ node->valnode->qoperator.oper == OP_AROUND);
node->left = clean_NOT_intree(node->left);
node->right = clean_NOT_intree(node->right);
@@ -277,7 +278,8 @@ clean_stopword_intree(NODE *node, int *ladd, int *radd)
node->right = clean_stopword_intree(node->right, &rladd, &rradd);
/* Check if current node is OP_PHRASE, get its distance */
- isphrase = (node->valnode->qoperator.oper == OP_PHRASE);
+ isphrase = (node->valnode->qoperator.oper == OP_PHRASE ||
+ node->valnode->qoperator.oper == OP_AROUND);
ndistance = isphrase ? node->valnode->qoperator.distance : 0;
if (node->left == NULL && node->right == NULL)
diff --git a/src/backend/utils/adt/tsquery_op.c b/src/backend/utils/adt/tsquery_op.c
index 755c3e9..7cf9b8a 100644
--- a/src/backend/utils/adt/tsquery_op.c
+++ b/src/backend/utils/adt/tsquery_op.c
@@ -37,7 +37,7 @@ join_tsqueries(TSQuery a, TSQuery b, int8 operator, uint16 distance)
res->valnode = (QueryItem *) palloc0(sizeof(QueryItem));
res->valnode->type = QI_OPR;
res->valnode->qoperator.oper = operator;
- if (operator == OP_PHRASE)
+ if (operator == OP_PHRASE || operator == OP_AROUND)
res->valnode->qoperator.distance = distance;
res->child = (QTNode **) palloc0(sizeof(QTNode *) * 2);
diff --git a/src/backend/utils/adt/tsquery_util.c b/src/backend/utils/adt/tsquery_util.c
index 971bb81..548a846 100644
--- a/src/backend/utils/adt/tsquery_util.c
+++ b/src/backend/utils/adt/tsquery_util.c
@@ -121,7 +121,7 @@ QTNodeCompare(QTNode *an, QTNode *bn)
return res;
}
- if (ao->oper == OP_PHRASE && ao->distance != bo->distance)
+ if ((ao->oper == OP_PHRASE || ao->oper == OP_AROUND) && ao->distance != bo->distance)
return (ao->distance > bo->distance) ? -1 : 1;
return 0;
@@ -171,7 +171,8 @@ QTNSort(QTNode *in)
for (i = 0; i < in->nchild; i++)
QTNSort(in->child[i]);
- if (in->nchild > 1 && in->valnode->qoperator.oper != OP_PHRASE)
+ if (in->nchild > 1 && in->valnode->qoperator.oper != OP_PHRASE
+ && in->valnode->qoperator.oper != OP_AROUND)
qsort((void *) in->child, in->nchild, sizeof(QTNode *), cmpQTN);
}
diff --git a/src/backend/utils/adt/tsrank.c b/src/backend/utils/adt/tsrank.c
index 4577bcc..b62b14d 100644
--- a/src/backend/utils/adt/tsrank.c
+++ b/src/backend/utils/adt/tsrank.c
@@ -366,7 +366,8 @@ calc_rank(const float *w, TSVector t, TSQuery q, int32 method)
/* XXX: What about NOT? */
res = (item->type == QI_OPR && (item->qoperator.oper == OP_AND ||
- item->qoperator.oper == OP_PHRASE)) ?
+ item->qoperator.oper == OP_PHRASE ||
+ item->qoperator.oper == OP_AROUND)) ?
calc_rank_and(w, t, q) :
calc_rank_or(w, t, q);
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 8225202..ce7ee25 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -1429,6 +1429,7 @@ checkcondition_str(void *checkval, QueryOperand *val, ExecPhraseData *data)
#define TSPO_L_ONLY 0x01 /* emit positions appearing only in L */
#define TSPO_R_ONLY 0x02 /* emit positions appearing only in R */
#define TSPO_BOTH 0x04 /* emit positions appearing in both L&R */
+#define TSPO_NOT_EXAC 0x08 /* not exact distance for AROUND(X) */
static bool
TS_phrase_output(ExecPhraseData *data,
@@ -1473,8 +1474,18 @@ TS_phrase_output(ExecPhraseData *data,
Rpos = INT_MAX;
}
+ /* Processing OP_AROUND */
+ if ((emit & TSPO_NOT_EXAC) &&
+ Lpos - Rpos >= 0 &&
+ Lpos - Rpos <= (Loffset + Roffset) * 2 - Rdata->width + Ldata->width)
+ {
+ if (emit & TSPO_BOTH)
+ output_pos = Rpos;
+ Lindex++;
+ Rindex++;
+ }
/* Merge-join the two input lists */
- if (Lpos < Rpos)
+ else if (Lpos < Rpos)
{
/* Lpos is not matched in Rdata, should we output it? */
if (emit & TSPO_L_ONLY)
@@ -1625,6 +1636,7 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
}
case OP_PHRASE:
+ case OP_AROUND:
case OP_AND:
memset(&Ldata, 0, sizeof(Ldata));
memset(&Rdata, 0, sizeof(Rdata));
@@ -1647,7 +1659,7 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
(Rdata.npos == 0 && !Rdata.negate))
return (flags & TS_EXEC_PHRASE_NO_POS) ? true : false;
- if (curitem->qoperator.oper == OP_PHRASE)
+ if (curitem->qoperator.oper == OP_PHRASE || curitem->qoperator.oper == OP_AROUND)
{
/*
* Compute Loffset and Roffset suitable for phrase match, and
@@ -1703,7 +1715,8 @@ TS_phrase_execute(QueryItem *curitem, void *arg, uint32 flags,
{
/* straight AND */
return TS_phrase_output(data, &Ldata, &Rdata,
- TSPO_BOTH,
+ TSPO_BOTH |
+ (curitem->qoperator.oper == OP_AROUND ? TSPO_NOT_EXAC : 0),
Loffset, Roffset,
Min(Ldata.npos, Rdata.npos));
}
@@ -1843,6 +1856,7 @@ TS_execute(QueryItem *curitem, void *arg, uint32 flags,
return TS_execute(curitem + 1, arg, flags, chkcond);
case OP_PHRASE:
+ case OP_AROUND:
return TS_phrase_execute(curitem, arg, flags, chkcond, NULL);
default:
@@ -1882,9 +1896,10 @@ tsquery_requires_match(QueryItem *curitem)
return false;
case OP_PHRASE:
+ case OP_AROUND:
/*
- * Treat OP_PHRASE as OP_AND here
+ * Treat OP_PHRASE and OP_AROUND as OP_AND here
*/
case OP_AND:
/* If either side requires a match, we're good */
diff --git a/src/include/tsearch/ts_type.h b/src/include/tsearch/ts_type.h
index 30d7c4b..ee1a184 100644
--- a/src/include/tsearch/ts_type.h
+++ b/src/include/tsearch/ts_type.h
@@ -166,8 +166,9 @@ typedef struct
#define OP_NOT 1
#define OP_AND 2
#define OP_OR 3
-#define OP_PHRASE 4 /* highest code, tsquery_cleanup.c */
-#define OP_COUNT 4
+#define OP_PHRASE 4
+#define OP_AROUND 5 /* highest code, tsquery_cleanup.c */
+#define OP_COUNT 5
extern const int tsearch_op_priority[OP_COUNT];
queryto_tsquery4_02_queryto_tsquery.patchtext/x-diff; name=queryto_tsquery4_02_queryto_tsquery.patchDownload
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index cf55e39..0f26d2d 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -390,7 +390,8 @@ add_to_tsvector(void *_state, char *elem_value, int elem_len)
* and different variants are ORed together.
*/
static void
-pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval, int16 weight, bool prefix)
+pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
+ int16 weight, bool prefix, bool isphrase)
{
int32 count = 0;
ParsedText prs;
@@ -423,7 +424,12 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
/* put placeholders for each missing stop word */
pushStop(state);
if (cntpos)
- pushOperator(state, data->qoperator, 1);
+ {
+ if (isphrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
+ }
cntpos++;
pos++;
}
@@ -464,7 +470,10 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
if (cntpos)
{
/* distance may be useful */
- pushOperator(state, data->qoperator, 1);
+ if (isphrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
}
cntpos++;
@@ -490,6 +499,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
+ false,
false);
PG_RETURN_TSQUERY(query);
@@ -520,7 +530,8 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_POINTER(query);
}
@@ -551,7 +562,8 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_TSQUERY(query);
}
@@ -567,3 +579,36 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+websearch_to_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ false,
+ true);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+websearch_to_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(websearch_to_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+
+}
\ No newline at end of file
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 9c04391..a4df90d 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -63,6 +63,7 @@ struct TSQueryParserStateData
#define WAITOPERATOR 2
#define WAITFIRSTOPERAND 3
#define WAITSINGLEOPERAND 4
+#define WAITQUOTE 5
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
@@ -281,7 +282,8 @@ parse_around_operator(char *buf, int16 *distance)
static ts_tokentype
gettoken_query(TSQueryParserState state,
int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+ int *lenval, char **strval, int16 *weight, bool *prefix,
+ bool isquery)
{
*weight = 0;
*prefix = false;
@@ -292,7 +294,8 @@ gettoken_query(TSQueryParserState state,
{
case WAITFIRSTOPERAND:
case WAITOPERAND:
- if (t_iseq(state->buf, '!'))
+ if (t_iseq(state->buf, '!') ||
+ (isquery && t_iseq(state->buf, '-')))
{
(state->buf)++; /* can safely ++, t_iseq guarantee that
* pg_mblen()==1 */
@@ -314,6 +317,20 @@ gettoken_query(TSQueryParserState state,
errmsg("syntax error in tsquery: \"%s\"",
state->buffer)));
}
+ else if (isquery && t_iseq(state->buf, '"'))
+ {
+ char *quote = strchr(state->buf + 1, '"');
+ if (quote == NULL)
+ {
+ state->buf++;
+ continue;
+ }
+ *strval = state->buf + 1;
+ *lenval = quote - state->buf - 1;
+ state->buf = quote + 1;
+ state->state = WAITQUOTE;
+ return PT_VAL;
+ }
else if (!t_isspace(state->buf))
{
/*
@@ -351,6 +368,13 @@ gettoken_query(TSQueryParserState state,
(state->buf)++;
return PT_OPR;
}
+ else if (isquery && has_prefix(state->buf, "OR "))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_OR;
+ (state->buf) += 3;
+ return PT_OPR;
+ }
else if (t_iseq(state->buf, '<'))
{
state->state = WAITOPERAND;
@@ -361,7 +385,7 @@ gettoken_query(TSQueryParserState state,
return PT_ERR;
return PT_OPR;
}
- else if (has_prefix(state->buf, "AROUND("))
+ else if (isquery && has_prefix(state->buf, "AROUND("))
{
state->state = WAITOPERAND;
*operator = OP_AROUND;
@@ -377,8 +401,23 @@ gettoken_query(TSQueryParserState state,
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
+ else if (t_iseq(state->buf, '('))
+ {
+ *operator = OP_AND;
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
else if (*(state->buf) == '\0')
return (state->count) ? PT_ERR : PT_END;
+ else if (isquery &&
+ (t_isalpha(state->buf) || t_iseq(state->buf, '!')
+ || t_iseq(state->buf, '-')
+ || t_iseq(state->buf, '"')))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_AND;
+ return PT_OPR;
+ }
else if (!t_isspace(state->buf))
return PT_ERR;
break;
@@ -390,6 +429,9 @@ gettoken_query(TSQueryParserState state,
state->buf += strlen(state->buf);
state->count++;
return PT_VAL;
+ case WAITQUOTE:
+ state->state = WAITOPERATOR;
+ continue;
default:
return PT_ERR;
break;
@@ -546,7 +588,8 @@ cleanOpStack(TSQueryParserState state,
static void
makepol(TSQueryParserState state,
PushFunction pushval,
- Datum opaque)
+ Datum opaque,
+ bool isquery)
{
int8 operator = 0;
ts_tokentype type;
@@ -560,19 +603,20 @@ makepol(TSQueryParserState state,
/* since this function recurses, it could be driven to stack overflow */
check_stack_depth();
- while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix)) != PT_END)
+ while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight,
+ &prefix, isquery)) != PT_END)
{
switch (type)
{
case PT_VAL:
- pushval(opaque, state, strval, lenval, weight, prefix);
+ pushval(opaque, state, strval, lenval, weight, prefix, state->state == WAITQUOTE);
break;
case PT_OPR:
cleanOpStack(state, opstack, &lenstack, operator);
pushOpStack(opstack, &lenstack, operator, weight);
break;
case PT_OPEN:
- makepol(state, pushval, opaque);
+ makepol(state, pushval, opaque, isquery);
break;
case PT_CLOSE:
cleanOpStack(state, opstack, &lenstack, OP_OR /* lowest */ );
@@ -677,7 +721,8 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ bool isplain,
+ bool isquery)
{
struct TSQueryParserStateData state;
int i;
@@ -704,7 +749,7 @@ parse_tsquery(char *buf,
*(state.curop) = '\0';
/* parse query & make polish notation (postfix, but in reverse order) */
- makepol(&state, pushval, opaque);
+ makepol(&state, pushval, opaque, isquery);
close_tsvector_parser(state.valstate);
@@ -775,7 +820,7 @@ parse_tsquery(char *buf,
static void
pushval_asis(Datum opaque, TSQueryParserState state, char *strval, int lenval,
- int16 weight, bool prefix)
+ int16 weight, bool prefix, bool isphrase)
{
pushValue(state, strval, lenval, weight, prefix);
}
@@ -788,7 +833,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false, false));
}
/*
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index c969375..3110503 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4906,6 +4906,8 @@ DATA(insert OID = 3746 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2
DESCR("make tsquery");
DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ plainto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
@@ -4914,6 +4916,8 @@ DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1
DESCR("make tsquery");
DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ plainto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index 782548c..ffb1762 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -44,11 +44,12 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
char *token, int tokenlen,
int16 tokenweights, /* bitmap as described in
* QueryOperand struct */
- bool prefix);
+ bool prefix,
+ bool isphrase);
extern TSQuery parse_tsquery(char *buf,
PushFunction pushval,
- Datum opaque, bool isplain);
+ Datum opaque, bool isplain, bool isquery);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index d63fb12..c16cfca 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1672,3 +1672,142 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+--test websearch_to_tsquery function
+select websearch_to_tsquery('My brand new smartphone');
+ websearch_to_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand "new smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('"A fat cat" has just eaten a -rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select websearch_to_tsquery('"A fat cat" has just eaten OR -rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select websearch_to_tsquery('"A fat cat" has just (eaten OR -rat)');
+ websearch_to_tsquery
+----------------------------------------
+ 'fat' <-> 'cat' & ( 'eaten' | !'rat' )
+(1 row)
+
+-- testing AROUND operator evaluation
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"gnu debugger" AROUND(5) runs');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('run AROUND(5) "gnu debugger"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"gnu debugger" AROUND(6) runs');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('run AROUND(6) "gnu debugger"');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"many programming languages" AROUND(10) "portable debugger"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"portable debugger" AROUND(10) "many programming languages"');
+ ?column?
+----------
+ f
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"many programming languages" AROUND(11) "portable debugger"');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"portable debugger" AROUND(11) "many programming languages"');
+ ?column?
+----------
+ t
+(1 row)
+
+select websearch_to_tsquery('"fat cat AROUND(5) rat"');
+ websearch_to_tsquery
+------------------------------------------------
+ 'fat' <-> 'cat' <-> 'around' <-> '5' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple','"fat cat OR rat"');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('fat*rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('fat-rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('"A the" OR just on');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ websearch_to_tsquery
+--------------------------------------
+ 'толст' <-> 'кошк' & 'съел' & 'крыс'
+(1 row)
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ f
+(1 row)
+
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index 1c8520b..0ef6e55 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -539,3 +539,43 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+--test websearch_to_tsquery function
+select websearch_to_tsquery('My brand new smartphone');
+select websearch_to_tsquery('My brand "new smartphone"');
+select websearch_to_tsquery('"A fat cat" has just eaten a -rat.');
+select websearch_to_tsquery('"A fat cat" has just eaten OR -rat.');
+select websearch_to_tsquery('"A fat cat" has just (eaten OR -rat)');
+
+-- testing AROUND operator evaluation
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"gnu debugger" AROUND(5) runs');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('run AROUND(5) "gnu debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"gnu debugger" AROUND(6) runs');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('run AROUND(6) "gnu debugger"');
+
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"many programming languages" AROUND(10) "portable debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"portable debugger" AROUND(10) "many programming languages"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"many programming languages" AROUND(11) "portable debugger"');
+select to_tsvector('The GNU Debugger is a portable debugger that runs on many Unix like systems and works for many programming languages') @@
+websearch_to_tsquery('"portable debugger" AROUND(11) "many programming languages"');
+
+select websearch_to_tsquery('"fat cat AROUND(5) rat"');
+select websearch_to_tsquery('simple','"fat cat OR rat"');
+select websearch_to_tsquery('fat*rat');
+select websearch_to_tsquery('fat-rat');
+
+select websearch_to_tsquery('"A the" OR just on');
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
\ No newline at end of file
queryto_tsquery4_03_documentation.patchtext/x-diff; name=queryto_tsquery4_03_documentation.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4dd9d02..e3479b0 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9535,6 +9535,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
<row>
<entry>
<indexterm>
+ <primary>websearch_to_tsquery</primary>
+ </indexterm>
+ <literal><function>websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type> , </optional> <replaceable class="parameter">query</replaceable> <type>text</type>)</function></literal>
+ </entry>
+ <entry><type>tsquery</type></entry>
+ <entry>produce <type>tsquery</type> from websearch like query</entry>
+ <entry><literal>websearch_to_tsquery('english', 'The Fat Rats')</literal></entry>
+ <entry><literal>'fat' & 'rat'</literal></entry>
+ </row>
+ <row>
+ <entry>
+ <indexterm>
<primary>querytree</primary>
</indexterm>
<literal><function>querytree(<replaceable class="parameter">query</replaceable> <type>tsquery</type>)</function></literal>
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 4dc52ec..6f9c040 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,15 @@ UPDATE tt SET ti =
<para>
<productname>PostgreSQL</productname> provides the
functions <function>to_tsquery</function>,
- <function>plainto_tsquery</function>, and
- <function>phraseto_tsquery</function>
+ <function>plainto_tsquery</function>,
+ <function>phraseto_tsquery</function> and
+ <function>websearch_to_tsquery</function>
for converting a query to the <type>tsquery</type> data type.
<function>to_tsquery</function> offers access to more features
than either <function>plainto_tsquery</function> or
<function>phraseto_tsquery</function>, but it is less forgiving
- about its input.
+ about its input. <function>websearch_to_tsquery</function> provides a different syntax,
+ similar to one used in web search engines, to create tsqeury.
</para>
<indexterm>
@@ -960,8 +962,72 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
-----------------------------
'fat' <-> 'rat' <-> 'c'
</screen>
- </para>
+</para>
+
+<synopsis>
+websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
+</synopsis>
+ <para>
+ <function>websearch_to_tsquery</function> creates a <type>tsquery</type> from a unformated text.
+ But instead of <function>plainto_tsquery</function> and <function>phraseto_tsquery</function> it won't
+ ignore already placed operations. This function supports following operators:
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ <literal>some text</literal> - any text inside quote signs will be treated as a phrase and will be
+ performed like in <function>phraseto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>OR</literal> - standard logical operator. It is just an alias for <literal>|</literal> sign.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>terma AROUND(N) termb</literal> - this operation will match if the distance between
+ terma and termb is less than N.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>-</literal> - standard logical negation sign. It is an alias for <literal>!</literal> sign.
+ </para>
+ </listitem>
+ </itemizedlist>
+ Other missing operators will be replaced by AND like in <function>plainto_tsquery</function>.
+ </para>
+ <para>
+ Examples:
+ <screen>
+ select websearch_to_tsquery('The fat rats');
+ websearch_to_tsquery
+ -----------------
+ 'fat' & 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('"supernovae stars" AND -crab');
+ websearch_to_tsquery
+ ----------------------------------
+ 'supernova' <-> 'star' & !'crab'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('-run AROUND(5) "gnu debugger" OR "I like bananas"');
+ websearch_to_tsquery
+ -----------------------------------------------------------
+ !'run' AROUND(5) 'gnu' <-> 'debugg' | 'like' <-> 'banana'
+ (1 row)
+ </screen>
+ </para>
+ <para>
+ Note that in the examples <function>websearch_to_tsquery</function> didn't ignore
+ operators if they were placed and put required operations
+ if they were skipped. In case of inquote text <function>websearch_to_tsquery</function>
+ has placed phrase operator and & in other cases.
+ </para>
</sect2>
<sect2 id="textsearch-ranking">
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed
Here are a few minor issues:
```
+/*
+ * Checks if 'str' starts with a 'prefix'
+ */
+static bool
+has_prefix(char * str, char * prefix)
+{
+ if (strlen(prefix) > strlen(str))
+ {
+ return false;
+ }
+ return strstr(str, prefix) == str;
+}
```
strlen() check is redundant.
```
+ case OP_AROUND:
+ snprintf(in->cur, 256, " AROUND(%d) %s", distance, nrm.buf);
+ break;
```
Instead of the constant 256 it's better to use sizeof().
Apart from these issues this patch looks not bad.
The new status of this patch is: Ready for Committer
On 2017-11-29 17:56:30 +0300, Victor Drobny wrote:
Thank you for review. I have tried to fix all of your comments.
However i want to mention that the absence of comments for functions
in to_tsany.c is justified by the absence of comments for other
similar functions.
That's not justification. Tsquery related code is notorious for being
badly commented, we do not want to continue that.
Greetings,
Andres Freund
It seems that this patch doesn't apply anymore, see http://commitfest.cputube.org/
The new status of this patch is: Waiting on Author
Hi Victor,
On 3/5/18 7:52 AM, Aleksander Alekseev wrote:
It seems that this patch doesn't apply anymore, see http://commitfest.cputube.org/
The new status of this patch is: Waiting on Author
This patch needs a rebase and should address the comments from
Aleksander and Andres. We are now three weeks into the CF with no new
patch.
Are you planning to provide a new patch? If not, I think it should be
marked as Returned with Feedback and submitted to the next CF once it
has been updated.
Regards,
--
-David
david@pgmasters.net
Hi David,
I'd like to take over from Victor. I'll post a revised version of the
patch in a couple of days.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
I am extending phrase operator <n> is such way that it will have <n,m>
syntax that means from n to m words, so I will use such syntax (<n,m>)
further. I found that a AROUND(N) b is exactly the same as a <-N,N> b
and it can be replaced while parsing. So, what do you think of such
idea? In this patch I have noticed some unobvious behavior.
I think new operator should be a subject for separate patch. And I prefer idea
about range phrase operator.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
On Thu, 22 Mar 2018 16:53:15 +0300
Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:
Hi David,
I'd like to take over from Victor. I'll post a revised version of the
patch in a couple of days.
Hi Dmitry,
Recently I worked with the old version of the patch and found a bug.
So, I think it is better to notify you immediately, so you can fix it in
rebased/revised version.
I noticed, that operator AROUND(N) works only
in case of non-negative operands. If any of the operands is negative, it
behaves as phrase operator <N>. It is caused by lack of TS_NOT_EXAC
flag and AROUND(N) operator check in function TS_phrase_execute in
branches for negated operands.
--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Recently I worked with the old version of the patch and found a bug.
So, I think it is better to notify you immediately, so you can fix it
in
rebased/revised version.I noticed, that operator AROUND(N) works only
in case of non-negative operands. If any of the operands is negative,
it
behaves as phrase operator <N>. It is caused by lack of TS_NOT_EXAC
flag and AROUND(N) operator check in function TS_phrase_execute in
branches for negated operands.
Good to know, thanks! To be honest, I' sure that Theodor is right: it's
better to implement AROUND(N) operator using <N, M> when it's committed.
The following version of patch won't support AROUND(N). I have to fix a
few more questionable things, though.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Mar 26, 2018 at 9:51 PM, Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:
Recently I worked with the old version of the patch and found a bug.
So, I think it is better to notify you immediately, so you can fix it in
rebased/revised version.I noticed, that operator AROUND(N) works only
in case of non-negative operands. If any of the operands is negative, it
behaves as phrase operator <N>. It is caused by lack of TS_NOT_EXAC
flag and AROUND(N) operator check in function TS_phrase_execute in
branches for negated operands.Good to know, thanks! To be honest, I' sure that Theodor is right: it's
better to implement AROUND(N) operator using <N, M> when it's committed. The
following version of patch won't support AROUND(N). I have to fix a few more
questionable things, though.
Hi,
I took a quick look at the language in the last version of the patches.
Patch 01:
+ errmsg("Invalid AROUND(X) operator!")));
s/I/i/;s/!//
+ errmsg("Missing ')' in AROUND(X) operator")));
s/M/m/
Patch 03 (the documentation) needed some proof-reading. I've attached
a new version of that patch with some small suggested improvements.
Questions I had while reading the documentation without looking at the
code: Is there anything to_tsquery() can do that
websearch_to_tsquery() can't? The documentation doesn't mention
parentheses, but I can see that they are in fact supported from the
regression test. Would it be OK to use user-supplied websearch
strings? Ie can it produce a syntax error? Well clearly it can, see
above -- perhaps that should be explicitly documented? Is there any
way to write OR as a term (that's a valuable non-stopword in French)?
It seems like AROUND(x) should be documented also more generally for
tsquery, but I see there is some discussion about how that should
look.
By the way, not this patch's fault, but I noticed that commit
f5f1355dc4d did this:
- (errmsg("query contains only
stopword(s) or doesn't contain lexeme(s), ignored")));
+ (errmsg("text-search query contains
only stop words or doesn't contain lexemes, ignored")));
But the old test still appears in an example in doc/src/sgml/textsearch.sgml.
--
Thomas Munro
http://www.enterprisedb.com
Attachments:
docs.patchapplication/octet-stream; name=docs.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7b1a85fc71..f79297bb63 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9609,6 +9609,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
<entry><literal>phraseto_tsquery('english', 'The Fat Rats')</literal></entry>
<entry><literal>'fat' <-> 'rat'</literal></entry>
</row>
+ <row>
+ <entry>
+ <indexterm>
+ <primary>websearch_to_tsquery</primary>
+ </indexterm>
+ <literal><function>websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type> , </optional> <replaceable class="parameter">query</replaceable> <type>text</type>)</function></literal>
+ </entry>
+ <entry><type>tsquery</type></entry>
+ <entry>produce <type>tsquery</type> from a web search style query</entry>
+ <entry><literal>websearch_to_tsquery('english', 'The Fat Rats')</literal></entry>
+ <entry><literal>'fat' & 'rat'</literal></entry>
+ </row>
<row>
<entry>
<indexterm>
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 610b7bf033..4786f362d2 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,16 @@ UPDATE tt SET ti =
<para>
<productname>PostgreSQL</productname> provides the
functions <function>to_tsquery</function>,
- <function>plainto_tsquery</function>, and
- <function>phraseto_tsquery</function>
+ <function>plainto_tsquery</function>,
+ <function>phraseto_tsquery</function> and
+ <function>websearch_to_tsquery</function>
for converting a query to the <type>tsquery</type> data type.
<function>to_tsquery</function> offers access to more features
than either <function>plainto_tsquery</function> or
- <function>phraseto_tsquery</function>, but it is less forgiving
- about its input.
+ <function>phraseto_tsquery</function>, but it is less forgiving about its
+ input. <function>websearch_to_tsquery</function> offers the same features
+ as <function>to_tsquery</function> but uses an alternative syntax, similar
+ to the one used by web search engines.
</para>
<indexterm>
@@ -962,6 +965,84 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
</screen>
</para>
+<synopsis>
+websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
+</synopsis>
+
+ <para>
+ <function>websearch_to_tsquery</function> creates a <type>tsquery</type>
+ value from <replaceable>querytext</replaceable> using an alternative
+ syntax in which simple unformatted text is a valid query.
+ Unlike <function>plainto_tsquery</function>
+ and <function>phraseto_tsquery</function>, it also recognizes certain
+ operators. The following syntax is supported:
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ <literal>unquoted text</literal>: text not inside quote marks will be
+ converted to terms separated by <literal>&</literal> operators, as
+ if processed by
+ <function>plainto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>"quoted text"</literal>: text inside quote marks will be
+ converted to terms separated by <literal><-></literal>
+ operators, as if processed by <function>phraseto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>AND</literal>: logical and will be converted to
+ the <literal>&</literal> operator. Not normally needed because
+ it's implicit between terms (except in quotes).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>OR</literal>: logical or will be converted to
+ the <literal>|</literal> operator.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>a AROUND(n) b</literal>: will match if the distance between a
+ and b is less than n.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>-</literal>: the logical not operator, converted to the
+ the <literal>!</literal> operator.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ Examples:
+ <screen>
+ select websearch_to_tsquery('The fat rats');
+ websearch_to_tsquery
+ -----------------
+ 'fat' & 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('"supernovae stars" AND -crab');
+ websearch_to_tsquery
+ ----------------------------------
+ 'supernova' <-> 'star' & !'crab'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('-run AROUND(5) "gnu debugger" OR "I like bananas"');
+ websearch_to_tsquery
+ -----------------------------------------------------------
+ !'run' AROUND(5) 'gnu' <-> 'debugg' | 'like' <-> 'banana'
+ (1 row)
+ </screen>
+ </para>
</sect2>
<sect2 id="textsearch-ranking">
Patch 03 (the documentation) needed some proof-reading. I've attached
a new version of that patch with some small suggested improvements.
Thanks, I'm definitely going to use this.
Is there anything to_tsquery() can do that websearch_to_tsquery()
can't?
Currently, no.
Would it be OK to use user-supplied websearch strings?
Ie can it produce a syntax error?
I believe that's the most important question. After a private discussion
with Theodor I came to a conclusion that the most beneficial outcome
would be to suppress all syntax errors and give user some result,
cutting all misused operators along the way. This requires some changes,
though.
Is there any way to write OR as a term (that's a valuable non-stopword
in French)?
You could quote it like this: websearch_to_tsquery('"or"');
Moreover, it's still possible to use & and |.
It seems like AROUND(x) should be documented also more generally for
tsquery, but I see there is some discussion about how that should
look.
Personally, I like <N, M> operator better. It would instantly deprecate
AROUND(N), which is why I'm going to drop it.
By the way, not this patch's fault, but I noticed that commit
f5f1355dc4d did this:- (errmsg("query contains only stopword(s) or doesn't contain lexeme(s), ignored"))); + (errmsg("text-search query contains only stop words or doesn't contain lexemes, ignored")));But the old test still appears in an example in
doc/src/sgml/textsearch.sgml.
Will fix this.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi everyone,
I'd like to share some intermediate results. Here's what has changed:
1. OR operator is now case-insensitive. Moreover, trailing whitespace is
no longer used to identify it:
select websearch_to_tsquery('simple', 'abc or');
websearch_to_tsquery
----------------------
'abc' & 'or'
(1 row)
select websearch_to_tsquery('simple', 'abc or(def)');
websearch_to_tsquery
----------------------
'abc' | 'def'
(1 row)
select websearch_to_tsquery('simple', 'abc or!def');
websearch_to_tsquery
----------------------
'abc' | 'def'
(1 row)
2. AROUND(N) has been dropped. I hope that <N, M> operator will allow us
to implement it with a few lines of code.
3. websearch_to_tsquery() now tolerates various syntax errors, for
instance:
Misused operators:
'abc &'
'| abc'
'<- def'
Missing parentheses:
'abc & (def <-> (cat or rat'
Other sorts of nonsense:
'abc &--|| def' => 'abc' & !!'def'
'abc:def' => 'abc':D & 'ef'
This, however, doesn't mean that the result will always be adequate (who
would have thought?). Overall, current implementation follows the GIGO
principle. In theory, this would allow us to use user-supplied websearch
strings (but see gotchas), even if they don't make much sense. Better
then nothing, right?
4. A small refactoring: I've replaced all WAIT* macros with a enum for
better debugging (names look much nicer in GDB). Hope this is
acceptable.
5. Finally, I've added a few more comments and tests. I haven't checked
the code coverage, though.
A few gotchas:
I haven't touched gettoken_tsvector() yet. As a result, the following
queries produce errors:
select websearch_to_tsquery('simple', '''');
ERROR: syntax error in tsquery: "'"
select websearch_to_tsquery('simple', '\');
ERROR: there is no escaped character: "\"
Maybe there's more. The question is: should we fix those, or it's fine
as it is? I don't have a strong opinion about this.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
websearch_to_tsquery_v1.difftext/x-diff; name=websearch_to_tsquery_v1.diffDownload
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index ea5947a3a8..bdf05236cf 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -390,7 +390,8 @@ add_to_tsvector(void *_state, char *elem_value, int elem_len)
* and different variants are ORed together.
*/
static void
-pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval, int16 weight, bool prefix)
+pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
+ int16 weight, bool prefix, bool force_phrase)
{
int32 count = 0;
ParsedText prs;
@@ -423,7 +424,12 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
/* put placeholders for each missing stop word */
pushStop(state);
if (cntpos)
- pushOperator(state, data->qoperator, 1);
+ {
+ if (force_phrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
+ }
cntpos++;
pos++;
}
@@ -464,7 +470,10 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
if (cntpos)
{
/* distance may be useful */
- pushOperator(state, data->qoperator, 1);
+ if (force_phrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
}
cntpos++;
@@ -490,6 +499,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
+ false,
false);
PG_RETURN_TSQUERY(query);
@@ -520,7 +530,8 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_POINTER(query);
}
@@ -551,7 +562,8 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_TSQUERY(query);
}
@@ -567,3 +579,36 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+websearch_to_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ false,
+ true);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+websearch_to_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(websearch_to_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+
+}
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 1ccbf79030..4b7460e5b9 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -32,12 +32,24 @@ const int tsearch_op_priority[OP_COUNT] =
3 /* OP_PHRASE */
};
+/*
+ * parser's states
+ */
+typedef enum
+{
+ WAITOPERAND = 1,
+ WAITOPERATOR = 2,
+ WAITFIRSTOPERAND = 3,
+ WAITSINGLEOPERAND = 4,
+ INQUOTES = 5 /* for quoted phrases in web search */
+} ts_parserstate;
+
struct TSQueryParserStateData
{
/* State for gettoken_query */
char *buffer; /* entire string we are scanning */
char *buf; /* current scan point */
- int state;
+ ts_parserstate state;
int count; /* nesting count, incremented by (,
* decremented by ) */
@@ -57,12 +69,6 @@ struct TSQueryParserStateData
TSVectorParseState valstate;
};
-/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
-
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
* part, like ':AB*' of a query.
@@ -197,6 +203,25 @@ err:
return buf;
}
+/*
+ * Parse OR operator used in websearch_to_tsquery().
+ */
+static bool
+parse_or_operator(char *buf, int *len)
+{
+ bool is_or = (t_iseq(&buf[0], 'o') || t_iseq(&buf[0], 'O')) &&
+ (t_iseq(&buf[1], 'r') || t_iseq(&buf[1], 'R')) &&
+ (buf[2] != '\0' &&
+ !t_iseq(&buf[2], '-') &&
+ !t_iseq(&buf[2], '_') &&
+ !t_isalpha(&buf[2]) &&
+ !t_isdigit(&buf[2]));
+
+ *len = 2 + pg_mblen(&buf[2]);
+
+ return is_or;
+}
+
/*
* token types for parsing
*/
@@ -220,19 +245,22 @@ typedef enum
*/
static ts_tokentype
gettoken_query(TSQueryParserState state,
- int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+ int8 *operator, int *lenval, char **strval,
+ int16 *weight, bool *prefix, bool isweb)
{
*weight = 0;
*prefix = false;
while (1)
{
+ int oplen = 0;
+
switch (state->state)
{
case WAITFIRSTOPERAND:
case WAITOPERAND:
- if (t_iseq(state->buf, '!'))
+ if (t_iseq(state->buf, '!') ||
+ (isweb && t_iseq(state->buf, '-')))
{
(state->buf)++; /* can safely ++, t_iseq guarantee that
* pg_mblen()==1 */
@@ -249,11 +277,55 @@ gettoken_query(TSQueryParserState state,
}
else if (t_iseq(state->buf, ':'))
{
+ if (isweb)
+ {
+ /* it doesn't mean anything */
+ (state->buf)++;
+ continue;
+ }
+
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("syntax error in tsquery: \"%s\"",
state->buffer)));
}
+ else if (isweb && t_iseq(state->buf, ')'))
+ {
+ if (state->count == 0)
+ {
+ /* web search tolerates useless closing parantheses */
+ (state->buf)++;
+ continue;
+ }
+ (state->buf)++;
+ state->count--;
+ return PT_CLOSE;
+ }
+ else if (isweb &&
+ (t_iseq(state->buf, '&') ||
+ t_iseq(state->buf, '|') ||
+ t_iseq(state->buf, '<')))
+ {
+ /* or else gettoken_tsvector() will raise an error */
+ (state->buf)++;
+ continue;
+ }
+ else if (isweb && t_iseq(state->buf, '"'))
+ {
+ /* quoted text should be ordered (<->) */
+ char *quote = strchr(state->buf + 1, '"');
+ if (quote == NULL)
+ {
+ /* web search tolerates missing quotes too */
+ state->buf++;
+ continue;
+ }
+ *strval = state->buf + 1;
+ *lenval = quote - *strval;
+ state->buf = quote + 1;
+ state->state = INQUOTES;
+ return PT_VAL;
+ }
else if (!t_isspace(state->buf))
{
/*
@@ -269,6 +341,16 @@ gettoken_query(TSQueryParserState state,
}
else if (state->state == WAITFIRSTOPERAND)
return PT_END;
+ else if (isweb)
+ {
+ if (state->count > 0)
+ /* decrement per each parentheses level (see PT_OPEN) */
+ state->count--;
+ else
+ /* finally, we have to provide an operand */
+ pushStop(state);
+ return PT_END;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -291,26 +373,61 @@ gettoken_query(TSQueryParserState state,
(state->buf)++;
return PT_OPR;
}
- else if (t_iseq(state->buf, '<'))
+ else if (isweb && parse_or_operator(state->buf, &oplen))
{
state->state = WAITOPERAND;
- *operator = OP_PHRASE;
- /* weight var is used as storage for distance */
- state->buf = parse_phrase_operator(state->buf, weight);
- if (*weight < 0)
+ *operator = OP_OR;
+ (state->buf) += oplen;
+ return PT_OPR;
+ }
+ else if (t_iseq(state->buf, '<'))
+ {
+ int16 distance;
+ char *phrase = parse_phrase_operator(state->buf, &distance);
+ if (distance < 0)
+ {
+ if (isweb)
+ {
+ /* web search tolerates broken phrase operator */
+ (state->buf)++;
+ continue;
+ }
return PT_ERR;
+ }
+ state->buf = phrase;
+ *operator = OP_PHRASE;
+ *weight = distance; /* weight var is used as storage for distance */
+ state->state = WAITOPERAND;
return PT_OPR;
}
else if (t_iseq(state->buf, ')'))
{
+ if (isweb && state->count == 0)
+ {
+ /* web search tolerates useless closing parentheses */
+ (state->buf)++;
+ continue;
+ }
(state->buf)++;
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
else if (*(state->buf) == '\0')
- return (state->count) ? PT_ERR : PT_END;
+ {
+ /* web search tolerates unexpected end of line */
+ return (!isweb && state->count) ? PT_ERR : PT_END;
+ }
else if (!t_isspace(state->buf))
+ {
+ if (isweb)
+ {
+ /* put implicit AND if there's no operator */
+ *operator = OP_AND;
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
return PT_ERR;
+ }
break;
case WAITSINGLEOPERAND:
if (*(state->buf) == '\0')
@@ -320,9 +437,10 @@ gettoken_query(TSQueryParserState state,
state->buf += strlen(state->buf);
state->count++;
return PT_VAL;
- default:
- return PT_ERR;
- break;
+ case INQUOTES:
+ /* phrase should be followed by an operator */
+ state->state = WAITOPERATOR;
+ continue;
}
state->buf += pg_mblen(state->buf);
}
@@ -475,7 +593,8 @@ cleanOpStack(TSQueryParserState state,
static void
makepol(TSQueryParserState state,
PushFunction pushval,
- Datum opaque)
+ Datum opaque,
+ bool isweb)
{
int8 operator = 0;
ts_tokentype type;
@@ -489,19 +608,21 @@ makepol(TSQueryParserState state,
/* since this function recurses, it could be driven to stack overflow */
check_stack_depth();
- while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix)) != PT_END)
+ while ((type = gettoken_query(state, &operator, &lenval, &strval,
+ &weight, &prefix, isweb)) != PT_END)
{
switch (type)
{
case PT_VAL:
- pushval(opaque, state, strval, lenval, weight, prefix);
+ pushval(opaque, state, strval, lenval, weight, prefix,
+ state->state == INQUOTES /* force phrase operator */);
break;
case PT_OPR:
cleanOpStack(state, opstack, &lenstack, operator);
pushOpStack(opstack, &lenstack, operator, weight);
break;
case PT_OPEN:
- makepol(state, pushval, opaque);
+ makepol(state, pushval, opaque, isweb);
break;
case PT_CLOSE:
cleanOpStack(state, opstack, &lenstack, OP_OR /* lowest */ );
@@ -605,7 +726,8 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ bool isplain,
+ bool isweb)
{
struct TSQueryParserStateData state;
int i;
@@ -632,7 +754,7 @@ parse_tsquery(char *buf,
*(state.curop) = '\0';
/* parse query & make polish notation (postfix, but in reverse order) */
- makepol(&state, pushval, opaque);
+ makepol(&state, pushval, opaque, isweb);
close_tsvector_parser(state.valstate);
@@ -703,7 +825,7 @@ parse_tsquery(char *buf,
static void
pushval_asis(Datum opaque, TSQueryParserState state, char *strval, int lenval,
- int16 weight, bool prefix)
+ int16 weight, bool prefix, bool isphrase)
{
pushValue(state, strval, lenval, weight, prefix);
}
@@ -716,7 +838,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false, false));
}
/*
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index bfc90098f8..00f1a85ae7 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4950,6 +4950,8 @@ DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s
DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
DESCR("transform to tsvector");
DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ to_tsquery _null_ _null_ _null_ ));
@@ -4958,6 +4960,8 @@ DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s
DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
DESCR("transform jsonb to tsvector");
DATA(insert OID = 4210 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "114" _null_ _null_ _null_ _null_ _null_ json_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index f8ddce5ecb..098c9b9091 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -44,11 +44,12 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
char *token, int tokenlen,
int16 tokenweights, /* bitmap as described in
* QueryOperand struct */
- bool prefix);
+ bool prefix,
+ bool isphrase);
extern TSQuery parse_tsquery(char *buf,
PushFunction pushval,
- Datum opaque, bool isplain);
+ Datum opaque, bool isplain, bool isweb);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index d63fb12f1d..265fd55fcc 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1672,3 +1672,408 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('()');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('(())');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('()()()');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('abc ()');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('() abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc & ()');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('() & abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('(');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('((');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('((( )) abc or def');
+ websearch_to_tsquery
+----------------------
+ 'abc' | 'def'
+(1 row)
+
+select websearch_to_tsquery('))');
+NOTICE: text-search query doesn't contain lexemes: "))"
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery(')');
+NOTICE: text-search query doesn't contain lexemes: ")"
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery(')(');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('& )( |');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('abc )( def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('abc | )( & def');
+ websearch_to_tsquery
+----------------------
+ 'abc' | 'def'
+(1 row)
+
+select websearch_to_tsquery('& abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc &');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('| abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc |');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('< abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc <');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('a:::b');
+ websearch_to_tsquery
+----------------------
+ 'b'
+(1 row)
+
+select websearch_to_tsquery('My brand new smartphone');
+ websearch_to_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand "new smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand:B "new -smartphone"');
+ websearch_to_tsquery
+-----------------------------------
+ 'brand':B & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand:Z "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------------
+ 'brand' & 'z' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My & (brand ("new -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My & (brand ("new) -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat or rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or'
+(1 row)
+
+select websearch_to_tsquery('simple', 'OR rat');
+ websearch_to_tsquery
+----------------------
+ 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat' & ( 'cat' | 'rat' )
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat*rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat-rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat_rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', 'fat or(rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or)rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or&rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or|rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or!rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or<rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or>rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or ');
+ websearch_to_tsquery
+----------------------
+ 'fat'
+(1 row)
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orange'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc orтест');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orтест'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR1234');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or1234'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or-abc');
+ websearch_to_tsquery
+---------------------------------
+ 'abc' & 'or-abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR_abc');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or'
+(1 row)
+
+select websearch_to_tsquery('simple', 'or OR or');
+ websearch_to_tsquery
+----------------------
+ 'or' | 'or'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+ websearch_to_tsquery
+----------------------------------------
+ 'fat' <-> 'cat' & ( 'eaten' | !'rat' )
+(1 row)
+
+select websearch_to_tsquery('english', 'this is ----fine');
+ websearch_to_tsquery
+----------------------
+ !!!!'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -!-fine, "dear friend" OR good');
+ websearch_to_tsquery
+------------------------------------------
+ !!!'fine' & 'dear' <-> 'friend' | 'good'
+(1 row)
+
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+ websearch_to_tsquery
+--------------------------
+ 'old' <-> 'cat' & 'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ websearch_to_tsquery
+--------------------------------------
+ 'толст' <-> 'кошк' & 'съел' & 'крыс'
+(1 row)
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ f
+(1 row)
+
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index 1c8520b3e9..1bf9f80a9c 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -539,3 +539,88 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('()');
+select websearch_to_tsquery('(())');
+select websearch_to_tsquery('()()()');
+select websearch_to_tsquery('abc ()');
+select websearch_to_tsquery('() abc');
+select websearch_to_tsquery('abc & ()');
+select websearch_to_tsquery('() & abc');
+
+select websearch_to_tsquery('(');
+select websearch_to_tsquery('((');
+select websearch_to_tsquery('((( )) abc or def');
+select websearch_to_tsquery('))');
+select websearch_to_tsquery(')');
+
+select websearch_to_tsquery(')(');
+select websearch_to_tsquery('& )( |');
+select websearch_to_tsquery('abc )( def');
+select websearch_to_tsquery('abc | )( & def');
+
+select websearch_to_tsquery('& abc');
+select websearch_to_tsquery('abc &');
+select websearch_to_tsquery('| abc');
+select websearch_to_tsquery('abc |');
+select websearch_to_tsquery('< abc');
+select websearch_to_tsquery('abc <');
+select websearch_to_tsquery('a:::b');
+
+select websearch_to_tsquery('My brand new smartphone');
+select websearch_to_tsquery('My brand "new smartphone"');
+select websearch_to_tsquery('My brand "new -smartphone"');
+select websearch_to_tsquery('My brand:B "new -smartphone"');
+select websearch_to_tsquery('My brand:Z "new -smartphone"');
+select websearch_to_tsquery('My & (brand ("new -smartphone"');
+select websearch_to_tsquery('My & (brand ("new) -smartphone"');
+
+select websearch_to_tsquery('simple', 'cat or rat');
+select websearch_to_tsquery('simple', 'cat OR rat');
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+select websearch_to_tsquery('simple', 'cat OR');
+select websearch_to_tsquery('simple', 'OR rat');
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+select websearch_to_tsquery('simple', 'fat*rat');
+select websearch_to_tsquery('simple', 'fat-rat');
+select websearch_to_tsquery('simple', 'fat_rat');
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', 'fat or(rat');
+select websearch_to_tsquery('simple', 'fat or)rat');
+select websearch_to_tsquery('simple', 'fat or&rat');
+select websearch_to_tsquery('simple', 'fat or|rat');
+select websearch_to_tsquery('simple', 'fat or!rat');
+select websearch_to_tsquery('simple', 'fat or<rat');
+select websearch_to_tsquery('simple', 'fat or>rat');
+select websearch_to_tsquery('simple', 'fat or ');
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+select websearch_to_tsquery('simple', 'abc orтест');
+select websearch_to_tsquery('simple', 'abc OR1234');
+select websearch_to_tsquery('simple', 'abc or-abc');
+select websearch_to_tsquery('simple', 'abc OR_abc');
+select websearch_to_tsquery('simple', 'abc or');
+
+select websearch_to_tsquery('simple', 'or OR or');
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+
+select websearch_to_tsquery('english', 'this is ----fine');
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -!-fine, "dear friend" OR good');
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
select websearch_to_tsquery('simple', 'abc or!def');
websearch_to_tsquery
----------------------
'abc' | 'def'
(1 row)
This is wrong ofc, I've attached the fixed version.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
websearch_to_tsquery_v2.difftext/x-diff; name=websearch_to_tsquery_v2.diffDownload
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index ea5947a3a8..bdf05236cf 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -390,7 +390,8 @@ add_to_tsvector(void *_state, char *elem_value, int elem_len)
* and different variants are ORed together.
*/
static void
-pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval, int16 weight, bool prefix)
+pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
+ int16 weight, bool prefix, bool force_phrase)
{
int32 count = 0;
ParsedText prs;
@@ -423,7 +424,12 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
/* put placeholders for each missing stop word */
pushStop(state);
if (cntpos)
- pushOperator(state, data->qoperator, 1);
+ {
+ if (force_phrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
+ }
cntpos++;
pos++;
}
@@ -464,7 +470,10 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
if (cntpos)
{
/* distance may be useful */
- pushOperator(state, data->qoperator, 1);
+ if (force_phrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
}
cntpos++;
@@ -490,6 +499,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
+ false,
false);
PG_RETURN_TSQUERY(query);
@@ -520,7 +530,8 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_POINTER(query);
}
@@ -551,7 +562,8 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_TSQUERY(query);
}
@@ -567,3 +579,36 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+websearch_to_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ false,
+ true);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+websearch_to_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(websearch_to_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+
+}
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 1ccbf79030..a5bec3f332 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -32,12 +32,24 @@ const int tsearch_op_priority[OP_COUNT] =
3 /* OP_PHRASE */
};
+/*
+ * parser's states
+ */
+typedef enum
+{
+ WAITOPERAND = 1,
+ WAITOPERATOR = 2,
+ WAITFIRSTOPERAND = 3,
+ WAITSINGLEOPERAND = 4,
+ INQUOTES = 5 /* for quoted phrases in web search */
+} ts_parserstate;
+
struct TSQueryParserStateData
{
/* State for gettoken_query */
char *buffer; /* entire string we are scanning */
char *buf; /* current scan point */
- int state;
+ ts_parserstate state;
int count; /* nesting count, incremented by (,
* decremented by ) */
@@ -57,12 +69,6 @@ struct TSQueryParserStateData
TSVectorParseState valstate;
};
-/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
-
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
* part, like ':AB*' of a query.
@@ -197,6 +203,21 @@ err:
return buf;
}
+/*
+ * Parse OR operator used in websearch_to_tsquery().
+ */
+static bool
+parse_or_operator(char *buf)
+{
+ return (t_iseq(&buf[0], 'o') || t_iseq(&buf[0], 'O')) &&
+ (t_iseq(&buf[1], 'r') || t_iseq(&buf[1], 'R')) &&
+ (buf[2] != '\0' &&
+ !t_iseq(&buf[2], '-') &&
+ !t_iseq(&buf[2], '_') &&
+ !t_isalpha(&buf[2]) &&
+ !t_isdigit(&buf[2]));
+}
+
/*
* token types for parsing
*/
@@ -220,8 +241,8 @@ typedef enum
*/
static ts_tokentype
gettoken_query(TSQueryParserState state,
- int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+ int8 *operator, int *lenval, char **strval,
+ int16 *weight, bool *prefix, bool isweb)
{
*weight = 0;
*prefix = false;
@@ -232,7 +253,8 @@ gettoken_query(TSQueryParserState state,
{
case WAITFIRSTOPERAND:
case WAITOPERAND:
- if (t_iseq(state->buf, '!'))
+ if (t_iseq(state->buf, '!') ||
+ (isweb && t_iseq(state->buf, '-')))
{
(state->buf)++; /* can safely ++, t_iseq guarantee that
* pg_mblen()==1 */
@@ -249,11 +271,55 @@ gettoken_query(TSQueryParserState state,
}
else if (t_iseq(state->buf, ':'))
{
+ if (isweb)
+ {
+ /* it doesn't mean anything */
+ (state->buf)++;
+ continue;
+ }
+
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("syntax error in tsquery: \"%s\"",
state->buffer)));
}
+ else if (isweb && t_iseq(state->buf, ')'))
+ {
+ if (state->count == 0)
+ {
+ /* web search tolerates useless closing parantheses */
+ (state->buf)++;
+ continue;
+ }
+ (state->buf)++;
+ state->count--;
+ return PT_CLOSE;
+ }
+ else if (isweb &&
+ (t_iseq(state->buf, '&') ||
+ t_iseq(state->buf, '|') ||
+ t_iseq(state->buf, '<')))
+ {
+ /* or else gettoken_tsvector() will raise an error */
+ (state->buf)++;
+ continue;
+ }
+ else if (isweb && t_iseq(state->buf, '"'))
+ {
+ /* quoted text should be ordered (<->) */
+ char *quote = strchr(state->buf + 1, '"');
+ if (quote == NULL)
+ {
+ /* web search tolerates missing quotes too */
+ state->buf++;
+ continue;
+ }
+ *strval = state->buf + 1;
+ *lenval = quote - *strval;
+ state->buf = quote + 1;
+ state->state = INQUOTES;
+ return PT_VAL;
+ }
else if (!t_isspace(state->buf))
{
/*
@@ -269,6 +335,16 @@ gettoken_query(TSQueryParserState state,
}
else if (state->state == WAITFIRSTOPERAND)
return PT_END;
+ else if (isweb)
+ {
+ if (state->count > 0)
+ /* decrement per each parentheses level (see PT_OPEN) */
+ state->count--;
+ else
+ /* finally, we have to provide an operand */
+ pushStop(state);
+ return PT_END;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -291,26 +367,61 @@ gettoken_query(TSQueryParserState state,
(state->buf)++;
return PT_OPR;
}
- else if (t_iseq(state->buf, '<'))
+ else if (isweb && parse_or_operator(state->buf))
{
state->state = WAITOPERAND;
- *operator = OP_PHRASE;
- /* weight var is used as storage for distance */
- state->buf = parse_phrase_operator(state->buf, weight);
- if (*weight < 0)
+ *operator = OP_OR;
+ (state->buf) += 2; /* strlen("OR") */
+ return PT_OPR;
+ }
+ else if (t_iseq(state->buf, '<'))
+ {
+ int16 distance;
+ char *phrase = parse_phrase_operator(state->buf, &distance);
+ if (distance < 0)
+ {
+ if (isweb)
+ {
+ /* web search tolerates broken phrase operator */
+ (state->buf)++;
+ continue;
+ }
return PT_ERR;
+ }
+ state->buf = phrase;
+ *operator = OP_PHRASE;
+ *weight = distance; /* weight var is used as storage for distance */
+ state->state = WAITOPERAND;
return PT_OPR;
}
else if (t_iseq(state->buf, ')'))
{
+ if (isweb && state->count == 0)
+ {
+ /* web search tolerates useless closing parentheses */
+ (state->buf)++;
+ continue;
+ }
(state->buf)++;
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
else if (*(state->buf) == '\0')
- return (state->count) ? PT_ERR : PT_END;
+ {
+ /* web search tolerates unexpected end of line */
+ return (!isweb && state->count) ? PT_ERR : PT_END;
+ }
else if (!t_isspace(state->buf))
+ {
+ if (isweb)
+ {
+ /* put implicit AND if there's no operator */
+ *operator = OP_AND;
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
return PT_ERR;
+ }
break;
case WAITSINGLEOPERAND:
if (*(state->buf) == '\0')
@@ -320,9 +431,10 @@ gettoken_query(TSQueryParserState state,
state->buf += strlen(state->buf);
state->count++;
return PT_VAL;
- default:
- return PT_ERR;
- break;
+ case INQUOTES:
+ /* phrase should be followed by an operator */
+ state->state = WAITOPERATOR;
+ continue;
}
state->buf += pg_mblen(state->buf);
}
@@ -475,7 +587,8 @@ cleanOpStack(TSQueryParserState state,
static void
makepol(TSQueryParserState state,
PushFunction pushval,
- Datum opaque)
+ Datum opaque,
+ bool isweb)
{
int8 operator = 0;
ts_tokentype type;
@@ -489,19 +602,21 @@ makepol(TSQueryParserState state,
/* since this function recurses, it could be driven to stack overflow */
check_stack_depth();
- while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix)) != PT_END)
+ while ((type = gettoken_query(state, &operator, &lenval, &strval,
+ &weight, &prefix, isweb)) != PT_END)
{
switch (type)
{
case PT_VAL:
- pushval(opaque, state, strval, lenval, weight, prefix);
+ pushval(opaque, state, strval, lenval, weight, prefix,
+ state->state == INQUOTES /* force phrase operator */);
break;
case PT_OPR:
cleanOpStack(state, opstack, &lenstack, operator);
pushOpStack(opstack, &lenstack, operator, weight);
break;
case PT_OPEN:
- makepol(state, pushval, opaque);
+ makepol(state, pushval, opaque, isweb);
break;
case PT_CLOSE:
cleanOpStack(state, opstack, &lenstack, OP_OR /* lowest */ );
@@ -605,7 +720,8 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ bool isplain,
+ bool isweb)
{
struct TSQueryParserStateData state;
int i;
@@ -632,7 +748,7 @@ parse_tsquery(char *buf,
*(state.curop) = '\0';
/* parse query & make polish notation (postfix, but in reverse order) */
- makepol(&state, pushval, opaque);
+ makepol(&state, pushval, opaque, isweb);
close_tsvector_parser(state.valstate);
@@ -703,7 +819,7 @@ parse_tsquery(char *buf,
static void
pushval_asis(Datum opaque, TSQueryParserState state, char *strval, int lenval,
- int16 weight, bool prefix)
+ int16 weight, bool prefix, bool isphrase)
{
pushValue(state, strval, lenval, weight, prefix);
}
@@ -716,7 +832,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false, false));
}
/*
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index bfc90098f8..00f1a85ae7 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4950,6 +4950,8 @@ DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s
DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
DESCR("transform to tsvector");
DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ to_tsquery _null_ _null_ _null_ ));
@@ -4958,6 +4960,8 @@ DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s
DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
DESCR("transform jsonb to tsvector");
DATA(insert OID = 4210 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "114" _null_ _null_ _null_ _null_ _null_ json_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index f8ddce5ecb..098c9b9091 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -44,11 +44,12 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
char *token, int tokenlen,
int16 tokenweights, /* bitmap as described in
* QueryOperand struct */
- bool prefix);
+ bool prefix,
+ bool isphrase);
extern TSQuery parse_tsquery(char *buf,
PushFunction pushval,
- Datum opaque, bool isplain);
+ Datum opaque, bool isplain, bool isweb);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index d63fb12f1d..06100768a7 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1672,3 +1672,408 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('()');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('(())');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('()()()');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('abc ()');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('() abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc & ()');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('() & abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('(');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('((');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('((( )) abc or def');
+ websearch_to_tsquery
+----------------------
+ 'abc' | 'def'
+(1 row)
+
+select websearch_to_tsquery('))');
+NOTICE: text-search query doesn't contain lexemes: "))"
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery(')');
+NOTICE: text-search query doesn't contain lexemes: ")"
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery(')(');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('& )( |');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('abc )( def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('abc | )( & def');
+ websearch_to_tsquery
+----------------------
+ 'abc' | 'def'
+(1 row)
+
+select websearch_to_tsquery('& abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc &');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('| abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc |');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('< abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc <');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('a:::b');
+ websearch_to_tsquery
+----------------------
+ 'b'
+(1 row)
+
+select websearch_to_tsquery('My brand new smartphone');
+ websearch_to_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand "new smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand:B "new -smartphone"');
+ websearch_to_tsquery
+-----------------------------------
+ 'brand':B & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand:Z "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------------
+ 'brand' & 'z' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My & (brand ("new -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My & (brand ("new) -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat or rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or'
+(1 row)
+
+select websearch_to_tsquery('simple', 'OR rat');
+ websearch_to_tsquery
+----------------------
+ 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat' & ( 'cat' | 'rat' )
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat*rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat-rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat_rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', 'fat or(rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or)rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or&rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or|rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or!rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | !'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or<rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or>rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or ');
+ websearch_to_tsquery
+----------------------
+ 'fat'
+(1 row)
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orange'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc orтест');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orтест'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR1234');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or1234'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or-abc');
+ websearch_to_tsquery
+---------------------------------
+ 'abc' & 'or-abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR_abc');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or'
+(1 row)
+
+select websearch_to_tsquery('simple', 'or OR or');
+ websearch_to_tsquery
+----------------------
+ 'or' | 'or'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+ websearch_to_tsquery
+----------------------------------------
+ 'fat' <-> 'cat' & ( 'eaten' | !'rat' )
+(1 row)
+
+select websearch_to_tsquery('english', 'this is ----fine');
+ websearch_to_tsquery
+----------------------
+ !!!!'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -!-fine, "dear friend" OR good');
+ websearch_to_tsquery
+------------------------------------------
+ !!!'fine' & 'dear' <-> 'friend' | 'good'
+(1 row)
+
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+ websearch_to_tsquery
+--------------------------
+ 'old' <-> 'cat' & 'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ websearch_to_tsquery
+--------------------------------------
+ 'толст' <-> 'кошк' & 'съел' & 'крыс'
+(1 row)
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ f
+(1 row)
+
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index 1c8520b3e9..1bf9f80a9c 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -539,3 +539,88 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('()');
+select websearch_to_tsquery('(())');
+select websearch_to_tsquery('()()()');
+select websearch_to_tsquery('abc ()');
+select websearch_to_tsquery('() abc');
+select websearch_to_tsquery('abc & ()');
+select websearch_to_tsquery('() & abc');
+
+select websearch_to_tsquery('(');
+select websearch_to_tsquery('((');
+select websearch_to_tsquery('((( )) abc or def');
+select websearch_to_tsquery('))');
+select websearch_to_tsquery(')');
+
+select websearch_to_tsquery(')(');
+select websearch_to_tsquery('& )( |');
+select websearch_to_tsquery('abc )( def');
+select websearch_to_tsquery('abc | )( & def');
+
+select websearch_to_tsquery('& abc');
+select websearch_to_tsquery('abc &');
+select websearch_to_tsquery('| abc');
+select websearch_to_tsquery('abc |');
+select websearch_to_tsquery('< abc');
+select websearch_to_tsquery('abc <');
+select websearch_to_tsquery('a:::b');
+
+select websearch_to_tsquery('My brand new smartphone');
+select websearch_to_tsquery('My brand "new smartphone"');
+select websearch_to_tsquery('My brand "new -smartphone"');
+select websearch_to_tsquery('My brand:B "new -smartphone"');
+select websearch_to_tsquery('My brand:Z "new -smartphone"');
+select websearch_to_tsquery('My & (brand ("new -smartphone"');
+select websearch_to_tsquery('My & (brand ("new) -smartphone"');
+
+select websearch_to_tsquery('simple', 'cat or rat');
+select websearch_to_tsquery('simple', 'cat OR rat');
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+select websearch_to_tsquery('simple', 'cat OR');
+select websearch_to_tsquery('simple', 'OR rat');
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+select websearch_to_tsquery('simple', 'fat*rat');
+select websearch_to_tsquery('simple', 'fat-rat');
+select websearch_to_tsquery('simple', 'fat_rat');
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', 'fat or(rat');
+select websearch_to_tsquery('simple', 'fat or)rat');
+select websearch_to_tsquery('simple', 'fat or&rat');
+select websearch_to_tsquery('simple', 'fat or|rat');
+select websearch_to_tsquery('simple', 'fat or!rat');
+select websearch_to_tsquery('simple', 'fat or<rat');
+select websearch_to_tsquery('simple', 'fat or>rat');
+select websearch_to_tsquery('simple', 'fat or ');
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+select websearch_to_tsquery('simple', 'abc orтест');
+select websearch_to_tsquery('simple', 'abc OR1234');
+select websearch_to_tsquery('simple', 'abc or-abc');
+select websearch_to_tsquery('simple', 'abc OR_abc');
+select websearch_to_tsquery('simple', 'abc or');
+
+select websearch_to_tsquery('simple', 'or OR or');
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+
+select websearch_to_tsquery('english', 'this is ----fine');
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -!-fine, "dear friend" OR good');
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
Hello Dmitry,
A few gotchas:
I haven't touched gettoken_tsvector() yet. As a result, the following
queries produce errors:select websearch_to_tsquery('simple', '''');
ERROR: syntax error in tsquery: "'"select websearch_to_tsquery('simple', '\');
ERROR: there is no escaped character: "\"Maybe there's more. The question is: should we fix those, or it's fine as it
is? I don't have a strong opinion about this.
It doesn't sound right to me to accept any input as a general rule but
sometimes return errors nevertheless. That API would be complicated for
the users. Thus I suggest to accept any garbage and try our best to
interpret it.
--
Best regards,
Aleksander Alekseev
Hello hackers,
On 2018-03-28 12:21, Aleksander Alekseev wrote:
It doesn't sound right to me to accept any input as a general rule but
sometimes return errors nevertheless. That API would be complicated for
the users. Thus I suggest to accept any garbage and try our best to
interpret it.
I agree with Aleksander about silencing all errors in
websearch_to_tsquery().
In the attachment is a revised patch with the attempt to introduce an
ability to ignore syntax errors in gettoken_tsvector().
I'm also read through the patch and all the code looks good to me except
one thing.
The name of enum ts_parsestate looks more like a name of the function
than a name of a type.
In my version, it renamed to QueryParserState, but you can fix it if I'm
wrong.
--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
websearch_to_tsquery_v3.difftext/x-diff; charset=us-ascii; name=websearch_to_tsquery_v3.diffDownload
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index ea5947a3a8..bdf05236cf 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -390,7 +390,8 @@ add_to_tsvector(void *_state, char *elem_value, int elem_len)
* and different variants are ORed together.
*/
static void
-pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval, int16 weight, bool prefix)
+pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
+ int16 weight, bool prefix, bool force_phrase)
{
int32 count = 0;
ParsedText prs;
@@ -423,7 +424,12 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
/* put placeholders for each missing stop word */
pushStop(state);
if (cntpos)
- pushOperator(state, data->qoperator, 1);
+ {
+ if (force_phrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
+ }
cntpos++;
pos++;
}
@@ -464,7 +470,10 @@ pushval_morph(Datum opaque, TSQueryParserState state, char *strval, int lenval,
if (cntpos)
{
/* distance may be useful */
- pushOperator(state, data->qoperator, 1);
+ if (force_phrase)
+ pushOperator(state, OP_PHRASE, 1);
+ else
+ pushOperator(state, data->qoperator, 1);
}
cntpos++;
@@ -490,6 +499,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
+ false,
false);
PG_RETURN_TSQUERY(query);
@@ -520,7 +530,8 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_POINTER(query);
}
@@ -551,7 +562,8 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ true,
+ false);
PG_RETURN_TSQUERY(query);
}
@@ -567,3 +579,36 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+websearch_to_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ false,
+ true);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+websearch_to_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(websearch_to_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+
+}
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 1ccbf79030..00e6218691 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -32,12 +32,24 @@ const int tsearch_op_priority[OP_COUNT] =
3 /* OP_PHRASE */
};
+/*
+ * parser's states
+ */
+typedef enum
+{
+ WAITOPERAND = 1,
+ WAITOPERATOR = 2,
+ WAITFIRSTOPERAND = 3,
+ WAITSINGLEOPERAND = 4,
+ INQUOTES = 5 /* for quoted phrases in web search */
+} QueryParserState;
+
struct TSQueryParserStateData
{
/* State for gettoken_query */
char *buffer; /* entire string we are scanning */
char *buf; /* current scan point */
- int state;
+ QueryParserState state;
int count; /* nesting count, incremented by (,
* decremented by ) */
@@ -57,12 +69,6 @@ struct TSQueryParserStateData
TSVectorParseState valstate;
};
-/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
-
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
* part, like ':AB*' of a query.
@@ -197,6 +203,21 @@ err:
return buf;
}
+/*
+ * Parse OR operator used in websearch_to_tsquery().
+ */
+static bool
+parse_or_operator(char *buf)
+{
+ return (t_iseq(&buf[0], 'o') || t_iseq(&buf[0], 'O')) &&
+ (t_iseq(&buf[1], 'r') || t_iseq(&buf[1], 'R')) &&
+ (buf[2] != '\0' &&
+ !t_iseq(&buf[2], '-') &&
+ !t_iseq(&buf[2], '_') &&
+ !t_isalpha(&buf[2]) &&
+ !t_isdigit(&buf[2]));
+}
+
/*
* token types for parsing
*/
@@ -220,8 +241,8 @@ typedef enum
*/
static ts_tokentype
gettoken_query(TSQueryParserState state,
- int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+ int8 *operator, int *lenval, char **strval,
+ int16 *weight, bool *prefix, bool isweb)
{
*weight = 0;
*prefix = false;
@@ -232,7 +253,8 @@ gettoken_query(TSQueryParserState state,
{
case WAITFIRSTOPERAND:
case WAITOPERAND:
- if (t_iseq(state->buf, '!'))
+ if (t_iseq(state->buf, '!') ||
+ (isweb && t_iseq(state->buf, '-')))
{
(state->buf)++; /* can safely ++, t_iseq guarantee that
* pg_mblen()==1 */
@@ -249,11 +271,55 @@ gettoken_query(TSQueryParserState state,
}
else if (t_iseq(state->buf, ':'))
{
+ if (isweb)
+ {
+ /* it doesn't mean anything */
+ (state->buf)++;
+ continue;
+ }
+
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("syntax error in tsquery: \"%s\"",
state->buffer)));
}
+ else if (isweb && t_iseq(state->buf, ')'))
+ {
+ if (state->count == 0)
+ {
+ /* web search tolerates useless closing parantheses */
+ (state->buf)++;
+ continue;
+ }
+ (state->buf)++;
+ state->count--;
+ return PT_CLOSE;
+ }
+ else if (isweb &&
+ (t_iseq(state->buf, '&') ||
+ t_iseq(state->buf, '|') ||
+ t_iseq(state->buf, '<')))
+ {
+ /* or else gettoken_tsvector() will raise an error */
+ (state->buf)++;
+ continue;
+ }
+ else if (isweb && t_iseq(state->buf, '"'))
+ {
+ /* quoted text should be ordered (<->) */
+ char *quote = strchr(state->buf + 1, '"');
+ if (quote == NULL)
+ {
+ /* web search tolerates missing quotes too */
+ state->buf++;
+ continue;
+ }
+ *strval = state->buf + 1;
+ *lenval = quote - *strval;
+ state->buf = quote + 1;
+ state->state = INQUOTES;
+ return PT_VAL;
+ }
else if (!t_isspace(state->buf))
{
/*
@@ -269,6 +335,16 @@ gettoken_query(TSQueryParserState state,
}
else if (state->state == WAITFIRSTOPERAND)
return PT_END;
+ else if (isweb)
+ {
+ if (state->count > 0)
+ /* decrement per each parentheses level (see PT_OPEN) */
+ state->count--;
+ else
+ /* finally, we have to provide an operand */
+ pushStop(state);
+ return PT_END;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -291,26 +367,61 @@ gettoken_query(TSQueryParserState state,
(state->buf)++;
return PT_OPR;
}
- else if (t_iseq(state->buf, '<'))
+ else if (isweb && parse_or_operator(state->buf))
{
state->state = WAITOPERAND;
- *operator = OP_PHRASE;
- /* weight var is used as storage for distance */
- state->buf = parse_phrase_operator(state->buf, weight);
- if (*weight < 0)
+ *operator = OP_OR;
+ (state->buf) += 2; /* strlen("OR") */
+ return PT_OPR;
+ }
+ else if (t_iseq(state->buf, '<'))
+ {
+ int16 distance;
+ char *phrase = parse_phrase_operator(state->buf, &distance);
+ if (distance < 0)
+ {
+ if (isweb)
+ {
+ /* web search tolerates broken phrase operator */
+ (state->buf)++;
+ continue;
+ }
return PT_ERR;
+ }
+ state->buf = phrase;
+ *operator = OP_PHRASE;
+ *weight = distance; /* weight var is used as storage for distance */
+ state->state = WAITOPERAND;
return PT_OPR;
}
else if (t_iseq(state->buf, ')'))
{
+ if (isweb && state->count == 0)
+ {
+ /* web search tolerates useless closing parentheses */
+ (state->buf)++;
+ continue;
+ }
(state->buf)++;
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
else if (*(state->buf) == '\0')
- return (state->count) ? PT_ERR : PT_END;
+ {
+ /* web search tolerates unexpected end of line */
+ return (!isweb && state->count) ? PT_ERR : PT_END;
+ }
else if (!t_isspace(state->buf))
+ {
+ if (isweb)
+ {
+ /* put implicit AND if there's no operator */
+ *operator = OP_AND;
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
return PT_ERR;
+ }
break;
case WAITSINGLEOPERAND:
if (*(state->buf) == '\0')
@@ -320,9 +431,10 @@ gettoken_query(TSQueryParserState state,
state->buf += strlen(state->buf);
state->count++;
return PT_VAL;
- default:
- return PT_ERR;
- break;
+ case INQUOTES:
+ /* phrase should be followed by an operator */
+ state->state = WAITOPERATOR;
+ continue;
}
state->buf += pg_mblen(state->buf);
}
@@ -475,7 +587,8 @@ cleanOpStack(TSQueryParserState state,
static void
makepol(TSQueryParserState state,
PushFunction pushval,
- Datum opaque)
+ Datum opaque,
+ bool isweb)
{
int8 operator = 0;
ts_tokentype type;
@@ -489,19 +602,21 @@ makepol(TSQueryParserState state,
/* since this function recurses, it could be driven to stack overflow */
check_stack_depth();
- while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix)) != PT_END)
+ while ((type = gettoken_query(state, &operator, &lenval, &strval,
+ &weight, &prefix, isweb)) != PT_END)
{
switch (type)
{
case PT_VAL:
- pushval(opaque, state, strval, lenval, weight, prefix);
+ pushval(opaque, state, strval, lenval, weight, prefix,
+ state->state == INQUOTES /* force phrase operator */);
break;
case PT_OPR:
cleanOpStack(state, opstack, &lenstack, operator);
pushOpStack(opstack, &lenstack, operator, weight);
break;
case PT_OPEN:
- makepol(state, pushval, opaque);
+ makepol(state, pushval, opaque, isweb);
break;
case PT_CLOSE:
cleanOpStack(state, opstack, &lenstack, OP_OR /* lowest */ );
@@ -605,7 +720,8 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ bool isplain,
+ bool isweb)
{
struct TSQueryParserStateData state;
int i;
@@ -623,7 +739,7 @@ parse_tsquery(char *buf,
state.polstr = NIL;
/* init value parser's state */
- state.valstate = init_tsvector_parser(state.buffer, true, true);
+ state.valstate = init_tsvector_parser(state.buffer, true, true, isweb);
/* init list of operand */
state.sumlen = 0;
@@ -632,7 +748,7 @@ parse_tsquery(char *buf,
*(state.curop) = '\0';
/* parse query & make polish notation (postfix, but in reverse order) */
- makepol(&state, pushval, opaque);
+ makepol(&state, pushval, opaque, isweb);
close_tsvector_parser(state.valstate);
@@ -703,7 +819,7 @@ parse_tsquery(char *buf,
static void
pushval_asis(Datum opaque, TSQueryParserState state, char *strval, int lenval,
- int16 weight, bool prefix)
+ int16 weight, bool prefix, bool isphrase)
{
pushValue(state, strval, lenval, weight, prefix);
}
@@ -716,7 +832,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false, false));
}
/*
diff --git a/src/backend/utils/adt/tsvector.c b/src/backend/utils/adt/tsvector.c
index 64e02ef434..395a326513 100644
--- a/src/backend/utils/adt/tsvector.c
+++ b/src/backend/utils/adt/tsvector.c
@@ -200,7 +200,7 @@ tsvectorin(PG_FUNCTION_ARGS)
char *cur;
int buflen = 256; /* allocated size of tmpbuf */
- state = init_tsvector_parser(buf, false, false);
+ state = init_tsvector_parser(buf, false, false, false);
arrlen = 64;
arr = (WordEntryIN *) palloc(sizeof(WordEntryIN) * arrlen);
diff --git a/src/backend/utils/adt/tsvector_parser.c b/src/backend/utils/adt/tsvector_parser.c
index 7367ba6a40..6994d832d1 100644
--- a/src/backend/utils/adt/tsvector_parser.c
+++ b/src/backend/utils/adt/tsvector_parser.c
@@ -33,8 +33,22 @@ struct TSVectorParseStateData
int eml; /* max bytes per character */
bool oprisdelim; /* treat ! | * ( ) as delimiters? */
bool is_tsquery; /* say "tsquery" not "tsvector" in errors? */
+ bool ignore_errors; /* ignore errors and log them as warnings */
};
+/* State codes used in gettoken_tsvector */
+typedef enum
+{
+ WAITWORD = 1,
+ WAITENDWORD = 2,
+ WAITNEXTCHAR = 3,
+ WAITENDCMPLX = 4,
+ WAITPOSINFO = 5,
+ INPOSINFO = 6,
+ WAITPOSDELIM = 7,
+ WAITCHARCMPLX = 8
+} VectorParseStates;
+
/*
* Initializes parser for the input string. If oprisdelim is set, the
@@ -42,7 +56,7 @@ struct TSVectorParseStateData
* ! | & ( )
*/
TSVectorParseState
-init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
+init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery, bool ignore_errors)
{
TSVectorParseState state;
@@ -54,6 +68,7 @@ init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
state->eml = pg_database_encoding_max_length();
state->oprisdelim = oprisdelim;
state->is_tsquery = is_tsquery;
+ state->ignore_errors = ignore_errors;
return state;
}
@@ -119,27 +134,27 @@ do { \
return true; \
} while(0)
-
-/* State codes used in gettoken_tsvector */
-#define WAITWORD 1
-#define WAITENDWORD 2
-#define WAITNEXTCHAR 3
-#define WAITENDCMPLX 4
-#define WAITPOSINFO 5
-#define INPOSINFO 6
-#define WAITPOSDELIM 7
-#define WAITCHARCMPLX 8
-
#define PRSSYNTAXERROR prssyntaxerror(state)
static void
prssyntaxerror(TSVectorParseState state)
{
- ereport(ERROR,
- (errcode(ERRCODE_SYNTAX_ERROR),
- state->is_tsquery ?
- errmsg("syntax error in tsquery: \"%s\"", state->bufstart) :
- errmsg("syntax error in tsvector: \"%s\"", state->bufstart)));
+ if (state->ignore_errors)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ state->is_tsquery ?
+ errmsg("syntax error in tsquery: \"%s\"", state->bufstart) :
+ errmsg("syntax error in tsvector: \"%s\"", state->bufstart)));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ state->is_tsquery ?
+ errmsg("syntax error in tsquery: \"%s\"", state->bufstart) :
+ errmsg("syntax error in tsvector: \"%s\"", state->bufstart)));
+ }
}
@@ -165,9 +180,9 @@ gettoken_tsvector(TSVectorParseState state,
WordEntryPos **pos_ptr, int *poslen,
char **endptr)
{
- int oldstate = 0;
+ VectorParseStates oldstate = 0;
char *curpos = state->word;
- int statecode = WAITWORD;
+ VectorParseStates statecode = WAITWORD;
/*
* pos is for collecting the comma delimited list of positions followed by
@@ -202,10 +217,23 @@ gettoken_tsvector(TSVectorParseState state,
else if (statecode == WAITNEXTCHAR)
{
if (*(state->prsbuf) == '\0')
- ereport(ERROR,
- (errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("there is no escaped character: \"%s\"",
- state->bufstart)));
+ {
+ if (state->ignore_errors)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("there is no escaped character: \"%s\"",
+ state->bufstart)));
+ return false;
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("there is no escaped character: \"%s\"",
+ state->bufstart)));
+ }
+ }
else
{
RESIZEPRSBUF;
@@ -260,7 +288,15 @@ gettoken_tsvector(TSVectorParseState state,
oldstate = WAITENDCMPLX;
}
else if (*(state->prsbuf) == '\0')
+ {
+ if (state->ignore_errors)
+ {
+ /* Parse as there is a closing quote character in the end */
+ statecode = WAITCHARCMPLX;
+ continue;
+ }
PRSSYNTAXERROR;
+ }
else
{
RESIZEPRSBUF;
@@ -319,10 +355,22 @@ gettoken_tsvector(TSVectorParseState state,
WEP_SETPOS(pos[npos - 1], LIMITPOS(atoi(state->prsbuf)));
/* we cannot get here in tsquery, so no need for 2 errmsgs */
if (WEP_GETPOS(pos[npos - 1]) == 0)
- ereport(ERROR,
- (errcode(ERRCODE_SYNTAX_ERROR),
- errmsg("wrong position info in tsvector: \"%s\"",
- state->bufstart)));
+ {
+ if (state->ignore_errors)
+ {
+ ereport(WARNING,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("wrong position info in tsvector: \"%s\"",
+ state->bufstart)));
+ }
+ else
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("wrong position info in tsvector: \"%s\"",
+ state->bufstart)));
+ }
+ }
WEP_SETWEIGHT(pos[npos - 1], 0);
statecode = WAITPOSDELIM;
}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 90d994c71a..560416636b 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4966,6 +4966,8 @@ DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s
DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
DESCR("transform to tsvector");
DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ to_tsquery _null_ _null_ _null_ ));
@@ -4974,6 +4976,8 @@ DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s
DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
DESCR("transform jsonb to tsvector");
DATA(insert OID = 4210 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "114" _null_ _null_ _null_ _null_ _null_ json_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index f8ddce5ecb..2e805bfeb0 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -27,7 +27,8 @@ typedef struct TSVectorParseStateData *TSVectorParseState;
extern TSVectorParseState init_tsvector_parser(char *input,
bool oprisdelim,
- bool is_tsquery);
+ bool is_tsquery,
+ bool ignore_errors);
extern void reset_tsvector_parser(TSVectorParseState state, char *input);
extern bool gettoken_tsvector(TSVectorParseState state,
char **token, int *len,
@@ -44,11 +45,12 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
char *token, int tokenlen,
int16 tokenweights, /* bitmap as described in
* QueryOperand struct */
- bool prefix);
+ bool prefix,
+ bool isphrase);
extern TSQuery parse_tsquery(char *buf,
PushFunction pushval,
- Datum opaque, bool isplain);
+ Datum opaque, bool isplain, bool isweb);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index d63fb12f1d..d5d1bdda7f 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1672,3 +1672,455 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('()');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('(())');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('()()()');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('abc ()');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('() abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc & ()');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('() & abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('''');
+WARNING: syntax error in tsquery: "'"
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('\');
+WARNING: there is no escaped character: "\"
+NOTICE: text-search query doesn't contain lexemes: "\"
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('\\');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('ab '' abc');
+ websearch_to_tsquery
+----------------------
+ 'ab' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('ab '' abc''');
+ websearch_to_tsquery
+----------------------
+ 'ab' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('ab \ abc');
+ websearch_to_tsquery
+----------------------
+ 'ab' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('ab \\ abc');
+ websearch_to_tsquery
+----------------------
+ 'ab' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('(');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('((');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('((( )) abc or def');
+ websearch_to_tsquery
+----------------------
+ 'abc' | 'def'
+(1 row)
+
+select websearch_to_tsquery('))');
+NOTICE: text-search query doesn't contain lexemes: "))"
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery(')');
+NOTICE: text-search query doesn't contain lexemes: ")"
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery(')(');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('& )( |');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('abc )( def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('abc | )( & def');
+ websearch_to_tsquery
+----------------------
+ 'abc' | 'def'
+(1 row)
+
+select websearch_to_tsquery('& abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc &');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('| abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc |');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('< abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('abc <');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('a:::b');
+ websearch_to_tsquery
+----------------------
+ 'b'
+(1 row)
+
+select websearch_to_tsquery('My brand new smartphone');
+ websearch_to_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand "new smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand:B "new -smartphone"');
+ websearch_to_tsquery
+-----------------------------------
+ 'brand':B & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My brand:Z "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------------
+ 'brand' & 'z' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My & (brand ("new -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('My & (brand ("new) -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat or rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or'
+(1 row)
+
+select websearch_to_tsquery('simple', 'OR rat');
+ websearch_to_tsquery
+----------------------
+ 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat' & ( 'cat' | 'rat' )
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat*rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat-rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat_rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', 'fat or(rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or)rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or&rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or|rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or!rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | !'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or<rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or>rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or ');
+ websearch_to_tsquery
+----------------------
+ 'fat'
+(1 row)
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orange'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or��������');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or��������'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR1234');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or1234'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or-abc');
+ websearch_to_tsquery
+---------------------------------
+ 'abc' & 'or-abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR_abc');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or'
+(1 row)
+
+select websearch_to_tsquery('simple', 'or OR or');
+ websearch_to_tsquery
+----------------------
+ 'or' | 'or'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+ websearch_to_tsquery
+----------------------------------------
+ 'fat' <-> 'cat' & ( 'eaten' | !'rat' )
+(1 row)
+
+select websearch_to_tsquery('english', 'this is ----fine');
+ websearch_to_tsquery
+----------------------
+ !!!!'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -!-fine, "dear friend" OR good');
+ websearch_to_tsquery
+------------------------------------------
+ !!!'fine' & 'dear' <-> 'friend' | 'good'
+(1 row)
+
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+ websearch_to_tsquery
+--------------------------
+ 'old' <-> 'cat' & 'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('russian', '"�������������� ����������" ���������� ����������');
+ websearch_to_tsquery
+--------------------------------------
+ '����������' <-> '��������' & '��������' & '��������'
+(1 row)
+
+select to_tsvector('russian', '���������� �������������� ���������� ����������') @@
+websearch_to_tsquery('russian', '"�������������� ����������" ���������� ����������');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('russian', '���������� �������������� ���������� ���������� ����������') @@
+websearch_to_tsquery('russian', '"�������������� ����������" ���������� ����������');
+ ?column?
+----------
+ f
+(1 row)
+
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index 1c8520b3e9..c7976eb87c 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -539,3 +539,95 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('()');
+select websearch_to_tsquery('(())');
+select websearch_to_tsquery('()()()');
+select websearch_to_tsquery('abc ()');
+select websearch_to_tsquery('() abc');
+select websearch_to_tsquery('abc & ()');
+select websearch_to_tsquery('() & abc');
+select websearch_to_tsquery('''');
+select websearch_to_tsquery('\');
+select websearch_to_tsquery('\\');
+select websearch_to_tsquery('ab '' abc');
+select websearch_to_tsquery('ab '' abc''');
+select websearch_to_tsquery('ab \ abc');
+select websearch_to_tsquery('ab \\ abc');
+
+select websearch_to_tsquery('(');
+select websearch_to_tsquery('((');
+select websearch_to_tsquery('((( )) abc or def');
+select websearch_to_tsquery('))');
+select websearch_to_tsquery(')');
+
+select websearch_to_tsquery(')(');
+select websearch_to_tsquery('& )( |');
+select websearch_to_tsquery('abc )( def');
+select websearch_to_tsquery('abc | )( & def');
+
+select websearch_to_tsquery('& abc');
+select websearch_to_tsquery('abc &');
+select websearch_to_tsquery('| abc');
+select websearch_to_tsquery('abc |');
+select websearch_to_tsquery('< abc');
+select websearch_to_tsquery('abc <');
+select websearch_to_tsquery('a:::b');
+
+select websearch_to_tsquery('My brand new smartphone');
+select websearch_to_tsquery('My brand "new smartphone"');
+select websearch_to_tsquery('My brand "new -smartphone"');
+select websearch_to_tsquery('My brand:B "new -smartphone"');
+select websearch_to_tsquery('My brand:Z "new -smartphone"');
+select websearch_to_tsquery('My & (brand ("new -smartphone"');
+select websearch_to_tsquery('My & (brand ("new) -smartphone"');
+
+select websearch_to_tsquery('simple', 'cat or rat');
+select websearch_to_tsquery('simple', 'cat OR rat');
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+select websearch_to_tsquery('simple', 'cat OR');
+select websearch_to_tsquery('simple', 'OR rat');
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+select websearch_to_tsquery('simple', 'fat*rat');
+select websearch_to_tsquery('simple', 'fat-rat');
+select websearch_to_tsquery('simple', 'fat_rat');
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', 'fat or(rat');
+select websearch_to_tsquery('simple', 'fat or)rat');
+select websearch_to_tsquery('simple', 'fat or&rat');
+select websearch_to_tsquery('simple', 'fat or|rat');
+select websearch_to_tsquery('simple', 'fat or!rat');
+select websearch_to_tsquery('simple', 'fat or<rat');
+select websearch_to_tsquery('simple', 'fat or>rat');
+select websearch_to_tsquery('simple', 'fat or ');
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+select websearch_to_tsquery('simple', 'abc or��������');
+select websearch_to_tsquery('simple', 'abc OR1234');
+select websearch_to_tsquery('simple', 'abc or-abc');
+select websearch_to_tsquery('simple', 'abc OR_abc');
+select websearch_to_tsquery('simple', 'abc or');
+
+select websearch_to_tsquery('simple', 'or OR or');
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+
+select websearch_to_tsquery('english', 'this is ----fine');
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -!-fine, "dear friend" OR good');
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+select websearch_to_tsquery('russian', '"�������������� ����������" ���������� ����������');
+
+select to_tsvector('russian', '���������� �������������� ���������� ����������') @@
+websearch_to_tsquery('russian', '"�������������� ����������" ���������� ����������');
+
+select to_tsvector('russian', '���������� �������������� ���������� ���������� ����������') @@
+websearch_to_tsquery('russian', '"�������������� ����������" ���������� ����������');
Hi Aleksandr,
I agree with Aleksander about silencing all errors in
websearch_to_tsquery().In the attachment is a revised patch with the attempt to introduce an
ability to ignore syntax errors in gettoken_tsvector().
Thanks for the further improvements! Yes, you're both right, the API has
to be consistent. Unfortunately, I had to make some adjustments
according to Oleg Bartunov's review. Here's a change log:
1. &, | and (), <-> are no longer considered operators in web search
mode.
2. I've stumbled upon a bug: web search used to transform "pg_class"
into 'pg <-> class', which is no longer the case.
3. I changed the behavior of gettoken_tsvector() as soon as I had heard
from Aleksander Alekseev, so I decided to use my implementation in this
revision of the patch. This is a good subject for discussion, though.
Feel free to share your opinion.
4. As suggested by Theodor, I've replaced some bool args with bit flags.
The name of enum ts_parsestate looks more like a name of the function
than a name of a type.
In my version, it renamed to QueryParserState, but you can fix it if
I'm wrong.
True, but gettoken_query() returns ts_tokentype, so I decided to use
this naming scheme.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
websearch_to_tsquery_v4.difftext/x-diff; name=websearch_to_tsquery_v4.diffDownload
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index ea5947a3a8..6055fb6b4e 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -490,7 +490,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- false);
+ 0);
PG_RETURN_TSQUERY(query);
}
@@ -520,7 +520,7 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ P_TSQ_PLAIN);
PG_RETURN_POINTER(query);
}
@@ -551,7 +551,7 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ P_TSQ_PLAIN);
PG_RETURN_TSQUERY(query);
}
@@ -567,3 +567,35 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+websearch_to_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ P_TSQ_WEB);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+websearch_to_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(websearch_to_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+
+}
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 1ccbf79030..695bdb89e9 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -32,14 +32,27 @@ const int tsearch_op_priority[OP_COUNT] =
3 /* OP_PHRASE */
};
+/*
+ * parser's states
+ */
+typedef enum
+{
+ WAITOPERAND = 1,
+ WAITOPERATOR = 2,
+ WAITFIRSTOPERAND = 3,
+ WAITSINGLEOPERAND = 4
+} ts_parserstate;
+
struct TSQueryParserStateData
{
/* State for gettoken_query */
char *buffer; /* entire string we are scanning */
char *buf; /* current scan point */
- int state;
int count; /* nesting count, incremented by (,
* decremented by ) */
+ bool in_quotes; /* phrase in quotes "" */
+ bool is_web; /* is it a web search? */
+ ts_parserstate state;
/* polish (prefix) notation in list, filled in by push* functions */
List *polstr;
@@ -57,12 +70,6 @@ struct TSQueryParserStateData
TSVectorParseState valstate;
};
-/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
-
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
* part, like ':AB*' of a query.
@@ -197,6 +204,26 @@ err:
return buf;
}
+/*
+ * Parse OR operator used in websearch_to_tsquery().
+ */
+static bool
+parse_or_operator(TSQueryParserState state)
+{
+ char *buf = state->buf;
+
+ if (state->in_quotes)
+ return false;
+
+ return (t_iseq(&buf[0], 'o') || t_iseq(&buf[0], 'O')) &&
+ (t_iseq(&buf[1], 'r') || t_iseq(&buf[1], 'R')) &&
+ (buf[2] != '\0' &&
+ !t_iseq(&buf[2], '-') &&
+ !t_iseq(&buf[2], '_') &&
+ !t_isalpha(&buf[2]) &&
+ !t_isdigit(&buf[2]));
+}
+
/*
* token types for parsing
*/
@@ -219,10 +246,12 @@ typedef enum
*
*/
static ts_tokentype
-gettoken_query(TSQueryParserState state,
- int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+gettoken_query(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix)
{
+ bool is_web = state->is_web;
+
*weight = 0;
*prefix = false;
@@ -232,28 +261,63 @@ gettoken_query(TSQueryParserState state,
{
case WAITFIRSTOPERAND:
case WAITOPERAND:
- if (t_iseq(state->buf, '!'))
+ if ((! is_web && t_iseq(state->buf, '!')) ||
+ (is_web && t_iseq(state->buf, '-')))
{
- (state->buf)++; /* can safely ++, t_iseq guarantee that
- * pg_mblen()==1 */
+ state->buf++;
+
+ if (state->in_quotes)
+ continue;
+
*operator = OP_NOT;
state->state = WAITOPERAND;
return PT_OPR;
}
else if (t_iseq(state->buf, '('))
{
+ state->buf++;
+
+ if (is_web)
+ continue;
+
state->count++;
- (state->buf)++;
state->state = WAITOPERAND;
return PT_OPEN;
}
else if (t_iseq(state->buf, ':'))
{
+ state->buf++;
+
+ if (is_web)
+ continue;
+
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("syntax error in tsquery: \"%s\"",
state->buffer)));
}
+ else if (is_web && t_iseq(state->buf, '"'))
+ {
+ state->buf++;
+
+ /* web search tolerates missing quotes */
+ if (!state->in_quotes && strchr(state->buf, '"'))
+ {
+ /* quoted text should be ordered <-> */
+ state->in_quotes = true;
+ state->state = WAITOPERAND;
+ }
+ else
+ state->in_quotes = false;
+
+ continue;
+ }
+ else if (is_web && ISOPERATOR(state->buf))
+ {
+ /* or else gettoken_tsvector() will raise an error */
+ state->buf++;
+ continue;
+ }
else if (!t_isspace(state->buf))
{
/*
@@ -263,12 +327,22 @@ gettoken_query(TSQueryParserState state,
reset_tsvector_parser(state->valstate, state->buf);
if (gettoken_tsvector(state->valstate, strval, lenval, NULL, NULL, &state->buf))
{
- state->buf = get_modifiers(state->buf, weight, prefix);
+ if (!is_web)
+ {
+ /* web search does not support weights */
+ state->buf = get_modifiers(state->buf, weight, prefix);
+ }
state->state = WAITOPERATOR;
return PT_VAL;
}
else if (state->state == WAITFIRSTOPERAND)
return PT_END;
+ else if (is_web)
+ {
+ /* finally, we have to provide an operand */
+ pushStop(state);
+ return PT_END;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -277,40 +351,95 @@ gettoken_query(TSQueryParserState state,
}
break;
case WAITOPERATOR:
- if (t_iseq(state->buf, '&'))
+ if (! is_web && t_iseq(state->buf, '&'))
{
+ state->buf++;
state->state = WAITOPERAND;
*operator = OP_AND;
- (state->buf)++;
return PT_OPR;
}
- else if (t_iseq(state->buf, '|'))
+ else if (! is_web && t_iseq(state->buf, '|'))
{
+ state->buf++;
state->state = WAITOPERAND;
*operator = OP_OR;
- (state->buf)++;
return PT_OPR;
}
- else if (t_iseq(state->buf, '<'))
+ else if (! is_web && t_iseq(state->buf, '<'))
{
- state->state = WAITOPERAND;
- *operator = OP_PHRASE;
/* weight var is used as storage for distance */
state->buf = parse_phrase_operator(state->buf, weight);
+ state->state = WAITOPERAND;
+ *operator = OP_PHRASE;
if (*weight < 0)
return PT_ERR;
return PT_OPR;
}
- else if (t_iseq(state->buf, ')'))
+ else if (! is_web && t_iseq(state->buf, ')'))
{
- (state->buf)++;
+ state->buf++;
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
+ else if (is_web && t_iseq(state->buf, '"'))
+ {
+ state->buf++;
+
+ /* web search tolerates missing quotes */
+ if (!state->in_quotes && strchr(state->buf, '"'))
+ {
+ /* quoted text should be ordered <-> */
+ state->in_quotes = true;
+ state->state = WAITOPERAND;
+
+ /* put implicit AND after an operand */
+ *operator = OP_AND;
+ return PT_OPR;
+ }
+ else
+ state->in_quotes = false;
+
+ continue;
+ }
+ else if (is_web && parse_or_operator(state))
+ {
+ state->buf += 2; /* strlen("OR") */
+ state->state = WAITOPERAND;
+ *operator = OP_OR;
+ return PT_OPR;
+ }
+ else if (is_web && ISOPERATOR(state->buf))
+ {
+ /* just skip disabled operators */
+ state->buf++;
+ continue;
+ }
else if (*(state->buf) == '\0')
- return (state->count) ? PT_ERR : PT_END;
+ {
+ /* web search tolerates unexpected end of line */
+ return (!is_web && state->count) ? PT_ERR : PT_END;
+ }
else if (!t_isspace(state->buf))
+ {
+ if (is_web)
+ {
+ if (state->in_quotes)
+ {
+ /* put implicit <-> after an operand */
+ *operator = OP_PHRASE;
+ *weight = 1;
+ }
+ else
+ {
+ /* put implicit AND after an operand */
+ *operator = OP_AND;
+ }
+
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
return PT_ERR;
+ }
break;
case WAITSINGLEOPERAND:
if (*(state->buf) == '\0')
@@ -320,9 +449,6 @@ gettoken_query(TSQueryParserState state,
state->buf += strlen(state->buf);
state->count++;
return PT_VAL;
- default:
- return PT_ERR;
- break;
}
state->buf += pg_mblen(state->buf);
}
@@ -605,7 +731,7 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ int flags)
{
struct TSQueryParserStateData state;
int i;
@@ -613,17 +739,28 @@ parse_tsquery(char *buf,
int commonlen;
QueryItem *ptr;
ListCell *cell;
- bool needcleanup;
+ bool needcleanup,
+ is_plain,
+ is_web;
+ int tsv_flags = P_TSV_OPR_IS_DELIM | P_TSV_IS_TSQUERY;
+
+ is_plain = (flags & P_TSQ_PLAIN) != 0;
+ is_web = (flags & P_TSQ_WEB) != 0;
+
+ if (is_web)
+ tsv_flags |= P_TSV_IS_WEB;
/* init state */
state.buffer = buf;
state.buf = buf;
- state.state = (isplain) ? WAITSINGLEOPERAND : WAITFIRSTOPERAND;
state.count = 0;
+ state.in_quotes = false;
+ state.is_web = is_web;
+ state.state = is_plain ? WAITSINGLEOPERAND : WAITFIRSTOPERAND;
state.polstr = NIL;
/* init value parser's state */
- state.valstate = init_tsvector_parser(state.buffer, true, true);
+ state.valstate = init_tsvector_parser(state.buffer, tsv_flags);
/* init list of operand */
state.sumlen = 0;
@@ -716,7 +853,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), 0));
}
/*
diff --git a/src/backend/utils/adt/tsvector.c b/src/backend/utils/adt/tsvector.c
index 64e02ef434..7a27bd12a3 100644
--- a/src/backend/utils/adt/tsvector.c
+++ b/src/backend/utils/adt/tsvector.c
@@ -200,7 +200,7 @@ tsvectorin(PG_FUNCTION_ARGS)
char *cur;
int buflen = 256; /* allocated size of tmpbuf */
- state = init_tsvector_parser(buf, false, false);
+ state = init_tsvector_parser(buf, 0);
arrlen = 64;
arr = (WordEntryIN *) palloc(sizeof(WordEntryIN) * arrlen);
diff --git a/src/backend/utils/adt/tsvector_parser.c b/src/backend/utils/adt/tsvector_parser.c
index 7367ba6a40..fed411a842 100644
--- a/src/backend/utils/adt/tsvector_parser.c
+++ b/src/backend/utils/adt/tsvector_parser.c
@@ -33,6 +33,7 @@ struct TSVectorParseStateData
int eml; /* max bytes per character */
bool oprisdelim; /* treat ! | * ( ) as delimiters? */
bool is_tsquery; /* say "tsquery" not "tsvector" in errors? */
+ bool is_web; /* we're in websearch_to_tsquery() */
};
@@ -42,7 +43,7 @@ struct TSVectorParseStateData
* ! | & ( )
*/
TSVectorParseState
-init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
+init_tsvector_parser(char *input, int flags)
{
TSVectorParseState state;
@@ -52,8 +53,9 @@ init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
state->len = 32;
state->word = (char *) palloc(state->len);
state->eml = pg_database_encoding_max_length();
- state->oprisdelim = oprisdelim;
- state->is_tsquery = is_tsquery;
+ state->oprisdelim = (flags & P_TSV_OPR_IS_DELIM) != 0;
+ state->is_tsquery = (flags & P_TSV_IS_TSQUERY) != 0;
+ state->is_web = (flags & P_TSV_IS_WEB) != 0;
return state;
}
@@ -89,16 +91,6 @@ do { \
} \
} while (0)
-/* phrase operator begins with '<' */
-#define ISOPERATOR(x) \
- ( pg_mblen(x) == 1 && ( *(x) == '!' || \
- *(x) == '&' || \
- *(x) == '|' || \
- *(x) == '(' || \
- *(x) == ')' || \
- *(x) == '<' \
- ) )
-
/* Fills gettoken_tsvector's output parameters, and returns true */
#define RETURN_TOKEN \
do { \
@@ -183,14 +175,15 @@ gettoken_tsvector(TSVectorParseState state,
{
if (*(state->prsbuf) == '\0')
return false;
- else if (t_iseq(state->prsbuf, '\''))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\''))
statecode = WAITENDCMPLX;
- else if (t_iseq(state->prsbuf, '\\'))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDWORD;
}
- else if (state->oprisdelim && ISOPERATOR(state->prsbuf))
+ else if ((state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
+ (state->is_web && t_iseq(state->prsbuf, '"')))
PRSSYNTAXERROR;
else if (!t_isspace(state->prsbuf))
{
@@ -217,13 +210,14 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITENDWORD)
{
- if (t_iseq(state->prsbuf, '\\'))
+ if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDWORD;
}
else if (t_isspace(state->prsbuf) || *(state->prsbuf) == '\0' ||
- (state->oprisdelim && ISOPERATOR(state->prsbuf)))
+ (state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
+ (state->is_web && t_iseq(state->prsbuf, '"')))
{
RESIZEPRSBUF;
if (curpos == state->word)
@@ -250,11 +244,11 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITENDCMPLX)
{
- if (t_iseq(state->prsbuf, '\''))
+ if (!state->is_web && t_iseq(state->prsbuf, '\''))
{
statecode = WAITCHARCMPLX;
}
- else if (t_iseq(state->prsbuf, '\\'))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDCMPLX;
@@ -270,7 +264,7 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITCHARCMPLX)
{
- if (t_iseq(state->prsbuf, '\''))
+ if (!state->is_web && t_iseq(state->prsbuf, '\''))
{
RESIZEPRSBUF;
COPYCHAR(curpos, state->prsbuf);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index bfc90098f8..00f1a85ae7 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4950,6 +4950,8 @@ DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s
DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
DESCR("transform to tsvector");
DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ to_tsquery _null_ _null_ _null_ ));
@@ -4958,6 +4960,8 @@ DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s
DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
DESCR("transform jsonb to tsvector");
DATA(insert OID = 4210 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "114" _null_ _null_ _null_ _null_ _null_ json_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index f8ddce5ecb..73e969fe9c 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -25,9 +25,11 @@
struct TSVectorParseStateData; /* opaque struct in tsvector_parser.c */
typedef struct TSVectorParseStateData *TSVectorParseState;
-extern TSVectorParseState init_tsvector_parser(char *input,
- bool oprisdelim,
- bool is_tsquery);
+#define P_TSV_OPR_IS_DELIM (1 << 0)
+#define P_TSV_IS_TSQUERY (1 << 1)
+#define P_TSV_IS_WEB (1 << 2)
+
+extern TSVectorParseState init_tsvector_parser(char *input, int flags);
extern void reset_tsvector_parser(TSVectorParseState state, char *input);
extern bool gettoken_tsvector(TSVectorParseState state,
char **token, int *len,
@@ -35,6 +37,16 @@ extern bool gettoken_tsvector(TSVectorParseState state,
char **endptr);
extern void close_tsvector_parser(TSVectorParseState state);
+/* phrase operator begins with '<' */
+#define ISOPERATOR(x) \
+ ( pg_mblen(x) == 1 && ( *(x) == '!' || \
+ *(x) == '&' || \
+ *(x) == '|' || \
+ *(x) == '(' || \
+ *(x) == ')' || \
+ *(x) == '<' \
+ ) )
+
/* parse_tsquery */
struct TSQueryParserStateData; /* private in backend/utils/adt/tsquery.c */
@@ -46,9 +58,13 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
* QueryOperand struct */
bool prefix);
+#define P_TSQ_PLAIN (1 << 0)
+#define P_TSQ_WEB (1 << 1)
+
extern TSQuery parse_tsquery(char *buf,
- PushFunction pushval,
- Datum opaque, bool isplain);
+ PushFunction pushval,
+ Datum opaque,
+ int flags);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index d63fb12f1d..2b1da308df 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1672,3 +1672,325 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('simple', 'I have a fat:*ABCD cat');
+ websearch_to_tsquery
+---------------------------------------------
+ 'i' & 'have' & 'a' & 'fat' & 'abcd' & 'cat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'orange:**AABBCCDD');
+ websearch_to_tsquery
+-----------------------
+ 'orange' & 'aabbccdd'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat:A!cat:B|rat:C<');
+ websearch_to_tsquery
+-----------------------------------------
+ 'fat' & 'a' & 'cat' & 'b' & 'rat' & 'c'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat:A : cat:B');
+ websearch_to_tsquery
+---------------------------
+ 'fat' & 'a' & 'cat' & 'b'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc : def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc:def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'a:::b');
+ websearch_to_tsquery
+----------------------
+ 'a' & 'b'
+(1 row)
+
+select websearch_to_tsquery('simple', ':');
+NOTICE: text-search query doesn't contain lexemes: ":"
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand new smartphone');
+ websearch_to_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand "new smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand:B "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------------
+ 'brand' & 'b' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat or rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or'
+(1 row)
+
+select websearch_to_tsquery('simple', 'OR rat');
+ websearch_to_tsquery
+----------------------
+ 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+ websearch_to_tsquery
+-----------------------
+ 'fat' & 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat*rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat-rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat_rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', '"fat cat"or"fat rat"');
+ websearch_to_tsquery
+-----------------------------------
+ 'fat' <-> 'cat' | 'fat' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or(rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or)rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or&rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or|rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or!rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or<rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or>rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or ');
+ websearch_to_tsquery
+----------------------
+ 'fat'
+(1 row)
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orange'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc orтест');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orтест'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR1234');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or1234'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or-abc');
+ websearch_to_tsquery
+---------------------------------
+ 'abc' & 'or-abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR_abc');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or'
+(1 row)
+
+select websearch_to_tsquery('simple', 'or OR or');
+ websearch_to_tsquery
+----------------------
+ 'or' | 'or'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+ websearch_to_tsquery
+-----------------------------------
+ 'fat' <-> 'cat' & 'eaten' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', 'this is ----fine');
+ websearch_to_tsquery
+----------------------
+ !!!!'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -fine, "dear friend" OR good');
+ websearch_to_tsquery
+----------------------------------------
+ !'fine' & 'dear' <-> 'friend' | 'good'
+(1 row)
+
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+ websearch_to_tsquery
+------------------------
+ 'old' & 'cat' & 'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ websearch_to_tsquery
+--------------------------------------
+ 'толст' <-> 'кошк' & 'съел' & 'крыс'
+(1 row)
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ f
+(1 row)
+
+-- cases handled by gettoken_tsvector()
+select websearch_to_tsquery('''');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('''abc''''def''');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('\abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('\');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index 1c8520b3e9..da8d089100 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -539,3 +539,75 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('simple', 'I have a fat:*ABCD cat');
+select websearch_to_tsquery('simple', 'orange:**AABBCCDD');
+select websearch_to_tsquery('simple', 'fat:A!cat:B|rat:C<');
+select websearch_to_tsquery('simple', 'fat:A : cat:B');
+
+select websearch_to_tsquery('simple', 'abc : def');
+select websearch_to_tsquery('simple', 'abc:def');
+select websearch_to_tsquery('simple', 'a:::b');
+select websearch_to_tsquery('simple', ':');
+
+select websearch_to_tsquery('english', 'My brand new smartphone');
+select websearch_to_tsquery('english', 'My brand "new smartphone"');
+select websearch_to_tsquery('english', 'My brand "new -smartphone"');
+select websearch_to_tsquery('english', 'My brand:B "new -smartphone"');
+
+select websearch_to_tsquery('simple', 'cat or rat');
+select websearch_to_tsquery('simple', 'cat OR rat');
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+select websearch_to_tsquery('simple', 'cat OR');
+select websearch_to_tsquery('simple', 'OR rat');
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+select websearch_to_tsquery('simple', 'fat*rat');
+select websearch_to_tsquery('simple', 'fat-rat');
+select websearch_to_tsquery('simple', 'fat_rat');
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', '"fat cat"or"fat rat"');
+select websearch_to_tsquery('simple', 'fat or(rat');
+select websearch_to_tsquery('simple', 'fat or)rat');
+select websearch_to_tsquery('simple', 'fat or&rat');
+select websearch_to_tsquery('simple', 'fat or|rat');
+select websearch_to_tsquery('simple', 'fat or!rat');
+select websearch_to_tsquery('simple', 'fat or<rat');
+select websearch_to_tsquery('simple', 'fat or>rat');
+select websearch_to_tsquery('simple', 'fat or ');
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+select websearch_to_tsquery('simple', 'abc orтест');
+select websearch_to_tsquery('simple', 'abc OR1234');
+select websearch_to_tsquery('simple', 'abc or-abc');
+select websearch_to_tsquery('simple', 'abc OR_abc');
+select websearch_to_tsquery('simple', 'abc or');
+
+select websearch_to_tsquery('simple', 'or OR or');
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+
+select websearch_to_tsquery('english', 'this is ----fine');
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -fine, "dear friend" OR good');
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+-- cases handled by gettoken_tsvector()
+select websearch_to_tsquery('''');
+select websearch_to_tsquery('''abc''''def''');
+select websearch_to_tsquery('\abc');
+select websearch_to_tsquery('\');
I've fixed a bug and added some tests and documentation.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
websearch_to_tsquery_v5.difftext/x-diff; name=websearch_to_tsquery_v5.diffDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5abb1c46fb..c3b7be6e4e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9609,6 +9609,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
<entry><literal>phraseto_tsquery('english', 'The Fat Rats')</literal></entry>
<entry><literal>'fat' <-> 'rat'</literal></entry>
</row>
+ <row>
+ <entry>
+ <indexterm>
+ <primary>websearch_to_tsquery</primary>
+ </indexterm>
+ <literal><function>websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type> , </optional> <replaceable class="parameter">query</replaceable> <type>text</type>)</function></literal>
+ </entry>
+ <entry><type>tsquery</type></entry>
+ <entry>produce <type>tsquery</type> from a web search style query</entry>
+ <entry><literal>websearch_to_tsquery('english', '"fat rat" or rat')</literal></entry>
+ <entry><literal>'fat' <-> 'rat' | 'rat'</literal></entry>
+ </row>
<row>
<entry>
<indexterm>
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 610b7bf033..19f58511c8 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,16 @@ UPDATE tt SET ti =
<para>
<productname>PostgreSQL</productname> provides the
functions <function>to_tsquery</function>,
- <function>plainto_tsquery</function>, and
- <function>phraseto_tsquery</function>
+ <function>plainto_tsquery</function>,
+ <function>phraseto_tsquery</function> and
+ <function>websearch_to_tsquery</function>
for converting a query to the <type>tsquery</type> data type.
<function>to_tsquery</function> offers access to more features
than either <function>plainto_tsquery</function> or
- <function>phraseto_tsquery</function>, but it is less forgiving
- about its input.
+ <function>phraseto_tsquery</function>, but it is less forgiving about its
+ input. <function>websearch_to_tsquery</function> is a simplified version
+ of <function>to_tsquery</function> with an alternative syntax, similar
+ to the one used by web search engines.
</para>
<indexterm>
@@ -962,6 +965,87 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
</screen>
</para>
+<synopsis>
+websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
+</synopsis>
+
+ <para>
+ <function>websearch_to_tsquery</function> creates a <type>tsquery</type>
+ value from <replaceable>querytext</replaceable> using an alternative
+ syntax in which simple unformatted text is a valid query.
+ Unlike <function>plainto_tsquery</function>
+ and <function>phraseto_tsquery</function>, it also recognizes certain
+ operators. Moreover, this function should never raise syntax errors,
+ which makes it possible to use raw user-supplied input for search.
+ The following syntax is supported:
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ <literal>unquoted text</literal>: text not inside quote marks will be
+ converted to terms separated by <literal>&</literal> operators, as
+ if processed by
+ <function>plainto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>"quoted text"</literal>: text inside quote marks will be
+ converted to terms separated by <literal><-></literal>
+ operators, as if processed by <function>phraseto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>OR</literal>: logical or will be converted to
+ the <literal>|</literal> operator.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>-</literal>: the logical not operator, converted to the
+ the <literal>!</literal> operator.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ Examples:
+ <screen>
+ select websearch_to_tsquery('english', 'The fat rats');
+ websearch_to_tsquery
+ -----------------
+ 'fat' & 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '"supernovae stars" -crab');
+ websearch_to_tsquery
+ ----------------------------------
+ 'supernova' <-> 'star' & !'crab'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '"sad cat" or "fat rat"');
+ websearch_to_tsquery
+ -----------------------------------
+ 'sad' <-> 'cat' | 'fat' <-> 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', 'signal -"segmentation fault"');
+ websearch_to_tsquery
+ ---------------------------------------
+ 'signal' & !( 'segment' <-> 'fault' )
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '""" )( dummy \\ query <->');
+ websearch_to_tsquery
+ ----------------------
+ 'dummi' & 'queri'
+ (1 row)
+ </screen>
+ </para>
</sect2>
<sect2 id="textsearch-ranking">
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index ea5947a3a8..6055fb6b4e 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -490,7 +490,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- false);
+ 0);
PG_RETURN_TSQUERY(query);
}
@@ -520,7 +520,7 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ P_TSQ_PLAIN);
PG_RETURN_POINTER(query);
}
@@ -551,7 +551,7 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ P_TSQ_PLAIN);
PG_RETURN_TSQUERY(query);
}
@@ -567,3 +567,35 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+websearch_to_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ P_TSQ_WEB);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+websearch_to_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(websearch_to_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+
+}
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 1ccbf79030..5657aae7cc 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -32,14 +32,27 @@ const int tsearch_op_priority[OP_COUNT] =
3 /* OP_PHRASE */
};
+/*
+ * parser's states
+ */
+typedef enum
+{
+ WAITOPERAND = 1,
+ WAITOPERATOR = 2,
+ WAITFIRSTOPERAND = 3,
+ WAITSINGLEOPERAND = 4
+} ts_parserstate;
+
struct TSQueryParserStateData
{
/* State for gettoken_query */
char *buffer; /* entire string we are scanning */
char *buf; /* current scan point */
- int state;
int count; /* nesting count, incremented by (,
* decremented by ) */
+ bool in_quotes; /* phrase in quotes "" */
+ bool is_web; /* is it a web search? */
+ ts_parserstate state;
/* polish (prefix) notation in list, filled in by push* functions */
List *polstr;
@@ -57,12 +70,6 @@ struct TSQueryParserStateData
TSVectorParseState valstate;
};
-/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
-
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
* part, like ':AB*' of a query.
@@ -197,6 +204,26 @@ err:
return buf;
}
+/*
+ * Parse OR operator used in websearch_to_tsquery().
+ */
+static bool
+parse_or_operator(TSQueryParserState state)
+{
+ char *buf = state->buf;
+
+ if (state->in_quotes)
+ return false;
+
+ return (t_iseq(&buf[0], 'o') || t_iseq(&buf[0], 'O')) &&
+ (t_iseq(&buf[1], 'r') || t_iseq(&buf[1], 'R')) &&
+ (buf[2] != '\0' &&
+ !t_iseq(&buf[2], '-') &&
+ !t_iseq(&buf[2], '_') &&
+ !t_isalpha(&buf[2]) &&
+ !t_isdigit(&buf[2]));
+}
+
/*
* token types for parsing
*/
@@ -219,10 +246,12 @@ typedef enum
*
*/
static ts_tokentype
-gettoken_query(TSQueryParserState state,
- int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+gettoken_query(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix)
{
+ bool is_web = state->is_web;
+
*weight = 0;
*prefix = false;
@@ -232,28 +261,76 @@ gettoken_query(TSQueryParserState state,
{
case WAITFIRSTOPERAND:
case WAITOPERAND:
- if (t_iseq(state->buf, '!'))
+ if ((! is_web && t_iseq(state->buf, '!')) ||
+ (is_web && t_iseq(state->buf, '-')))
{
- (state->buf)++; /* can safely ++, t_iseq guarantee that
- * pg_mblen()==1 */
- *operator = OP_NOT;
+ state->buf++;
state->state = WAITOPERAND;
+
+ if (state->in_quotes)
+ continue;
+
+ *operator = OP_NOT;
return PT_OPR;
}
else if (t_iseq(state->buf, '('))
{
- state->count++;
- (state->buf)++;
+ state->buf++;
state->state = WAITOPERAND;
+
+ if (is_web)
+ continue;
+
+ state->count++;
return PT_OPEN;
}
else if (t_iseq(state->buf, ':'))
{
+ state->buf++;
+ state->state = WAITOPERAND;
+
+ if (is_web)
+ continue;
+
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("syntax error in tsquery: \"%s\"",
state->buffer)));
}
+ else if (is_web && t_iseq(state->buf, '"'))
+ {
+ state->buf++;
+
+ if (!state->in_quotes)
+ {
+ state->state = WAITOPERAND;
+
+ if (strchr(state->buf, '"'))
+ {
+ /* quoted text should be ordered <-> */
+ state->in_quotes = true;
+ return PT_OPEN;
+ }
+
+ /* web search tolerates missing quotes */
+ continue;
+ }
+ else
+ {
+ /* we have to provide an operand */
+ state->in_quotes = false;
+ state->state = WAITOPERATOR;
+ pushStop(state);
+ return PT_CLOSE;
+ }
+ }
+ else if (is_web && ISOPERATOR(state->buf))
+ {
+ /* or else gettoken_tsvector() will raise an error */
+ state->buf++;
+ state->state = WAITOPERAND;
+ continue;
+ }
else if (!t_isspace(state->buf))
{
/*
@@ -263,12 +340,24 @@ gettoken_query(TSQueryParserState state,
reset_tsvector_parser(state->valstate, state->buf);
if (gettoken_tsvector(state->valstate, strval, lenval, NULL, NULL, &state->buf))
{
- state->buf = get_modifiers(state->buf, weight, prefix);
+ if (!is_web)
+ {
+ /* web search does not support weights */
+ state->buf = get_modifiers(state->buf, weight, prefix);
+ }
state->state = WAITOPERATOR;
return PT_VAL;
}
else if (state->state == WAITFIRSTOPERAND)
+ {
return PT_END;
+ }
+ else if (is_web)
+ {
+ /* finally, we have to provide an operand */
+ pushStop(state);
+ return PT_END;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -277,40 +366,90 @@ gettoken_query(TSQueryParserState state,
}
break;
case WAITOPERATOR:
- if (t_iseq(state->buf, '&'))
+ if (! is_web && t_iseq(state->buf, '&'))
{
+ state->buf++;
state->state = WAITOPERAND;
*operator = OP_AND;
- (state->buf)++;
return PT_OPR;
}
- else if (t_iseq(state->buf, '|'))
+ else if (! is_web && t_iseq(state->buf, '|'))
{
+ state->buf++;
state->state = WAITOPERAND;
*operator = OP_OR;
- (state->buf)++;
return PT_OPR;
}
- else if (t_iseq(state->buf, '<'))
+ else if (! is_web && t_iseq(state->buf, '<'))
{
- state->state = WAITOPERAND;
- *operator = OP_PHRASE;
/* weight var is used as storage for distance */
state->buf = parse_phrase_operator(state->buf, weight);
+ state->state = WAITOPERAND;
+ *operator = OP_PHRASE;
if (*weight < 0)
return PT_ERR;
return PT_OPR;
}
- else if (t_iseq(state->buf, ')'))
+ else if (! is_web && t_iseq(state->buf, ')'))
{
- (state->buf)++;
+ state->buf++;
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
+ else if (is_web && t_iseq(state->buf, '"'))
+ {
+ if (!state->in_quotes)
+ {
+ /*
+ * put implicit AND after an operand
+ * and handle this quote in WAITOPERAND
+ */
+ state->state = WAITOPERAND;
+ *operator = OP_AND;
+ return PT_OPR;
+ }
+ else
+ {
+ state->buf++;
+
+ /* just close quotes */
+ state->in_quotes = false;
+ return PT_CLOSE;
+ }
+ }
+ else if (is_web && parse_or_operator(state))
+ {
+ state->buf += 2; /* strlen("OR") */
+ state->state = WAITOPERAND;
+ *operator = OP_OR;
+ return PT_OPR;
+ }
else if (*(state->buf) == '\0')
- return (state->count) ? PT_ERR : PT_END;
+ {
+ /* web search tolerates unexpected end of line */
+ return (!is_web && state->count) ? PT_ERR : PT_END;
+ }
else if (!t_isspace(state->buf))
+ {
+ if (is_web)
+ {
+ if (state->in_quotes)
+ {
+ /* put implicit <-> after an operand */
+ *operator = OP_PHRASE;
+ *weight = 1;
+ }
+ else
+ {
+ /* put implicit AND after an operand */
+ *operator = OP_AND;
+ }
+
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
return PT_ERR;
+ }
break;
case WAITSINGLEOPERAND:
if (*(state->buf) == '\0')
@@ -320,9 +459,6 @@ gettoken_query(TSQueryParserState state,
state->buf += strlen(state->buf);
state->count++;
return PT_VAL;
- default:
- return PT_ERR;
- break;
}
state->buf += pg_mblen(state->buf);
}
@@ -605,7 +741,7 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ int flags)
{
struct TSQueryParserStateData state;
int i;
@@ -613,17 +749,28 @@ parse_tsquery(char *buf,
int commonlen;
QueryItem *ptr;
ListCell *cell;
- bool needcleanup;
+ bool needcleanup,
+ is_plain,
+ is_web;
+ int tsv_flags = P_TSV_OPR_IS_DELIM | P_TSV_IS_TSQUERY;
+
+ is_plain = (flags & P_TSQ_PLAIN) != 0;
+ is_web = (flags & P_TSQ_WEB) != 0;
+
+ if (is_web)
+ tsv_flags |= P_TSV_IS_WEB;
/* init state */
state.buffer = buf;
state.buf = buf;
- state.state = (isplain) ? WAITSINGLEOPERAND : WAITFIRSTOPERAND;
state.count = 0;
+ state.in_quotes = false;
+ state.is_web = is_web;
+ state.state = is_plain ? WAITSINGLEOPERAND : WAITFIRSTOPERAND;
state.polstr = NIL;
/* init value parser's state */
- state.valstate = init_tsvector_parser(state.buffer, true, true);
+ state.valstate = init_tsvector_parser(state.buffer, tsv_flags);
/* init list of operand */
state.sumlen = 0;
@@ -716,7 +863,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), 0));
}
/*
diff --git a/src/backend/utils/adt/tsvector.c b/src/backend/utils/adt/tsvector.c
index 64e02ef434..7a27bd12a3 100644
--- a/src/backend/utils/adt/tsvector.c
+++ b/src/backend/utils/adt/tsvector.c
@@ -200,7 +200,7 @@ tsvectorin(PG_FUNCTION_ARGS)
char *cur;
int buflen = 256; /* allocated size of tmpbuf */
- state = init_tsvector_parser(buf, false, false);
+ state = init_tsvector_parser(buf, 0);
arrlen = 64;
arr = (WordEntryIN *) palloc(sizeof(WordEntryIN) * arrlen);
diff --git a/src/backend/utils/adt/tsvector_parser.c b/src/backend/utils/adt/tsvector_parser.c
index 7367ba6a40..fed411a842 100644
--- a/src/backend/utils/adt/tsvector_parser.c
+++ b/src/backend/utils/adt/tsvector_parser.c
@@ -33,6 +33,7 @@ struct TSVectorParseStateData
int eml; /* max bytes per character */
bool oprisdelim; /* treat ! | * ( ) as delimiters? */
bool is_tsquery; /* say "tsquery" not "tsvector" in errors? */
+ bool is_web; /* we're in websearch_to_tsquery() */
};
@@ -42,7 +43,7 @@ struct TSVectorParseStateData
* ! | & ( )
*/
TSVectorParseState
-init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
+init_tsvector_parser(char *input, int flags)
{
TSVectorParseState state;
@@ -52,8 +53,9 @@ init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
state->len = 32;
state->word = (char *) palloc(state->len);
state->eml = pg_database_encoding_max_length();
- state->oprisdelim = oprisdelim;
- state->is_tsquery = is_tsquery;
+ state->oprisdelim = (flags & P_TSV_OPR_IS_DELIM) != 0;
+ state->is_tsquery = (flags & P_TSV_IS_TSQUERY) != 0;
+ state->is_web = (flags & P_TSV_IS_WEB) != 0;
return state;
}
@@ -89,16 +91,6 @@ do { \
} \
} while (0)
-/* phrase operator begins with '<' */
-#define ISOPERATOR(x) \
- ( pg_mblen(x) == 1 && ( *(x) == '!' || \
- *(x) == '&' || \
- *(x) == '|' || \
- *(x) == '(' || \
- *(x) == ')' || \
- *(x) == '<' \
- ) )
-
/* Fills gettoken_tsvector's output parameters, and returns true */
#define RETURN_TOKEN \
do { \
@@ -183,14 +175,15 @@ gettoken_tsvector(TSVectorParseState state,
{
if (*(state->prsbuf) == '\0')
return false;
- else if (t_iseq(state->prsbuf, '\''))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\''))
statecode = WAITENDCMPLX;
- else if (t_iseq(state->prsbuf, '\\'))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDWORD;
}
- else if (state->oprisdelim && ISOPERATOR(state->prsbuf))
+ else if ((state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
+ (state->is_web && t_iseq(state->prsbuf, '"')))
PRSSYNTAXERROR;
else if (!t_isspace(state->prsbuf))
{
@@ -217,13 +210,14 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITENDWORD)
{
- if (t_iseq(state->prsbuf, '\\'))
+ if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDWORD;
}
else if (t_isspace(state->prsbuf) || *(state->prsbuf) == '\0' ||
- (state->oprisdelim && ISOPERATOR(state->prsbuf)))
+ (state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
+ (state->is_web && t_iseq(state->prsbuf, '"')))
{
RESIZEPRSBUF;
if (curpos == state->word)
@@ -250,11 +244,11 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITENDCMPLX)
{
- if (t_iseq(state->prsbuf, '\''))
+ if (!state->is_web && t_iseq(state->prsbuf, '\''))
{
statecode = WAITCHARCMPLX;
}
- else if (t_iseq(state->prsbuf, '\\'))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDCMPLX;
@@ -270,7 +264,7 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITCHARCMPLX)
{
- if (t_iseq(state->prsbuf, '\''))
+ if (!state->is_web && t_iseq(state->prsbuf, '\''))
{
RESIZEPRSBUF;
COPYCHAR(curpos, state->prsbuf);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 90d994c71a..560416636b 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4966,6 +4966,8 @@ DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s
DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
DESCR("transform to tsvector");
DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ to_tsquery _null_ _null_ _null_ ));
@@ -4974,6 +4976,8 @@ DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s
DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
DESCR("transform jsonb to tsvector");
DATA(insert OID = 4210 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "114" _null_ _null_ _null_ _null_ _null_ json_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index f8ddce5ecb..73e969fe9c 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -25,9 +25,11 @@
struct TSVectorParseStateData; /* opaque struct in tsvector_parser.c */
typedef struct TSVectorParseStateData *TSVectorParseState;
-extern TSVectorParseState init_tsvector_parser(char *input,
- bool oprisdelim,
- bool is_tsquery);
+#define P_TSV_OPR_IS_DELIM (1 << 0)
+#define P_TSV_IS_TSQUERY (1 << 1)
+#define P_TSV_IS_WEB (1 << 2)
+
+extern TSVectorParseState init_tsvector_parser(char *input, int flags);
extern void reset_tsvector_parser(TSVectorParseState state, char *input);
extern bool gettoken_tsvector(TSVectorParseState state,
char **token, int *len,
@@ -35,6 +37,16 @@ extern bool gettoken_tsvector(TSVectorParseState state,
char **endptr);
extern void close_tsvector_parser(TSVectorParseState state);
+/* phrase operator begins with '<' */
+#define ISOPERATOR(x) \
+ ( pg_mblen(x) == 1 && ( *(x) == '!' || \
+ *(x) == '&' || \
+ *(x) == '|' || \
+ *(x) == '(' || \
+ *(x) == ')' || \
+ *(x) == '<' \
+ ) )
+
/* parse_tsquery */
struct TSQueryParserStateData; /* private in backend/utils/adt/tsquery.c */
@@ -46,9 +58,13 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
* QueryOperand struct */
bool prefix);
+#define P_TSQ_PLAIN (1 << 0)
+#define P_TSQ_WEB (1 << 1)
+
extern TSQuery parse_tsquery(char *buf,
- PushFunction pushval,
- Datum opaque, bool isplain);
+ PushFunction pushval,
+ Datum opaque,
+ int flags);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index d63fb12f1d..37825a1130 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1672,3 +1672,432 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('simple', 'I have a fat:*ABCD cat');
+ websearch_to_tsquery
+---------------------------------------------
+ 'i' & 'have' & 'a' & 'fat' & 'abcd' & 'cat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'orange:**AABBCCDD');
+ websearch_to_tsquery
+-----------------------
+ 'orange' & 'aabbccdd'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat:A!cat:B|rat:C<');
+ websearch_to_tsquery
+-----------------------------------------
+ 'fat' & 'a' & 'cat' & 'b' & 'rat' & 'c'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat:A : cat:B');
+ websearch_to_tsquery
+---------------------------
+ 'fat' & 'a' & 'cat' & 'b'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat*rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat-rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat_rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+-- weights are completely ignored
+select websearch_to_tsquery('simple', 'abc : def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc:def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'a:::b');
+ websearch_to_tsquery
+----------------------
+ 'a' & 'b'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc:d');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'd'
+(1 row)
+
+select websearch_to_tsquery('simple', ':');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+-- these operators are ignored
+select websearch_to_tsquery('simple', 'abc & def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc | def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc <-> def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc (pg or class)');
+ websearch_to_tsquery
+------------------------
+ 'abc' & 'pg' | 'class'
+(1 row)
+
+-- NOT is ignored in quotes
+select websearch_to_tsquery('english', 'My brand new smartphone');
+ websearch_to_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand "new smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+-- test OR operator
+select websearch_to_tsquery('simple', 'cat or rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or'
+(1 row)
+
+select websearch_to_tsquery('simple', 'OR rat');
+ websearch_to_tsquery
+----------------------
+ 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+ websearch_to_tsquery
+-----------------------
+ 'fat' & 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'or OR or');
+ websearch_to_tsquery
+----------------------
+ 'or' | 'or'
+(1 row)
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', '"fat cat"or"fat rat"');
+ websearch_to_tsquery
+-----------------------------------
+ 'fat' <-> 'cat' | 'fat' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or(rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or)rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or&rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or|rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or!rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or<rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or>rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or ');
+ websearch_to_tsquery
+----------------------
+ 'fat'
+(1 row)
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orange'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc orтест');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orтест'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR1234');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or1234'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or-abc');
+ websearch_to_tsquery
+---------------------------------
+ 'abc' & 'or-abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR_abc');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or'
+(1 row)
+
+-- test quotes
+select websearch_to_tsquery('english', '"pg_class pg');
+ websearch_to_tsquery
+-----------------------
+ 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'pg_class pg"');
+ websearch_to_tsquery
+-----------------------
+ 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '"pg_class pg"');
+ websearch_to_tsquery
+-----------------------------
+ ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "pg_class pg"');
+ websearch_to_tsquery
+-------------------------------------
+ 'abc' & ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '"pg_class pg" def');
+ websearch_to_tsquery
+-------------------------------------
+ ( 'pg' & 'class' ) <-> 'pg' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "pg pg_class pg" def');
+ websearch_to_tsquery
+------------------------------------------------------
+ 'abc' & 'pg' <-> ( 'pg' & 'class' ) <-> 'pg' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', ' or "pg pg_class pg" or ');
+ websearch_to_tsquery
+--------------------------------------
+ 'pg' <-> ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '""pg pg_class pg""');
+ websearch_to_tsquery
+------------------------------
+ 'pg' & 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc """"" def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'cat -"fat rat"');
+ websearch_to_tsquery
+------------------------------
+ 'cat' & !( 'fat' <-> 'rat' )
+(1 row)
+
+select websearch_to_tsquery('english', 'cat -"fat rat" cheese');
+ websearch_to_tsquery
+----------------------------------------
+ 'cat' & !( 'fat' <-> 'rat' ) & 'chees'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "def -"');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "def :"');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+ websearch_to_tsquery
+-----------------------------------
+ 'fat' <-> 'cat' & 'eaten' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', 'this is ----fine');
+ websearch_to_tsquery
+----------------------
+ !!!!'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -fine, "dear friend" OR good');
+ websearch_to_tsquery
+----------------------------------------
+ !'fine' & 'dear' <-> 'friend' | 'good'
+(1 row)
+
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+ websearch_to_tsquery
+------------------------
+ 'old' & 'cat' & 'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ websearch_to_tsquery
+--------------------------------------
+ 'толст' <-> 'кошк' & 'съел' & 'крыс'
+(1 row)
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ f
+(1 row)
+
+-- cases handled by gettoken_tsvector()
+select websearch_to_tsquery('''');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('''abc''''def''');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('\abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('\');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index 1c8520b3e9..f299a5d32b 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -539,3 +539,98 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('simple', 'I have a fat:*ABCD cat');
+select websearch_to_tsquery('simple', 'orange:**AABBCCDD');
+select websearch_to_tsquery('simple', 'fat:A!cat:B|rat:C<');
+select websearch_to_tsquery('simple', 'fat:A : cat:B');
+
+select websearch_to_tsquery('simple', 'fat*rat');
+select websearch_to_tsquery('simple', 'fat-rat');
+select websearch_to_tsquery('simple', 'fat_rat');
+
+-- weights are completely ignored
+select websearch_to_tsquery('simple', 'abc : def');
+select websearch_to_tsquery('simple', 'abc:def');
+select websearch_to_tsquery('simple', 'a:::b');
+select websearch_to_tsquery('simple', 'abc:d');
+select websearch_to_tsquery('simple', ':');
+
+-- these operators are ignored
+select websearch_to_tsquery('simple', 'abc & def');
+select websearch_to_tsquery('simple', 'abc | def');
+select websearch_to_tsquery('simple', 'abc <-> def');
+select websearch_to_tsquery('simple', 'abc (pg or class)');
+
+-- NOT is ignored in quotes
+select websearch_to_tsquery('english', 'My brand new smartphone');
+select websearch_to_tsquery('english', 'My brand "new smartphone"');
+select websearch_to_tsquery('english', 'My brand "new -smartphone"');
+
+-- test OR operator
+select websearch_to_tsquery('simple', 'cat or rat');
+select websearch_to_tsquery('simple', 'cat OR rat');
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+select websearch_to_tsquery('simple', 'cat OR');
+select websearch_to_tsquery('simple', 'OR rat');
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+select websearch_to_tsquery('simple', 'or OR or');
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', '"fat cat"or"fat rat"');
+select websearch_to_tsquery('simple', 'fat or(rat');
+select websearch_to_tsquery('simple', 'fat or)rat');
+select websearch_to_tsquery('simple', 'fat or&rat');
+select websearch_to_tsquery('simple', 'fat or|rat');
+select websearch_to_tsquery('simple', 'fat or!rat');
+select websearch_to_tsquery('simple', 'fat or<rat');
+select websearch_to_tsquery('simple', 'fat or>rat');
+select websearch_to_tsquery('simple', 'fat or ');
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+select websearch_to_tsquery('simple', 'abc orтест');
+select websearch_to_tsquery('simple', 'abc OR1234');
+select websearch_to_tsquery('simple', 'abc or-abc');
+select websearch_to_tsquery('simple', 'abc OR_abc');
+select websearch_to_tsquery('simple', 'abc or');
+
+-- test quotes
+select websearch_to_tsquery('english', '"pg_class pg');
+select websearch_to_tsquery('english', 'pg_class pg"');
+select websearch_to_tsquery('english', '"pg_class pg"');
+select websearch_to_tsquery('english', 'abc "pg_class pg"');
+select websearch_to_tsquery('english', '"pg_class pg" def');
+select websearch_to_tsquery('english', 'abc "pg pg_class pg" def');
+select websearch_to_tsquery('english', ' or "pg pg_class pg" or ');
+select websearch_to_tsquery('english', '""pg pg_class pg""');
+select websearch_to_tsquery('english', 'abc """"" def');
+select websearch_to_tsquery('english', 'cat -"fat rat"');
+select websearch_to_tsquery('english', 'cat -"fat rat" cheese');
+select websearch_to_tsquery('english', 'abc "def -"');
+select websearch_to_tsquery('english', 'abc "def :"');
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+
+select websearch_to_tsquery('english', 'this is ----fine');
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -fine, "dear friend" OR good');
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+-- cases handled by gettoken_tsvector()
+select websearch_to_tsquery('''');
+select websearch_to_tsquery('''abc''''def''');
+select websearch_to_tsquery('\abc');
+select websearch_to_tsquery('\');
Hi everyone,
The code in its current state looks messy and way too complicated;
there're lots of interleaving code branches. Thus, I decided to split
gettoken_query() into three independent tokenizers for phrase, web and
original (to_tsquery()) syntaxes. Documentation is included. Any
feedback is very welcome.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
websearch_to_tsquery_v6.difftext/x-diff; name=websearch_to_tsquery_v6.diffDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5abb1c46fb..c3b7be6e4e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9609,6 +9609,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
<entry><literal>phraseto_tsquery('english', 'The Fat Rats')</literal></entry>
<entry><literal>'fat' <-> 'rat'</literal></entry>
</row>
+ <row>
+ <entry>
+ <indexterm>
+ <primary>websearch_to_tsquery</primary>
+ </indexterm>
+ <literal><function>websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type> , </optional> <replaceable class="parameter">query</replaceable> <type>text</type>)</function></literal>
+ </entry>
+ <entry><type>tsquery</type></entry>
+ <entry>produce <type>tsquery</type> from a web search style query</entry>
+ <entry><literal>websearch_to_tsquery('english', '"fat rat" or rat')</literal></entry>
+ <entry><literal>'fat' <-> 'rat' | 'rat'</literal></entry>
+ </row>
<row>
<entry>
<indexterm>
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 610b7bf033..19f58511c8 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,16 @@ UPDATE tt SET ti =
<para>
<productname>PostgreSQL</productname> provides the
functions <function>to_tsquery</function>,
- <function>plainto_tsquery</function>, and
- <function>phraseto_tsquery</function>
+ <function>plainto_tsquery</function>,
+ <function>phraseto_tsquery</function> and
+ <function>websearch_to_tsquery</function>
for converting a query to the <type>tsquery</type> data type.
<function>to_tsquery</function> offers access to more features
than either <function>plainto_tsquery</function> or
- <function>phraseto_tsquery</function>, but it is less forgiving
- about its input.
+ <function>phraseto_tsquery</function>, but it is less forgiving about its
+ input. <function>websearch_to_tsquery</function> is a simplified version
+ of <function>to_tsquery</function> with an alternative syntax, similar
+ to the one used by web search engines.
</para>
<indexterm>
@@ -962,6 +965,87 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
</screen>
</para>
+<synopsis>
+websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
+</synopsis>
+
+ <para>
+ <function>websearch_to_tsquery</function> creates a <type>tsquery</type>
+ value from <replaceable>querytext</replaceable> using an alternative
+ syntax in which simple unformatted text is a valid query.
+ Unlike <function>plainto_tsquery</function>
+ and <function>phraseto_tsquery</function>, it also recognizes certain
+ operators. Moreover, this function should never raise syntax errors,
+ which makes it possible to use raw user-supplied input for search.
+ The following syntax is supported:
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ <literal>unquoted text</literal>: text not inside quote marks will be
+ converted to terms separated by <literal>&</literal> operators, as
+ if processed by
+ <function>plainto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>"quoted text"</literal>: text inside quote marks will be
+ converted to terms separated by <literal><-></literal>
+ operators, as if processed by <function>phraseto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>OR</literal>: logical or will be converted to
+ the <literal>|</literal> operator.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>-</literal>: the logical not operator, converted to the
+ the <literal>!</literal> operator.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ Examples:
+ <screen>
+ select websearch_to_tsquery('english', 'The fat rats');
+ websearch_to_tsquery
+ -----------------
+ 'fat' & 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '"supernovae stars" -crab');
+ websearch_to_tsquery
+ ----------------------------------
+ 'supernova' <-> 'star' & !'crab'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '"sad cat" or "fat rat"');
+ websearch_to_tsquery
+ -----------------------------------
+ 'sad' <-> 'cat' | 'fat' <-> 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', 'signal -"segmentation fault"');
+ websearch_to_tsquery
+ ---------------------------------------
+ 'signal' & !( 'segment' <-> 'fault' )
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '""" )( dummy \\ query <->');
+ websearch_to_tsquery
+ ----------------------
+ 'dummi' & 'queri'
+ (1 row)
+ </screen>
+ </para>
</sect2>
<sect2 id="textsearch-ranking">
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index ea5947a3a8..6055fb6b4e 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -490,7 +490,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- false);
+ 0);
PG_RETURN_TSQUERY(query);
}
@@ -520,7 +520,7 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ P_TSQ_PLAIN);
PG_RETURN_POINTER(query);
}
@@ -551,7 +551,7 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ P_TSQ_PLAIN);
PG_RETURN_TSQUERY(query);
}
@@ -567,3 +567,35 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+websearch_to_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ P_TSQ_WEB);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+websearch_to_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(websearch_to_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+
+}
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 1ccbf79030..d5df8b5506 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -32,14 +32,53 @@ const int tsearch_op_priority[OP_COUNT] =
3 /* OP_PHRASE */
};
+/*
+ * parser's states
+ */
+typedef enum
+{
+ WAITOPERAND = 1,
+ WAITOPERATOR = 2,
+ WAITFIRSTOPERAND = 3
+} ts_parserstate;
+
+/*
+ * token types for parsing
+ */
+typedef enum
+{
+ PT_END = 0,
+ PT_ERR = 1,
+ PT_VAL = 2,
+ PT_OPR = 3,
+ PT_OPEN = 4,
+ PT_CLOSE = 5
+} ts_tokentype;
+
+/*
+ * get token from query string
+ *
+ * *operator is filled in with OP_* when return values is PT_OPR,
+ * but *weight could contain a distance value in case of phrase operator.
+ * *strval, *lenval and *weight are filled in when return value is PT_VAL
+ *
+ */
+typedef ts_tokentype ts_tokenizer(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix);
+
struct TSQueryParserStateData
{
- /* State for gettoken_query */
+ /* Tokenizer used for parsing tsquery */
+ ts_tokenizer *gettoken;
+
+ /* State of tokenizer function */
char *buffer; /* entire string we are scanning */
char *buf; /* current scan point */
- int state;
int count; /* nesting count, incremented by (,
* decremented by ) */
+ bool in_quotes; /* phrase in quotes "" */
+ ts_parserstate state;
/* polish (prefix) notation in list, filled in by push* functions */
List *polstr;
@@ -57,12 +96,6 @@ struct TSQueryParserStateData
TSVectorParseState valstate;
};
-/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
-
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
* part, like ':AB*' of a query.
@@ -198,35 +231,34 @@ err:
}
/*
- * token types for parsing
+ * Parse OR operator used in websearch_to_tsquery().
*/
-typedef enum
+static bool
+parse_or_operator(TSQueryParserState state)
{
- PT_END = 0,
- PT_ERR = 1,
- PT_VAL = 2,
- PT_OPR = 3,
- PT_OPEN = 4,
- PT_CLOSE = 5
-} ts_tokentype;
+ char *buf = state->buf;
+
+ if (state->in_quotes)
+ return false;
+
+ return (t_iseq(&buf[0], 'o') || t_iseq(&buf[0], 'O')) &&
+ (t_iseq(&buf[1], 'r') || t_iseq(&buf[1], 'R')) &&
+ (buf[2] != '\0' &&
+ !t_iseq(&buf[2], '-') &&
+ !t_iseq(&buf[2], '_') &&
+ !t_isalpha(&buf[2]) &&
+ !t_isdigit(&buf[2]));
+}
-/*
- * get token from query string
- *
- * *operator is filled in with OP_* when return values is PT_OPR,
- * but *weight could contain a distance value in case of phrase operator.
- * *strval, *lenval and *weight are filled in when return value is PT_VAL
- *
- */
static ts_tokentype
-gettoken_query(TSQueryParserState state,
- int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+gettoken_query_standard(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix)
{
*weight = 0;
*prefix = false;
- while (1)
+ while (true)
{
switch (state->state)
{
@@ -234,17 +266,20 @@ gettoken_query(TSQueryParserState state,
case WAITOPERAND:
if (t_iseq(state->buf, '!'))
{
- (state->buf)++; /* can safely ++, t_iseq guarantee that
- * pg_mblen()==1 */
- *operator = OP_NOT;
+ state->buf++;
state->state = WAITOPERAND;
+
+ if (state->in_quotes)
+ continue;
+
+ *operator = OP_NOT;
return PT_OPR;
}
else if (t_iseq(state->buf, '('))
{
- state->count++;
- (state->buf)++;
+ state->buf++;
state->state = WAITOPERAND;
+ state->count++;
return PT_OPEN;
}
else if (t_iseq(state->buf, ':'))
@@ -256,10 +291,7 @@ gettoken_query(TSQueryParserState state,
}
else if (!t_isspace(state->buf))
{
- /*
- * We rely on the tsvector parser to parse the value for
- * us
- */
+ /* We rely on the tsvector parser to parse the value for us */
reset_tsvector_parser(state->valstate, state->buf);
if (gettoken_tsvector(state->valstate, strval, lenval, NULL, NULL, &state->buf))
{
@@ -268,7 +300,9 @@ gettoken_query(TSQueryParserState state,
return PT_VAL;
}
else if (state->state == WAITFIRSTOPERAND)
+ {
return PT_END;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -276,58 +310,209 @@ gettoken_query(TSQueryParserState state,
state->buffer)));
}
break;
+
case WAITOPERATOR:
if (t_iseq(state->buf, '&'))
{
+ state->buf++;
state->state = WAITOPERAND;
*operator = OP_AND;
- (state->buf)++;
return PT_OPR;
}
else if (t_iseq(state->buf, '|'))
{
+ state->buf++;
state->state = WAITOPERAND;
*operator = OP_OR;
- (state->buf)++;
return PT_OPR;
}
else if (t_iseq(state->buf, '<'))
{
- state->state = WAITOPERAND;
- *operator = OP_PHRASE;
/* weight var is used as storage for distance */
state->buf = parse_phrase_operator(state->buf, weight);
+ state->state = WAITOPERAND;
+ *operator = OP_PHRASE;
if (*weight < 0)
return PT_ERR;
return PT_OPR;
}
else if (t_iseq(state->buf, ')'))
{
- (state->buf)++;
+ state->buf++;
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
- else if (*(state->buf) == '\0')
- return (state->count) ? PT_ERR : PT_END;
+ else if (*state->buf == '\0')
+ {
+ return state->count ? PT_ERR : PT_END;
+ }
else if (!t_isspace(state->buf))
+ {
return PT_ERR;
+ }
break;
- case WAITSINGLEOPERAND:
- if (*(state->buf) == '\0')
+ }
+
+ state->buf += pg_mblen(state->buf);
+ }
+}
+
+static ts_tokentype
+gettoken_query_websearch(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix)
+{
+ *weight = 0;
+ *prefix = false;
+
+ while (true)
+ {
+ switch (state->state)
+ {
+ case WAITFIRSTOPERAND:
+ case WAITOPERAND:
+ if (t_iseq(state->buf, '-'))
+ {
+ state->buf++;
+ state->state = WAITOPERAND;
+
+ if (state->in_quotes)
+ continue;
+
+ *operator = OP_NOT;
+ return PT_OPR;
+ }
+ else if (t_iseq(state->buf, '"'))
+ {
+ state->buf++;
+
+ if (!state->in_quotes)
+ {
+ state->state = WAITOPERAND;
+
+ if (strchr(state->buf, '"'))
+ {
+ /* quoted text should be ordered <-> */
+ state->in_quotes = true;
+ return PT_OPEN;
+ }
+
+ /* web search tolerates missing quotes */
+ continue;
+ }
+ else
+ {
+ /* we have to provide an operand */
+ state->in_quotes = false;
+ state->state = WAITOPERATOR;
+ pushStop(state);
+ return PT_CLOSE;
+ }
+ }
+ else if (ISOPERATOR(state->buf))
+ {
+ /* or else gettoken_tsvector() will raise an error */
+ state->buf++;
+ state->state = WAITOPERAND;
+ continue;
+ }
+ else if (!t_isspace(state->buf))
+ {
+ /* We rely on the tsvector parser to parse the value for us */
+ reset_tsvector_parser(state->valstate, state->buf);
+ if (gettoken_tsvector(state->valstate, strval, lenval, NULL, NULL, &state->buf))
+ {
+ state->state = WAITOPERATOR;
+ return PT_VAL;
+ }
+ else if (state->state == WAITFIRSTOPERAND)
+ {
+ return PT_END;
+ }
+ else
+ {
+ /* finally, we have to provide an operand */
+ pushStop(state);
+ return PT_END;
+ }
+ }
+ break;
+
+ case WAITOPERATOR:
+ if (t_iseq(state->buf, '"'))
+ {
+ if (!state->in_quotes)
+ {
+ /*
+ * put implicit AND after an operand
+ * and handle this quote in WAITOPERAND
+ */
+ state->state = WAITOPERAND;
+ *operator = OP_AND;
+ return PT_OPR;
+ }
+ else
+ {
+ state->buf++;
+
+ /* just close quotes */
+ state->in_quotes = false;
+ return PT_CLOSE;
+ }
+ }
+ else if (parse_or_operator(state))
+ {
+ state->buf += 2; /* strlen("OR") */
+ state->state = WAITOPERAND;
+ *operator = OP_OR;
+ return PT_OPR;
+ }
+ else if (*state->buf == '\0')
+ {
return PT_END;
- *strval = state->buf;
- *lenval = strlen(state->buf);
- state->buf += strlen(state->buf);
- state->count++;
- return PT_VAL;
- default:
- return PT_ERR;
+ }
+ else if (!t_isspace(state->buf))
+ {
+ if (state->in_quotes)
+ {
+ /* put implicit <-> after an operand */
+ *operator = OP_PHRASE;
+ *weight = 1;
+ }
+ else
+ {
+ /* put implicit AND after an operand */
+ *operator = OP_AND;
+ }
+
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
break;
}
+
state->buf += pg_mblen(state->buf);
}
}
+static ts_tokentype
+gettoken_query_plain(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix)
+{
+ *weight = 0;
+ *prefix = false;
+
+ if (*state->buf == '\0')
+ return PT_END;
+
+ *strval = state->buf;
+ *lenval = strlen(state->buf);
+ state->buf += strlen(state->buf);
+ state->count++;
+ return PT_VAL;
+}
+
/*
* Push an operator to state->polstr
*/
@@ -489,7 +674,9 @@ makepol(TSQueryParserState state,
/* since this function recurses, it could be driven to stack overflow */
check_stack_depth();
- while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix)) != PT_END)
+ while ((type = state->gettoken(state, &operator,
+ &lenval, &strval,
+ &weight, &prefix)) != PT_END)
{
switch (type)
{
@@ -605,7 +792,7 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ int flags)
{
struct TSQueryParserStateData state;
int i;
@@ -613,17 +800,38 @@ parse_tsquery(char *buf,
int commonlen;
QueryItem *ptr;
ListCell *cell;
- bool needcleanup;
+ bool needcleanup,
+ is_plain,
+ is_web;
+ int tsv_flags = P_TSV_OPR_IS_DELIM | P_TSV_IS_TSQUERY;
+
+ is_plain = (flags & P_TSQ_PLAIN) != 0;
+ is_web = (flags & P_TSQ_WEB) != 0;
+
+ /* plain should not be used with web */
+ Assert(!(is_plain && is_web));
+
+ if (is_web)
+ tsv_flags |= P_TSV_IS_WEB;
+
+ /* select suitable tokenizer */
+ if (is_plain)
+ state.gettoken = gettoken_query_plain;
+ else if (is_web)
+ state.gettoken = gettoken_query_websearch;
+ else
+ state.gettoken = gettoken_query_standard;
/* init state */
state.buffer = buf;
state.buf = buf;
- state.state = (isplain) ? WAITSINGLEOPERAND : WAITFIRSTOPERAND;
state.count = 0;
+ state.in_quotes = false;
+ state.state = WAITFIRSTOPERAND;
state.polstr = NIL;
/* init value parser's state */
- state.valstate = init_tsvector_parser(state.buffer, true, true);
+ state.valstate = init_tsvector_parser(state.buffer, tsv_flags);
/* init list of operand */
state.sumlen = 0;
@@ -716,7 +924,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), 0));
}
/*
diff --git a/src/backend/utils/adt/tsvector.c b/src/backend/utils/adt/tsvector.c
index 64e02ef434..7a27bd12a3 100644
--- a/src/backend/utils/adt/tsvector.c
+++ b/src/backend/utils/adt/tsvector.c
@@ -200,7 +200,7 @@ tsvectorin(PG_FUNCTION_ARGS)
char *cur;
int buflen = 256; /* allocated size of tmpbuf */
- state = init_tsvector_parser(buf, false, false);
+ state = init_tsvector_parser(buf, 0);
arrlen = 64;
arr = (WordEntryIN *) palloc(sizeof(WordEntryIN) * arrlen);
diff --git a/src/backend/utils/adt/tsvector_parser.c b/src/backend/utils/adt/tsvector_parser.c
index 7367ba6a40..fed411a842 100644
--- a/src/backend/utils/adt/tsvector_parser.c
+++ b/src/backend/utils/adt/tsvector_parser.c
@@ -33,6 +33,7 @@ struct TSVectorParseStateData
int eml; /* max bytes per character */
bool oprisdelim; /* treat ! | * ( ) as delimiters? */
bool is_tsquery; /* say "tsquery" not "tsvector" in errors? */
+ bool is_web; /* we're in websearch_to_tsquery() */
};
@@ -42,7 +43,7 @@ struct TSVectorParseStateData
* ! | & ( )
*/
TSVectorParseState
-init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
+init_tsvector_parser(char *input, int flags)
{
TSVectorParseState state;
@@ -52,8 +53,9 @@ init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
state->len = 32;
state->word = (char *) palloc(state->len);
state->eml = pg_database_encoding_max_length();
- state->oprisdelim = oprisdelim;
- state->is_tsquery = is_tsquery;
+ state->oprisdelim = (flags & P_TSV_OPR_IS_DELIM) != 0;
+ state->is_tsquery = (flags & P_TSV_IS_TSQUERY) != 0;
+ state->is_web = (flags & P_TSV_IS_WEB) != 0;
return state;
}
@@ -89,16 +91,6 @@ do { \
} \
} while (0)
-/* phrase operator begins with '<' */
-#define ISOPERATOR(x) \
- ( pg_mblen(x) == 1 && ( *(x) == '!' || \
- *(x) == '&' || \
- *(x) == '|' || \
- *(x) == '(' || \
- *(x) == ')' || \
- *(x) == '<' \
- ) )
-
/* Fills gettoken_tsvector's output parameters, and returns true */
#define RETURN_TOKEN \
do { \
@@ -183,14 +175,15 @@ gettoken_tsvector(TSVectorParseState state,
{
if (*(state->prsbuf) == '\0')
return false;
- else if (t_iseq(state->prsbuf, '\''))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\''))
statecode = WAITENDCMPLX;
- else if (t_iseq(state->prsbuf, '\\'))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDWORD;
}
- else if (state->oprisdelim && ISOPERATOR(state->prsbuf))
+ else if ((state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
+ (state->is_web && t_iseq(state->prsbuf, '"')))
PRSSYNTAXERROR;
else if (!t_isspace(state->prsbuf))
{
@@ -217,13 +210,14 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITENDWORD)
{
- if (t_iseq(state->prsbuf, '\\'))
+ if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDWORD;
}
else if (t_isspace(state->prsbuf) || *(state->prsbuf) == '\0' ||
- (state->oprisdelim && ISOPERATOR(state->prsbuf)))
+ (state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
+ (state->is_web && t_iseq(state->prsbuf, '"')))
{
RESIZEPRSBUF;
if (curpos == state->word)
@@ -250,11 +244,11 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITENDCMPLX)
{
- if (t_iseq(state->prsbuf, '\''))
+ if (!state->is_web && t_iseq(state->prsbuf, '\''))
{
statecode = WAITCHARCMPLX;
}
- else if (t_iseq(state->prsbuf, '\\'))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDCMPLX;
@@ -270,7 +264,7 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITCHARCMPLX)
{
- if (t_iseq(state->prsbuf, '\''))
+ if (!state->is_web && t_iseq(state->prsbuf, '\''))
{
RESIZEPRSBUF;
COPYCHAR(curpos, state->prsbuf);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 90d994c71a..560416636b 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4966,6 +4966,8 @@ DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s
DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
DESCR("transform to tsvector");
DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ to_tsquery _null_ _null_ _null_ ));
@@ -4974,6 +4976,8 @@ DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s
DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
DESCR("transform jsonb to tsvector");
DATA(insert OID = 4210 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "114" _null_ _null_ _null_ _null_ _null_ json_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index f8ddce5ecb..73e969fe9c 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -25,9 +25,11 @@
struct TSVectorParseStateData; /* opaque struct in tsvector_parser.c */
typedef struct TSVectorParseStateData *TSVectorParseState;
-extern TSVectorParseState init_tsvector_parser(char *input,
- bool oprisdelim,
- bool is_tsquery);
+#define P_TSV_OPR_IS_DELIM (1 << 0)
+#define P_TSV_IS_TSQUERY (1 << 1)
+#define P_TSV_IS_WEB (1 << 2)
+
+extern TSVectorParseState init_tsvector_parser(char *input, int flags);
extern void reset_tsvector_parser(TSVectorParseState state, char *input);
extern bool gettoken_tsvector(TSVectorParseState state,
char **token, int *len,
@@ -35,6 +37,16 @@ extern bool gettoken_tsvector(TSVectorParseState state,
char **endptr);
extern void close_tsvector_parser(TSVectorParseState state);
+/* phrase operator begins with '<' */
+#define ISOPERATOR(x) \
+ ( pg_mblen(x) == 1 && ( *(x) == '!' || \
+ *(x) == '&' || \
+ *(x) == '|' || \
+ *(x) == '(' || \
+ *(x) == ')' || \
+ *(x) == '<' \
+ ) )
+
/* parse_tsquery */
struct TSQueryParserStateData; /* private in backend/utils/adt/tsquery.c */
@@ -46,9 +58,13 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
* QueryOperand struct */
bool prefix);
+#define P_TSQ_PLAIN (1 << 0)
+#define P_TSQ_WEB (1 << 1)
+
extern TSQuery parse_tsquery(char *buf,
- PushFunction pushval,
- Datum opaque, bool isplain);
+ PushFunction pushval,
+ Datum opaque,
+ int flags);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index d63fb12f1d..37825a1130 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1672,3 +1672,432 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('simple', 'I have a fat:*ABCD cat');
+ websearch_to_tsquery
+---------------------------------------------
+ 'i' & 'have' & 'a' & 'fat' & 'abcd' & 'cat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'orange:**AABBCCDD');
+ websearch_to_tsquery
+-----------------------
+ 'orange' & 'aabbccdd'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat:A!cat:B|rat:C<');
+ websearch_to_tsquery
+-----------------------------------------
+ 'fat' & 'a' & 'cat' & 'b' & 'rat' & 'c'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat:A : cat:B');
+ websearch_to_tsquery
+---------------------------
+ 'fat' & 'a' & 'cat' & 'b'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat*rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat-rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat_rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+-- weights are completely ignored
+select websearch_to_tsquery('simple', 'abc : def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc:def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'a:::b');
+ websearch_to_tsquery
+----------------------
+ 'a' & 'b'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc:d');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'd'
+(1 row)
+
+select websearch_to_tsquery('simple', ':');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+-- these operators are ignored
+select websearch_to_tsquery('simple', 'abc & def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc | def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc <-> def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc (pg or class)');
+ websearch_to_tsquery
+------------------------
+ 'abc' & 'pg' | 'class'
+(1 row)
+
+-- NOT is ignored in quotes
+select websearch_to_tsquery('english', 'My brand new smartphone');
+ websearch_to_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand "new smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+-- test OR operator
+select websearch_to_tsquery('simple', 'cat or rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or'
+(1 row)
+
+select websearch_to_tsquery('simple', 'OR rat');
+ websearch_to_tsquery
+----------------------
+ 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+ websearch_to_tsquery
+-----------------------
+ 'fat' & 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'or OR or');
+ websearch_to_tsquery
+----------------------
+ 'or' | 'or'
+(1 row)
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', '"fat cat"or"fat rat"');
+ websearch_to_tsquery
+-----------------------------------
+ 'fat' <-> 'cat' | 'fat' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or(rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or)rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or&rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or|rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or!rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or<rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or>rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or ');
+ websearch_to_tsquery
+----------------------
+ 'fat'
+(1 row)
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orange'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc orтест');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orтест'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR1234');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or1234'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or-abc');
+ websearch_to_tsquery
+---------------------------------
+ 'abc' & 'or-abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR_abc');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or'
+(1 row)
+
+-- test quotes
+select websearch_to_tsquery('english', '"pg_class pg');
+ websearch_to_tsquery
+-----------------------
+ 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'pg_class pg"');
+ websearch_to_tsquery
+-----------------------
+ 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '"pg_class pg"');
+ websearch_to_tsquery
+-----------------------------
+ ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "pg_class pg"');
+ websearch_to_tsquery
+-------------------------------------
+ 'abc' & ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '"pg_class pg" def');
+ websearch_to_tsquery
+-------------------------------------
+ ( 'pg' & 'class' ) <-> 'pg' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "pg pg_class pg" def');
+ websearch_to_tsquery
+------------------------------------------------------
+ 'abc' & 'pg' <-> ( 'pg' & 'class' ) <-> 'pg' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', ' or "pg pg_class pg" or ');
+ websearch_to_tsquery
+--------------------------------------
+ 'pg' <-> ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '""pg pg_class pg""');
+ websearch_to_tsquery
+------------------------------
+ 'pg' & 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc """"" def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'cat -"fat rat"');
+ websearch_to_tsquery
+------------------------------
+ 'cat' & !( 'fat' <-> 'rat' )
+(1 row)
+
+select websearch_to_tsquery('english', 'cat -"fat rat" cheese');
+ websearch_to_tsquery
+----------------------------------------
+ 'cat' & !( 'fat' <-> 'rat' ) & 'chees'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "def -"');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "def :"');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+ websearch_to_tsquery
+-----------------------------------
+ 'fat' <-> 'cat' & 'eaten' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', 'this is ----fine');
+ websearch_to_tsquery
+----------------------
+ !!!!'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -fine, "dear friend" OR good');
+ websearch_to_tsquery
+----------------------------------------
+ !'fine' & 'dear' <-> 'friend' | 'good'
+(1 row)
+
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+ websearch_to_tsquery
+------------------------
+ 'old' & 'cat' & 'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ websearch_to_tsquery
+--------------------------------------
+ 'толст' <-> 'кошк' & 'съел' & 'крыс'
+(1 row)
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ f
+(1 row)
+
+-- cases handled by gettoken_tsvector()
+select websearch_to_tsquery('''');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('''abc''''def''');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('\abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('\');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index 1c8520b3e9..f299a5d32b 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -539,3 +539,98 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('simple', 'I have a fat:*ABCD cat');
+select websearch_to_tsquery('simple', 'orange:**AABBCCDD');
+select websearch_to_tsquery('simple', 'fat:A!cat:B|rat:C<');
+select websearch_to_tsquery('simple', 'fat:A : cat:B');
+
+select websearch_to_tsquery('simple', 'fat*rat');
+select websearch_to_tsquery('simple', 'fat-rat');
+select websearch_to_tsquery('simple', 'fat_rat');
+
+-- weights are completely ignored
+select websearch_to_tsquery('simple', 'abc : def');
+select websearch_to_tsquery('simple', 'abc:def');
+select websearch_to_tsquery('simple', 'a:::b');
+select websearch_to_tsquery('simple', 'abc:d');
+select websearch_to_tsquery('simple', ':');
+
+-- these operators are ignored
+select websearch_to_tsquery('simple', 'abc & def');
+select websearch_to_tsquery('simple', 'abc | def');
+select websearch_to_tsquery('simple', 'abc <-> def');
+select websearch_to_tsquery('simple', 'abc (pg or class)');
+
+-- NOT is ignored in quotes
+select websearch_to_tsquery('english', 'My brand new smartphone');
+select websearch_to_tsquery('english', 'My brand "new smartphone"');
+select websearch_to_tsquery('english', 'My brand "new -smartphone"');
+
+-- test OR operator
+select websearch_to_tsquery('simple', 'cat or rat');
+select websearch_to_tsquery('simple', 'cat OR rat');
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+select websearch_to_tsquery('simple', 'cat OR');
+select websearch_to_tsquery('simple', 'OR rat');
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+select websearch_to_tsquery('simple', 'or OR or');
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', '"fat cat"or"fat rat"');
+select websearch_to_tsquery('simple', 'fat or(rat');
+select websearch_to_tsquery('simple', 'fat or)rat');
+select websearch_to_tsquery('simple', 'fat or&rat');
+select websearch_to_tsquery('simple', 'fat or|rat');
+select websearch_to_tsquery('simple', 'fat or!rat');
+select websearch_to_tsquery('simple', 'fat or<rat');
+select websearch_to_tsquery('simple', 'fat or>rat');
+select websearch_to_tsquery('simple', 'fat or ');
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+select websearch_to_tsquery('simple', 'abc orтест');
+select websearch_to_tsquery('simple', 'abc OR1234');
+select websearch_to_tsquery('simple', 'abc or-abc');
+select websearch_to_tsquery('simple', 'abc OR_abc');
+select websearch_to_tsquery('simple', 'abc or');
+
+-- test quotes
+select websearch_to_tsquery('english', '"pg_class pg');
+select websearch_to_tsquery('english', 'pg_class pg"');
+select websearch_to_tsquery('english', '"pg_class pg"');
+select websearch_to_tsquery('english', 'abc "pg_class pg"');
+select websearch_to_tsquery('english', '"pg_class pg" def');
+select websearch_to_tsquery('english', 'abc "pg pg_class pg" def');
+select websearch_to_tsquery('english', ' or "pg pg_class pg" or ');
+select websearch_to_tsquery('english', '""pg pg_class pg""');
+select websearch_to_tsquery('english', 'abc """"" def');
+select websearch_to_tsquery('english', 'cat -"fat rat"');
+select websearch_to_tsquery('english', 'cat -"fat rat" cheese');
+select websearch_to_tsquery('english', 'abc "def -"');
+select websearch_to_tsquery('english', 'abc "def :"');
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+
+select websearch_to_tsquery('english', 'this is ----fine');
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -fine, "dear friend" OR good');
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+-- cases handled by gettoken_tsvector()
+select websearch_to_tsquery('''');
+select websearch_to_tsquery('''abc''''def''');
+select websearch_to_tsquery('\abc');
+select websearch_to_tsquery('\');
The code in its current state looks messy and way too complicated;
there're lots of interleaving code branches. Thus, I decided to split
gettoken_query() into three independent tokenizers for phrase, web and
original (to_tsquery()) syntaxes. Documentation is included. Any
feedback is very welcome.
I'm sorry, I totally forgot to fix a few more things, the patch is
attached below.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
websearch_to_tsquery_v7.difftext/x-diff; name=websearch_to_tsquery_v7.diffDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5abb1c46fb..c3b7be6e4e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9609,6 +9609,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
<entry><literal>phraseto_tsquery('english', 'The Fat Rats')</literal></entry>
<entry><literal>'fat' <-> 'rat'</literal></entry>
</row>
+ <row>
+ <entry>
+ <indexterm>
+ <primary>websearch_to_tsquery</primary>
+ </indexterm>
+ <literal><function>websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type> , </optional> <replaceable class="parameter">query</replaceable> <type>text</type>)</function></literal>
+ </entry>
+ <entry><type>tsquery</type></entry>
+ <entry>produce <type>tsquery</type> from a web search style query</entry>
+ <entry><literal>websearch_to_tsquery('english', '"fat rat" or rat')</literal></entry>
+ <entry><literal>'fat' <-> 'rat' | 'rat'</literal></entry>
+ </row>
<row>
<entry>
<indexterm>
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 610b7bf033..19f58511c8 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,16 @@ UPDATE tt SET ti =
<para>
<productname>PostgreSQL</productname> provides the
functions <function>to_tsquery</function>,
- <function>plainto_tsquery</function>, and
- <function>phraseto_tsquery</function>
+ <function>plainto_tsquery</function>,
+ <function>phraseto_tsquery</function> and
+ <function>websearch_to_tsquery</function>
for converting a query to the <type>tsquery</type> data type.
<function>to_tsquery</function> offers access to more features
than either <function>plainto_tsquery</function> or
- <function>phraseto_tsquery</function>, but it is less forgiving
- about its input.
+ <function>phraseto_tsquery</function>, but it is less forgiving about its
+ input. <function>websearch_to_tsquery</function> is a simplified version
+ of <function>to_tsquery</function> with an alternative syntax, similar
+ to the one used by web search engines.
</para>
<indexterm>
@@ -962,6 +965,87 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
</screen>
</para>
+<synopsis>
+websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
+</synopsis>
+
+ <para>
+ <function>websearch_to_tsquery</function> creates a <type>tsquery</type>
+ value from <replaceable>querytext</replaceable> using an alternative
+ syntax in which simple unformatted text is a valid query.
+ Unlike <function>plainto_tsquery</function>
+ and <function>phraseto_tsquery</function>, it also recognizes certain
+ operators. Moreover, this function should never raise syntax errors,
+ which makes it possible to use raw user-supplied input for search.
+ The following syntax is supported:
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ <literal>unquoted text</literal>: text not inside quote marks will be
+ converted to terms separated by <literal>&</literal> operators, as
+ if processed by
+ <function>plainto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>"quoted text"</literal>: text inside quote marks will be
+ converted to terms separated by <literal><-></literal>
+ operators, as if processed by <function>phraseto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>OR</literal>: logical or will be converted to
+ the <literal>|</literal> operator.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>-</literal>: the logical not operator, converted to the
+ the <literal>!</literal> operator.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ Examples:
+ <screen>
+ select websearch_to_tsquery('english', 'The fat rats');
+ websearch_to_tsquery
+ -----------------
+ 'fat' & 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '"supernovae stars" -crab');
+ websearch_to_tsquery
+ ----------------------------------
+ 'supernova' <-> 'star' & !'crab'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '"sad cat" or "fat rat"');
+ websearch_to_tsquery
+ -----------------------------------
+ 'sad' <-> 'cat' | 'fat' <-> 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', 'signal -"segmentation fault"');
+ websearch_to_tsquery
+ ---------------------------------------
+ 'signal' & !( 'segment' <-> 'fault' )
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '""" )( dummy \\ query <->');
+ websearch_to_tsquery
+ ----------------------
+ 'dummi' & 'queri'
+ (1 row)
+ </screen>
+ </para>
</sect2>
<sect2 id="textsearch-ranking">
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index ea5947a3a8..6055fb6b4e 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -490,7 +490,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- false);
+ 0);
PG_RETURN_TSQUERY(query);
}
@@ -520,7 +520,7 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ P_TSQ_PLAIN);
PG_RETURN_POINTER(query);
}
@@ -551,7 +551,7 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ P_TSQ_PLAIN);
PG_RETURN_TSQUERY(query);
}
@@ -567,3 +567,35 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+websearch_to_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ P_TSQ_WEB);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+websearch_to_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(websearch_to_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+
+}
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 1ccbf79030..202db8875f 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -32,14 +32,53 @@ const int tsearch_op_priority[OP_COUNT] =
3 /* OP_PHRASE */
};
+/*
+ * parser's states
+ */
+typedef enum
+{
+ WAITOPERAND = 1,
+ WAITOPERATOR = 2,
+ WAITFIRSTOPERAND = 3
+} ts_parserstate;
+
+/*
+ * token types for parsing
+ */
+typedef enum
+{
+ PT_END = 0,
+ PT_ERR = 1,
+ PT_VAL = 2,
+ PT_OPR = 3,
+ PT_OPEN = 4,
+ PT_CLOSE = 5
+} ts_tokentype;
+
+/*
+ * get token from query string
+ *
+ * *operator is filled in with OP_* when return values is PT_OPR,
+ * but *weight could contain a distance value in case of phrase operator.
+ * *strval, *lenval and *weight are filled in when return value is PT_VAL
+ *
+ */
+typedef ts_tokentype (*ts_tokenizer)(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix);
+
struct TSQueryParserStateData
{
- /* State for gettoken_query */
+ /* Tokenizer used for parsing tsquery */
+ ts_tokenizer gettoken;
+
+ /* State of tokenizer function */
char *buffer; /* entire string we are scanning */
char *buf; /* current scan point */
- int state;
int count; /* nesting count, incremented by (,
* decremented by ) */
+ bool in_quotes; /* phrase in quotes "" */
+ ts_parserstate state;
/* polish (prefix) notation in list, filled in by push* functions */
List *polstr;
@@ -57,12 +96,6 @@ struct TSQueryParserStateData
TSVectorParseState valstate;
};
-/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
-
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
* part, like ':AB*' of a query.
@@ -198,35 +231,34 @@ err:
}
/*
- * token types for parsing
+ * Parse OR operator used in websearch_to_tsquery().
*/
-typedef enum
+static bool
+parse_or_operator(TSQueryParserState state)
{
- PT_END = 0,
- PT_ERR = 1,
- PT_VAL = 2,
- PT_OPR = 3,
- PT_OPEN = 4,
- PT_CLOSE = 5
-} ts_tokentype;
+ char *buf = state->buf;
+
+ if (state->in_quotes)
+ return false;
+
+ return (t_iseq(&buf[0], 'o') || t_iseq(&buf[0], 'O')) &&
+ (t_iseq(&buf[1], 'r') || t_iseq(&buf[1], 'R')) &&
+ (buf[2] != '\0' &&
+ !t_iseq(&buf[2], '-') &&
+ !t_iseq(&buf[2], '_') &&
+ !t_isalpha(&buf[2]) &&
+ !t_isdigit(&buf[2]));
+}
-/*
- * get token from query string
- *
- * *operator is filled in with OP_* when return values is PT_OPR,
- * but *weight could contain a distance value in case of phrase operator.
- * *strval, *lenval and *weight are filled in when return value is PT_VAL
- *
- */
static ts_tokentype
-gettoken_query(TSQueryParserState state,
- int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+gettoken_query_standard(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix)
{
*weight = 0;
*prefix = false;
- while (1)
+ while (true)
{
switch (state->state)
{
@@ -234,17 +266,16 @@ gettoken_query(TSQueryParserState state,
case WAITOPERAND:
if (t_iseq(state->buf, '!'))
{
- (state->buf)++; /* can safely ++, t_iseq guarantee that
- * pg_mblen()==1 */
- *operator = OP_NOT;
+ state->buf++;
state->state = WAITOPERAND;
+ *operator = OP_NOT;
return PT_OPR;
}
else if (t_iseq(state->buf, '('))
{
- state->count++;
- (state->buf)++;
+ state->buf++;
state->state = WAITOPERAND;
+ state->count++;
return PT_OPEN;
}
else if (t_iseq(state->buf, ':'))
@@ -256,10 +287,7 @@ gettoken_query(TSQueryParserState state,
}
else if (!t_isspace(state->buf))
{
- /*
- * We rely on the tsvector parser to parse the value for
- * us
- */
+ /* We rely on the tsvector parser to parse the value for us */
reset_tsvector_parser(state->valstate, state->buf);
if (gettoken_tsvector(state->valstate, strval, lenval, NULL, NULL, &state->buf))
{
@@ -268,7 +296,9 @@ gettoken_query(TSQueryParserState state,
return PT_VAL;
}
else if (state->state == WAITFIRSTOPERAND)
+ {
return PT_END;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -276,58 +306,209 @@ gettoken_query(TSQueryParserState state,
state->buffer)));
}
break;
+
case WAITOPERATOR:
if (t_iseq(state->buf, '&'))
{
+ state->buf++;
state->state = WAITOPERAND;
*operator = OP_AND;
- (state->buf)++;
return PT_OPR;
}
else if (t_iseq(state->buf, '|'))
{
+ state->buf++;
state->state = WAITOPERAND;
*operator = OP_OR;
- (state->buf)++;
return PT_OPR;
}
else if (t_iseq(state->buf, '<'))
{
- state->state = WAITOPERAND;
- *operator = OP_PHRASE;
/* weight var is used as storage for distance */
state->buf = parse_phrase_operator(state->buf, weight);
+ state->state = WAITOPERAND;
+ *operator = OP_PHRASE;
if (*weight < 0)
return PT_ERR;
return PT_OPR;
}
else if (t_iseq(state->buf, ')'))
{
- (state->buf)++;
+ state->buf++;
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
- else if (*(state->buf) == '\0')
+ else if (*state->buf == '\0')
+ {
return (state->count) ? PT_ERR : PT_END;
+ }
else if (!t_isspace(state->buf))
+ {
return PT_ERR;
+ }
break;
- case WAITSINGLEOPERAND:
- if (*(state->buf) == '\0')
+ }
+
+ state->buf += pg_mblen(state->buf);
+ }
+}
+
+static ts_tokentype
+gettoken_query_websearch(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix)
+{
+ *weight = 0;
+ *prefix = false;
+
+ while (true)
+ {
+ switch (state->state)
+ {
+ case WAITFIRSTOPERAND:
+ case WAITOPERAND:
+ if (t_iseq(state->buf, '-'))
+ {
+ state->buf++;
+ state->state = WAITOPERAND;
+
+ if (state->in_quotes)
+ continue;
+
+ *operator = OP_NOT;
+ return PT_OPR;
+ }
+ else if (t_iseq(state->buf, '"'))
+ {
+ state->buf++;
+
+ if (!state->in_quotes)
+ {
+ state->state = WAITOPERAND;
+
+ if (strchr(state->buf, '"'))
+ {
+ /* quoted text should be ordered <-> */
+ state->in_quotes = true;
+ return PT_OPEN;
+ }
+
+ /* web search tolerates missing quotes */
+ continue;
+ }
+ else
+ {
+ /* we have to provide an operand */
+ state->in_quotes = false;
+ state->state = WAITOPERATOR;
+ pushStop(state);
+ return PT_CLOSE;
+ }
+ }
+ else if (ISOPERATOR(state->buf))
+ {
+ /* or else gettoken_tsvector() will raise an error */
+ state->buf++;
+ state->state = WAITOPERAND;
+ continue;
+ }
+ else if (!t_isspace(state->buf))
+ {
+ /* We rely on the tsvector parser to parse the value for us */
+ reset_tsvector_parser(state->valstate, state->buf);
+ if (gettoken_tsvector(state->valstate, strval, lenval, NULL, NULL, &state->buf))
+ {
+ state->state = WAITOPERATOR;
+ return PT_VAL;
+ }
+ else if (state->state == WAITFIRSTOPERAND)
+ {
+ return PT_END;
+ }
+ else
+ {
+ /* finally, we have to provide an operand */
+ pushStop(state);
+ return PT_END;
+ }
+ }
+ break;
+
+ case WAITOPERATOR:
+ if (t_iseq(state->buf, '"'))
+ {
+ if (!state->in_quotes)
+ {
+ /*
+ * put implicit AND after an operand
+ * and handle this quote in WAITOPERAND
+ */
+ state->state = WAITOPERAND;
+ *operator = OP_AND;
+ return PT_OPR;
+ }
+ else
+ {
+ state->buf++;
+
+ /* just close quotes */
+ state->in_quotes = false;
+ return PT_CLOSE;
+ }
+ }
+ else if (parse_or_operator(state))
+ {
+ state->buf += 2; /* strlen("OR") */
+ state->state = WAITOPERAND;
+ *operator = OP_OR;
+ return PT_OPR;
+ }
+ else if (*state->buf == '\0')
+ {
return PT_END;
- *strval = state->buf;
- *lenval = strlen(state->buf);
- state->buf += strlen(state->buf);
- state->count++;
- return PT_VAL;
- default:
- return PT_ERR;
+ }
+ else if (!t_isspace(state->buf))
+ {
+ if (state->in_quotes)
+ {
+ /* put implicit <-> after an operand */
+ *operator = OP_PHRASE;
+ *weight = 1;
+ }
+ else
+ {
+ /* put implicit AND after an operand */
+ *operator = OP_AND;
+ }
+
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
break;
}
+
state->buf += pg_mblen(state->buf);
}
}
+static ts_tokentype
+gettoken_query_plain(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix)
+{
+ *weight = 0;
+ *prefix = false;
+
+ if (*state->buf == '\0')
+ return PT_END;
+
+ *strval = state->buf;
+ *lenval = strlen(state->buf);
+ state->buf += *lenval;
+ state->count++;
+ return PT_VAL;
+}
+
/*
* Push an operator to state->polstr
*/
@@ -489,7 +670,9 @@ makepol(TSQueryParserState state,
/* since this function recurses, it could be driven to stack overflow */
check_stack_depth();
- while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix)) != PT_END)
+ while ((type = state->gettoken(state, &operator,
+ &lenval, &strval,
+ &weight, &prefix)) != PT_END)
{
switch (type)
{
@@ -605,7 +788,7 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ int flags)
{
struct TSQueryParserStateData state;
int i;
@@ -613,17 +796,38 @@ parse_tsquery(char *buf,
int commonlen;
QueryItem *ptr;
ListCell *cell;
- bool needcleanup;
+ bool needcleanup,
+ is_plain,
+ is_web;
+ int tsv_flags = P_TSV_OPR_IS_DELIM | P_TSV_IS_TSQUERY;
+
+ is_plain = (flags & P_TSQ_PLAIN) != 0;
+ is_web = (flags & P_TSQ_WEB) != 0;
+
+ /* plain should not be used with web */
+ Assert(!(is_plain && is_web));
+
+ if (is_web)
+ tsv_flags |= P_TSV_IS_WEB;
+
+ /* select suitable tokenizer */
+ if (is_plain)
+ state.gettoken = gettoken_query_plain;
+ else if (is_web)
+ state.gettoken = gettoken_query_websearch;
+ else
+ state.gettoken = gettoken_query_standard;
/* init state */
state.buffer = buf;
state.buf = buf;
- state.state = (isplain) ? WAITSINGLEOPERAND : WAITFIRSTOPERAND;
state.count = 0;
+ state.in_quotes = false;
+ state.state = WAITFIRSTOPERAND;
state.polstr = NIL;
/* init value parser's state */
- state.valstate = init_tsvector_parser(state.buffer, true, true);
+ state.valstate = init_tsvector_parser(state.buffer, tsv_flags);
/* init list of operand */
state.sumlen = 0;
@@ -716,7 +920,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), 0));
}
/*
diff --git a/src/backend/utils/adt/tsvector.c b/src/backend/utils/adt/tsvector.c
index 64e02ef434..7a27bd12a3 100644
--- a/src/backend/utils/adt/tsvector.c
+++ b/src/backend/utils/adt/tsvector.c
@@ -200,7 +200,7 @@ tsvectorin(PG_FUNCTION_ARGS)
char *cur;
int buflen = 256; /* allocated size of tmpbuf */
- state = init_tsvector_parser(buf, false, false);
+ state = init_tsvector_parser(buf, 0);
arrlen = 64;
arr = (WordEntryIN *) palloc(sizeof(WordEntryIN) * arrlen);
diff --git a/src/backend/utils/adt/tsvector_parser.c b/src/backend/utils/adt/tsvector_parser.c
index 7367ba6a40..fed411a842 100644
--- a/src/backend/utils/adt/tsvector_parser.c
+++ b/src/backend/utils/adt/tsvector_parser.c
@@ -33,6 +33,7 @@ struct TSVectorParseStateData
int eml; /* max bytes per character */
bool oprisdelim; /* treat ! | * ( ) as delimiters? */
bool is_tsquery; /* say "tsquery" not "tsvector" in errors? */
+ bool is_web; /* we're in websearch_to_tsquery() */
};
@@ -42,7 +43,7 @@ struct TSVectorParseStateData
* ! | & ( )
*/
TSVectorParseState
-init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
+init_tsvector_parser(char *input, int flags)
{
TSVectorParseState state;
@@ -52,8 +53,9 @@ init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
state->len = 32;
state->word = (char *) palloc(state->len);
state->eml = pg_database_encoding_max_length();
- state->oprisdelim = oprisdelim;
- state->is_tsquery = is_tsquery;
+ state->oprisdelim = (flags & P_TSV_OPR_IS_DELIM) != 0;
+ state->is_tsquery = (flags & P_TSV_IS_TSQUERY) != 0;
+ state->is_web = (flags & P_TSV_IS_WEB) != 0;
return state;
}
@@ -89,16 +91,6 @@ do { \
} \
} while (0)
-/* phrase operator begins with '<' */
-#define ISOPERATOR(x) \
- ( pg_mblen(x) == 1 && ( *(x) == '!' || \
- *(x) == '&' || \
- *(x) == '|' || \
- *(x) == '(' || \
- *(x) == ')' || \
- *(x) == '<' \
- ) )
-
/* Fills gettoken_tsvector's output parameters, and returns true */
#define RETURN_TOKEN \
do { \
@@ -183,14 +175,15 @@ gettoken_tsvector(TSVectorParseState state,
{
if (*(state->prsbuf) == '\0')
return false;
- else if (t_iseq(state->prsbuf, '\''))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\''))
statecode = WAITENDCMPLX;
- else if (t_iseq(state->prsbuf, '\\'))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDWORD;
}
- else if (state->oprisdelim && ISOPERATOR(state->prsbuf))
+ else if ((state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
+ (state->is_web && t_iseq(state->prsbuf, '"')))
PRSSYNTAXERROR;
else if (!t_isspace(state->prsbuf))
{
@@ -217,13 +210,14 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITENDWORD)
{
- if (t_iseq(state->prsbuf, '\\'))
+ if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDWORD;
}
else if (t_isspace(state->prsbuf) || *(state->prsbuf) == '\0' ||
- (state->oprisdelim && ISOPERATOR(state->prsbuf)))
+ (state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
+ (state->is_web && t_iseq(state->prsbuf, '"')))
{
RESIZEPRSBUF;
if (curpos == state->word)
@@ -250,11 +244,11 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITENDCMPLX)
{
- if (t_iseq(state->prsbuf, '\''))
+ if (!state->is_web && t_iseq(state->prsbuf, '\''))
{
statecode = WAITCHARCMPLX;
}
- else if (t_iseq(state->prsbuf, '\\'))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDCMPLX;
@@ -270,7 +264,7 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITCHARCMPLX)
{
- if (t_iseq(state->prsbuf, '\''))
+ if (!state->is_web && t_iseq(state->prsbuf, '\''))
{
RESIZEPRSBUF;
COPYCHAR(curpos, state->prsbuf);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 90d994c71a..560416636b 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4966,6 +4966,8 @@ DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s
DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
DESCR("transform to tsvector");
DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ to_tsquery _null_ _null_ _null_ ));
@@ -4974,6 +4976,8 @@ DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s
DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
DESCR("transform jsonb to tsvector");
DATA(insert OID = 4210 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "114" _null_ _null_ _null_ _null_ _null_ json_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index f8ddce5ecb..73e969fe9c 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -25,9 +25,11 @@
struct TSVectorParseStateData; /* opaque struct in tsvector_parser.c */
typedef struct TSVectorParseStateData *TSVectorParseState;
-extern TSVectorParseState init_tsvector_parser(char *input,
- bool oprisdelim,
- bool is_tsquery);
+#define P_TSV_OPR_IS_DELIM (1 << 0)
+#define P_TSV_IS_TSQUERY (1 << 1)
+#define P_TSV_IS_WEB (1 << 2)
+
+extern TSVectorParseState init_tsvector_parser(char *input, int flags);
extern void reset_tsvector_parser(TSVectorParseState state, char *input);
extern bool gettoken_tsvector(TSVectorParseState state,
char **token, int *len,
@@ -35,6 +37,16 @@ extern bool gettoken_tsvector(TSVectorParseState state,
char **endptr);
extern void close_tsvector_parser(TSVectorParseState state);
+/* phrase operator begins with '<' */
+#define ISOPERATOR(x) \
+ ( pg_mblen(x) == 1 && ( *(x) == '!' || \
+ *(x) == '&' || \
+ *(x) == '|' || \
+ *(x) == '(' || \
+ *(x) == ')' || \
+ *(x) == '<' \
+ ) )
+
/* parse_tsquery */
struct TSQueryParserStateData; /* private in backend/utils/adt/tsquery.c */
@@ -46,9 +58,13 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
* QueryOperand struct */
bool prefix);
+#define P_TSQ_PLAIN (1 << 0)
+#define P_TSQ_WEB (1 << 1)
+
extern TSQuery parse_tsquery(char *buf,
- PushFunction pushval,
- Datum opaque, bool isplain);
+ PushFunction pushval,
+ Datum opaque,
+ int flags);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index d63fb12f1d..37825a1130 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1672,3 +1672,432 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('simple', 'I have a fat:*ABCD cat');
+ websearch_to_tsquery
+---------------------------------------------
+ 'i' & 'have' & 'a' & 'fat' & 'abcd' & 'cat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'orange:**AABBCCDD');
+ websearch_to_tsquery
+-----------------------
+ 'orange' & 'aabbccdd'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat:A!cat:B|rat:C<');
+ websearch_to_tsquery
+-----------------------------------------
+ 'fat' & 'a' & 'cat' & 'b' & 'rat' & 'c'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat:A : cat:B');
+ websearch_to_tsquery
+---------------------------
+ 'fat' & 'a' & 'cat' & 'b'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat*rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat-rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat_rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+-- weights are completely ignored
+select websearch_to_tsquery('simple', 'abc : def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc:def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'a:::b');
+ websearch_to_tsquery
+----------------------
+ 'a' & 'b'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc:d');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'd'
+(1 row)
+
+select websearch_to_tsquery('simple', ':');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+-- these operators are ignored
+select websearch_to_tsquery('simple', 'abc & def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc | def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc <-> def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc (pg or class)');
+ websearch_to_tsquery
+------------------------
+ 'abc' & 'pg' | 'class'
+(1 row)
+
+-- NOT is ignored in quotes
+select websearch_to_tsquery('english', 'My brand new smartphone');
+ websearch_to_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand "new smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+-- test OR operator
+select websearch_to_tsquery('simple', 'cat or rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or'
+(1 row)
+
+select websearch_to_tsquery('simple', 'OR rat');
+ websearch_to_tsquery
+----------------------
+ 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+ websearch_to_tsquery
+-----------------------
+ 'fat' & 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'or OR or');
+ websearch_to_tsquery
+----------------------
+ 'or' | 'or'
+(1 row)
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', '"fat cat"or"fat rat"');
+ websearch_to_tsquery
+-----------------------------------
+ 'fat' <-> 'cat' | 'fat' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or(rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or)rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or&rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or|rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or!rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or<rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or>rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or ');
+ websearch_to_tsquery
+----------------------
+ 'fat'
+(1 row)
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orange'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc orтест');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orтест'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR1234');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or1234'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or-abc');
+ websearch_to_tsquery
+---------------------------------
+ 'abc' & 'or-abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR_abc');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or'
+(1 row)
+
+-- test quotes
+select websearch_to_tsquery('english', '"pg_class pg');
+ websearch_to_tsquery
+-----------------------
+ 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'pg_class pg"');
+ websearch_to_tsquery
+-----------------------
+ 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '"pg_class pg"');
+ websearch_to_tsquery
+-----------------------------
+ ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "pg_class pg"');
+ websearch_to_tsquery
+-------------------------------------
+ 'abc' & ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '"pg_class pg" def');
+ websearch_to_tsquery
+-------------------------------------
+ ( 'pg' & 'class' ) <-> 'pg' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "pg pg_class pg" def');
+ websearch_to_tsquery
+------------------------------------------------------
+ 'abc' & 'pg' <-> ( 'pg' & 'class' ) <-> 'pg' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', ' or "pg pg_class pg" or ');
+ websearch_to_tsquery
+--------------------------------------
+ 'pg' <-> ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '""pg pg_class pg""');
+ websearch_to_tsquery
+------------------------------
+ 'pg' & 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc """"" def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'cat -"fat rat"');
+ websearch_to_tsquery
+------------------------------
+ 'cat' & !( 'fat' <-> 'rat' )
+(1 row)
+
+select websearch_to_tsquery('english', 'cat -"fat rat" cheese');
+ websearch_to_tsquery
+----------------------------------------
+ 'cat' & !( 'fat' <-> 'rat' ) & 'chees'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "def -"');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "def :"');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+ websearch_to_tsquery
+-----------------------------------
+ 'fat' <-> 'cat' & 'eaten' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', 'this is ----fine');
+ websearch_to_tsquery
+----------------------
+ !!!!'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -fine, "dear friend" OR good');
+ websearch_to_tsquery
+----------------------------------------
+ !'fine' & 'dear' <-> 'friend' | 'good'
+(1 row)
+
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+ websearch_to_tsquery
+------------------------
+ 'old' & 'cat' & 'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ websearch_to_tsquery
+--------------------------------------
+ 'толст' <-> 'кошк' & 'съел' & 'крыс'
+(1 row)
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ f
+(1 row)
+
+-- cases handled by gettoken_tsvector()
+select websearch_to_tsquery('''');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('''abc''''def''');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('\abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('\');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index 1c8520b3e9..f299a5d32b 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -539,3 +539,98 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('simple', 'I have a fat:*ABCD cat');
+select websearch_to_tsquery('simple', 'orange:**AABBCCDD');
+select websearch_to_tsquery('simple', 'fat:A!cat:B|rat:C<');
+select websearch_to_tsquery('simple', 'fat:A : cat:B');
+
+select websearch_to_tsquery('simple', 'fat*rat');
+select websearch_to_tsquery('simple', 'fat-rat');
+select websearch_to_tsquery('simple', 'fat_rat');
+
+-- weights are completely ignored
+select websearch_to_tsquery('simple', 'abc : def');
+select websearch_to_tsquery('simple', 'abc:def');
+select websearch_to_tsquery('simple', 'a:::b');
+select websearch_to_tsquery('simple', 'abc:d');
+select websearch_to_tsquery('simple', ':');
+
+-- these operators are ignored
+select websearch_to_tsquery('simple', 'abc & def');
+select websearch_to_tsquery('simple', 'abc | def');
+select websearch_to_tsquery('simple', 'abc <-> def');
+select websearch_to_tsquery('simple', 'abc (pg or class)');
+
+-- NOT is ignored in quotes
+select websearch_to_tsquery('english', 'My brand new smartphone');
+select websearch_to_tsquery('english', 'My brand "new smartphone"');
+select websearch_to_tsquery('english', 'My brand "new -smartphone"');
+
+-- test OR operator
+select websearch_to_tsquery('simple', 'cat or rat');
+select websearch_to_tsquery('simple', 'cat OR rat');
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+select websearch_to_tsquery('simple', 'cat OR');
+select websearch_to_tsquery('simple', 'OR rat');
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+select websearch_to_tsquery('simple', 'or OR or');
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', '"fat cat"or"fat rat"');
+select websearch_to_tsquery('simple', 'fat or(rat');
+select websearch_to_tsquery('simple', 'fat or)rat');
+select websearch_to_tsquery('simple', 'fat or&rat');
+select websearch_to_tsquery('simple', 'fat or|rat');
+select websearch_to_tsquery('simple', 'fat or!rat');
+select websearch_to_tsquery('simple', 'fat or<rat');
+select websearch_to_tsquery('simple', 'fat or>rat');
+select websearch_to_tsquery('simple', 'fat or ');
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+select websearch_to_tsquery('simple', 'abc orтест');
+select websearch_to_tsquery('simple', 'abc OR1234');
+select websearch_to_tsquery('simple', 'abc or-abc');
+select websearch_to_tsquery('simple', 'abc OR_abc');
+select websearch_to_tsquery('simple', 'abc or');
+
+-- test quotes
+select websearch_to_tsquery('english', '"pg_class pg');
+select websearch_to_tsquery('english', 'pg_class pg"');
+select websearch_to_tsquery('english', '"pg_class pg"');
+select websearch_to_tsquery('english', 'abc "pg_class pg"');
+select websearch_to_tsquery('english', '"pg_class pg" def');
+select websearch_to_tsquery('english', 'abc "pg pg_class pg" def');
+select websearch_to_tsquery('english', ' or "pg pg_class pg" or ');
+select websearch_to_tsquery('english', '""pg pg_class pg""');
+select websearch_to_tsquery('english', 'abc """"" def');
+select websearch_to_tsquery('english', 'cat -"fat rat"');
+select websearch_to_tsquery('english', 'cat -"fat rat" cheese');
+select websearch_to_tsquery('english', 'abc "def -"');
+select websearch_to_tsquery('english', 'abc "def :"');
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+
+select websearch_to_tsquery('english', 'this is ----fine');
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -fine, "dear friend" OR good');
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+-- cases handled by gettoken_tsvector()
+select websearch_to_tsquery('''');
+select websearch_to_tsquery('''abc''''def''');
+select websearch_to_tsquery('\abc');
+select websearch_to_tsquery('\');
On Tue, 03 Apr 2018 14:28:37 +0300
Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:
I'm sorry, I totally forgot to fix a few more things, the patch is
attached below.
The patch looks good to me except two things.
I'm not sure about the different result for these queries:
SELECT websearch_to_tsquery('simple', 'cat or ');
websearch_to_tsquery
----------------------
'cat'
(1 row)
SELECT websearch_to_tsquery('simple', 'cat or');
websearch_to_tsquery
----------------------
'cat' & 'or'
(1 row)
But I don't have strong opinion about these queries, since input in
both of them looks broken in terms of operator usage.
I found an odd behavior of the query creation function in case:
SELECT websearch_to_tsquery('english', '"pg_class pg"');
websearch_to_tsquery
-----------------------------
( 'pg' & 'class' ) <-> 'pg'
(1 row)
This query means that lexemes 'pg' and 'class' should be at the same
distance from the last 'pg'. In other words, they should have the same
position. But default parser will interpret pg_class as two separate
words during text parsing/vector creation.
The bug wasn't introduced in the patch and can be found in current
master. During the discussion of the patch with Dmitry, he noticed that
to_tsquery() function shares same odd behavior:
select to_tsquery('english', ' pg_class <-> pg');
to_tsquery
-----------------------------
( 'pg' & 'class' ) <-> 'pg'
(1 row)
This oddity caused by they implementation of makepol. In makepol, each
token (parsed by query parser) is sent to FTS parser and in case of
further separation of the token, it uses operator selected in functions
to_tsquery and friends. So it doesn't change over the runtime.
I see two different ways to solve it:
1) Use the same operator inside the parenthesizes. This will mean to
parse it as few parts of one word.
2) Remove parenthesizes. This will mean to parse it as few separate
words.
I prefer the second way since in some languages words can be separated
by some special symbol or not separated by any symbols at all and
should be extracted by special FTS parser. It also allows us to parse
such words as one by using the special parser (as it done for hyphenated
word).
But in the example with websearch_to_tsquery, I think it should use
the same operator for quoted part of the query. For example, we can
update the operator in makepol before sending it to pushval
(pushval_morph) to do so.
It looks like there should be two separated patches, one for
websearch_to_tsquery and another one for fixing odd behavior of the
query construction. But since the first one may depend on the
bugfix, to solve case with quotes, I will mark it as Waiting on
Author.
--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
I'm not sure about the different result for these queries:
SELECT websearch_to_tsquery('simple', 'cat or ');
websearch_to_tsquery
----------------------
'cat'
(1 row)
SELECT websearch_to_tsquery('simple', 'cat or');
websearch_to_tsquery
----------------------
'cat' & 'or'
(1 row)
I guess both queries should produce just 'cat'. I've changed the
definition of parse_or_operator().
I found an odd behavior of the query creation function in case:
SELECT websearch_to_tsquery('english', '"pg_class pg"');
websearch_to_tsquery
-----------------------------
( 'pg' & 'class' ) <-> 'pg'
(1 row)This query means that lexemes 'pg' and 'class' should be at the same
distance from the last 'pg'. In other words, they should have the same
position. But default parser will interpret pg_class as two separate
words during text parsing/vector creation.The bug wasn't introduced in the patch and can be found in current
master. During the discussion of the patch with Dmitry, he noticed that
to_tsquery() function shares same odd behavior:
select to_tsquery('english', ' pg_class <-> pg');
to_tsquery
-----------------------------
( 'pg' & 'class' ) <-> 'pg'
(1 row)
I've been thinking about this for a while, and it seems that this should
be fixed somewhere near parsetext(). Perhaps 'pg' and 'class' should
share the same position. After all, somebody could implement a parser
which would split some words using an arbitrary set of rules, for
instance "split all words containing digits". I propose merging this
patch provided that there are no objections.
--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
websearch_to_tsquery_v8.patchtext/x-diff; name=websearch_to_tsquery_v8.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 9a1efc14cf..122f034f17 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9630,6 +9630,18 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
<entry><literal>phraseto_tsquery('english', 'The Fat Rats')</literal></entry>
<entry><literal>'fat' <-> 'rat'</literal></entry>
</row>
+ <row>
+ <entry>
+ <indexterm>
+ <primary>websearch_to_tsquery</primary>
+ </indexterm>
+ <literal><function>websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type> , </optional> <replaceable class="parameter">query</replaceable> <type>text</type>)</function></literal>
+ </entry>
+ <entry><type>tsquery</type></entry>
+ <entry>produce <type>tsquery</type> from a web search style query</entry>
+ <entry><literal>websearch_to_tsquery('english', '"fat rat" or rat')</literal></entry>
+ <entry><literal>'fat' <-> 'rat' | 'rat'</literal></entry>
+ </row>
<row>
<entry>
<indexterm>
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 610b7bf033..19f58511c8 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -797,13 +797,16 @@ UPDATE tt SET ti =
<para>
<productname>PostgreSQL</productname> provides the
functions <function>to_tsquery</function>,
- <function>plainto_tsquery</function>, and
- <function>phraseto_tsquery</function>
+ <function>plainto_tsquery</function>,
+ <function>phraseto_tsquery</function> and
+ <function>websearch_to_tsquery</function>
for converting a query to the <type>tsquery</type> data type.
<function>to_tsquery</function> offers access to more features
than either <function>plainto_tsquery</function> or
- <function>phraseto_tsquery</function>, but it is less forgiving
- about its input.
+ <function>phraseto_tsquery</function>, but it is less forgiving about its
+ input. <function>websearch_to_tsquery</function> is a simplified version
+ of <function>to_tsquery</function> with an alternative syntax, similar
+ to the one used by web search engines.
</para>
<indexterm>
@@ -962,6 +965,87 @@ SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
</screen>
</para>
+<synopsis>
+websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
+</synopsis>
+
+ <para>
+ <function>websearch_to_tsquery</function> creates a <type>tsquery</type>
+ value from <replaceable>querytext</replaceable> using an alternative
+ syntax in which simple unformatted text is a valid query.
+ Unlike <function>plainto_tsquery</function>
+ and <function>phraseto_tsquery</function>, it also recognizes certain
+ operators. Moreover, this function should never raise syntax errors,
+ which makes it possible to use raw user-supplied input for search.
+ The following syntax is supported:
+ <itemizedlist spacing="compact" mark="bullet">
+ <listitem>
+ <para>
+ <literal>unquoted text</literal>: text not inside quote marks will be
+ converted to terms separated by <literal>&</literal> operators, as
+ if processed by
+ <function>plainto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>"quoted text"</literal>: text inside quote marks will be
+ converted to terms separated by <literal><-></literal>
+ operators, as if processed by <function>phraseto_tsquery</function>.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>OR</literal>: logical or will be converted to
+ the <literal>|</literal> operator.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>-</literal>: the logical not operator, converted to the
+ the <literal>!</literal> operator.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ Examples:
+ <screen>
+ select websearch_to_tsquery('english', 'The fat rats');
+ websearch_to_tsquery
+ -----------------
+ 'fat' & 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '"supernovae stars" -crab');
+ websearch_to_tsquery
+ ----------------------------------
+ 'supernova' <-> 'star' & !'crab'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '"sad cat" or "fat rat"');
+ websearch_to_tsquery
+ -----------------------------------
+ 'sad' <-> 'cat' | 'fat' <-> 'rat'
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', 'signal -"segmentation fault"');
+ websearch_to_tsquery
+ ---------------------------------------
+ 'signal' & !( 'segment' <-> 'fault' )
+ (1 row)
+ </screen>
+ <screen>
+ select websearch_to_tsquery('english', '""" )( dummy \\ query <->');
+ websearch_to_tsquery
+ ----------------------
+ 'dummi' & 'queri'
+ (1 row)
+ </screen>
+ </para>
</sect2>
<sect2 id="textsearch-ranking">
diff --git a/src/backend/tsearch/to_tsany.c b/src/backend/tsearch/to_tsany.c
index ea5947a3a8..6055fb6b4e 100644
--- a/src/backend/tsearch/to_tsany.c
+++ b/src/backend/tsearch/to_tsany.c
@@ -490,7 +490,7 @@ to_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- false);
+ 0);
PG_RETURN_TSQUERY(query);
}
@@ -520,7 +520,7 @@ plainto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ P_TSQ_PLAIN);
PG_RETURN_POINTER(query);
}
@@ -551,7 +551,7 @@ phraseto_tsquery_byid(PG_FUNCTION_ARGS)
query = parse_tsquery(text_to_cstring(in),
pushval_morph,
PointerGetDatum(&data),
- true);
+ P_TSQ_PLAIN);
PG_RETURN_TSQUERY(query);
}
@@ -567,3 +567,35 @@ phraseto_tsquery(PG_FUNCTION_ARGS)
ObjectIdGetDatum(cfgId),
PointerGetDatum(in)));
}
+
+Datum
+websearch_to_tsquery_byid(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(1);
+ MorphOpaque data;
+ TSQuery query = NULL;
+
+ data.cfg_id = PG_GETARG_OID(0);
+
+ data.qoperator = OP_AND;
+
+ query = parse_tsquery(text_to_cstring(in),
+ pushval_morph,
+ PointerGetDatum(&data),
+ P_TSQ_WEB);
+
+ PG_RETURN_TSQUERY(query);
+}
+
+Datum
+websearch_to_tsquery(PG_FUNCTION_ARGS)
+{
+ text *in = PG_GETARG_TEXT_PP(0);
+ Oid cfgId;
+
+ cfgId = getTSCurrentConfig(true);
+ PG_RETURN_DATUM(DirectFunctionCall2(websearch_to_tsquery_byid,
+ ObjectIdGetDatum(cfgId),
+ PointerGetDatum(in)));
+
+}
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 1ccbf79030..e9139bef99 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -32,14 +32,53 @@ const int tsearch_op_priority[OP_COUNT] =
3 /* OP_PHRASE */
};
+/*
+ * parser's states
+ */
+typedef enum
+{
+ WAITOPERAND = 1,
+ WAITOPERATOR = 2,
+ WAITFIRSTOPERAND = 3
+} ts_parserstate;
+
+/*
+ * token types for parsing
+ */
+typedef enum
+{
+ PT_END = 0,
+ PT_ERR = 1,
+ PT_VAL = 2,
+ PT_OPR = 3,
+ PT_OPEN = 4,
+ PT_CLOSE = 5
+} ts_tokentype;
+
+/*
+ * get token from query string
+ *
+ * *operator is filled in with OP_* when return values is PT_OPR,
+ * but *weight could contain a distance value in case of phrase operator.
+ * *strval, *lenval and *weight are filled in when return value is PT_VAL
+ *
+ */
+typedef ts_tokentype (*ts_tokenizer)(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix);
+
struct TSQueryParserStateData
{
- /* State for gettoken_query */
+ /* Tokenizer used for parsing tsquery */
+ ts_tokenizer gettoken;
+
+ /* State of tokenizer function */
char *buffer; /* entire string we are scanning */
char *buf; /* current scan point */
- int state;
int count; /* nesting count, incremented by (,
* decremented by ) */
+ bool in_quotes; /* phrase in quotes "" */
+ ts_parserstate state;
/* polish (prefix) notation in list, filled in by push* functions */
List *polstr;
@@ -57,12 +96,6 @@ struct TSQueryParserStateData
TSVectorParseState valstate;
};
-/* parser's states */
-#define WAITOPERAND 1
-#define WAITOPERATOR 2
-#define WAITFIRSTOPERAND 3
-#define WAITSINGLEOPERAND 4
-
/*
* subroutine to parse the modifiers (weight and prefix flag currently)
* part, like ':AB*' of a query.
@@ -118,18 +151,17 @@ get_modifiers(char *buf, int16 *weight, bool *prefix)
*
* The buffer should begin with '<' char
*/
-static char *
-parse_phrase_operator(char *buf, int16 *distance)
+static bool
+parse_phrase_operator(TSQueryParserState pstate, int16 *distance)
{
enum
{
PHRASE_OPEN = 0,
PHRASE_DIST,
PHRASE_CLOSE,
- PHRASE_ERR,
PHRASE_FINISH
} state = PHRASE_OPEN;
- char *ptr = buf;
+ char *ptr = pstate->buf;
char *endptr;
long l = 1; /* default distance */
@@ -138,9 +170,13 @@ parse_phrase_operator(char *buf, int16 *distance)
switch (state)
{
case PHRASE_OPEN:
- Assert(t_iseq(ptr, '<'));
- state = PHRASE_DIST;
- ptr++;
+ if (t_iseq(ptr, '<'))
+ {
+ state = PHRASE_DIST;
+ ptr++;
+ }
+ else
+ return false;
break;
case PHRASE_DIST:
@@ -148,18 +184,16 @@ parse_phrase_operator(char *buf, int16 *distance)
{
state = PHRASE_CLOSE;
ptr++;
- break;
+ continue;
}
+
if (!t_isdigit(ptr))
- {
- state = PHRASE_ERR;
- break;
- }
+ return false;
errno = 0;
l = strtol(ptr, &endptr, 10);
if (ptr == endptr)
- state = PHRASE_ERR;
+ return false;
else if (errno == ERANGE || l < 0 || l > MAXENTRYPOS)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -179,54 +213,56 @@ parse_phrase_operator(char *buf, int16 *distance)
ptr++;
}
else
- state = PHRASE_ERR;
+ return false;
break;
case PHRASE_FINISH:
*distance = (int16) l;
- return ptr;
-
- case PHRASE_ERR:
- default:
- goto err;
+ pstate->buf = ptr;
+ return true;
}
}
-err:
- *distance = -1;
- return buf;
+ return false;
}
/*
- * token types for parsing
+ * Parse OR operator used in websearch_to_tsquery().
*/
-typedef enum
+static bool
+parse_or_operator(TSQueryParserState pstate)
{
- PT_END = 0,
- PT_ERR = 1,
- PT_VAL = 2,
- PT_OPR = 3,
- PT_OPEN = 4,
- PT_CLOSE = 5
-} ts_tokentype;
+ char *ptr = pstate->buf;
+
+ if (pstate->in_quotes)
+ return false;
+
+ /* it should begin with "OR" literal */
+ if (pg_strncasecmp(ptr, "or", 2) != 0)
+ return false;
+
+ ptr += 2;
+
+ /* it shouldn't be a part of any word */
+ if (t_iseq(ptr, '-') ||
+ t_iseq(ptr, '_') ||
+ t_isalpha(ptr) ||
+ t_isdigit(ptr))
+ return false;
+
+ pstate->buf = ptr;
+ return true;
+}
-/*
- * get token from query string
- *
- * *operator is filled in with OP_* when return values is PT_OPR,
- * but *weight could contain a distance value in case of phrase operator.
- * *strval, *lenval and *weight are filled in when return value is PT_VAL
- *
- */
static ts_tokentype
-gettoken_query(TSQueryParserState state,
- int8 *operator,
- int *lenval, char **strval, int16 *weight, bool *prefix)
+gettoken_query_standard(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix)
{
*weight = 0;
*prefix = false;
- while (1)
+ while (true)
{
switch (state->state)
{
@@ -234,17 +270,16 @@ gettoken_query(TSQueryParserState state,
case WAITOPERAND:
if (t_iseq(state->buf, '!'))
{
- (state->buf)++; /* can safely ++, t_iseq guarantee that
- * pg_mblen()==1 */
- *operator = OP_NOT;
+ state->buf++;
state->state = WAITOPERAND;
+ *operator = OP_NOT;
return PT_OPR;
}
else if (t_iseq(state->buf, '('))
{
- state->count++;
- (state->buf)++;
+ state->buf++;
state->state = WAITOPERAND;
+ state->count++;
return PT_OPEN;
}
else if (t_iseq(state->buf, ':'))
@@ -256,10 +291,7 @@ gettoken_query(TSQueryParserState state,
}
else if (!t_isspace(state->buf))
{
- /*
- * We rely on the tsvector parser to parse the value for
- * us
- */
+ /* We rely on the tsvector parser to parse the value for us */
reset_tsvector_parser(state->valstate, state->buf);
if (gettoken_tsvector(state->valstate, strval, lenval, NULL, NULL, &state->buf))
{
@@ -268,7 +300,9 @@ gettoken_query(TSQueryParserState state,
return PT_VAL;
}
else if (state->state == WAITFIRSTOPERAND)
+ {
return PT_END;
+ }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -276,58 +310,205 @@ gettoken_query(TSQueryParserState state,
state->buffer)));
}
break;
+
case WAITOPERATOR:
if (t_iseq(state->buf, '&'))
{
+ state->buf++;
state->state = WAITOPERAND;
*operator = OP_AND;
- (state->buf)++;
return PT_OPR;
}
else if (t_iseq(state->buf, '|'))
{
+ state->buf++;
state->state = WAITOPERAND;
*operator = OP_OR;
- (state->buf)++;
return PT_OPR;
}
- else if (t_iseq(state->buf, '<'))
+ else if (parse_phrase_operator(state, weight))
{
+ /* weight var is used as storage for distance */
state->state = WAITOPERAND;
*operator = OP_PHRASE;
- /* weight var is used as storage for distance */
- state->buf = parse_phrase_operator(state->buf, weight);
- if (*weight < 0)
- return PT_ERR;
return PT_OPR;
}
else if (t_iseq(state->buf, ')'))
{
- (state->buf)++;
+ state->buf++;
state->count--;
return (state->count < 0) ? PT_ERR : PT_CLOSE;
}
- else if (*(state->buf) == '\0')
+ else if (*state->buf == '\0')
+ {
return (state->count) ? PT_ERR : PT_END;
+ }
else if (!t_isspace(state->buf))
+ {
return PT_ERR;
+ }
break;
- case WAITSINGLEOPERAND:
- if (*(state->buf) == '\0')
+ }
+
+ state->buf += pg_mblen(state->buf);
+ }
+}
+
+static ts_tokentype
+gettoken_query_websearch(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix)
+{
+ *weight = 0;
+ *prefix = false;
+
+ while (true)
+ {
+ switch (state->state)
+ {
+ case WAITFIRSTOPERAND:
+ case WAITOPERAND:
+ if (t_iseq(state->buf, '-'))
+ {
+ state->buf++;
+ state->state = WAITOPERAND;
+
+ if (state->in_quotes)
+ continue;
+
+ *operator = OP_NOT;
+ return PT_OPR;
+ }
+ else if (t_iseq(state->buf, '"'))
+ {
+ state->buf++;
+
+ if (!state->in_quotes)
+ {
+ state->state = WAITOPERAND;
+
+ if (strchr(state->buf, '"'))
+ {
+ /* quoted text should be ordered <-> */
+ state->in_quotes = true;
+ return PT_OPEN;
+ }
+
+ /* web search tolerates missing quotes */
+ continue;
+ }
+ else
+ {
+ /* we have to provide an operand */
+ state->in_quotes = false;
+ state->state = WAITOPERATOR;
+ pushStop(state);
+ return PT_CLOSE;
+ }
+ }
+ else if (ISOPERATOR(state->buf))
+ {
+ /* or else gettoken_tsvector() will raise an error */
+ state->buf++;
+ state->state = WAITOPERAND;
+ continue;
+ }
+ else if (!t_isspace(state->buf))
+ {
+ /* We rely on the tsvector parser to parse the value for us */
+ reset_tsvector_parser(state->valstate, state->buf);
+ if (gettoken_tsvector(state->valstate, strval, lenval, NULL, NULL, &state->buf))
+ {
+ state->state = WAITOPERATOR;
+ return PT_VAL;
+ }
+ else if (state->state == WAITFIRSTOPERAND)
+ {
+ return PT_END;
+ }
+ else
+ {
+ /* finally, we have to provide an operand */
+ pushStop(state);
+ return PT_END;
+ }
+ }
+ break;
+
+ case WAITOPERATOR:
+ if (t_iseq(state->buf, '"'))
+ {
+ if (!state->in_quotes)
+ {
+ /*
+ * put implicit AND after an operand
+ * and handle this quote in WAITOPERAND
+ */
+ state->state = WAITOPERAND;
+ *operator = OP_AND;
+ return PT_OPR;
+ }
+ else
+ {
+ state->buf++;
+
+ /* just close quotes */
+ state->in_quotes = false;
+ return PT_CLOSE;
+ }
+ }
+ else if (parse_or_operator(state))
+ {
+ state->state = WAITOPERAND;
+ *operator = OP_OR;
+ return PT_OPR;
+ }
+ else if (*state->buf == '\0')
+ {
return PT_END;
- *strval = state->buf;
- *lenval = strlen(state->buf);
- state->buf += strlen(state->buf);
- state->count++;
- return PT_VAL;
- default:
- return PT_ERR;
+ }
+ else if (!t_isspace(state->buf))
+ {
+ if (state->in_quotes)
+ {
+ /* put implicit <-> after an operand */
+ *operator = OP_PHRASE;
+ *weight = 1;
+ }
+ else
+ {
+ /* put implicit AND after an operand */
+ *operator = OP_AND;
+ }
+
+ state->state = WAITOPERAND;
+ return PT_OPR;
+ }
break;
}
+
state->buf += pg_mblen(state->buf);
}
}
+static ts_tokentype
+gettoken_query_plain(TSQueryParserState state, int8 *operator,
+ int *lenval, char **strval,
+ int16 *weight, bool *prefix)
+{
+ *weight = 0;
+ *prefix = false;
+
+ if (*state->buf == '\0')
+ return PT_END;
+
+ *strval = state->buf;
+ *lenval = strlen(state->buf);
+ state->buf += *lenval;
+ state->count++;
+ return PT_VAL;
+}
+
/*
* Push an operator to state->polstr
*/
@@ -489,7 +670,9 @@ makepol(TSQueryParserState state,
/* since this function recurses, it could be driven to stack overflow */
check_stack_depth();
- while ((type = gettoken_query(state, &operator, &lenval, &strval, &weight, &prefix)) != PT_END)
+ while ((type = state->gettoken(state, &operator,
+ &lenval, &strval,
+ &weight, &prefix)) != PT_END)
{
switch (type)
{
@@ -605,7 +788,7 @@ TSQuery
parse_tsquery(char *buf,
PushFunction pushval,
Datum opaque,
- bool isplain)
+ int flags)
{
struct TSQueryParserStateData state;
int i;
@@ -613,17 +796,38 @@ parse_tsquery(char *buf,
int commonlen;
QueryItem *ptr;
ListCell *cell;
- bool needcleanup;
+ bool needcleanup,
+ is_plain,
+ is_web;
+ int tsv_flags = P_TSV_OPR_IS_DELIM | P_TSV_IS_TSQUERY;
+
+ is_plain = (flags & P_TSQ_PLAIN) != 0;
+ is_web = (flags & P_TSQ_WEB) != 0;
+
+ /* plain should not be used with web */
+ Assert(!(is_plain && is_web));
+
+ if (is_web)
+ tsv_flags |= P_TSV_IS_WEB;
+
+ /* select suitable tokenizer */
+ if (is_plain)
+ state.gettoken = gettoken_query_plain;
+ else if (is_web)
+ state.gettoken = gettoken_query_websearch;
+ else
+ state.gettoken = gettoken_query_standard;
/* init state */
state.buffer = buf;
state.buf = buf;
- state.state = (isplain) ? WAITSINGLEOPERAND : WAITFIRSTOPERAND;
state.count = 0;
+ state.in_quotes = false;
+ state.state = WAITFIRSTOPERAND;
state.polstr = NIL;
/* init value parser's state */
- state.valstate = init_tsvector_parser(state.buffer, true, true);
+ state.valstate = init_tsvector_parser(state.buffer, tsv_flags);
/* init list of operand */
state.sumlen = 0;
@@ -716,7 +920,7 @@ tsqueryin(PG_FUNCTION_ARGS)
{
char *in = PG_GETARG_CSTRING(0);
- PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), false));
+ PG_RETURN_TSQUERY(parse_tsquery(in, pushval_asis, PointerGetDatum(NULL), 0));
}
/*
diff --git a/src/backend/utils/adt/tsvector.c b/src/backend/utils/adt/tsvector.c
index 64e02ef434..7a27bd12a3 100644
--- a/src/backend/utils/adt/tsvector.c
+++ b/src/backend/utils/adt/tsvector.c
@@ -200,7 +200,7 @@ tsvectorin(PG_FUNCTION_ARGS)
char *cur;
int buflen = 256; /* allocated size of tmpbuf */
- state = init_tsvector_parser(buf, false, false);
+ state = init_tsvector_parser(buf, 0);
arrlen = 64;
arr = (WordEntryIN *) palloc(sizeof(WordEntryIN) * arrlen);
diff --git a/src/backend/utils/adt/tsvector_parser.c b/src/backend/utils/adt/tsvector_parser.c
index 7367ba6a40..fed411a842 100644
--- a/src/backend/utils/adt/tsvector_parser.c
+++ b/src/backend/utils/adt/tsvector_parser.c
@@ -33,6 +33,7 @@ struct TSVectorParseStateData
int eml; /* max bytes per character */
bool oprisdelim; /* treat ! | * ( ) as delimiters? */
bool is_tsquery; /* say "tsquery" not "tsvector" in errors? */
+ bool is_web; /* we're in websearch_to_tsquery() */
};
@@ -42,7 +43,7 @@ struct TSVectorParseStateData
* ! | & ( )
*/
TSVectorParseState
-init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
+init_tsvector_parser(char *input, int flags)
{
TSVectorParseState state;
@@ -52,8 +53,9 @@ init_tsvector_parser(char *input, bool oprisdelim, bool is_tsquery)
state->len = 32;
state->word = (char *) palloc(state->len);
state->eml = pg_database_encoding_max_length();
- state->oprisdelim = oprisdelim;
- state->is_tsquery = is_tsquery;
+ state->oprisdelim = (flags & P_TSV_OPR_IS_DELIM) != 0;
+ state->is_tsquery = (flags & P_TSV_IS_TSQUERY) != 0;
+ state->is_web = (flags & P_TSV_IS_WEB) != 0;
return state;
}
@@ -89,16 +91,6 @@ do { \
} \
} while (0)
-/* phrase operator begins with '<' */
-#define ISOPERATOR(x) \
- ( pg_mblen(x) == 1 && ( *(x) == '!' || \
- *(x) == '&' || \
- *(x) == '|' || \
- *(x) == '(' || \
- *(x) == ')' || \
- *(x) == '<' \
- ) )
-
/* Fills gettoken_tsvector's output parameters, and returns true */
#define RETURN_TOKEN \
do { \
@@ -183,14 +175,15 @@ gettoken_tsvector(TSVectorParseState state,
{
if (*(state->prsbuf) == '\0')
return false;
- else if (t_iseq(state->prsbuf, '\''))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\''))
statecode = WAITENDCMPLX;
- else if (t_iseq(state->prsbuf, '\\'))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDWORD;
}
- else if (state->oprisdelim && ISOPERATOR(state->prsbuf))
+ else if ((state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
+ (state->is_web && t_iseq(state->prsbuf, '"')))
PRSSYNTAXERROR;
else if (!t_isspace(state->prsbuf))
{
@@ -217,13 +210,14 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITENDWORD)
{
- if (t_iseq(state->prsbuf, '\\'))
+ if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDWORD;
}
else if (t_isspace(state->prsbuf) || *(state->prsbuf) == '\0' ||
- (state->oprisdelim && ISOPERATOR(state->prsbuf)))
+ (state->oprisdelim && ISOPERATOR(state->prsbuf)) ||
+ (state->is_web && t_iseq(state->prsbuf, '"')))
{
RESIZEPRSBUF;
if (curpos == state->word)
@@ -250,11 +244,11 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITENDCMPLX)
{
- if (t_iseq(state->prsbuf, '\''))
+ if (!state->is_web && t_iseq(state->prsbuf, '\''))
{
statecode = WAITCHARCMPLX;
}
- else if (t_iseq(state->prsbuf, '\\'))
+ else if (!state->is_web && t_iseq(state->prsbuf, '\\'))
{
statecode = WAITNEXTCHAR;
oldstate = WAITENDCMPLX;
@@ -270,7 +264,7 @@ gettoken_tsvector(TSVectorParseState state,
}
else if (statecode == WAITCHARCMPLX)
{
- if (t_iseq(state->prsbuf, '\''))
+ if (!state->is_web && t_iseq(state->prsbuf, '\''))
{
RESIZEPRSBUF;
COPYCHAR(curpos, state->prsbuf);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 9bf20c059b..edf212fcf0 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -4971,6 +4971,8 @@ DATA(insert OID = 3747 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s
DESCR("transform to tsquery");
DATA(insert OID = 5006 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery_byid _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8889 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f i s 2 0 3615 "3734 25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery_byid _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 3749 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "25" _null_ _null_ _null_ _null_ _null_ to_tsvector _null_ _null_ _null_ ));
DESCR("transform to tsvector");
DATA(insert OID = 3750 ( to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ to_tsquery _null_ _null_ _null_ ));
@@ -4979,6 +4981,8 @@ DATA(insert OID = 3751 ( plainto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s
DESCR("transform to tsquery");
DATA(insert OID = 5001 ( phraseto_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ phraseto_tsquery _null_ _null_ _null_ ));
DESCR("transform to tsquery");
+DATA(insert OID = 8890 ( websearch_to_tsquery PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3615 "25" _null_ _null_ _null_ _null_ _null_ websearch_to_tsquery _null_ _null_ _null_ ));
+DESCR("transform to tsquery");
DATA(insert OID = 4209 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "3802" _null_ _null_ _null_ _null_ _null_ jsonb_to_tsvector _null_ _null_ _null_ ));
DESCR("transform jsonb to tsvector");
DATA(insert OID = 4210 ( to_tsvector PGNSP PGUID 12 100 0 0 0 f f f t f s s 1 0 3614 "114" _null_ _null_ _null_ _null_ _null_ json_to_tsvector _null_ _null_ _null_ ));
diff --git a/src/include/tsearch/ts_utils.h b/src/include/tsearch/ts_utils.h
index f8ddce5ecb..73e969fe9c 100644
--- a/src/include/tsearch/ts_utils.h
+++ b/src/include/tsearch/ts_utils.h
@@ -25,9 +25,11 @@
struct TSVectorParseStateData; /* opaque struct in tsvector_parser.c */
typedef struct TSVectorParseStateData *TSVectorParseState;
-extern TSVectorParseState init_tsvector_parser(char *input,
- bool oprisdelim,
- bool is_tsquery);
+#define P_TSV_OPR_IS_DELIM (1 << 0)
+#define P_TSV_IS_TSQUERY (1 << 1)
+#define P_TSV_IS_WEB (1 << 2)
+
+extern TSVectorParseState init_tsvector_parser(char *input, int flags);
extern void reset_tsvector_parser(TSVectorParseState state, char *input);
extern bool gettoken_tsvector(TSVectorParseState state,
char **token, int *len,
@@ -35,6 +37,16 @@ extern bool gettoken_tsvector(TSVectorParseState state,
char **endptr);
extern void close_tsvector_parser(TSVectorParseState state);
+/* phrase operator begins with '<' */
+#define ISOPERATOR(x) \
+ ( pg_mblen(x) == 1 && ( *(x) == '!' || \
+ *(x) == '&' || \
+ *(x) == '|' || \
+ *(x) == '(' || \
+ *(x) == ')' || \
+ *(x) == '<' \
+ ) )
+
/* parse_tsquery */
struct TSQueryParserStateData; /* private in backend/utils/adt/tsquery.c */
@@ -46,9 +58,13 @@ typedef void (*PushFunction) (Datum opaque, TSQueryParserState state,
* QueryOperand struct */
bool prefix);
+#define P_TSQ_PLAIN (1 << 0)
+#define P_TSQ_WEB (1 << 1)
+
extern TSQuery parse_tsquery(char *buf,
- PushFunction pushval,
- Datum opaque, bool isplain);
+ PushFunction pushval,
+ Datum opaque,
+ int flags);
/* Functions for use by PushFunction implementations */
extern void pushValue(TSQueryParserState state,
diff --git a/src/test/regress/expected/tsearch.out b/src/test/regress/expected/tsearch.out
index d63fb12f1d..3f67691270 100644
--- a/src/test/regress/expected/tsearch.out
+++ b/src/test/regress/expected/tsearch.out
@@ -1672,3 +1672,426 @@ select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat ca
(1 row)
set enable_seqscan = on;
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('simple', 'I have a fat:*ABCD cat');
+ websearch_to_tsquery
+---------------------------------------------
+ 'i' & 'have' & 'a' & 'fat' & 'abcd' & 'cat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'orange:**AABBCCDD');
+ websearch_to_tsquery
+-----------------------
+ 'orange' & 'aabbccdd'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat:A!cat:B|rat:C<');
+ websearch_to_tsquery
+-----------------------------------------
+ 'fat' & 'a' & 'cat' & 'b' & 'rat' & 'c'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat:A : cat:B');
+ websearch_to_tsquery
+---------------------------
+ 'fat' & 'a' & 'cat' & 'b'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat*rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat-rat');
+ websearch_to_tsquery
+---------------------------
+ 'fat-rat' & 'fat' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat_rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' & 'rat'
+(1 row)
+
+-- weights are completely ignored
+select websearch_to_tsquery('simple', 'abc : def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc:def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'a:::b');
+ websearch_to_tsquery
+----------------------
+ 'a' & 'b'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc:d');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'd'
+(1 row)
+
+select websearch_to_tsquery('simple', ':');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+-- these operators are ignored
+select websearch_to_tsquery('simple', 'abc & def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc | def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc <-> def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc (pg or class)');
+ websearch_to_tsquery
+------------------------
+ 'abc' & 'pg' | 'class'
+(1 row)
+
+-- NOT is ignored in quotes
+select websearch_to_tsquery('english', 'My brand new smartphone');
+ websearch_to_tsquery
+-------------------------------
+ 'brand' & 'new' & 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand "new smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+select websearch_to_tsquery('english', 'My brand "new -smartphone"');
+ websearch_to_tsquery
+---------------------------------
+ 'brand' & 'new' <-> 'smartphon'
+(1 row)
+
+-- test OR operator
+select websearch_to_tsquery('simple', 'cat or rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+ websearch_to_tsquery
+----------------------
+ 'cat' & 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'cat OR');
+ websearch_to_tsquery
+----------------------
+ 'cat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'OR rat');
+ websearch_to_tsquery
+----------------------
+ 'or' & 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' <-> 'or' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+ websearch_to_tsquery
+-----------------------
+ 'fat' & 'cat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'or OR or');
+ websearch_to_tsquery
+----------------------
+ 'or' | 'or'
+(1 row)
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', '"fat cat"or"fat rat"');
+ websearch_to_tsquery
+-----------------------------------
+ 'fat' <-> 'cat' | 'fat' <-> 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or(rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or)rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or&rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or|rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or!rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or<rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or>rat');
+ websearch_to_tsquery
+----------------------
+ 'fat' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('simple', 'fat or ');
+ websearch_to_tsquery
+----------------------
+ 'fat'
+(1 row)
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orange'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc orтест');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'orтест'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR1234');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or1234'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc or-abc');
+ websearch_to_tsquery
+---------------------------------
+ 'abc' & 'or-abc' & 'or' & 'abc'
+(1 row)
+
+select websearch_to_tsquery('simple', 'abc OR_abc');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'or' & 'abc'
+(1 row)
+
+-- test quotes
+select websearch_to_tsquery('english', '"pg_class pg');
+ websearch_to_tsquery
+-----------------------
+ 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'pg_class pg"');
+ websearch_to_tsquery
+-----------------------
+ 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '"pg_class pg"');
+ websearch_to_tsquery
+-----------------------------
+ ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "pg_class pg"');
+ websearch_to_tsquery
+-------------------------------------
+ 'abc' & ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '"pg_class pg" def');
+ websearch_to_tsquery
+-------------------------------------
+ ( 'pg' & 'class' ) <-> 'pg' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "pg pg_class pg" def');
+ websearch_to_tsquery
+------------------------------------------------------
+ 'abc' & 'pg' <-> ( 'pg' & 'class' ) <-> 'pg' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', ' or "pg pg_class pg" or ');
+ websearch_to_tsquery
+--------------------------------------
+ 'pg' <-> ( 'pg' & 'class' ) <-> 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', '""pg pg_class pg""');
+ websearch_to_tsquery
+------------------------------
+ 'pg' & 'pg' & 'class' & 'pg'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc """"" def');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'cat -"fat rat"');
+ websearch_to_tsquery
+------------------------------
+ 'cat' & !( 'fat' <-> 'rat' )
+(1 row)
+
+select websearch_to_tsquery('english', 'cat -"fat rat" cheese');
+ websearch_to_tsquery
+----------------------------------------
+ 'cat' & !( 'fat' <-> 'rat' ) & 'chees'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "def -"');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', 'abc "def :"');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' & !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+ websearch_to_tsquery
+-----------------------------------
+ 'fat' <-> 'cat' & 'eaten' | 'rat'
+(1 row)
+
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+ websearch_to_tsquery
+------------------------------------
+ 'fat' <-> 'cat' & 'eaten' | !'rat'
+(1 row)
+
+select websearch_to_tsquery('english', 'this is ----fine');
+ websearch_to_tsquery
+----------------------
+ !!!!'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -fine, "dear friend" OR good');
+ websearch_to_tsquery
+----------------------------------------
+ !'fine' & 'dear' <-> 'friend' | 'good'
+(1 row)
+
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+ websearch_to_tsquery
+------------------------
+ 'old' & 'cat' & 'fine'
+(1 row)
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ websearch_to_tsquery
+--------------------------------------
+ 'толст' <-> 'кошк' & 'съел' & 'крыс'
+(1 row)
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ t
+(1 row)
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+ ?column?
+----------
+ f
+(1 row)
+
+-- cases handled by gettoken_tsvector()
+select websearch_to_tsquery('''');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
+select websearch_to_tsquery('''abc''''def''');
+ websearch_to_tsquery
+----------------------
+ 'abc' & 'def'
+(1 row)
+
+select websearch_to_tsquery('\abc');
+ websearch_to_tsquery
+----------------------
+ 'abc'
+(1 row)
+
+select websearch_to_tsquery('\');
+NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
+ websearch_to_tsquery
+----------------------
+
+(1 row)
+
diff --git a/src/test/regress/sql/tsearch.sql b/src/test/regress/sql/tsearch.sql
index 1c8520b3e9..826bd58e45 100644
--- a/src/test/regress/sql/tsearch.sql
+++ b/src/test/regress/sql/tsearch.sql
@@ -539,3 +539,97 @@ create index phrase_index_test_idx on phrase_index_test using gin(fts);
set enable_seqscan = off;
select * from phrase_index_test where fts @@ phraseto_tsquery('english', 'fat cat');
set enable_seqscan = on;
+
+-- test websearch_to_tsquery function
+select websearch_to_tsquery('simple', 'I have a fat:*ABCD cat');
+select websearch_to_tsquery('simple', 'orange:**AABBCCDD');
+select websearch_to_tsquery('simple', 'fat:A!cat:B|rat:C<');
+select websearch_to_tsquery('simple', 'fat:A : cat:B');
+
+select websearch_to_tsquery('simple', 'fat*rat');
+select websearch_to_tsquery('simple', 'fat-rat');
+select websearch_to_tsquery('simple', 'fat_rat');
+
+-- weights are completely ignored
+select websearch_to_tsquery('simple', 'abc : def');
+select websearch_to_tsquery('simple', 'abc:def');
+select websearch_to_tsquery('simple', 'a:::b');
+select websearch_to_tsquery('simple', 'abc:d');
+select websearch_to_tsquery('simple', ':');
+
+-- these operators are ignored
+select websearch_to_tsquery('simple', 'abc & def');
+select websearch_to_tsquery('simple', 'abc | def');
+select websearch_to_tsquery('simple', 'abc <-> def');
+select websearch_to_tsquery('simple', 'abc (pg or class)');
+
+-- NOT is ignored in quotes
+select websearch_to_tsquery('english', 'My brand new smartphone');
+select websearch_to_tsquery('english', 'My brand "new smartphone"');
+select websearch_to_tsquery('english', 'My brand "new -smartphone"');
+
+-- test OR operator
+select websearch_to_tsquery('simple', 'cat or rat');
+select websearch_to_tsquery('simple', 'cat OR rat');
+select websearch_to_tsquery('simple', 'cat "OR" rat');
+select websearch_to_tsquery('simple', 'cat OR');
+select websearch_to_tsquery('simple', 'OR rat');
+select websearch_to_tsquery('simple', '"fat cat OR rat"');
+select websearch_to_tsquery('simple', 'fat (cat OR rat');
+select websearch_to_tsquery('simple', 'or OR or');
+
+-- OR is an operator here ...
+select websearch_to_tsquery('simple', '"fat cat"or"fat rat"');
+select websearch_to_tsquery('simple', 'fat or(rat');
+select websearch_to_tsquery('simple', 'fat or)rat');
+select websearch_to_tsquery('simple', 'fat or&rat');
+select websearch_to_tsquery('simple', 'fat or|rat');
+select websearch_to_tsquery('simple', 'fat or!rat');
+select websearch_to_tsquery('simple', 'fat or<rat');
+select websearch_to_tsquery('simple', 'fat or>rat');
+select websearch_to_tsquery('simple', 'fat or ');
+
+-- ... but not here
+select websearch_to_tsquery('simple', 'abc orange');
+select websearch_to_tsquery('simple', 'abc orтест');
+select websearch_to_tsquery('simple', 'abc OR1234');
+select websearch_to_tsquery('simple', 'abc or-abc');
+select websearch_to_tsquery('simple', 'abc OR_abc');
+
+-- test quotes
+select websearch_to_tsquery('english', '"pg_class pg');
+select websearch_to_tsquery('english', 'pg_class pg"');
+select websearch_to_tsquery('english', '"pg_class pg"');
+select websearch_to_tsquery('english', 'abc "pg_class pg"');
+select websearch_to_tsquery('english', '"pg_class pg" def');
+select websearch_to_tsquery('english', 'abc "pg pg_class pg" def');
+select websearch_to_tsquery('english', ' or "pg pg_class pg" or ');
+select websearch_to_tsquery('english', '""pg pg_class pg""');
+select websearch_to_tsquery('english', 'abc """"" def');
+select websearch_to_tsquery('english', 'cat -"fat rat"');
+select websearch_to_tsquery('english', 'cat -"fat rat" cheese');
+select websearch_to_tsquery('english', 'abc "def -"');
+select websearch_to_tsquery('english', 'abc "def :"');
+
+select websearch_to_tsquery('english', '"A fat cat" has just eaten a -rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just eaten OR !rat.');
+select websearch_to_tsquery('english', '"A fat cat" has just (+eaten OR -rat)');
+
+select websearch_to_tsquery('english', 'this is ----fine');
+select websearch_to_tsquery('english', '(()) )))) this ||| is && -fine, "dear friend" OR good');
+select websearch_to_tsquery('english', 'an old <-> cat " is fine &&& too');
+
+select websearch_to_tsquery('english', '"A the" OR just on');
+select websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+select to_tsvector('russian', 'съела толстая серая кошка крысу') @@
+websearch_to_tsquery('russian', '"толстая кошка" съела крысу');
+
+-- cases handled by gettoken_tsvector()
+select websearch_to_tsquery('''');
+select websearch_to_tsquery('''abc''''def''');
+select websearch_to_tsquery('\abc');
+select websearch_to_tsquery('\');
On 2018-04-04 17:33, Dmitry Ivanov wrote:
I've been thinking about this for a while, and it seems that this
should be fixed somewhere near parsetext(). Perhaps 'pg' and 'class'
should share the same position. After all, somebody could implement a
parser which would split some words using an arbitrary set of rules,
for instance "split all words containing digits". I propose merging
this patch provided that there are no objections.
I'm agree that this problem should be solved in separate patch and
that this feature can be merged without waiting for the fix.
--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Thanks to everyone, pushed with some editorization:
1) translate russian test to prevent potential problems with encoding
2) fix inconsistency 'or cat' and 'cat or', second example doesn't treat OR as
lexeme, but first one does.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Hello everyone, I am wondering if
AROUND(N) or <N, M> is still possible? I found this thread below and the
original post
/messages/by-id/fe931111ff7e9ad79196486ada79e268@postgrespro.ru
mentioned the proposed feature: 'New operator AROUND(N). It matches if the
distance between words(or maybe phrases) is less than or equal to N.'
currently in tsquery_phrase(query1 tsquery, query2 tsquery, distance
integer) the distaince is searching a fixed distance, is there way to
search maximum distance so the search returns query1 followed by query2 up
to a certain distance? like the AROUND(N) or <N, M> mentioned in the thread?
Thank you!
On Mon, Jul 22, 2019 at 9:13 AM Dmitry Ivanov <d.ivanov@postgrespro.ru>
wrote:
Show quoted text
Hi everyone,
I'd like to share some intermediate results. Here's what has changed:
1. OR operator is now case-insensitive. Moreover, trailing whitespace is
no longer used to identify it:select websearch_to_tsquery('simple', 'abc or');
websearch_to_tsquery
----------------------
'abc' & 'or'
(1 row)select websearch_to_tsquery('simple', 'abc or(def)');
websearch_to_tsquery
----------------------
'abc' | 'def'
(1 row)select websearch_to_tsquery('simple', 'abc or!def');
websearch_to_tsquery
----------------------
'abc' | 'def'
(1 row)2. AROUND(N) has been dropped. I hope that <N, M> operator will allow us
to implement it with a few lines of code.3. websearch_to_tsquery() now tolerates various syntax errors, for
instance:Misused operators:
'abc &'
'| abc'
'<- def'Missing parentheses:
'abc & (def <-> (cat or rat'
Other sorts of nonsense:
'abc &--|| def' => 'abc' & !!'def'
'abc:def' => 'abc':D & 'ef'This, however, doesn't mean that the result will always be adequate (who
would have thought?). Overall, current implementation follows the GIGO
principle. In theory, this would allow us to use user-supplied websearch
strings (but see gotchas), even if they don't make much sense. Better
then nothing, right?4. A small refactoring: I've replaced all WAIT* macros with a enum for
better debugging (names look much nicer in GDB). Hope this is
acceptable.5. Finally, I've added a few more comments and tests. I haven't checked
the code coverage, though.A few gotchas:
I haven't touched gettoken_tsvector() yet. As a result, the following
queries produce errors:select websearch_to_tsquery('simple', '''');
ERROR: syntax error in tsquery: "'"select websearch_to_tsquery('simple', '\');
ERROR: there is no escaped character: "\"Maybe there's more. The question is: should we fix those, or it's fine
as it is? I don't have a strong opinion about this.--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Is this possible with the current websearch_to_tsquery function?
Thanks.
Hello everyone, I am wondering if
Show quoted text
AROUND(N) or <N, M> is still possible? I found this thread below and the
original post
/messages/by-id/fe931111ff7e9ad79196486ada79e268@postgrespro.ru
mentioned the proposed feature: 'New operator AROUND(N). It matches if the
distance between words(or maybe phrases) is less than or equal to N.'currently in tsquery_phrase(query1 tsquery, query2 tsquery, distance
integer) the distaince is searching a fixed distance, is there way to
search maximum distance so the search returns query1 followed by query2 up
to a certain distance? like the AROUND(N) or <N, M> mentioned in the
thread?Thank you!
On Mon, Jul 22, 2019 at 9:13 AM Dmitry Ivanov <d.ivanov@postgrespro.ru>
wrote:Hi everyone,
I'd like to share some intermediate results. Here's what has changed:
1. OR operator is now case-insensitive. Moreover, trailing whitespace is
no longer used to identify it:select websearch_to_tsquery('simple', 'abc or');
websearch_to_tsquery
----------------------
'abc' & 'or'
(1 row)select websearch_to_tsquery('simple', 'abc or(def)');
websearch_to_tsquery
----------------------
'abc' | 'def'
(1 row)select websearch_to_tsquery('simple', 'abc or!def');
websearch_to_tsquery
----------------------
'abc' | 'def'
(1 row)2. AROUND(N) has been dropped. I hope that <N, M> operator will allow us
to implement it with a few lines of code.3. websearch_to_tsquery() now tolerates various syntax errors, for
instance:Misused operators:
'abc &'
'| abc'
'<- def'Missing parentheses:
'abc & (def <-> (cat or rat'
Other sorts of nonsense:
'abc &--|| def' => 'abc' & !!'def'
'abc:def' => 'abc':D & 'ef'This, however, doesn't mean that the result will always be adequate (who
would have thought?). Overall, current implementation follows the GIGO
principle. In theory, this would allow us to use user-supplied websearch
strings (but see gotchas), even if they don't make much sense. Better
then nothing, right?4. A small refactoring: I've replaced all WAIT* macros with a enum for
better debugging (names look much nicer in GDB). Hope this is
acceptable.5. Finally, I've added a few more comments and tests. I haven't checked
the code coverage, though.A few gotchas:
I haven't touched gettoken_tsvector() yet. As a result, the following
queries produce errors:select websearch_to_tsquery('simple', '''');
ERROR: syntax error in tsquery: "'"select websearch_to_tsquery('simple', '\');
ERROR: there is no escaped character: "\"Maybe there's more. The question is: should we fix those, or it's fine
as it is? I don't have a strong opinion about this.--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hello,
On 23.07.2019 09:55, Wh isere wrote:
Is this possible with the current websearch_to_tsquery function?
Thanks.
Hello everyone, I am wondering if
AROUND(N) or <N, M> is still possible? I found this thread below and
the original post
/messages/by-id/fe931111ff7e9ad79196486ada79e268@postgrespro.ru
mentioned the proposed feature: 'New operator AROUND(N). It matches
if the distance between words(or maybe phrases) is less than or
equal to N.'currently in tsquery_phrase(query1 tsquery, query2 tsquery, distance
integer) the distaince is searching a fixed distance, is there way to
search maximum distance so the search returns query1 followed by
query2 up
to a certain distance? like the AROUND(N) or <N, M> mentioned in the
thread?
As far as I know AROUND(N) and <N, M> weren't committed, unfortunately.
And so you can search only using a fixed distance currently.
websearch_to_tsquery() can't help here. It just transforms search
pattern with OR, AND statements into tsquery syntax.
--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Thanks Arthur! I guess there is not other solution? I tried to create a
function to loop through all the distance but its very slow.
On Tuesday, July 30, 2019, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
Show quoted text
Hello,
On 23.07.2019 09:55, Wh isere wrote:
Is this possible with the current websearch_to_tsquery function?
Thanks.
Hello everyone, I am wondering if
AROUND(N) or <N, M> is still possible? I found this thread below and
the original post
/messages/by-id/fe931111ff7e9ad7919648
6ada79e268%40postgrespro.ru
mentioned the proposed feature: 'New operator AROUND(N). It matches
if the distance between words(or maybe phrases) is less than or
equal to N.'currently in tsquery_phrase(query1 tsquery, query2 tsquery, distance
integer) the distaince is searching a fixed distance, is there way to
search maximum distance so the search returns query1 followed by
query2 up
to a certain distance? like the AROUND(N) or <N, M> mentioned in the
thread?As far as I know AROUND(N) and <N, M> weren't committed, unfortunately.
And so you can search only using a fixed distance currently.websearch_to_tsquery() can't help here. It just transforms search pattern
with OR, AND statements into tsquery syntax.--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company