Add absolute value to dict_int
I've seen a few requests on how to make FTS search on the absolute value of
integers. This question is usually driven by the fact that the text search
parser interprets a separating hyphen ("partnumber-987") as a minus sign.
There is currently no good answer for this that doesn't involve C coding.
I think this feature has a natural and trivial home in the dict_int
extension, so attached is a patch that does that.
There are no changes to the extension creation scripts, so there is no need
to bump the version. And I think the convention is that we don't bump the
version just to signify a change which implements a new feature when that
doesn't change the creation scripts.
Cheers,
Jeff
Attachments:
dict_int_absval_v001.patchapplication/octet-stream; name=dict_int_absval_v001.patchDownload
diff --git a/contrib/dict_int/dict_int.c b/contrib/dict_int/dict_int.c
index 56cb283..4111d6a 100644
--- a/contrib/dict_int/dict_int.c
+++ b/contrib/dict_int/dict_int.c
@@ -21,6 +21,7 @@ typedef struct
{
int maxlen;
bool rejectlong;
+ bool absval;
} DictInt;
@@ -37,6 +38,7 @@ dintdict_init(PG_FUNCTION_ARGS)
d = (DictInt *) palloc0(sizeof(DictInt));
d->maxlen = 6;
d->rejectlong = false;
+ d->absval = false;
foreach(l, dictoptions)
{
@@ -55,6 +57,10 @@ dintdict_init(PG_FUNCTION_ARGS)
{
d->rejectlong = defGetBoolean(defel);
}
+ else if (strcmp(defel->defname, "absval") == 0)
+ {
+ d->absval = defGetBoolean(defel);
+ }
else
{
ereport(ERROR,
@@ -72,10 +78,20 @@ dintdict_lexize(PG_FUNCTION_ARGS)
{
DictInt *d = (DictInt *) PG_GETARG_POINTER(0);
char *in = (char *) PG_GETARG_POINTER(1);
- char *txt = pnstrdup(in, PG_GETARG_INT32(2));
+ int len = PG_GETARG_INT32(2);
+ char *txt;
TSLexeme *res = palloc0(sizeof(TSLexeme) * 2);
res[1].lexeme = NULL;
+
+ if (d->absval && (in[0]=='+' || in[0]=='-'))
+ {
+ len--;
+ txt = pnstrdup(in+1, len);
+ }
+ else
+ txt = pnstrdup(in, len);
+
if (PG_GETARG_INT32(2) > d->maxlen)
{
if (d->rejectlong)
diff --git a/contrib/dict_int/expected/dict_int.out b/contrib/dict_int/expected/dict_int.out
index 483e700..702f7af 100644
--- a/contrib/dict_int/expected/dict_int.out
+++ b/contrib/dict_int/expected/dict_int.out
@@ -302,3 +302,28 @@ select ts_lexize('intdict', '314532610153');
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = -214783648);
ERROR: maxlen value has to be >= 1
+select ts_lexize('intdict', '-40865854');
+ ts_lexize
+-----------
+ {-40865}
+(1 row)
+
+select ts_lexize('intdict', '+40865854');
+ ts_lexize
+-----------
+ {+40865}
+(1 row)
+
+ALTER TEXT SEARCH DICTIONARY intdict (ABSVAL = true);
+select ts_lexize('intdict', '-40865854');
+ ts_lexize
+-----------
+ {408658}
+(1 row)
+
+select ts_lexize('intdict', '+40865854');
+ ts_lexize
+-----------
+ {408658}
+(1 row)
+
diff --git a/contrib/dict_int/sql/dict_int.sql b/contrib/dict_int/sql/dict_int.sql
index 5c27acc..8f702aa 100644
--- a/contrib/dict_int/sql/dict_int.sql
+++ b/contrib/dict_int/sql/dict_int.sql
@@ -53,3 +53,9 @@ select ts_lexize('intdict', '641439323669');
select ts_lexize('intdict', '314532610153');
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = -214783648);
+
+select ts_lexize('intdict', '-40865854');
+select ts_lexize('intdict', '+40865854');
+ALTER TEXT SEARCH DICTIONARY intdict (ABSVAL = true);
+select ts_lexize('intdict', '-40865854');
+select ts_lexize('intdict', '+40865854');
diff --git a/doc/src/sgml/dict-int.sgml b/doc/src/sgml/dict-int.sgml
index b556f1b..a659594 100644
--- a/doc/src/sgml/dict-int.sgml
+++ b/doc/src/sgml/dict-int.sgml
@@ -25,7 +25,7 @@
<title>Configuration</title>
<para>
- The dictionary accepts two options:
+ The dictionary accepts three options:
</para>
<itemizedlist>
@@ -45,7 +45,14 @@
word, so that it will not be indexed. Note that this also means that
such an integer cannot be searched for.
</para>
- </listitem>
+ </listitem>
+ <listitem>
+ <para>
+ The <literal>absval</literal> parameter specifies whether a leading
+ '+' or '-' sign should be removed from the integer. The default is <literal>false</literal>.
+ If <literal>true</literal>, the sign is removed before <literal>maxlen</literal> is applied.
+ </para>
+ </listitem>
</itemizedlist>
</sect2>
Jeff Janes <jeff.janes@gmail.com> writes:
I've seen a few requests on how to make FTS search on the absolute value of
integers. This question is usually driven by the fact that the text search
parser interprets a separating hyphen ("partnumber-987") as a minus sign.
There is currently no good answer for this that doesn't involve C coding.
I think this feature has a natural and trivial home in the dict_int
extension, so attached is a patch that does that.
Seems reasonable, so pushed with minor cleanup (I noticed it was
off-by-one for the maxlen check, which was harmless unless you had
rejectlong enabled as well). I debated whether I liked the "absval"
parameter name, which seemed a bit too abbreviative; but it's more
or less in line with the existing parameter names, so I left it alone.
There are no changes to the extension creation scripts, so there is no need
to bump the version. And I think the convention is that we don't bump the
version just to signify a change which implements a new feature when that
doesn't change the creation scripts.
Right, there's no need to update the creation script.
I noticed one odd thing which is not the fault of this patch, but
seems to need cleaned up:
regression=# ALTER TEXT SEARCH DICTIONARY intdict (absval = true);
ALTER TEXT SEARCH DICTIONARY
regression=# select ts_lexize('intdict', '-123456');
ts_lexize
-----------
{123456}
(1 row)
regression=# ALTER TEXT SEARCH DICTIONARY intdict (absval = 1);
ALTER TEXT SEARCH DICTIONARY
regression=# select ts_lexize('intdict', '-123456');
ERROR: absval requires a Boolean value
Why did ALTER accept that, if it wasn't valid? It's not like
there's no error checking at all:
regression=# ALTER TEXT SEARCH DICTIONARY intdict (absval = foo);
ERROR: absval requires a Boolean value
Upon digging into that, the reason is that defGetBoolean accepts
a T_Integer Value with value 1, but it doesn't accept a T_String
with value "1". And apparently we're satisfied to smash dictionary
parameters to strings before storing them.
There are at least three things we could do here:
1. Insist that defGetBoolean and its siblings should accept the
string equivalent of any non-string value they accept. This
would change the behavior of a whole lot of utility commands,
not only the text search commands, and I'm not exactly convinced
it's a good idea. Seems like it's losing error detection
capability.
2. Improve the catalog representation of text search parameters
so that the original Value node can be faithfully reconstructed.
I'd be for this, except it seems like a lot of work for a rather
minor benefit.
3. Rearrange text search parameter validation so we smash parameters
to strings *before* we validate them, ensuring we won't take any
settings that will be rejected later.
I'm leaning to #3 as being the most practical fix. Thoughts?
regards, tom lane
I wrote:
There are at least three things we could do here:
1. Insist that defGetBoolean and its siblings should accept the
string equivalent of any non-string value they accept. This
would change the behavior of a whole lot of utility commands,
not only the text search commands, and I'm not exactly convinced
it's a good idea. Seems like it's losing error detection
capability.
2. Improve the catalog representation of text search parameters
so that the original Value node can be faithfully reconstructed.
I'd be for this, except it seems like a lot of work for a rather
minor benefit.
3. Rearrange text search parameter validation so we smash parameters
to strings *before* we validate them, ensuring we won't take any
settings that will be rejected later.
I still don't much like #1, but after looking closer, #2 is not as
impractical as I thought. The catalog representation doesn't need
any redefinition really, because per the existing comments,
* For the convenience of pg_dump, the output is formatted exactly as it
* would need to appear in CREATE TEXT SEARCH DICTIONARY to reproduce the
* same options.
So all we really need to do is upgrade [de]serialize_deflist to be smarter
about int and float nodes. This is still a bit invasive because somebody
decided to make deserialize_deflist serve two masters, and I don't feel
like working through whether the prsheadline code would cope nicely with
non-string output nodes. So the first patch attached adds a flag argument
to deserialize_deflist to tell it whether to force all the values to
strings.
Alternatively, we could do #3, as in the second patch below. This
seems clearly Less Good, but it's a smaller/safer patch, and it's
at least potentially back-patchable, whereas changing the signature
of deserialize_deflist in stable branches would risk trouble.
(I didn't bother with regression test additions yet, but some would
be appropriate for either patch.)
Given the lack of field complaints, I'm not that excited about
back-patching anything for this. So my inclination is to go with #2
(first patch) and fix it in HEAD only.
Thoughts?
regards, tom lane
Attachments:
fix-serialization.patchtext/x-diff; charset=us-ascii; name=fix-serialization.patchDownload
diff --git a/src/backend/commands/tsearchcmds.c b/src/backend/commands/tsearchcmds.c
index 9dca682..0731ca5 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -36,6 +36,7 @@
#include "commands/alter.h"
#include "commands/defrem.h"
#include "commands/event_trigger.h"
+#include "common/string.h"
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "parser/parse_func.h"
@@ -52,6 +53,8 @@ static void MakeConfigurationMapping(AlterTSConfigurationStmt *stmt,
HeapTuple tup, Relation relMap);
static void DropConfigurationMapping(AlterTSConfigurationStmt *stmt,
HeapTuple tup, Relation relMap);
+static DefElem *buildDefItem(const char *name, const char *val,
+ bool force_strings);
/* --------------------- TS Parser commands ------------------------ */
@@ -566,7 +569,7 @@ AlterTSDictionary(AlterTSDictionaryStmt *stmt)
if (isnull)
dictoptions = NIL;
else
- dictoptions = deserialize_deflist(opt);
+ dictoptions = deserialize_deflist(opt, false);
/*
* Modify the options list as per specified changes
@@ -1519,9 +1522,6 @@ DropConfigurationMapping(AlterTSConfigurationStmt *stmt,
* For the convenience of pg_dump, the output is formatted exactly as it
* would need to appear in CREATE TEXT SEARCH DICTIONARY to reproduce the
* same options.
- *
- * Note that we assume that only the textual representation of an option's
- * value is interesting --- hence, non-string DefElems get forced to strings.
*/
text *
serialize_deflist(List *deflist)
@@ -1539,19 +1539,30 @@ serialize_deflist(List *deflist)
appendStringInfo(&buf, "%s = ",
quote_identifier(defel->defname));
- /* If backslashes appear, force E syntax to determine their handling */
- if (strchr(val, '\\'))
- appendStringInfoChar(&buf, ESCAPE_STRING_SYNTAX);
- appendStringInfoChar(&buf, '\'');
- while (*val)
+
+ /*
+ * If the value is a T_Integer or T_Float, emit it without quotes,
+ * otherwise with quotes. This is essential to allow correct
+ * reconstruction of the node type as well as the value.
+ */
+ if (IsA(defel->arg, Integer) || IsA(defel->arg, Float))
+ appendStringInfoString(&buf, val);
+ else
{
- char ch = *val++;
+ /* If backslashes appear, force E syntax to quote them safely */
+ if (strchr(val, '\\'))
+ appendStringInfoChar(&buf, ESCAPE_STRING_SYNTAX);
+ appendStringInfoChar(&buf, '\'');
+ while (*val)
+ {
+ char ch = *val++;
- if (SQL_STR_DOUBLE(ch, true))
+ if (SQL_STR_DOUBLE(ch, true))
+ appendStringInfoChar(&buf, ch);
appendStringInfoChar(&buf, ch);
- appendStringInfoChar(&buf, ch);
+ }
+ appendStringInfoChar(&buf, '\'');
}
- appendStringInfoChar(&buf, '\'');
if (lnext(deflist, l) != NULL)
appendStringInfoString(&buf, ", ");
}
@@ -1566,10 +1577,12 @@ serialize_deflist(List *deflist)
*
* This is also used for prsheadline options, so for backward compatibility
* we need to accept a few things serialize_deflist() will never emit:
- * in particular, unquoted and double-quoted values.
+ * in particular, unquoted and double-quoted strings. Also, for prsheadline
+ * we want to force all the output nodes to be plain strings, whereas for
+ * true deserialization we want to reconstruct the node type.
*/
List *
-deserialize_deflist(Datum txt)
+deserialize_deflist(Datum txt, bool force_strings)
{
text *in = DatumGetTextPP(txt); /* in case it's toasted */
List *result = NIL;
@@ -1694,8 +1707,9 @@ deserialize_deflist(Datum txt)
{
*wsptr++ = '\0';
result = lappend(result,
- makeDefElem(pstrdup(workspace),
- (Node *) makeString(pstrdup(startvalue)), -1));
+ buildDefItem(workspace,
+ startvalue,
+ true));
state = CS_WAITKEY;
}
}
@@ -1726,8 +1740,9 @@ deserialize_deflist(Datum txt)
{
*wsptr++ = '\0';
result = lappend(result,
- makeDefElem(pstrdup(workspace),
- (Node *) makeString(pstrdup(startvalue)), -1));
+ buildDefItem(workspace,
+ startvalue,
+ true));
state = CS_WAITKEY;
}
}
@@ -1741,8 +1756,9 @@ deserialize_deflist(Datum txt)
{
*wsptr++ = '\0';
result = lappend(result,
- makeDefElem(pstrdup(workspace),
- (Node *) makeString(pstrdup(startvalue)), -1));
+ buildDefItem(workspace,
+ startvalue,
+ force_strings));
state = CS_WAITKEY;
}
else
@@ -1760,8 +1776,9 @@ deserialize_deflist(Datum txt)
{
*wsptr++ = '\0';
result = lappend(result,
- makeDefElem(pstrdup(workspace),
- (Node *) makeString(pstrdup(startvalue)), -1));
+ buildDefItem(workspace,
+ startvalue,
+ force_strings));
}
else if (state != CS_WAITKEY)
ereport(ERROR,
@@ -1773,3 +1790,35 @@ deserialize_deflist(Datum txt)
return result;
}
+
+/*
+ * Build one DefElem for deserialize_deflist
+ */
+static DefElem *
+buildDefItem(const char *name, const char *val, bool force_strings)
+{
+ if (!force_strings && val[0] != '\0')
+ {
+ int v;
+ char *endptr;
+
+ /* Try to parse as an integer */
+ errno = 0;
+ v = strtoint(val, &endptr, 10);
+ if (errno == 0 && *endptr == '\0')
+ return makeDefElem(pstrdup(name),
+ (Node *) makeInteger(v),
+ -1);
+ /* Nope, how about as a float? */
+ errno = 0;
+ (void) strtod(val, &endptr);
+ if (errno == 0 && *endptr == '\0')
+ return makeDefElem(pstrdup(name),
+ (Node *) makeFloat(pstrdup(val)),
+ -1);
+ }
+ /* Just make it a string */
+ return makeDefElem(pstrdup(name),
+ (Node *) makeString(pstrdup(val)),
+ -1);
+}
diff --git a/src/backend/tsearch/wparser.c b/src/backend/tsearch/wparser.c
index 88005c0..85851da 100644
--- a/src/backend/tsearch/wparser.c
+++ b/src/backend/tsearch/wparser.c
@@ -329,7 +329,7 @@ ts_headline_byid_opt(PG_FUNCTION_ARGS)
VARDATA_ANY(in), VARSIZE_ANY_EXHDR(in));
if (opt)
- prsoptions = deserialize_deflist(PointerGetDatum(opt));
+ prsoptions = deserialize_deflist(PointerGetDatum(opt), true);
else
prsoptions = NIL;
@@ -400,7 +400,7 @@ ts_headline_jsonb_byid_opt(PG_FUNCTION_ARGS)
state->prsobj = lookup_ts_parser_cache(state->cfg->prsId);
state->query = query;
if (opt)
- state->prsoptions = deserialize_deflist(PointerGetDatum(opt));
+ state->prsoptions = deserialize_deflist(PointerGetDatum(opt), true);
else
state->prsoptions = NIL;
@@ -477,7 +477,7 @@ ts_headline_json_byid_opt(PG_FUNCTION_ARGS)
state->prsobj = lookup_ts_parser_cache(state->cfg->prsId);
state->query = query;
if (opt)
- state->prsoptions = deserialize_deflist(PointerGetDatum(opt));
+ state->prsoptions = deserialize_deflist(PointerGetDatum(opt), true);
else
state->prsoptions = NIL;
diff --git a/src/backend/utils/cache/ts_cache.c b/src/backend/utils/cache/ts_cache.c
index 1641271..8900ac2 100644
--- a/src/backend/utils/cache/ts_cache.c
+++ b/src/backend/utils/cache/ts_cache.c
@@ -334,7 +334,7 @@ lookup_ts_dictionary_cache(Oid dictId)
if (isnull)
dictoptions = NIL;
else
- dictoptions = deserialize_deflist(opt);
+ dictoptions = deserialize_deflist(opt, false);
entry->dictData =
DatumGetPointer(OidFunctionCall1(template->tmplinit,
diff --git a/src/include/commands/defrem.h b/src/include/commands/defrem.h
index 5cd6975..1bd5de7 100644
--- a/src/include/commands/defrem.h
+++ b/src/include/commands/defrem.h
@@ -127,7 +127,7 @@ extern void RemoveTSConfigurationById(Oid cfgId);
extern ObjectAddress AlterTSConfiguration(AlterTSConfigurationStmt *stmt);
extern text *serialize_deflist(List *deflist);
-extern List *deserialize_deflist(Datum txt);
+extern List *deserialize_deflist(Datum txt, bool force_strings);
/* commands/foreigncmds.c */
extern ObjectAddress AlterForeignServerOwner(const char *name, Oid newOwnerId);
deserialize-reserialize.patchtext/x-diff; charset=us-ascii; name=deserialize-reserialize.patchDownload
diff --git a/src/backend/commands/tsearchcmds.c b/src/backend/commands/tsearchcmds.c
index 9dca682..182837b 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -419,6 +419,7 @@ DefineTSDictionary(List *names, List *parameters)
NameData dname;
Oid templId = InvalidOid;
List *dictoptions = NIL;
+ Datum dictoptions_s;
Oid dictOid;
Oid namespaceoid;
AclResult aclresult;
@@ -453,6 +454,20 @@ DefineTSDictionary(List *names, List *parameters)
}
/*
+ * Immediately serialize the parameters, and then deserialize so that what
+ * we pass to validation is the same representation that will be seen at
+ * runtime. This is a kluge to cope with the fact that the serialization
+ * logic fails to reproduce the node structure exactly.
+ */
+ if (dictoptions)
+ {
+ dictoptions_s = PointerGetDatum(serialize_deflist(dictoptions));
+ dictoptions = deserialize_deflist(dictoptions_s);
+ }
+ else
+ dictoptions_s = (Datum) 0;
+
+ /*
* Validation
*/
if (!OidIsValid(templId))
@@ -480,8 +495,7 @@ DefineTSDictionary(List *names, List *parameters)
values[Anum_pg_ts_dict_dictowner - 1] = ObjectIdGetDatum(GetUserId());
values[Anum_pg_ts_dict_dicttemplate - 1] = ObjectIdGetDatum(templId);
if (dictoptions)
- values[Anum_pg_ts_dict_dictinitoption - 1] =
- PointerGetDatum(serialize_deflist(dictoptions));
+ values[Anum_pg_ts_dict_dictinitoption - 1] = dictoptions_s;
else
nulls[Anum_pg_ts_dict_dictinitoption - 1] = true;
@@ -537,6 +551,7 @@ AlterTSDictionary(AlterTSDictionaryStmt *stmt)
Oid dictId;
ListCell *pl;
List *dictoptions;
+ Datum dictoptions_s;
Datum opt;
bool isnull;
Datum repl_val[Natts_pg_ts_dict];
@@ -595,6 +610,20 @@ AlterTSDictionary(AlterTSDictionaryStmt *stmt)
}
/*
+ * Immediately serialize the parameters, and then deserialize so that what
+ * we pass to validation is the same representation that will be seen at
+ * runtime. This is a kluge to cope with the fact that the serialization
+ * logic fails to reproduce the node structure exactly.
+ */
+ if (dictoptions)
+ {
+ dictoptions_s = PointerGetDatum(serialize_deflist(dictoptions));
+ dictoptions = deserialize_deflist(dictoptions_s);
+ }
+ else
+ dictoptions_s = (Datum) 0;
+
+ /*
* Validate
*/
verify_dictoptions(((Form_pg_ts_dict) GETSTRUCT(tup))->dicttemplate,
@@ -608,8 +637,7 @@ AlterTSDictionary(AlterTSDictionaryStmt *stmt)
memset(repl_repl, false, sizeof(repl_repl));
if (dictoptions)
- repl_val[Anum_pg_ts_dict_dictinitoption - 1] =
- PointerGetDatum(serialize_deflist(dictoptions));
+ repl_val[Anum_pg_ts_dict_dictinitoption - 1] = dictoptions_s;
else
repl_null[Anum_pg_ts_dict_dictinitoption - 1] = true;
repl_repl[Anum_pg_ts_dict_dictinitoption - 1] = true;
@@ -1522,6 +1550,9 @@ DropConfigurationMapping(AlterTSConfigurationStmt *stmt,
*
* Note that we assume that only the textual representation of an option's
* value is interesting --- hence, non-string DefElems get forced to strings.
+ * This is, in fact, wrong. For example, defGetBoolean accepts integers 0
+ * and 1 but not the string equivalents of those. Pending making this smarter,
+ * callers must avoid assuming that serialize/deserialize is an identity.
*/
text *
serialize_deflist(List *deflist)
I wrote:
So all we really need to do is upgrade [de]serialize_deflist to be smarter
about int and float nodes. This is still a bit invasive because somebody
decided to make deserialize_deflist serve two masters, and I don't feel
like working through whether the prsheadline code would cope nicely with
non-string output nodes. So the first patch attached adds a flag argument
to deserialize_deflist to tell it whether to force all the values to
strings.
On closer inspection, it doesn't seem that changing the behavior for
prsheadline will make any difference. The only extant code that
reads that result is prsd_headline which always uses defGetString,
and probably any third-party text search parsers would too.
So I've pushed this without the extra flag argument.
regards, tom lane