Patch: Add tsmatch JSONPath operator for granular Full Text Search

Started by Florents Tselaiabout 1 month ago5 messages
Jump to latest
#1Florents Tselai
florents.tselai@gmail.com

Hi,

in real-life I work a lot with json & fts search, here's a feature I've
always wished I had,
but never tackle it. Until yesterday that is.

SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" && @.body
tsmatch "performance")');

This patch introduces a tsmatch boolean operator to the JSONPath engine.
By integrating FTS natively into path expressions,
this operator allows for high-precision filtering of nested JSONB
structures—
solving issues with structural ambiguity and query complexity.

Currently, users must choose between two suboptimal paths for FTS-ing
nested JSON:
- Imprecise Global Indexing
jsonb_to_tsvector aggregates text into a flat vector.
This ignores JSON boundaries, leading to false positives when the same key
(e.g., "body")
appears in different contexts (e.g., a "Product Description" vs. a
"Customer Review").

- Complex SQL Workarounds
Achieving 100% precision requires unnesting the document via
jsonb_array_elements and LATERAL joins.
This leads to verbose SQL and high memory overhead from generating
intermediate heap tuples.

One of the most significant advantages of tsmatch is its ability to
participate in multi-condition predicates
within the same JSON object - something jsonb_to_tsvector cannot do.

SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" && @.body
tsmatch "performance")');

In a flat vector, the association between "Alice" and "performance" is lost.
tsmatch preserves this link by evaluating the FTS predicate in-place during
path traversal.

While the SQL/JSON standard (ISO/IEC 9075-2) does not explicitly define an
FTS operator,
tsmatch is architecturally modeled after the standard-defined like_regex.

The implementation follows the like_regex precedent:
it is a non-indexable predicate that relies on GIN path-matching for
pruning and heap re-checks for precision.
Caching is scoped to the JsonPathExecContext,
ensuring 'compile-once' efficiency per execution without violating the
stability requirements of prepared statements.

This initial implementation uses plainto_tsquery.
However, the grammar is designed to support a "mode" flag (similar to
like_regex flags)
in future iterations to toggle between to_tsquery, websearch_to_tsquery,
and phraseto_tsquery.

Attachments:

v1-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patchapplication/octet-stream; name=v1-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patchDownload+369-5
tsmatch_bench.outapplication/octet-stream; name=tsmatch_bench.outDownload
tsmatch_bench.sqlapplication/octet-stream; name=tsmatch_bench.sqlDownload+0-11
#2Florents Tselai
florents.tselai@gmail.com
In reply to: Florents Tselai (#1)
Re: Patch: Add tsmatch JSONPath operator for granular Full Text Search

On Mon, Jan 26, 2026 at 7:22 PM Florents Tselai <florents.tselai@gmail.com>
wrote:

Hi,

in real-life I work a lot with json & fts search, here's a feature I've
always wished I had,
but never tackle it. Until yesterday that is.

SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" && @.body
tsmatch "performance")');

This patch introduces a tsmatch boolean operator to the JSONPath engine.
By integrating FTS natively into path expressions,
this operator allows for high-precision filtering of nested JSONB
structures—
solving issues with structural ambiguity and query complexity.

Currently, users must choose between two suboptimal paths for FTS-ing
nested JSON:
- Imprecise Global Indexing
jsonb_to_tsvector aggregates text into a flat vector.
This ignores JSON boundaries, leading to false positives when the same key
(e.g., "body")
appears in different contexts (e.g., a "Product Description" vs. a
"Customer Review").

- Complex SQL Workarounds
Achieving 100% precision requires unnesting the document via
jsonb_array_elements and LATERAL joins.
This leads to verbose SQL and high memory overhead from generating
intermediate heap tuples.

One of the most significant advantages of tsmatch is its ability to
participate in multi-condition predicates
within the same JSON object - something jsonb_to_tsvector cannot do.

SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" && @.body
tsmatch "performance")');

In a flat vector, the association between "Alice" and "performance" is
lost.
tsmatch preserves this link by evaluating the FTS predicate in-place
during path traversal.

While the SQL/JSON standard (ISO/IEC 9075-2) does not explicitly define an
FTS operator,
tsmatch is architecturally modeled after the standard-defined like_regex.

The implementation follows the like_regex precedent:
it is a non-indexable predicate that relies on GIN path-matching for
pruning and heap re-checks for precision.
Caching is scoped to the JsonPathExecContext,
ensuring 'compile-once' efficiency per execution without violating the
stability requirements of prepared statements.

This initial implementation uses plainto_tsquery.
However, the grammar is designed to support a "mode" flag (similar to
like_regex flags)
in future iterations to toggle between to_tsquery, websearch_to_tsquery,
and phraseto_tsquery.

Here's a v2, that implements the tsqparser clause

So this should now work too

select jsonb_path_query_array('["fast car", "slow car", "fast and
furious"]', '$[*] ? (@ tsmatch "fast car" tsqparser "w")

Attachments:

v2-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patchapplication/octet-stream; name=v2-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patchDownload+567-5
#3Chao Li
li.evan.chao@gmail.com
In reply to: Florents Tselai (#2)
Re: Patch: Add tsmatch JSONPath operator for granular Full Text Search

On Feb 1, 2026, at 19:02, Florents Tselai <florents.tselai@gmail.com> wrote:

On Mon, Jan 26, 2026 at 7:22 PM Florents Tselai <florents.tselai@gmail.com> wrote:
Hi,

in real-life I work a lot with json & fts search, here's a feature I've always wished I had,
but never tackle it. Until yesterday that is.

SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" && @.body tsmatch "performance")');

This patch introduces a tsmatch boolean operator to the JSONPath engine.
By integrating FTS natively into path expressions,
this operator allows for high-precision filtering of nested JSONB structures—
solving issues with structural ambiguity and query complexity.

Currently, users must choose between two suboptimal paths for FTS-ing nested JSON:
- Imprecise Global Indexing
jsonb_to_tsvector aggregates text into a flat vector.
This ignores JSON boundaries, leading to false positives when the same key (e.g., "body")
appears in different contexts (e.g., a "Product Description" vs. a "Customer Review").

- Complex SQL Workarounds
Achieving 100% precision requires unnesting the document via jsonb_array_elements and LATERAL joins.
This leads to verbose SQL and high memory overhead from generating intermediate heap tuples.

One of the most significant advantages of tsmatch is its ability to participate in multi-condition predicates
within the same JSON object - something jsonb_to_tsvector cannot do.

SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" && @.body tsmatch "performance")');

In a flat vector, the association between "Alice" and "performance" is lost.
tsmatch preserves this link by evaluating the FTS predicate in-place during path traversal.

While the SQL/JSON standard (ISO/IEC 9075-2) does not explicitly define an FTS operator,
tsmatch is architecturally modeled after the standard-defined like_regex.

The implementation follows the like_regex precedent:
it is a non-indexable predicate that relies on GIN path-matching for pruning and heap re-checks for precision.
Caching is scoped to the JsonPathExecContext,
ensuring 'compile-once' efficiency per execution without violating the stability requirements of prepared statements.

This initial implementation uses plainto_tsquery.
However, the grammar is designed to support a "mode" flag (similar to like_regex flags)
in future iterations to toggle between to_tsquery, websearch_to_tsquery, and phraseto_tsquery.

Here's a v2, that implements the tsqparser clause

So this should now work too

select jsonb_path_query_array('["fast car", "slow car", "fast and furious"]', '$[*] ? (@ tsmatch "fast car" tsqparser "w") <v2-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch>

Hi Florents,

Grant pinged me about this. I can review it in coming days. Can you please rebase it? I failed to apply to current master. Also, the CF reported a failure test case, please take a look.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#4Florents Tselai
florents.tselai@gmail.com
In reply to: Chao Li (#3)
Re: Patch: Add tsmatch JSONPath operator for granular Full Text Search

On Thu, Feb 26, 2026 at 8:48 AM Chao Li <li.evan.chao@gmail.com> wrote:

On Feb 1, 2026, at 19:02, Florents Tselai <florents.tselai@gmail.com>

wrote:

On Mon, Jan 26, 2026 at 7:22 PM Florents Tselai <

florents.tselai@gmail.com> wrote:

Hi,

in real-life I work a lot with json & fts search, here's a feature I've

always wished I had,

but never tackle it. Until yesterday that is.

SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" &&

@.body tsmatch "performance")');

This patch introduces a tsmatch boolean operator to the JSONPath engine.
By integrating FTS natively into path expressions,
this operator allows for high-precision filtering of nested JSONB

structures—

solving issues with structural ambiguity and query complexity.

Currently, users must choose between two suboptimal paths for FTS-ing

nested JSON:

- Imprecise Global Indexing
jsonb_to_tsvector aggregates text into a flat vector.
This ignores JSON boundaries, leading to false positives when the same

key (e.g., "body")

appears in different contexts (e.g., a "Product Description" vs. a

"Customer Review").

- Complex SQL Workarounds
Achieving 100% precision requires unnesting the document via

jsonb_array_elements and LATERAL joins.

This leads to verbose SQL and high memory overhead from generating

intermediate heap tuples.

One of the most significant advantages of tsmatch is its ability to

participate in multi-condition predicates

within the same JSON object - something jsonb_to_tsvector cannot do.

SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" &&

@.body tsmatch "performance")');

In a flat vector, the association between "Alice" and "performance" is

lost.

tsmatch preserves this link by evaluating the FTS predicate in-place

during path traversal.

While the SQL/JSON standard (ISO/IEC 9075-2) does not explicitly define

an FTS operator,

tsmatch is architecturally modeled after the standard-defined like_regex.

The implementation follows the like_regex precedent:
it is a non-indexable predicate that relies on GIN path-matching for

pruning and heap re-checks for precision.

Caching is scoped to the JsonPathExecContext,
ensuring 'compile-once' efficiency per execution without violating the

stability requirements of prepared statements.

This initial implementation uses plainto_tsquery.
However, the grammar is designed to support a "mode" flag (similar to

like_regex flags)

in future iterations to toggle between to_tsquery, websearch_to_tsquery,

and phraseto_tsquery.

Here's a v2, that implements the tsqparser clause

So this should now work too

select jsonb_path_query_array('["fast car", "slow car", "fast and

furious"]', '$[*] ? (@ tsmatch "fast car" tsqparser "w")
<v2-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch>

Hi Florents,

Grant pinged me about this. I can review it in coming days. Can you please
rebase it? I failed to apply to current master. Also, the CF reported a
failure test case, please take a look.

Hi Evan,
thanks for having a look. The conflict was due to the intro of
pg_fallthrough. Not related to this patch .

I noticed the failure too, but I'm having a hard time reproducing it tbh.
This fails for Debian Trixie with Meson. The same with Autoconf passes...

https://github.com/Florents-Tselai/postgres/runs/65098077968

Attachments:

v3-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patchapplication/octet-stream; name=v3-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patchDownload+567-5
#5Chao Li
li.evan.chao@gmail.com
In reply to: Florents Tselai (#4)
Re: Patch: Add tsmatch JSONPath operator for granular Full Text Search

On Feb 27, 2026, at 13:59, Florents Tselai <florents.tselai@gmail.com> wrote:

On Thu, Feb 26, 2026 at 8:48 AM Chao Li <li.evan.chao@gmail.com> wrote:

On Feb 1, 2026, at 19:02, Florents Tselai <florents.tselai@gmail.com> wrote:

On Mon, Jan 26, 2026 at 7:22 PM Florents Tselai <florents.tselai@gmail.com> wrote:
Hi,

in real-life I work a lot with json & fts search, here's a feature I've always wished I had,
but never tackle it. Until yesterday that is.

SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" && @.body tsmatch "performance")');

This patch introduces a tsmatch boolean operator to the JSONPath engine.
By integrating FTS natively into path expressions,
this operator allows for high-precision filtering of nested JSONB structures—
solving issues with structural ambiguity and query complexity.

Currently, users must choose between two suboptimal paths for FTS-ing nested JSON:
- Imprecise Global Indexing
jsonb_to_tsvector aggregates text into a flat vector.
This ignores JSON boundaries, leading to false positives when the same key (e.g., "body")
appears in different contexts (e.g., a "Product Description" vs. a "Customer Review").

- Complex SQL Workarounds
Achieving 100% precision requires unnesting the document via jsonb_array_elements and LATERAL joins.
This leads to verbose SQL and high memory overhead from generating intermediate heap tuples.

One of the most significant advantages of tsmatch is its ability to participate in multi-condition predicates
within the same JSON object - something jsonb_to_tsvector cannot do.

SELECT jsonb_path_query(doc, '$.comments[*] ? (@.user == "Alice" && @.body tsmatch "performance")');

In a flat vector, the association between "Alice" and "performance" is lost.
tsmatch preserves this link by evaluating the FTS predicate in-place during path traversal.

While the SQL/JSON standard (ISO/IEC 9075-2) does not explicitly define an FTS operator,
tsmatch is architecturally modeled after the standard-defined like_regex.

The implementation follows the like_regex precedent:
it is a non-indexable predicate that relies on GIN path-matching for pruning and heap re-checks for precision.
Caching is scoped to the JsonPathExecContext,
ensuring 'compile-once' efficiency per execution without violating the stability requirements of prepared statements.

This initial implementation uses plainto_tsquery.
However, the grammar is designed to support a "mode" flag (similar to like_regex flags)
in future iterations to toggle between to_tsquery, websearch_to_tsquery, and phraseto_tsquery.

Here's a v2, that implements the tsqparser clause

So this should now work too

select jsonb_path_query_array('["fast car", "slow car", "fast and furious"]', '$[*] ? (@ tsmatch "fast car" tsqparser "w") <v2-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch>

Hi Florents,

Grant pinged me about this. I can review it in coming days. Can you please rebase it? I failed to apply to current master. Also, the CF reported a failure test case, please take a look.

Hi Evan,
thanks for having a look. The conflict was due to the intro of pg_fallthrough. Not related to this patch .

I noticed the failure too, but I'm having a hard time reproducing it tbh.
This fails for Debian Trixie with Meson. The same with Autoconf passes...

https://github.com/Florents-Tselai/postgres/runs/65098077968

<v3-0001-Add-tsmatch-JSONPath-operator-for-granular-Full-T.patch>

I have reviewed v3 and traced a few test cases. Here comes my review comments:

1 
```
+        <replaceable>string</replaceable> <literal>tsmatch</literal> <replaceable>string</replaceable>
+        <optional> <literal>tsconfig</literal> <replaceable>string</replaceable> </optional>
+        <optional> <literal>tsqparser</literal> <replaceable>string</replaceable> </optional>
```

For all “replaceable”, instead of “string”, would it be better to use something more descriptive? For example:
```
<replaceable>json_string</replaceable> <literal>tsmatch</literal> <replaceable>query</replaceable>
<optional> <literal>tsconfig</literal> <replaceable>config_name</replaceable> </optional>
<optional> <literal>tsqparser</literal> <replaceable>parser_mode</replaceable> </optional>
```

2 - jsonpath_gram.y
```
+static bool makeItemTsMatch(JsonPathParseItem *doc,
+							  JsonPathString *tsquery,
+							  JsonPathString *tsconfig,
+							  JsonPathString *tsquery_parser,
+							  JsonPathParseItem ** result,
+							  struct Node *escontext);
```

Format Nit: Looking at the existing code, the J in the second and following lines, should be placed in the same column as the J in the first line.

3 - jsonpath_gram.y
```
+	| expr TSMATCH_P STRING_P
+	{
+		JsonPathParseItem *jppitem;
+		/* Pass NULL for tsconfig (3rd) and NULL for tsquery_parser (4th) */
+		if (! makeItemTsMatch($1, &$3, NULL, NULL, &jppitem, escontext))
+		   YYABORT;
+		$$ = jppitem;
+	}
+	| expr TSMATCH_P STRING_P TSCONFIG_P STRING_P
+	{
+		JsonPathParseItem *jppitem;
+		/* Pass NULL for tsquery_parser (4th) */
+		if (! makeItemTsMatch($1, &$3, &$5, NULL, &jppitem, escontext))
+		   YYABORT;
+		$$ = jppitem;
+	}
+	| expr TSMATCH_P STRING_P TSQUERYPARSER_P STRING_P
+	{
+		JsonPathParseItem *jppitem;
+		/* Pass NULL for tsconfig (3rd) */
+		if (! makeItemTsMatch($1, &$3, NULL, &$5, &jppitem, escontext))
+		   YYABORT;
+		$$ = jppitem;
+	}
+	| expr TSMATCH_P STRING_P TSCONFIG_P STRING_P TSQUERYPARSER_P STRING_P
+	{
+		JsonPathParseItem *jppitem;
+		if (! makeItemTsMatch($1, &$3, &$5, &$7, &jppitem, escontext))
+		   YYABORT;
+		$$ = jppitem;
+	}
```

Feels a little redundant, repeatedly calls makeItemTsMatch. See the attached diff for a simplification. But my version is a bit longer in terms of number of lines. So, up to you.

4 - jsonpath_gram.y
```
+static bool
+makeItemTsMatch(JsonPathParseItem *doc,
+			 JsonPathString *tsquery,
+			 JsonPathString *tsconfig,
+			 JsonPathString *tsquery_parser,
+			 JsonPathParseItem **result,
+			 struct Node *escontext)
```

makeItemTsMatch doesn’t need to return a bool. Actually, now it never returns false, instead, it just ereport(ERROR).

5 - jsonpath.h
```
+		struct
+		{
+			int32		doc;
+			char	   *tsquery;
+			uint32		tsquerylen;
+			int32		tsconfig;
+			char	   *tsqparser;
+			uint32		tsqparser_len;
+		}			tsmatch;
+		struct
+		{
+			JsonPathParseItem *doc;
+			char	   *tsquery;
+			uint32		tsquerylen;
+			JsonPathParseItem *tsconfig;
+			char	   *tsqparser;
+			uint32		tsqparser_len;
+		}			tsmatch;
 	}			value;
```

tsquerylen doesn’t have _ before len, and tsqparser_len, would it be better to make naming conventions consistent in the same structure?

6 - jsonpath_exec.c
```
#include "tsearch/ts_utils.h"
#include "tsearch/ts_cache.h"
#include "utils/regproc.h"
#include "catalog/namespace.h"

static JsonPathBool
executeTsMatch(JsonPathItem *jsp, JsonbValue *str, JsonbValue *rarg,
void *param)
```

Why don’t put these includes to the header section together with other includes?

7 - jsonpath_exec.c
```
+			else
+			{
+				/*
+				 * Fallback or Error for unknown flags (should be caught by
+				 * parser)
+				 */
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("unrecognized tsqparser flag")));
+			}
```

This “else” should never be entered as the same check has been done by makeItemTsMatch. So, maybe just use an Assert here, or pg_unreachable().

8 -  jsonpath_exec.c
```
+	/* Setup Context (Run ONLY once per predicate) */
+	if (!cxt->initialized)
```

While tracing this SQL:
```
evantest=# SELECT '{"tags": ["running", "jogging"]}'::jsonb
evantest-# @@ '$.tags[*] ? (@ tsmatch "run" tsconfig "english")';
?column?
----------

(1 row)
```

I noticed that, when process “jogging”, cxt->initialized is still false, meaning that, the cxt is not reused across array items. Given the same tsconfig should apply to all array items, I think cxt should be reused.

9 -  jsonpath_exec.c
```
+		/* Select Parser and Compile Query */
+		parser_mode = jsp->content.tsmatch.tsqparser;
+		parser_len = jsp->content.tsmatch.tsqparser_len;
+
+		if (parser_len > 0)
+		{
+			/* Dispatch based on flag */
+			if (pg_strncasecmp(parser_mode, "pl", parser_len) == 0)
```

Nit: parser_mode is only used inside if (parser_len > 0), it can be defined inside the “if”.

10 - jsonpath_gram.y
```
+			 ereport(ERROR,
+					 (errcode(ERRCODE_SYNTAX_ERROR),
+					  errmsg("invalid tsquery_parser value: \"%s\"", tsquery_parser->val),
+					  errhint("Valid values are \"pl\", \"ph\", and \"w\".")));
```

When tested a case with an invalid parser, I got:
```
evantest=# SELECT '{"tags": ["running", "jogging"]}'::jsonb @? '$.tags[*] ? (@ tsmatch "run" tsconfig "english" tsqparser "pss")';
ERROR: invalid tsquery_parser value: "pss@"
LINE 2: @? '$.tags[*] ? (@ tsmatch "run" tsconfig "english" tsqpar...
^
HINT: Valid values are "pl", "ph", and "w".
```

You can see the it shows a bad looking invalid value. I think that’s because tsquery_parser->val is not NULL terminated. I fixed this problem with:
```
errmsg("invalid tsquery_parser value: \"%.*s\"", (int) tsquery_parser->len, tsquery_parser->val),
```

This change is also included in the attached diff file.

11 - jsonpath.c
```
+ if (printBracketes)
+ appendStringInfoChar(buf, ')');
+ break;
+
if (printBracketes)
appendStringInfoChar(buf, ')');
```

Duplicate code. Looks like a copy-pasto.

12 - jsonpath.c
```
+				/* Write the Main Query String */
+				appendBinaryStringInfo(buf,
+									   &item->value.tsmatch.tsquerylen,
+									   sizeof(item->value.tsmatch.tsquerylen));
+				appendBinaryStringInfo(buf,
+									   item->value.tsmatch.tsquery,
+									   item->value.tsmatch.tsquerylen);
+				appendStringInfoChar(buf, '\0');
```

I don’t think we need to manually append ‘\0’ after appendBinaryStringInfo. Looking at the header comment of appendBinaryStringInfo, it says that a trailing null will be added.
```
/*
* appendBinaryStringInfo
*
* Append arbitrary binary data to a StringInfo, allocating more space
* if necessary. Ensures that a trailing null byte is present.
*/
void
appendBinaryStringInfo(StringInfo str, const void *data, int datalen)
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Attachments:

nocfbot_jsonpath_gram_y.diffapplication/octet-stream; name=nocfbot_jsonpath_gram_y.diff; x-unix-mode=0644Download+39-22