Empty string in lexeme for tsvector

Started by Jean-Christophe Arnuover 4 years ago10 messages
#1Jean-Christophe Arnu
jcarnu@gmail.com
1 attachment(s)

Hello Hackers,

This is my second proposal for a patch, so I hope not to make "rookie"
mistakes.

This proposal patch is based on a simple use case :

If one creates a table this way
CREATE TABLE tst_table AS (SELECT array_to_tsvector(ARRAY['','abc','def']));

the table content is :
array_to_tsvector
-------------------
'' 'abc' 'def'
(1 row)

First it can be strange to have an empty string for tsvector lexeme but
anyway, keep going to the real point.

Once dumped, this table dump contains that empty string that can't be
restored.
tsvector_parse (./utils/adt/tsvector_parser.c) raises an error.

Thus it is not possible for data to be restored this way.

There are two ways to consider this : is it alright to have empty strings
in lexemes ?
* If so, empty strings should be correctly parsed by tsvector_parser.
* If not, one should prevent empty strings from being stored into
tsvectors.

Since "empty strings" seems not to be a valid lexeme, I undertook to change
some functions dealing with tsvector to check whether string arguments are
empty. This might be the wrong path as I'm not familiar with tsvector
usage... (OTOH, I can provide a fix patch for tsvector_parser() if I'm
wrong).

This involved changing the way functions like array_to_tsvector(),
ts_delete() and setweight() behave. As for NULL values, empty string values
are checked and an error is raised for such a value. It appears to me that
ERRCODE_ZERO_LENGTH_CHARACTER_STRING (2200F) matched this behaviour but I
may be wrong.

Since this patch changes the way functions behave, consider it as a simple
try to overcome a strange situation we've noticed on a specific use case.

This included patch manages that checks for empty strings on the pointed
out functions. It comes with modified regression tests. Patch applies along
head/master and 14_RC1.

Comments are more than welcome!
Thank you,

--
Jean-Christophe Arnu

Attachments:

empty_string_in_tsvector_v0.patchtext/x-patch; charset=US-ASCII; name=empty_string_in_tsvector_v0.patchDownload
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..5a33e1bb10 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -331,6 +331,11 @@ tsvector_setweight_by_filter(PG_FUNCTION_ARGS)
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
 		lex_pos = tsvector_bsearch(tsout, lex, lex_len);
 
+		if (lex_len == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
+
 		if (lex_pos >= 0 && (j = POSDATALEN(tsout, entry + lex_pos)) != 0)
 		{
 			WordEntryPos *p = POSDATAPTR(tsout, entry + lex_pos);
@@ -607,6 +612,11 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 					 errmsg("lexeme array may not contain nulls")));
 
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
+
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
 		lex_pos = tsvector_bsearch(tsin, lex, lex_len);
@@ -761,13 +771,18 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, &dlexemes, &nulls, &nitems);
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/* Reject nulls and zero length strings (maybe we should just ignore them, instead?) */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 					 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
 	}
 
 	/* Sort and de-dup, because this is required for a valid tsvector. */
diff --git a/src/test/regress/expected/tstypes.out b/src/test/regress/expected/tstypes.out
index 2601e312df..a258179ef0 100644
--- a/src/test/regress/expected/tstypes.out
+++ b/src/test/regress/expected/tstypes.out
@@ -1260,6 +1260,8 @@ SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceshi
 
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
                    unnest                    
 ---------------------------------------------
@@ -1330,6 +1332,8 @@ SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
+ERROR:  lexeme array may not contain empty strings
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
  array_to_tsvector 
@@ -1375,6 +1379,8 @@ SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
                           ts_filter                          
 -------------------------------------------------------------
diff --git a/src/test/regress/sql/tstypes.sql b/src/test/regress/sql/tstypes.sql
index 30c8c702f0..6be0b26117 100644
--- a/src/test/regress/sql/tstypes.sql
+++ b/src/test/regress/sql/tstypes.sql
@@ -240,6 +240,7 @@ SELECT ts_delete('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3':
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
 
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
 SELECT unnest('base hidden rebel spaceship strike'::tsvector);
@@ -252,6 +253,7 @@ SELECT tsvector_to_array('base hidden rebel spaceship strike'::tsvector);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
 
@@ -262,6 +264,7 @@ SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '
 SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
 
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
 SELECT ts_filter('base hidden rebel spaceship strike'::tsvector, '{a}');
#2Ranier Vilela
ranier.vf@gmail.com
In reply to: Jean-Christophe Arnu (#1)
Re: Empty string in lexeme for tsvector

Em sex., 24 de set. de 2021 às 05:47, Jean-Christophe Arnu <jcarnu@gmail.com>
escreveu:

Hello Hackers,

This is my second proposal for a patch, so I hope not to make "rookie"
mistakes.

This proposal patch is based on a simple use case :

If one creates a table this way
CREATE TABLE tst_table AS (SELECT
array_to_tsvector(ARRAY['','abc','def']));

the table content is :
array_to_tsvector
-------------------
'' 'abc' 'def'
(1 row)

First it can be strange to have an empty string for tsvector lexeme but
anyway, keep going to the real point.

Once dumped, this table dump contains that empty string that can't be
restored.
tsvector_parse (./utils/adt/tsvector_parser.c) raises an error.

Thus it is not possible for data to be restored this way.

There are two ways to consider this : is it alright to have empty strings
in lexemes ?
* If so, empty strings should be correctly parsed by tsvector_parser.
* If not, one should prevent empty strings from being stored into
tsvectors.

Since "empty strings" seems not to be a valid lexeme, I undertook to
change some functions dealing with tsvector to check whether string
arguments are empty. This might be the wrong path as I'm not familiar with
tsvector usage... (OTOH, I can provide a fix patch for tsvector_parser() if
I'm wrong).

This involved changing the way functions like array_to_tsvector(),
ts_delete() and setweight() behave. As for NULL values, empty string values
are checked and an error is raised for such a value. It appears to me that
ERRCODE_ZERO_LENGTH_CHARACTER_STRING (2200F) matched this behaviour but I
may be wrong.

Since this patch changes the way functions behave, consider it as a simple
try to overcome a strange situation we've noticed on a specific use case.

This included patch manages that checks for empty strings on the pointed
out functions. It comes with modified regression tests. Patch applies along
head/master and 14_RC1.

Comments are more than welcome!

1. Would be better to add this test-and-error before tsvector_bsearch call.

+ if (lex_len == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+ errmsg("lexeme array may not contain empty strings")));
+

If lex_len is equal to zero, better get out soon.

2. The second test-and-error can use lex_len, just like the first test,
I don't see the point in recalculating the size of lex_len if that's
already done.

+ if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+ errmsg("lexeme array may not contain empty strings")));
+

regards,
Ranier Vilela

#3Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: Ranier Vilela (#2)
1 attachment(s)
Re: Empty string in lexeme for tsvector

Le ven. 24 sept. 2021 à 13:03, Ranier Vilela <ranier.vf@gmail.com> a écrit :

Comments are more than welcome!

1. Would be better to add this test-and-error before tsvector_bsearch call.

+ if (lex_len == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+ errmsg("lexeme array may not contain empty strings")));
+

If lex_len is equal to zero, better get out soon.

2. The second test-and-error can use lex_len, just like the first test,
I don't see the point in recalculating the size of lex_len if that's
already done.

+ if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+ errmsg("lexeme array may not contain empty strings")));
+

Hello Ranier,
Thank you for your comments.
Here's a new patch file taking your comments into account.

I was just wondering if empty string eviction is done in the right place.
As you rightfully commented, lex_len is calculated later (once again for a
right purpose) and my code checks for empty strings as soon as possible.
To me, it seems to be the right thing to do (prevent further processing on
lexemes
as soon as possible) but I might omit something.

Regards

Jean-Christophe Arnu

Attachments:

empty_string_in_tsvector_v1.patchtext/x-patch; charset=US-ASCII; name=empty_string_in_tsvector_v1.patchDownload
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..00f80ffcbc 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -329,6 +329,12 @@ tsvector_setweight_by_filter(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsout, lex, lex_len);
 
 		if (lex_pos >= 0 && (j = POSDATALEN(tsout, entry + lex_pos)) != 0)
@@ -609,6 +615,12 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsin, lex, lex_len);
 
 		if (lex_pos >= 0)
@@ -761,13 +773,18 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, &dlexemes, &nulls, &nitems);
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/* Reject nulls and zero length strings (maybe we should just ignore them, instead?) */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 					 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
 	}
 
 	/* Sort and de-dup, because this is required for a valid tsvector. */
diff --git a/src/test/regress/expected/tstypes.out b/src/test/regress/expected/tstypes.out
index 2601e312df..a258179ef0 100644
--- a/src/test/regress/expected/tstypes.out
+++ b/src/test/regress/expected/tstypes.out
@@ -1260,6 +1260,8 @@ SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceshi
 
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
                    unnest                    
 ---------------------------------------------
@@ -1330,6 +1332,8 @@ SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
+ERROR:  lexeme array may not contain empty strings
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
  array_to_tsvector 
@@ -1375,6 +1379,8 @@ SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
                           ts_filter                          
 -------------------------------------------------------------
diff --git a/src/test/regress/sql/tstypes.sql b/src/test/regress/sql/tstypes.sql
index 30c8c702f0..6be0b26117 100644
--- a/src/test/regress/sql/tstypes.sql
+++ b/src/test/regress/sql/tstypes.sql
@@ -240,6 +240,7 @@ SELECT ts_delete('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3':
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
 
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
 SELECT unnest('base hidden rebel spaceship strike'::tsvector);
@@ -252,6 +253,7 @@ SELECT tsvector_to_array('base hidden rebel spaceship strike'::tsvector);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
 
@@ -262,6 +264,7 @@ SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '
 SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
 
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
 SELECT ts_filter('base hidden rebel spaceship strike'::tsvector, '{a}');
#4Ranier Vilela
ranier.vf@gmail.com
In reply to: Jean-Christophe Arnu (#3)
Re: Empty string in lexeme for tsvector

Em sex., 24 de set. de 2021 às 09:39, Jean-Christophe Arnu <jcarnu@gmail.com>
escreveu:

Le ven. 24 sept. 2021 à 13:03, Ranier Vilela <ranier.vf@gmail.com> a
écrit :

Comments are more than welcome!

1. Would be better to add this test-and-error before tsvector_bsearch
call.

+ if (lex_len == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+ errmsg("lexeme array may not contain empty strings")));
+

If lex_len is equal to zero, better get out soon.

2. The second test-and-error can use lex_len, just like the first test,
I don't see the point in recalculating the size of lex_len if that's
already done.

+ if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+ errmsg("lexeme array may not contain empty strings")));
+

Hello Ranier,
Thank you for your comments.
Here's a new patch file taking your comments into account.

Thanks.

I was just wondering if empty string eviction is done in the right place.

As you rightfully commented, lex_len is calculated later (once again for a

right purpose) and my code checks for empty strings as soon as possible.
To me, it seems to be the right thing to do (prevent further processing on
lexemes
as soon as possible) but I might omit something.

It's always good to avoid unnecessary processing.

regards,
Ranier Vilela

#5Artur Zakirov
zaartur@gmail.com
In reply to: Jean-Christophe Arnu (#3)
Re: Empty string in lexeme for tsvector

On Fri, Sep 24, 2021 at 2:39 PM Jean-Christophe Arnu <jcarnu@gmail.com> wrote:

Here's a new patch file taking your comments into account.

Nice catch! The patch looks good to me.
Can you also add a more general test case:

=# SELECT $$'' '1' '2'$$::tsvector;
ERROR: syntax error in tsvector: "'' '1' '2'"
LINE 1: SELECT $$'' '1' '2'$$::tsvector;

--
Artur

#6Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: Artur Zakirov (#5)
1 attachment(s)
Re: Empty string in lexeme for tsvector

Le dim. 26 sept. 2021 à 15:55, Artur Zakirov <zaartur@gmail.com> a écrit :

Nice catch! The patch looks good to me.
Can you also add a more general test case:

=# SELECT $$'' '1' '2'$$::tsvector;
ERROR: syntax error in tsvector: "'' '1' '2'"
LINE 1: SELECT $$'' '1' '2'$$::tsvector;

Thank you, Artur for spotting this test.
It is now included into this patch.

--
Jean-Christophe Arnu

Attachments:

empty_string_in_tsvector_v3.patchtext/x-patch; charset=US-ASCII; name=empty_string_in_tsvector_v3.patchDownload
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..00f80ffcbc 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -329,6 +329,12 @@ tsvector_setweight_by_filter(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsout, lex, lex_len);
 
 		if (lex_pos >= 0 && (j = POSDATALEN(tsout, entry + lex_pos)) != 0)
@@ -609,6 +615,12 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsin, lex, lex_len);
 
 		if (lex_pos >= 0)
@@ -761,13 +773,18 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, &dlexemes, &nulls, &nitems);
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/* Reject nulls and zero length strings (maybe we should just ignore them, instead?) */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 					 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
 	}
 
 	/* Sort and de-dup, because this is required for a valid tsvector. */
diff --git a/src/test/regress/expected/tstypes.out b/src/test/regress/expected/tstypes.out
index 2601e312df..f8bf9c6051 100644
--- a/src/test/regress/expected/tstypes.out
+++ b/src/test/regress/expected/tstypes.out
@@ -85,6 +85,10 @@ SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
  'a':3A,4B 'b':2A 'ba':1237
 (1 row)
 
+SELECT $$'' '1' '2'$$::tsvector;
+ERROR:  syntax error in tsvector: "'' '1' '2'"
+LINE 1: SELECT $$'' '1' '2'$$::tsvector;
+               ^
 --Base tsquery test
 SELECT '1'::tsquery;
  tsquery 
@@ -1260,6 +1264,8 @@ SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceshi
 
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
                    unnest                    
 ---------------------------------------------
@@ -1330,6 +1336,8 @@ SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
+ERROR:  lexeme array may not contain empty strings
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
  array_to_tsvector 
@@ -1375,6 +1383,8 @@ SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
                           ts_filter                          
 -------------------------------------------------------------
diff --git a/src/test/regress/sql/tstypes.sql b/src/test/regress/sql/tstypes.sql
index 30c8c702f0..79a7777407 100644
--- a/src/test/regress/sql/tstypes.sql
+++ b/src/test/regress/sql/tstypes.sql
@@ -17,6 +17,7 @@ SELECT $$'\\as' ab\c ab\\c AB\\\c ab\\\\c$$::tsvector;
 SELECT tsvectorin(tsvectorout($$'\\as' ab\c ab\\c AB\\\c ab\\\\c$$::tsvector));
 SELECT '''w'':4A,3B,2C,1D,5 a:8';
 SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
+SELECT $$'' '1' '2'$$::tsvector;
 
 --Base tsquery test
 SELECT '1'::tsquery;
@@ -240,6 +241,7 @@ SELECT ts_delete('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3':
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
 
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
 SELECT unnest('base hidden rebel spaceship strike'::tsvector);
@@ -252,6 +254,7 @@ SELECT tsvector_to_array('base hidden rebel spaceship strike'::tsvector);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
 
@@ -262,6 +265,7 @@ SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '
 SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
 
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
 SELECT ts_filter('base hidden rebel spaceship strike'::tsvector, '{a}');
#7Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: Jean-Christophe Arnu (#6)
1 attachment(s)
Re: Empty string in lexeme for tsvector

Le dim. 26 sept. 2021 à 22:41, Jean-Christophe Arnu <jcarnu@gmail.com> a
écrit :

Le dim. 26 sept. 2021 à 15:55, Artur Zakirov <zaartur@gmail.com> a écrit :

Nice catch! The patch looks good to me.
Can you also add a more general test case:

=# SELECT $$'' '1' '2'$$::tsvector;
ERROR: syntax error in tsvector: "'' '1' '2'"
LINE 1: SELECT $$'' '1' '2'$$::tsvector;

Thank you, Artur for spotting this test.
It is now included into this patch.

Two more things :

* I updated the documentation for array_to_tsvector(), ts_delete() and
setweight() functions (so here's a new patch);
* I should mention François Ferry from Logilab who first reported the
backup/restore problem that led to this patch.

I think this should be ok, now the doc is up to date.

Kind regards.
--
Jean-Christophe Arnu

Attachments:

empty_string_in_tsvector_v4.patchtext/x-patch; charset=US-ASCII; name=empty_string_in_tsvector_v4.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 78812b2dbe..01f0b870ca 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12896,7 +12896,9 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
        </para>
        <para>
         Converts an array of lexemes to a <type>tsvector</type>.
-        The given strings are used as-is without further processing.
+        The given strings are used as-is. Some checks are performed
+        on array elements. Empty strings and <literal>NULL</literal> values
+        will cause an error to be raised.
        </para>
        <para>
         <literal>array_to_tsvector('{fat,cat,rat}'::text[])</literal>
@@ -13079,6 +13081,9 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
         Assigns the specified <parameter>weight</parameter> to elements
         of the <parameter>vector</parameter> that are listed
         in <parameter>lexemes</parameter>.
+        Some checks are performed on <parameter>lexemes</parameter>.
+        Empty strings and <literal>NULL</literal> values
+        will cause an error to be raised.
        </para>
        <para>
         <literal>setweight('fat:2,4 cat:3 rat:5,6B'::tsvector, 'A', '{cat,rat}')</literal>
@@ -13256,6 +13261,9 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
         Removes any occurrences of the lexemes
         in <parameter>lexemes</parameter>
         from the <parameter>vector</parameter>.
+        Some checks are performed on <parameter>lexemes</parameter>.
+        Empty strings and <literal>NULL</literal> values
+        will cause an error to be raised.
        </para>
        <para>
         <literal>ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat'])</literal>
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..00f80ffcbc 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -329,6 +329,12 @@ tsvector_setweight_by_filter(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsout, lex, lex_len);
 
 		if (lex_pos >= 0 && (j = POSDATALEN(tsout, entry + lex_pos)) != 0)
@@ -609,6 +615,12 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
+
+		if (lex_len == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
+
 		lex_pos = tsvector_bsearch(tsin, lex, lex_len);
 
 		if (lex_pos >= 0)
@@ -761,13 +773,18 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, &dlexemes, &nulls, &nitems);
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/* Reject nulls and zero length strings (maybe we should just ignore them, instead?) */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 					 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
 	}
 
 	/* Sort and de-dup, because this is required for a valid tsvector. */
diff --git a/src/test/regress/expected/tstypes.out b/src/test/regress/expected/tstypes.out
index 2601e312df..f8bf9c6051 100644
--- a/src/test/regress/expected/tstypes.out
+++ b/src/test/regress/expected/tstypes.out
@@ -85,6 +85,10 @@ SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
  'a':3A,4B 'b':2A 'ba':1237
 (1 row)
 
+SELECT $$'' '1' '2'$$::tsvector;
+ERROR:  syntax error in tsvector: "'' '1' '2'"
+LINE 1: SELECT $$'' '1' '2'$$::tsvector;
+               ^
 --Base tsquery test
 SELECT '1'::tsquery;
  tsquery 
@@ -1260,6 +1264,8 @@ SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceshi
 
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
                    unnest                    
 ---------------------------------------------
@@ -1330,6 +1336,8 @@ SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
+ERROR:  lexeme array may not contain empty strings
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
  array_to_tsvector 
@@ -1375,6 +1383,8 @@ SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
+ERROR:  lexeme array may not contain empty strings
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
                           ts_filter                          
 -------------------------------------------------------------
diff --git a/src/test/regress/sql/tstypes.sql b/src/test/regress/sql/tstypes.sql
index 30c8c702f0..79a7777407 100644
--- a/src/test/regress/sql/tstypes.sql
+++ b/src/test/regress/sql/tstypes.sql
@@ -17,6 +17,7 @@ SELECT $$'\\as' ab\c ab\\c AB\\\c ab\\\\c$$::tsvector;
 SELECT tsvectorin(tsvectorout($$'\\as' ab\c ab\\c AB\\\c ab\\\\c$$::tsvector));
 SELECT '''w'':4A,3B,2C,1D,5 a:8';
 SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
+SELECT $$'' '1' '2'$$::tsvector;
 
 --Base tsquery test
 SELECT '1'::tsquery;
@@ -240,6 +241,7 @@ SELECT ts_delete('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3':
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
 
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
 SELECT unnest('base hidden rebel spaceship strike'::tsvector);
@@ -252,6 +254,7 @@ SELECT tsvector_to_array('base hidden rebel spaceship strike'::tsvector);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
 
@@ -262,6 +265,7 @@ SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '
 SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
 
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
 SELECT ts_filter('base hidden rebel spaceship strike'::tsvector, '{a}');
#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jean-Christophe Arnu (#7)
Re: Empty string in lexeme for tsvector

Jean-Christophe Arnu <jcarnu@gmail.com> writes:

[ empty_string_in_tsvector_v4.patch ]

I looked through this patch a bit. I don't agree with adding
these new error conditions to tsvector_setweight_by_filter and
tsvector_delete_arr. Those don't prevent bad lexemes from being
added to tsvectors, so AFAICS they can have no effect other than
breaking existing applications. In fact, tsvector_delete_arr is
one thing you could use to fix existing bad tsvectors, so making
it throw an error seems actually counterproductive.

(By the same token, I think there's a good argument for
tsvector_delete_arr to just ignore nulls, not throw an error.
That's a somewhat orthogonal issue, though.)

What I'm wondering about more than that is whether array_to_tsvector
is the only place that can inject an empty lexeme ... don't we have
anything else that can add lexemes without going through the parser?

regards, tom lane

#9Jean-Christophe Arnu
jcarnu@gmail.com
In reply to: Tom Lane (#8)
1 attachment(s)
Re: Empty string in lexeme for tsvector

Thank you Tom for your review.

Le mer. 29 sept. 2021 à 21:36, Tom Lane <tgl@sss.pgh.pa.us> a écrit :

Jean-Christophe Arnu <jcarnu@gmail.com> writes:

[ empty_string_in_tsvector_v4.patch ]

I looked through this patch a bit. I don't agree with adding
these new error conditions to tsvector_setweight_by_filter and
tsvector_delete_arr. Those don't prevent bad lexemes from being
added to tsvectors, so AFAICS they can have no effect other than
breaking existing applications. In fact, tsvector_delete_arr is
one thing you could use to fix existing bad tsvectors, so making
it throw an error seems actually counterproductive.

Agreed.
The new patch included here does not change tsvector_setweight_by_filter()
anymore. Tests and docs are also upgraded.
This patch is not ready yet.

(By the same token, I think there's a good argument for
tsvector_delete_arr to just ignore nulls, not throw an error.
That's a somewhat orthogonal issue, though.)

Nulls are now ignored in tsvector_delete_arr().

What I'm wondering about more than that is whether array_to_tsvector
is the only place that can inject an empty lexeme ... don't we have
anything else that can add lexemes without going through the parser?

I crawled the docs [1]https://www.postgresql.org/docs/14/functions-textsearch.html in order to check each tsvector output from
functions and
operators :

* The only fonctions left that may lead to empty strings seem
both json_to_tsvector() and jsonb_to_tsvector(). Both functions use
parsetext
(in ts_parse.c) that seems to behave correctly and don't create "empty
string".
* concat operator "||' allows to compute a ts_vector containing "empty
string" if
one of its operands contains itself an empty string tsvector. This seems
perfectly
correct from the operator point of view... Should we change this
behaviour to
filter out empty strings ?

I also wonder if we should not also consider changing COPY FROM behaviour
on empty string lexemes.
Current version is just crashing on empty string lexemes. Should
we allow them anyway as COPY FROM input (it seems weird not to be able
to re-import dumped data) or "skipping them" just like array_to_tsvector()
does in the patched version (that may lead to database content changes) or
finally should we not change COPY behaviour ?

I admit this is a tricky bunch of questions I'm too rookie to answer.

[1]: https://www.postgresql.org/docs/14/functions-textsearch.html

--
Jean-Christophe Arnu

Attachments:

empty_string_in_tsvector_v5.patchtext/x-patch; charset=US-ASCII; name=empty_string_in_tsvector_v5.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 78812b2dbe..74e25db03c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12896,7 +12896,9 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
        </para>
        <para>
         Converts an array of lexemes to a <type>tsvector</type>.
-        The given strings are used as-is without further processing.
+        The given strings are used as-is. Some checks are performed
+        on array elements. Empty strings and <literal>NULL</literal> values
+        will cause an error to be raised.
        </para>
        <para>
         <literal>array_to_tsvector('{fat,cat,rat}'::text[])</literal>
@@ -13079,6 +13081,8 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
         Assigns the specified <parameter>weight</parameter> to elements
         of the <parameter>vector</parameter> that are listed
         in <parameter>lexemes</parameter>.
+        Some checks are performed on <parameter>lexemes</parameter>.
+        <literal>NULL</literal> values will cause an error to be raised.
        </para>
        <para>
         <literal>setweight('fat:2,4 cat:3 rat:5,6B'::tsvector, 'A', '{cat,rat}')</literal>
@@ -13256,6 +13260,8 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
         Removes any occurrences of the lexemes
         in <parameter>lexemes</parameter>
         from the <parameter>vector</parameter>.
+        <literal>NULL</literal> values in <parameter>lexemes</parameter>
+        will be ignored.
        </para>
        <para>
         <literal>ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat'])</literal>
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..e50d8a84e2 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -602,10 +602,9 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 		int			lex_len,
 					lex_pos;
 
+		// Ignore null values
 		if (nulls[i])
-			ereport(ERROR,
-					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
-					 errmsg("lexeme array may not contain nulls")));
+			continue;
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
@@ -761,13 +760,18 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, &dlexemes, &nulls, &nitems);
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/* Reject nulls and zero length strings (maybe we should just ignore them, instead?) */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 					 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
 	}
 
 	/* Sort and de-dup, because this is required for a valid tsvector. */
diff --git a/src/test/regress/expected/tstypes.out b/src/test/regress/expected/tstypes.out
index 2601e312df..016e901995 100644
--- a/src/test/regress/expected/tstypes.out
+++ b/src/test/regress/expected/tstypes.out
@@ -85,6 +85,10 @@ SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
  'a':3A,4B 'b':2A 'ba':1237
 (1 row)
 
+SELECT $$'' '1' '2'$$::tsvector;
+ERROR:  syntax error in tsvector: "'' '1' '2'"
+LINE 1: SELECT $$'' '1' '2'$$::tsvector;
+               ^
 --Base tsquery test
 SELECT '1'::tsquery;
  tsquery 
@@ -1259,7 +1263,17 @@ SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceshi
 (1 row)
 
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
-ERROR:  lexeme array may not contain nulls
+        ts_delete         
+--------------------------
+ 'base' 'hidden' 'strike'
+(1 row)
+
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
+        ts_delete         
+--------------------------
+ 'base' 'hidden' 'strike'
+(1 row)
+
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
                    unnest                    
 ---------------------------------------------
@@ -1330,6 +1344,8 @@ SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
+ERROR:  lexeme array may not contain empty strings
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
  array_to_tsvector 
@@ -1375,6 +1391,12 @@ SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
+            setweight            
+---------------------------------
+ 'a' 'asd' 'w':5,6,12B,13A 'zxc'
+(1 row)
+
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
                           ts_filter                          
 -------------------------------------------------------------
diff --git a/src/test/regress/sql/tstypes.sql b/src/test/regress/sql/tstypes.sql
index 30c8c702f0..79a7777407 100644
--- a/src/test/regress/sql/tstypes.sql
+++ b/src/test/regress/sql/tstypes.sql
@@ -17,6 +17,7 @@ SELECT $$'\\as' ab\c ab\\c AB\\\c ab\\\\c$$::tsvector;
 SELECT tsvectorin(tsvectorout($$'\\as' ab\c ab\\c AB\\\c ab\\\\c$$::tsvector));
 SELECT '''w'':4A,3B,2C,1D,5 a:8';
 SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
+SELECT $$'' '1' '2'$$::tsvector;
 
 --Base tsquery test
 SELECT '1'::tsquery;
@@ -240,6 +241,7 @@ SELECT ts_delete('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3':
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '']);
 
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
 SELECT unnest('base hidden rebel spaceship strike'::tsvector);
@@ -252,6 +254,7 @@ SELECT tsvector_to_array('base hidden rebel spaceship strike'::tsvector);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
 
@@ -262,6 +265,7 @@ SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '
 SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
 SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '']);
 
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
 SELECT ts_filter('base hidden rebel spaceship strike'::tsvector, '{a}');
#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jean-Christophe Arnu (#9)
1 attachment(s)
Re: Empty string in lexeme for tsvector

Jean-Christophe Arnu <jcarnu@gmail.com> writes:

(By the same token, I think there's a good argument for
tsvector_delete_arr to just ignore nulls, not throw an error.
That's a somewhat orthogonal issue, though.)

Nulls are now ignored in tsvector_delete_arr().

I think setweight() with an array should handle this the same as
ts_delete() with an array, so the attached v6 does it like that.

I also wonder if we should not also consider changing COPY FROM behaviour
on empty string lexemes.
Current version is just crashing on empty string lexemes. Should
we allow them anyway as COPY FROM input (it seems weird not to be able
to re-import dumped data) or "skipping them" just like array_to_tsvector()
does in the patched version (that may lead to database content changes) or
finally should we not change COPY behaviour ?

No, I don't think so. tsvector's restriction against empty lexemes was
certainly intentional from the beginning, so I wouldn't be surprised if
we'd run into semantic difficulties if we remove it. Moreover, we're
going on fourteen years with that restriction and we've had few
complaints, so there's no field demand to loosen it. It's clearly just
an oversight that array_to_tsvector() failed to enforce the restriction.

I polished the docs and tests a bit more, too. I think the attached
is committable -- any objections?

regards, tom lane

Attachments:

empty_string_in_tsvector_v6.patchtext/x-diff; charset=us-ascii; name=empty_string_in_tsvector_v6.patchDownload
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4b49dff2ff..24447c0017 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12920,8 +12920,10 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
         <returnvalue>tsvector</returnvalue>
        </para>
        <para>
-        Converts an array of lexemes to a <type>tsvector</type>.
-        The given strings are used as-is without further processing.
+        Converts an array of text strings to a <type>tsvector</type>.
+        The given strings are used as lexemes as-is, without further
+        processing.  Array elements must not be empty strings
+        or <literal>NULL</literal>.
        </para>
        <para>
         <literal>array_to_tsvector('{fat,cat,rat}'::text[])</literal>
@@ -13104,6 +13106,9 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
         Assigns the specified <parameter>weight</parameter> to elements
         of the <parameter>vector</parameter> that are listed
         in <parameter>lexemes</parameter>.
+        The strings in <parameter>lexemes</parameter> are taken as lexemes
+        as-is, without further processing.  Strings that do not match any
+        lexeme in <parameter>vector</parameter> are ignored.
        </para>
        <para>
         <literal>setweight('fat:2,4 cat:3 rat:5,6B'::tsvector, 'A', '{cat,rat}')</literal>
@@ -13265,6 +13270,8 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
        <para>
         Removes any occurrence of the given <parameter>lexeme</parameter>
         from the <parameter>vector</parameter>.
+        The <parameter>lexeme</parameter> string is treated as a lexeme as-is,
+        without further processing.
        </para>
        <para>
         <literal>ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, 'fat')</literal>
@@ -13281,6 +13288,9 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
         Removes any occurrences of the lexemes
         in <parameter>lexemes</parameter>
         from the <parameter>vector</parameter>.
+        The strings in <parameter>lexemes</parameter> are taken as lexemes
+        as-is, without further processing.  Strings that do not match any
+        lexeme in <parameter>vector</parameter> are ignored.
        </para>
        <para>
         <literal>ts_delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat'])</literal>
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index 9236ebcc8f..11ccb5297c 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -322,10 +322,9 @@ tsvector_setweight_by_filter(PG_FUNCTION_ARGS)
 		int			lex_len,
 					lex_pos;
 
+		/* Ignore null array elements, they surely don't match */
 		if (nulls[i])
-			ereport(ERROR,
-					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
-					 errmsg("lexeme array may not contain nulls")));
+			continue;
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
@@ -602,10 +601,9 @@ tsvector_delete_arr(PG_FUNCTION_ARGS)
 		int			lex_len,
 					lex_pos;
 
+		/* Ignore null array elements, they surely don't match */
 		if (nulls[i])
-			ereport(ERROR,
-					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
-					 errmsg("lexeme array may not contain nulls")));
+			continue;
 
 		lex = VARDATA(dlexemes[i]);
 		lex_len = VARSIZE(dlexemes[i]) - VARHDRSZ;
@@ -761,13 +759,21 @@ array_to_tsvector(PG_FUNCTION_ARGS)
 
 	deconstruct_array(v, TEXTOID, -1, false, TYPALIGN_INT, &dlexemes, &nulls, &nitems);
 
-	/* Reject nulls (maybe we should just ignore them, instead?) */
+	/*
+	 * Reject nulls and zero length strings (maybe we should just ignore them,
+	 * instead?)
+	 */
 	for (i = 0; i < nitems; i++)
 	{
 		if (nulls[i])
 			ereport(ERROR,
 					(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
 					 errmsg("lexeme array may not contain nulls")));
+
+		if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+					 errmsg("lexeme array may not contain empty strings")));
 	}
 
 	/* Sort and de-dup, because this is required for a valid tsvector. */
diff --git a/src/test/regress/expected/tstypes.out b/src/test/regress/expected/tstypes.out
index 2601e312df..92c1c6e10b 100644
--- a/src/test/regress/expected/tstypes.out
+++ b/src/test/regress/expected/tstypes.out
@@ -85,6 +85,10 @@ SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
  'a':3A,4B 'b':2A 'ba':1237
 (1 row)
 
+SELECT $$'' '1' '2'$$::tsvector;  -- error, empty lexeme is not allowed
+ERROR:  syntax error in tsvector: "'' '1' '2'"
+LINE 1: SELECT $$'' '1' '2'$$::tsvector;
+               ^
 --Base tsquery test
 SELECT '1'::tsquery;
  tsquery 
@@ -1258,8 +1262,12 @@ SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceshi
  'base' 'hidden' 'strike'
 (1 row)
 
-SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
-ERROR:  lexeme array may not contain nulls
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '', NULL]);
+        ts_delete         
+--------------------------
+ 'base' 'hidden' 'strike'
+(1 row)
+
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
                    unnest                    
 ---------------------------------------------
@@ -1328,8 +1336,11 @@ SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
  'base' 'hidden' 'rebel' 'spaceship' 'strike'
 (1 row)
 
+-- null and empty string are disallowed, since we mustn't make an empty lexeme
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
 ERROR:  lexeme array may not contain nulls
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
+ERROR:  lexeme array may not contain empty strings
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
  array_to_tsvector 
@@ -1367,14 +1378,12 @@ SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '
  'a':1C,3C 'asd':1C 'w':5,6,12B,13A 'zxc':81C,222C,567C
 (1 row)
 
-SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '', NULL]);
             setweight            
 ---------------------------------
  'a' 'asd' 'w':5,6,12B,13A 'zxc'
 (1 row)
 
-SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
-ERROR:  lexeme array may not contain nulls
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
                           ts_filter                          
 -------------------------------------------------------------
diff --git a/src/test/regress/sql/tstypes.sql b/src/test/regress/sql/tstypes.sql
index 30c8c702f0..61e8f49c91 100644
--- a/src/test/regress/sql/tstypes.sql
+++ b/src/test/regress/sql/tstypes.sql
@@ -17,6 +17,7 @@ SELECT $$'\\as' ab\c ab\\c AB\\\c ab\\\\c$$::tsvector;
 SELECT tsvectorin(tsvectorout($$'\\as' ab\c ab\\c AB\\\c ab\\\\c$$::tsvector));
 SELECT '''w'':4A,3B,2C,1D,5 a:8';
 SELECT 'a:3A b:2a'::tsvector || 'ba:1234 a:1B';
+SELECT $$'' '1' '2'$$::tsvector;  -- error, empty lexeme is not allowed
 
 --Base tsquery test
 SELECT '1'::tsquery;
@@ -239,7 +240,7 @@ SELECT ts_delete('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3':
 SELECT ts_delete('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector, ARRAY['spaceship','leya','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel']);
 SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel','rebel']);
-SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', NULL]);
+SELECT ts_delete('base hidden rebel spaceship strike'::tsvector, ARRAY['spaceship','leya','rebel', '', NULL]);
 
 SELECT unnest('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D strike:3'::tsvector);
 SELECT unnest('base hidden rebel spaceship strike'::tsvector);
@@ -251,7 +252,9 @@ SELECT tsvector_to_array('base:7 hidden:6 rebel:1 spaceship:2,33A,34B,35C,36D st
 SELECT tsvector_to_array('base hidden rebel spaceship strike'::tsvector);
 
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship','strike']);
+-- null and empty string are disallowed, since we mustn't make an empty lexeme
 SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', NULL]);
+SELECT array_to_tsvector(ARRAY['base','hidden','rebel','spaceship', '']);
 -- array_to_tsvector must sort and de-dup
 SELECT array_to_tsvector(ARRAY['foo','bar','baz','bar']);
 
@@ -260,8 +263,7 @@ SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c');
 SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '{a}');
 SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '{a}');
 SELECT setweight('a:1,3A asd:1C w:5,6,12B,13A zxc:81,222A,567'::tsvector, 'c', '{a,zxc}');
-SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', '{a,zxc}');
-SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', NULL]);
+SELECT setweight('a asd w:5,6,12B,13A zxc'::tsvector, 'c', ARRAY['a', 'zxc', '', NULL]);
 
 SELECT ts_filter('base:7A empir:17 evil:15 first:11 galact:16 hidden:6A rebel:1A spaceship:2A strike:3A victori:12 won:9'::tsvector, '{a}');
 SELECT ts_filter('base hidden rebel spaceship strike'::tsvector, '{a}');