Re: tsearch2 column update produces "word too long"error

Started by Markus Wollnyover 22 years ago3 messagesgeneral
Jump to latest
#1Markus Wollny
Markus.Wollny@computec.de

Hi!

Now I really couldn't code C to save my life, but I managed to elicit
some more debugging info. It's still dumb-user-interaction as suspected,
but this is an issue I have to take into account as a basis; here's the
"patch" for ts_cfg.c:

if (lenlemm >= MAXSTRLEN)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
! errmsg("word is too long(%d):
%s",lenlemm,lemm)));

Now when I try

UPDATE ct_com_board_message
SET ftindex=to_tsvector('default',coalesce(user_login,'') ||'
'|| coalesce(title,'') ||' '|| coalesce(text,''));

I eventually get:

ERROR: word is too long(2724):
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj

This is a brightly shining example of utterly wanton user-stupidity, I
think: A 2k+ string of |:ja:|. Input like that cannot be helped, though
- if he'd been a bit more imaginative, he could have used a few dozen
"Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch" in a row or
anything else; unfortunately there's no app that could automatically
whack a user if he's doing something stupid.

But on the other hand I cannot think of any reason why crap like that
should be indexed in the first place. Therefore I would like to see some
sort of option allowing me to still use tsearch2 but actually
automatically excluding anything exceeding MAXSTRLEN - so the UPDATE
might throw a NOTICE (if anything at all) but still get on with the
rest.

An alteration like that does however exceed my limited abilities with C
by far and I don't want to mess with something I do not fully understand
and then use that mess in a production environment. Is there a way to
get around this problem with oversized words?

Kind regards

Markus

Show quoted text

-----Ursprüngliche Nachricht-----
Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
Gesendet: Freitag, 21. November 2003 15:13
An: Markus Wollny
Cc: pgsql-general@postgresql.org
Betreff: Re: AW: [GENERAL] tsearch2 column update produces "word too
long"error

On Fri, 21 Nov 2003, Markus Wollny wrote:

Hello!

Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
Gesendet: Freitag, 21. November 2003 13:06
An: Markus Wollny
Cc: pgsql-general@postgresql.org

Word length is limited by 2K. What's exactly the word
tsearch2 complained on ?
'Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'
is fine :)

This was a silly example, I know - it is a long word, but

not too long

to worry a machine. The offending word will surely be much

longer, but

as a matter of fact, I cannot think of any user actually

typing a 2k+

string without any spaces in between. I'm not sure on which word
tsearch2 complained, it doesn't tell and even logging did

not provide me

with any more detail:

2003-11-21 14:06:44 [26497] ERROR: 42601: word is too long
LOCATION: parsetext_v2, ts_cfg.c:294
STATEMENT: UPDATE ct_com_board_message
SET
ftindex=to_tsvector('default',coalesce(user_login,'') ||' '||
coalesce(title,'') ||' '|| coalesce(text,''));

Is there some way to find the exact position?

I'm afraid you need to hack ts_cfg.c:294 yourself to print the word
which's bugging you :)

btw, don't forget to configure properly dictionaries, so you
don't have a lot of unique words.

I won't forget that; I justed wanted to run a quick-off first test
before diving deeper into Ispell and other issues which are

as yet a bit

of a mystery to me.

Kind Regards

Markus

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Markus Wollny (#1)
Re: tsearch2 column update produces "word too

Markus,

thanks for your analyses ! I think we'll submit a patch to throw NOTICE
and skip these useless words from indexing.

Oleg
On Mon, 24 Nov 2003, Markus Wollny wrote:

Hi!

Now I really couldn't code C to save my life, but I managed to elicit
some more debugging info. It's still dumb-user-interaction as suspected,
but this is an issue I have to take into account as a basis; here's the
"patch" for ts_cfg.c:

if (lenlemm >= MAXSTRLEN)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
! errmsg("word is too long(%d):
%s",lenlemm,lemm)));

Now when I try

UPDATE ct_com_board_message
SET ftindex=to_tsvector('default',coalesce(user_login,'') ||'
'|| coalesce(title,'') ||' '|| coalesce(text,''));

I eventually get:

ERROR: word is too long(2724):
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj

This is a brightly shining example of utterly wanton user-stupidity, I
think: A 2k+ string of |:ja:|. Input like that cannot be helped, though
- if he'd been a bit more imaginative, he could have used a few dozen
"Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch" in a row or
anything else; unfortunately there's no app that could automatically
whack a user if he's doing something stupid.

But on the other hand I cannot think of any reason why crap like that
should be indexed in the first place. Therefore I would like to see some
sort of option allowing me to still use tsearch2 but actually
automatically excluding anything exceeding MAXSTRLEN - so the UPDATE
might throw a NOTICE (if anything at all) but still get on with the
rest.

An alteration like that does however exceed my limited abilities with C
by far and I don't want to mess with something I do not fully understand
and then use that mess in a production environment. Is there a way to
get around this problem with oversized words?

Kind regards

Markus

-----UrsprО©╫ngliche Nachricht-----
Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
Gesendet: Freitag, 21. November 2003 15:13
An: Markus Wollny
Cc: pgsql-general@postgresql.org
Betreff: Re: AW: [GENERAL] tsearch2 column update produces "word too
long"error

On Fri, 21 Nov 2003, Markus Wollny wrote:

Hello!

Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
Gesendet: Freitag, 21. November 2003 13:06
An: Markus Wollny
Cc: pgsql-general@postgresql.org

Word length is limited by 2K. What's exactly the word
tsearch2 complained on ?
'Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'
is fine :)

This was a silly example, I know - it is a long word, but

not too long

to worry a machine. The offending word will surely be much

longer, but

as a matter of fact, I cannot think of any user actually

typing a 2k+

string without any spaces in between. I'm not sure on which word
tsearch2 complained, it doesn't tell and even logging did

not provide me

with any more detail:

2003-11-21 14:06:44 [26497] ERROR: 42601: word is too long
LOCATION: parsetext_v2, ts_cfg.c:294
STATEMENT: UPDATE ct_com_board_message
SET
ftindex=to_tsvector('default',coalesce(user_login,'') ||' '||
coalesce(title,'') ||' '|| coalesce(text,''));

Is there some way to find the exact position?

I'm afraid you need to hack ts_cfg.c:294 yourself to print the word
which's bugging you :)

btw, don't forget to configure properly dictionaries, so you
don't have a lot of unique words.

I won't forget that; I justed wanted to run a quick-off first test
before diving deeper into Ispell and other issues which are

as yet a bit

of a mystery to me.

Kind Regards

Markus

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#3Teodor Sigaev
teodor@sigaev.ru
In reply to: Markus Wollny (#1)

Patch submitted to 7.5devel and REL7_4_STABLE

Markus Wollny wrote:

Hi!

Now I really couldn't code C to save my life, but I managed to elicit
some more debugging info. It's still dumb-user-interaction as suspected,
but this is an issue I have to take into account as a basis; here's the
"patch" for ts_cfg.c:

if (lenlemm >= MAXSTRLEN)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
! errmsg("word is too long(%d):
%s",lenlemm,lemm)));

Now when I try

UPDATE ct_com_board_message
SET ftindex=to_tsvector('default',coalesce(user_login,'') ||'
'|| coalesce(title,'') ||' '|| coalesce(text,''));

I eventually get:

ERROR: word is too long(2724):
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajjajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajjajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajjajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajajajjajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj

This is a brightly shining example of utterly wanton user-stupidity, I
think: A 2k+ string of |:ja:|. Input like that cannot be helped, though
- if he'd been a bit more imaginative, he could have used a few dozen
"Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch" in a row or
anything else; unfortunately there's no app that could automatically
whack a user if he's doing something stupid.

But on the other hand I cannot think of any reason why crap like that
should be indexed in the first place. Therefore I would like to see some
sort of option allowing me to still use tsearch2 but actually
automatically excluding anything exceeding MAXSTRLEN - so the UPDATE
might throw a NOTICE (if anything at all) but still get on with the
rest.

An alteration like that does however exceed my limited abilities with C
by far and I don't want to mess with something I do not fully understand
and then use that mess in a production environment. Is there a way to
get around this problem with oversized words?

Kind regards

Markus

-----Urspr�ngliche Nachricht-----
Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
Gesendet: Freitag, 21. November 2003 15:13
An: Markus Wollny
Cc: pgsql-general@postgresql.org
Betreff: Re: AW: [GENERAL] tsearch2 column update produces "word too
long"error

On Fri, 21 Nov 2003, Markus Wollny wrote:

Hello!

Von: Oleg Bartunov [mailto:oleg@sai.msu.su]
Gesendet: Freitag, 21. November 2003 13:06
An: Markus Wollny
Cc: pgsql-general@postgresql.org

Word length is limited by 2K. What's exactly the word
tsearch2 complained on ?
'Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'
is fine :)

This was a silly example, I know - it is a long word, but

not too long

to worry a machine. The offending word will surely be much

longer, but

as a matter of fact, I cannot think of any user actually

typing a 2k+

string without any spaces in between. I'm not sure on which word
tsearch2 complained, it doesn't tell and even logging did

not provide me

with any more detail:

2003-11-21 14:06:44 [26497] ERROR: 42601: word is too long
LOCATION: parsetext_v2, ts_cfg.c:294
STATEMENT: UPDATE ct_com_board_message
SET
ftindex=to_tsvector('default',coalesce(user_login,'') ||' '||
coalesce(title,'') ||' '|| coalesce(text,''));

Is there some way to find the exact position?

I'm afraid you need to hack ts_cfg.c:294 yourself to print the word
which's bugging you :)

btw, don't forget to configure properly dictionaries, so you
don't have a lot of unique words.

I won't forget that; I justed wanted to run a quick-off first test
before diving deeper into Ispell and other issues which are

as yet a bit

of a mystery to me.

Kind Regards

Markus

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

--
Teodor Sigaev E-mail: teodor@sigaev.ru