BUG #5219: Segfault in to_tsvector
The following bug has been logged online:
Bug reference: 5219
Logged by: Kenaniah Cerny
Email address: kenaniah@gmail.com
PostgreSQL version: 8.4.1
Operating system: Centos5.2 -- Linux 2.6.18-92.1.10.el5 #1 SMP i686 athlon
i386 GNU/Linux
Description: Segfault in to_tsvector
Details:
Full backtrace: http://pgsql.privatepaste.com/5411abf8f3
The issue takes place running this query:
http://pgsql.privatepaste.com/35064cbba8
Crash is attributed to this index definition:
CREATE INDEX "anime_titles_idx_name_simple_text" ON "public"."anime_titles"
USING gin ((to_tsvector('simple'::regconfig, name)));
I believe the issue is caused by possibly non-UTF-8 data. Both the server
and the client (a PHP script using PDO's pgsql driver) are using UTF-8. The
string causing this issue is stored in the database in a text field and
looks like this:
http://s801.photobucket.com/albums/yy299/kenaniah972/?action=view&current=is
sue.png
After output into an HTML input field and resubmission through firefox, the
string that is passed through to the DB looks like this:
http://s801.photobucket.com/albums/yy299/kenaniah972/?action=view&current=su
bmitted.png
(The &# characters were manually omitted in submission)
I don't profess to know anything about encodings, but I don't think this is
valid UTF-8 input. I might be wrong. All I do know is that this causes the
to_tsvector part of the gin index to throw a segfault in the insert
statement, rather than returning an invalid UTF-8 input error or just plain
working.
"Kenaniah Cerny" <kenaniah@gmail.com> writes:
Description: Segfault in to_tsvector
Full backtrace: http://pgsql.privatepaste.com/5411abf8f3
This looks like the known problem that ts_stat fails on an empty
tsvector. Can you try this patch
http://archives.postgresql.org/pgsql-committers/2009-10/msg00056.php
or just pick up 8.4 branch tip from CVS?
If that does fix it, I don't think this is an encoding problem,
but rather that the name doesn't contain anything that is recognized
as a word by the textsearch configuration you're using.
regards, tom lane
Thanks,
The patch took some massaging, but took care of the issue when applied to
the 8.4.1 source.
Kenaniah Cerny
On Sat, Nov 28, 2009 at 7:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
"Kenaniah Cerny" <kenaniah@gmail.com> writes:
Description: Segfault in to_tsvector
Full backtrace: http://pgsql.privatepaste.com/5411abf8f3This looks like the known problem that ts_stat fails on an empty
tsvector. Can you try this patch
http://archives.postgresql.org/pgsql-committers/2009-10/msg00056.php
or just pick up 8.4 branch tip from CVS?If that does fix it, I don't think this is an encoding problem,
but rather that the name doesn't contain anything that is recognized
as a word by the textsearch configuration you're using.regards, tom lane