BUG #7793: tsearch_data thesaurus size limit

Started by David Boutinover 13 years ago3 messagesbugs
Jump to latest
#1David Boutin
davios@gmail.com

The following bug has been logged on the website:

Bug reference: 7793
Logged by: David Boutin
Email address: davios@gmail.com
PostgreSQL version: 9.1.7
Operating system: Ubuntu 12.04 LTS 64bits
Description:

Hi all,

I like working with thesaurus files with specific text search configuration
to ease search with synonyms.
Today I tried to create a thesaurus of artist names (using musicbrainz
database) including their synonyms/aliases.

This thesaurus file is about 1M lines.

And I realized it is impossible to use it with FTS, I got unexpected error
with "plainto_tsquery" and even segmentation fault for some names according
to postgresql log file.
I then tried to reduce the size of this file several times to arrive to a
final file of 65535 lines which works fine whereas a 65536 lines file crash
my queries.

Is there any way to increase this thesaurus size limit?

Many thanks in advance for your help

Kind regards
David

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Boutin (#1)
Re: BUG #7793: tsearch_data thesaurus size limit

davios@gmail.com writes:

[ thesaurus dictionary fails for more than 64K entries ]

I see a whole bunch of uses of "uint16" in
src/backend/tsearch/dict_thesaurus.c. It's not immediately clear which
of these would need to be widened to support more entries, or what the
storage cost of doing that would be. We probably should at least put in
a range check so that you get a clean failure instead of a crash though.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3David Boutin
davios@gmail.com
In reply to: Tom Lane (#2)
Re: BUG #7793: tsearch_data thesaurus size limit

Ok thanks for your reply Tom.

We have made on our side some update on this file
src/backend/tsearch/dict_thesaurus.c
Then we recompile PG 9.2.2 with this patch and now the thesaurus works fine
with more than 64k entries and queries runtime is always low as expected.

Here is our update of the file:

 typedef struct LexemeInfo
 {
- *uint16* idsubst; /* entry's number in DictThesaurus->subst */
+ *uint32* idsubst; /* entry's number in DictThesaurus->subst */
  uint16 posinsubst; /* pos info in entry */

...

 static void
-newLexeme(DictThesaurus *d, char *b, char *e, *uint16* idsubst, uint16
posinsubst)
+newLexeme(DictThesaurus *d, char *b, char *e, *uint32* idsubst, uint16
posinsubst)
 {
  TheLexeme  *ptr;

...

 static void
-addWrd(DictThesaurus *d, char *b, char *e, *uint16* idsubst, uint16 nwrd,
uint16 posinsubst, bool useasis)
+addWrd(DictThesaurus *d, char *b, char *e, *uint32* idsubst, uint16 nwrd,
uint16 posinsubst, bool useasis)
 {

...

thesaurusRead(char *filename, DictThesaurus *d) {
tsearch_readline_state trst;
- *uint16* idsubst = 0;
+ *uint32* idsubst = 0;
bool useasis = false;

...

 static bool
-matchIdSubst(LexemeInfo *stored, *uint16* idsubst)
+matchIdSubst(LexemeInfo *stored, *uint32* idsubst)
 {
  bool res = true;

Kind regards.
David

On Mon, Jan 7, 2013 at 1:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

davios@gmail.com writes:

[ thesaurus dictionary fails for more than 64K entries ]

I see a whole bunch of uses of "uint16" in
src/backend/tsearch/dict_thesaurus.c. It's not immediately clear which
of these would need to be widened to support more entries, or what the
storage cost of doing that would be. We probably should at least put in
a range check so that you get a clean failure instead of a crash though.

regards, tom lane