Very bad FTS performance with the Polish config
Hello
This has been discussed in #postgresql and posted to -performance a
couple days ago, but no solution has been found. The discussion can be
found here:
http://archives.postgresql.org/pgsql-performance/2009-11/msg00162.php
I just finished implementing a "search engine" for my site and found
ts_headline extremely slow when used with a Polish tsearch
configuration, while fast with English. All of it boils down to a simple
testcase, but first some background.
I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0
(2.6.21), then switched both installations to 8.3.8 (both packages
compiled from source, but provided by the distro - port/emerge). The
Polish dictionaries and config were created according to this article
(it's in Polish, but the code is self-explanatory):
http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/
Now for the testcase:
text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum.'
# explain analyze select ts_headline('polish', text,
plainto_tsquery('polish', 'foobar'));
QUERY PLAN
------------------------------------------------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0) (actual time=6.407..6.470
rows=1 loops=1)
Total runtime: 6.524 ms
(2 rows)
# explain analyze select ts_headline('english', text,
plainto_tsquery('english', 'foobar'));
QUERY PLAN
------------------------------------------------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.861..0.895
rows=1 loops=1)
Total runtime: 0.935 ms
(2 rows)
# explain analyze select ts_headline('simple', text,
plainto_tsquery('simple', 'foobar'));
QUERY PLAN
------------------------------------------------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.627..0.660
rows=1 loops=1)
Total runtime: 0.697 ms
(2 rows)
#
As you can see, the results differ by an order of magnitude between
Polish and English. While in this simple testcase it's a non-issue, in
the real world this translates into enormous overhead.
One of the queries I ran testing my site's search function took
1870ms. When I took that query and changed all ts_headline(foo) calls to
just foo, the time dropped below 100ms. That's the difference between
something completely unacceptable and something quite useful.
I can post various details about the hardware, software and specific
queries, but the testcases speak for themselves. I'm sure you can easily
reproduce my results.
I'm putting my code into production tomorrow, since I can't wait
anymore. Hints would be very much appreciated!
cheers,
Wojciech Knapik
PS. This issue is not related to the loading time of dictionaries, or
calls to ts_headline for results that won't be displayed. A few other
details can be found here
http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with
snippets of my conversations in #postgresql that lead to this testcase.
Big thanks to RhodiumToad for helping me with fts for the last couple
days ;]
Wojciech Knapik escreveu:
PS. This issue is not related to the loading time of dictionaries, or
calls to ts_headline for results that won't be displayed.
So what? Could you post the profiling of that query?
--
Euler Taveira de Oliveira
http://www.timbira.com/
Euler Taveira de Oliveira wrote:
PS. This issue is not related to the loading time of dictionaries, or
calls to ts_headline for results that won't be displayed.So what? Could you post the profiling of that query?
Polish:
http://pastie.textmate.org/private/8lhmnbvde43lfjoxc52r1q
English:
http://pastie.textmate.org/private/4iaipottrmjmfxfykz94mw
cheers,
Wojciech Knapik
PS. Sorry for the double post.
Wojciech Knapik escreveu:
Euler Taveira de Oliveira wrote:
PS. This issue is not related to the loading time of dictionaries, or
calls to ts_headline for results that won't be displayed.So what? Could you post the profiling of that query?
I was talking about gprof (--enable-profiling), oprofile [1]http://wiki.postgresql.org/wiki/Profiling_with_OProfile or similar tools.
But it seems the slow step is the sort one.
[1]: http://wiki.postgresql.org/wiki/Profiling_with_OProfile
--
Euler Taveira de Oliveira
http://www.timbira.com/
Wojciech Knapik <webmaster@wolniartysci.pl> writes:
I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0
(2.6.21), then switched both installations to 8.3.8 (both packages
compiled from source, but provided by the distro - port/emerge). The
Polish dictionaries and config were created according to this article
(it's in Polish, but the code is self-explanatory):
http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/
I tried to duplicate this test, but got no further than here:
u8=# CREATE TEXT SEARCH DICTIONARY polish_ispell (
TEMPLATE = ispell,
DictFile = polish,
AffFile = polish,
StopWords = polish
);
ERROR: syntax error
CONTEXT: line 174 of configuration file "/home/tgl/testversion/share/postgresql/tsearch_data/polish.affix": " L E C > -C,G�EM #zalec (15a)
"
u8=#
Seems there's something about the current version of the dictionary that
we don't like. I used sjp-ispell-pl-20091117-src.tar.bz2 ...
regards, tom lane
Wojciech,
your polish_english, polish configurations uses ispell language and slow,
while english configuration doesn't contains ispell. So, what's your
complains ? Try add ispell dictionary to english configuration and see
timings.
Oleg
On Wed, 18 Nov 2009, Wojciech Knapik wrote:
Hello
This has been discussed in #postgresql and posted to -performance a
couple days ago, but no solution has been found. The discussion can be
found here:
http://archives.postgresql.org/pgsql-performance/2009-11/msg00162.phpI just finished implementing a "search engine" for my site and found
ts_headline extremely slow when used with a Polish tsearch
configuration, while fast with English. All of it boils down to a simple
testcase, but first some background.I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0
(2.6.21), then switched both installations to 8.3.8 (both packages
compiled from source, but provided by the distro - port/emerge). The
Polish dictionaries and config were created according to this article
(it's in Polish, but the code is self-explanatory):http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/
Now for the testcase:
text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum.'# explain analyze select ts_headline('polish', text,
plainto_tsquery('polish', 'foobar'));
QUERY PLAN
------------------------------------------------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0) (actual time=6.407..6.470
rows=1 loops=1)
Total runtime: 6.524 ms
(2 rows)# explain analyze select ts_headline('english', text,
plainto_tsquery('english', 'foobar'));
QUERY PLAN
------------------------------------------------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.861..0.895
rows=1 loops=1)
Total runtime: 0.935 ms
(2 rows)# explain analyze select ts_headline('simple', text,
plainto_tsquery('simple', 'foobar'));
QUERY PLAN
------------------------------------------------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.627..0.660
rows=1 loops=1)
Total runtime: 0.697 ms
(2 rows)#
As you can see, the results differ by an order of magnitude between
Polish and English. While in this simple testcase it's a non-issue, in
the real world this translates into enormous overhead.One of the queries I ran testing my site's search function took
1870ms. When I took that query and changed all ts_headline(foo) calls to
just foo, the time dropped below 100ms. That's the difference between
something completely unacceptable and something quite useful.I can post various details about the hardware, software and specific
queries, but the testcases speak for themselves. I'm sure you can easily
reproduce my results.I'm putting my code into production tomorrow, since I can't wait
anymore. Hints would be very much appreciated!cheers,
Wojciech KnapikPS. This issue is not related to the loading time of dictionaries, or
calls to ts_headline for results that won't be displayed. A few other
details can be found here
http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with
snippets of my conversations in #postgresql that lead to this testcase.
Big thanks to RhodiumToad for helping me with fts for the last couple
days ;]
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Tom Lane wrote:
I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0
(2.6.21), then switched both installations to 8.3.8 (both packages
compiled from source, but provided by the distro - port/emerge). The
Polish dictionaries and config were created according to this article
(it's in Polish, but the code is self-explanatory):http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/
I tried to duplicate this test, but got no further than here:
u8=# CREATE TEXT SEARCH DICTIONARY polish_ispell (
TEMPLATE = ispell,
DictFile = polish,
AffFile = polish,
StopWords = polish
);
ERROR: syntax error
CONTEXT: line 174 of configuration file "/home/tgl/testversion/share/postgresql/tsearch_data/polish.affix": " L E C > -C,GĹEM #zalec (15a)
"
u8=#Seems there's something about the current version of the dictionary that
we don't like. I used sjp-ispell-pl-20091117-src.tar.bz2 ...
Here are the files I used (polish.affix, polish.dict already generated):
http://wolniartysci.pl/pl.tar.gz
These should work fine. I'd be grateful if you could test and see if you
get similar results.
cheers,
Wojciech Knapik
PS. Weird, I get the emails without a reply-to set for the list..
your polish_english, polish configurations uses ispell language and slow,
while english configuration doesn't contains ispell. So, what's your
complains ? Try add ispell dictionary to english configuration and see
timings.
Oh, so this is not anomalous ? These are the expected speeds for an
ispell dictionary ? I didn't realize that. Sorry for the bother then. It
just seemed way too slow to be practical.
cheers,
Wojciech Knapik
On Wed, 18 Nov 2009, Wojciech Knapik wrote:
your polish_english, polish configurations uses ispell language and slow,
while english configuration doesn't contains ispell. So, what's your
complains ? Try add ispell dictionary to english configuration and see
timings.Oh, so this is not anomalous ? These are the expected speeds for an ispell
dictionary ? I didn't realize that. Sorry for the bother then. It just seemed
way too slow to be practical.
You can see real timings using ts_lexize() function for different dictionaries
(try several time to avoid cold-start problem) instead of ts_headline(),
which involves other factors.
On my test machine I see no real difference between very simple dictionary
and french ispell, snowball dictionaries:
dev-oleg=# select ts_lexize('simple','voila');
ts_lexize
-----------
{voila}
(1 row)
Time: 0.282 ms
dev-oleg=# select ts_lexize('simple','voila');
ts_lexize
-----------
{voila}
(1 row)
Time: 0.269 ms
dev-oleg=# select ts_lexize('french_stem','voila');
ts_lexize
-----------
{voil}
(1 row)
Time: 0.187 ms
I see no big difference in ts_headline as well:
dev-oleg=# select ts_headline('english','I can do voila', 'voila'::tsquery);
ts_headline
-----------------------
I can do <b>voila</b>
(1 row)
Time: 0.265 ms
dev-oleg=# select ts_headline('nomaofr','I can do voila', 'voila'::tsquery);
ts_headline
-----------------------
I can do <b>voila</b>
(1 row)
Time: 0.299 ms
This is 8.4.1 version of PostgreSQL.
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote:
your polish_english, polish configurations uses ispell language
and slow, while english configuration doesn't contains ispell.
So, what's your complains ? Try add ispell dictionary to english
configuration and see timings.Oh, so this is not anomalous ? These are the expected speeds for an
ispell dictionary ? I didn't realize that. Sorry for the bother
then. It just seemed way too slow to be practical.You can see real timings using ts_lexize() function for different
dictionaries (try several time to avoid cold-start problem) instead
of ts_headline(), which involves other factors.On my test machine I see no real difference between very simple
dictionary and french ispell, snowball dictionaries:
ts_lexize seems to be just as fast for simple, polish_ispell and
english_stem with the 'voila' argument.
polish_ispell is in fact *faster* for the lorem ipsum text repeated a
couple times (10 ?). Which suggests that the issue is with ts_headline
iteself.
I see no big difference in ts_headline as well:
dev-oleg=# select ts_headline('english','I can do voila',
'voila'::tsquery);
ts_headline
-----------------------
I can do <b>voila</b>
(1 row)Time: 0.265 ms
Yes, for 4-word texts the results are similar.
Try that with a longer text and the difference becomes more and more
significant. For the lorem ipsum text, 'polish' is about 4 times slower,
than 'english'. For 5 repetitions of the text, it's 6 times, for 10
repetitions - 7.5 times...
This is 8.4.1 version of PostgreSQL.
An that was 8.3.8/OSX.
cheers,
Wojciech Knapik
2009/11/18 Oleg Bartunov <oleg@sai.msu.su>:
On Wed, 18 Nov 2009, Wojciech Knapik wrote:
your polish_english, polish configurations uses ispell language and slow,
while english configuration doesn't contains ispell. So, what's your
complains ? Try add ispell dictionary to english configuration and see
timings.Oh, so this is not anomalous ? These are the expected speeds for an ispell
dictionary ? I didn't realize that. Sorry for the bother then. It just
seemed way too slow to be practical.You can see real timings using ts_lexize() function for different
dictionaries
(try several time to avoid cold-start problem) instead of ts_headline(),
which involves other factors.On my test machine I see no real difference between very simple dictionary
and french ispell, snowball dictionaries:
It's depend on language (and dictionary sizes).
for czech:
postgres=# select ts_lexize('simple','vody');
ts_lexize
-----------
{vody}
(1 row)
Time: 0.785 ms
postgres=# select ts_lexize('cspell','vody');
ts_lexize
-----------
{voda}
(1 row)
Time: 1.041 ms
I afraid so czech and polland language is very hard (with long affix file).
Regards
Pavel
Show quoted text
dev-oleg=# select ts_lexize('simple','voila');
ts_lexize
-----------
{voila}
(1 row)Time: 0.282 ms
dev-oleg=# select ts_lexize('simple','voila');
ts_lexize
-----------
{voila}
(1 row)Time: 0.269 ms
dev-oleg=# select ts_lexize('french_stem','voila');
ts_lexize
-----------
{voil}
(1 row)Time: 0.187 ms
I see no big difference in ts_headline as well:
dev-oleg=# select ts_headline('english','I can do voila', 'voila'::tsquery);
ts_headline
-----------------------
I can do <b>voila</b>
(1 row)Time: 0.265 ms
dev-oleg=# select ts_headline('nomaofr','I can do voila', 'voila'::tsquery);
ts_headline
-----------------------
I can do <b>voila</b>
(1 row)Time: 0.299 ms
This is 8.4.1 version of PostgreSQL.
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, 18 Nov 2009, Wojciech Knapik wrote:
Yes, for 4-word texts the results are similar.
Try that with a longer text and the difference becomes more and more
significant. For the lorem ipsum text, 'polish' is about 4 times slower, than
'english'. For 5 repetitions of the text, it's 6 times, for 10 repetitions -
7.5 times...
Again, I see nothing unclear here, since dictionaries (as specified
in configuration) apply to ALL words in document. The more words in
document, the more overhead.
You can pass not all document to ts_headline, but just part, to have
predicted performance. This is useful in any case.
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote:
Yes, for 4-word texts the results are similar.
Try that with a longer text and the difference becomes more and more
significant. For the lorem ipsum text, 'polish' is about 4 times
slower, than 'english'. For 5 repetitions of the text, it's 6 times,
for 10 repetitions - 7.5 times...Again, I see nothing unclear here, since dictionaries (as specified
in configuration) apply to ALL words in document. The more words in
document, the more overhead.
You're missing the point. I'm not surprised that the function takes more
time for larger input texts - that's obvious. The thing is, the
computation times rise more steeply when the Polish config is used.
Steeply enough, that the difference between the Polish and English
configs becomes enormous in practical cases.
Now this may be expected behaviour, but since I don't know if it is, I
posted to the mailing lists to find out. If you're saying this is ok and
there's nothing to fix here, then there's nothing more to discuss and we
may consider the thread closed.
If not, ts_headline deserves a closer look.
cheers,
Wojciech Knapik
ts_headline calls ts_lexize equivalent to break the text. Off course there
is algorithm to process the tokens and generate the headline. I would be
really surprised if the algorithm to generate the headline is somehow
dependent on language (as it only processes the tokens). So Oleg is right
when he says ts_lexize is something to be checked.
I will try to replicate what you are trying to do but in the meantime can
you run the same ts_headline under psql multiple times and paste the result.
-Sushant.
2009/11/19 Wojciech Knapik <webmaster@wolniartysci.pl>
Show quoted text
Oleg Bartunov wrote:
Yes, for 4-word texts the results are similar.
Try that with a longer text and the difference becomes more and more
significant. For the lorem ipsum text, 'polish' is about 4 times slower,
than 'english'. For 5 repetitions of the text, it's 6 times, for 10
repetitions - 7.5 times...Again, I see nothing unclear here, since dictionaries (as specified
in configuration) apply to ALL words in document. The more words in
document, the more overhead.You're missing the point. I'm not surprised that the function takes more
time for larger input texts - that's obvious. The thing is, the computation
times rise more steeply when the Polish config is used. Steeply enough, that
the difference between the Polish and English configs becomes enormous in
practical cases.Now this may be expected behaviour, but since I don't know if it is, I
posted to the mailing lists to find out. If you're saying this is ok and
there's nothing to fix here, then there's nothing more to discuss and we may
consider the thread closed.
If not, ts_headline deserves a closer look.cheers,
Wojciech Knapik--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
ts_headline calls ts_lexize equivalent to break the text. Off course
there is algorithm to process the tokens and generate the headline. I
would be really surprised if the algorithm to generate the headline is
somehow dependent on language (as it only processes the tokens). So Oleg
is right when he says ts_lexize is something to be checked.
ts_lexize performs well for all dictionaries, nothing to see here.
It's ts_headline that's causing the problem. But that's just IMHO.
I will try to replicate what you are trying to do but in the meantime
can you run the same ts_headline under psql multiple times and paste the
result.
All the results I pasted had the medium run time out of multiple calls.
These were certainly not some extreme corner cases.
cheers,
Wojciech Knapik
Wojciech Knapik <webmaster@wolniartysci.pl> writes:
Tom Lane wrote:
I tried to duplicate this test, but got no further than here:
ERROR: syntax error
CONTEXT: line 174 of configuration file "/home/tgl/testversion/share/postgresql/tsearch_data/polish.affix": " L E C > -C,G�EM #zalec (15a)
Here are the files I used (polish.affix, polish.dict already generated):
http://wolniartysci.pl/pl.tar.gz
Your files were the same as mine. I eventually figured out the problem
was I was using C locale, in which some of those letters aren't letters.
(I wonder whether the tsearch config file parsers could be made less
sensitive to this by avoiding t_isalpha tests.) In pl_PL.ut8 locale
I could see that the example is indeed much slower. Oleg is right that
the fundamental difference is that this Polish configuration is using
an ispell dictionary where the simple English configuration is not.
But, just for the record, here's what an oprofile profile looks like:
samples % image name symbol name
7480 20.9477 postgres RS_execute
5370 15.0386 postgres pg_utf_mblen
4138 11.5884 postgres pg_mblen
3756 10.5187 postgres mb_strchr
2880 8.0654 postgres FindWord
2754 7.7126 postgres CheckAffix
1576 4.4136 postgres NormalizeSubWord
966 2.7053 postgres FindAffixes
896 2.5092 postgres TParserGet
742 2.0780 postgres AllocSetAlloc
420 1.1762 postgres AllocSetFree
396 1.1090 postgres addHLParsedLex
384 1.0754 postgres LexizeExec
So about 55% of the time is going into affix pattern matching.
I wonder whether that couldn't be made faster. A lot of the cycles
are spent on coping with variable-length characters --- perhaps the
ispell code should convert to wchar representation before doing this?
regards, tom lane
Tom Lane wrote:
*SNIP*
So about 55% of the time is going into affix pattern matching.
I wonder whether that couldn't be made faster. A lot of the cycles
are spent on coping with variable-length characters --- perhaps the
ispell code should convert to wchar representation before doing this?
Thanks a lot for looking into this. I hope this will lead to some
improvements one day.
Unfortunately my C skills were pretty basic years ago and I haven't used
the language since, so I can't be of much help..
cheers,
Wojciech Knapik