[PATCH] tsearch parser inefficiency if text includes urls or emails

Started by Andres Freundover 16 years ago15 messageshackers

andres@anarazel.de

over 16 years ago

Hi,

While playing around/evaluating tsearch I notices that to_tsvector is
obscenely slow for some files. After some profiling I found that this is due
using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c.
If a multibyte encoding is in use TParserInit copies the whole remaining input
and converts it to wchar_t or pg_wchar - for every email or protocol prefixed
url in the the document. Which obviously is bad.

I solved the issue by having a seperate TParserCopyInit/TParserCopyClose which
reuses the the already converted strings of the original TParser - only at
different offsets.

Another approach would be to get rid of the separate parser invocations -
requiring a bunch of additional states. This seemed more complex to me, so I
wanted to get some feedback first.

Without patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE filename =
'/usr/share/doc/libdrm-nouveau1/changelog';

─────────────────────────────────────────────────────────────────────────────────────────────────────
...
(1 row)

Time: 5835.676 ms

With patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE filename =
'/usr/share/doc/libdrm-nouveau1/changelog';

Time: 395.341 ms

Ill cleanup the patch if it seems like a sensible solution...

Is this backpatch-worthy?

Andres

PS: I let the additional define in for the moment so that its easier to see the
performance differences.

Andres Freund

andres@anarazel.de

over 16 years ago

In reply to: Andres Freund (#1)

Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

On Sunday 01 November 2009 16:19:43 Andres Freund wrote:

While playing around/evaluating tsearch I notices that to_tsvector is
obscenely slow for some files. After some profiling I found that this is
due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If
a multibyte encoding is in use TParserInit copies the whole remaining
input and converts it to wchar_t or pg_wchar - for every email or protocol
prefixed url in the the document. Which obviously is bad.

I solved the issue by having a seperate TParserCopyInit/TParserCopyClose
which reuses the the already converted strings of the original TParser -
only at different offsets.

Another approach would be to get rid of the separate parser invocations -
requiring a bunch of additional states. This seemed more complex to me, so
I wanted to get some feedback first.

Without patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE
filename = '/usr/share/doc/libdrm-nouveau1/changelog';

──────────────────────────────────────────────────────────────────────────
─────────────────────────── ...
(1 row)

Time: 5835.676 ms

With patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE
filename = '/usr/share/doc/libdrm-nouveau1/changelog';

──────────────────────────────────────────────────────────────────────────
─────────────────────────── ...
(1 row)

Time: 395.341 ms

Ill cleanup the patch if it seems like a sensible solution...

As nobody commented here is a corrected (stupid thinko) and cleaned up
version. Anyone cares to comment whether I am the only one thinking this is an
issue?

Andres

Kenneth Marshall

ktm@rice.edu

over 16 years ago

In reply to: Andres Freund (#2)

Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote:

On Sunday 01 November 2009 16:19:43 Andres Freund wrote:

While playing around/evaluating tsearch I notices that to_tsvector is
obscenely slow for some files. After some profiling I found that this is
due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If
a multibyte encoding is in use TParserInit copies the whole remaining
input and converts it to wchar_t or pg_wchar - for every email or protocol
prefixed url in the the document. Which obviously is bad.

I solved the issue by having a seperate TParserCopyInit/TParserCopyClose
which reuses the the already converted strings of the original TParser -
only at different offsets.

Another approach would be to get rid of the separate parser invocations -
requiring a bunch of additional states. This seemed more complex to me, so
I wanted to get some feedback first.

Without patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE
filename = '/usr/share/doc/libdrm-nouveau1/changelog';

??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????? ...
(1 row)

Time: 5835.676 ms

With patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE
filename = '/usr/share/doc/libdrm-nouveau1/changelog';

??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????? ...
(1 row)

Time: 395.341 ms

Ill cleanup the patch if it seems like a sensible solution...

As nobody commented here is a corrected (stupid thinko) and cleaned up
version. Anyone cares to comment whether I am the only one thinking this is an
issue?

Andres

As a user of tsearch, I can certainly appreciate to speed-up in parsing --
more CPU for everyone else.

Regards,
Ken

Andres Freund

andres@anarazel.de

over 16 years ago

In reply to: Kenneth Marshall (#3)

Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

On Sunday 08 November 2009 17:41:15 Kenneth Marshall wrote:

On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote:

As nobody commented here is a corrected (stupid thinko) and cleaned up
version. Anyone cares to comment whether I am the only one thinking this
is an issue?
Andres

+1
As a user of tsearch, I can certainly appreciate to speed-up in parsing --
more CPU for everyone else.

Please note that this is mostly an issue when using rather long documents
including either email addresses (ie. an @) or links with protocol prefixes
(like ?+://) - so it might not give you personally a benefit :-(

Andres

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 16 years ago

In reply to: Andres Freund (#2)

Re: tsearch parser inefficiency if text includes urls or emails - new version

Andres Freund <andres@anarazel.de> wrote:

On Sunday 01 November 2009 16:19:43 Andres Freund wrote:

While playing around/evaluating tsearch I notices that to_tsvector
is obscenely slow for some files. After some profiling I found that
this is due using a seperate TSParser in p_ishost/p_isURLPath in
wparser_def.c. If a multibyte encoding is in use TParserInit copies
the whole remaining input and converts it to wchar_t or pg_wchar -
for every email or protocol prefixed url in the the document. Which
obviously is bad.

I solved the issue by having a seperate TParserCopyInit/
TParserCopyClose which reuses the the already converted strings of
the original TParser - only at different offsets.

Another approach would be to get rid of the separate parser
invocations - requiring a bunch of additional states. This seemed
more complex to me, so I wanted to get some feedback first.

Without patch:

Time: 5835.676 ms

With patch:

Time: 395.341 ms

As nobody commented here is a corrected (stupid thinko) and cleaned
up version.

I've taken this one for review, and have taken a preliminary look.

It is in context format, applies cleanly, and passes "make check".
Next I read through the code, and have the same question that Andres
posed 12 days ago. His patch massively reduces the cost of the parser
recursively calling itself for some cases, and it seems like the least
invasive way to modify the parser to solve this performance problem;
but it does beg the question of why a state machine like this should
recursively call itself when it hits certain states.

The patch is mitigating the penalty for what seems to me to be an
existing ugly kludge. Is it acceptable to address the problem in this
way, or should the much more invasive work be done to eliminate the
kludge? (Note: I personally would much rather see the performance
penalty addressed this way, and a TODO added for the more invasive
work, than to leave this alone for the next release if there's nobody
willing to tackle the problem at a more fundamental level.)

If nobody has a contrary opinion, I'll proceed with the review of this
patch and add something to the TODO page for the more invasive work.

-Kevin

Alvaro Herrera

alvherre@2ndquadrant.com

over 16 years ago

In reply to: Kevin Grittner (#5)

Re: tsearch parser inefficiency if text includes urls or emails - new version

Kevin Grittner wrote:

(Note: I personally would much rather see the performance
penalty addressed this way, and a TODO added for the more invasive
work, than to leave this alone for the next release if there's nobody
willing to tackle the problem at a more fundamental level.)

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Andres Freund

andres@anarazel.de

over 16 years ago

In reply to: Kevin Grittner (#5)

Re: tsearch parser inefficiency if text includes urls or emails - new version

On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote:

It is in context format, applies cleanly, and passes "make check".

Unfortunately the latter is not saying much - I had a bug there and it was not
found by the regression tests. Perhaps I should take a stab and add at least
some more...

It is in context format, applies cleanly, and passes "make check".
Next I read through the code, and have the same question that Andres
posed 12 days ago. His patch massively reduces the cost of the parser
recursively calling itself for some cases, and it seems like the least
invasive way to modify the parser to solve this performance problem;
but it does beg the question of why a state machine like this should
recursively call itself when it hits certain states.

I was wondering about that as well. I am not completely sure but to me it
looks like its just done to reduce the amount of rules and states.

I have to say that that code is not exactly clear and well documented...

Andres

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 16 years ago

In reply to: Kevin Grittner (#5)

Re: tsearch parser inefficiency if text includes urls or emails - new version

Andres Freund wrote:
On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote:

It is in context format, applies cleanly, and passes "make check".

Unfortunately the latter is not saying much - I had a bug there and
it was not found by the regression tests. Perhaps I should take a
stab and add at least some more...

Sounds like a good idea. The one thing to avoid is anything with a
long enough run time to annoy those that run it many times a day.

It is in context format, applies cleanly, and passes "make check".
Next I read through the code, and have the same question that
Andres posed 12 days ago. His patch massively reduces the cost of
the parser recursively calling itself for some cases, and it seems
like the least invasive way to modify the parser to solve this
performance problem; but it does beg the question of why a state
machine like this should recursively call itself when it hits
certain states.

I was wondering about that as well. I am not completely sure but to
me it looks like its just done to reduce the amount of rules and
states.

I'm assuming that's the reason, but didn't dig deep enough to be sure.
I suspect to be really sure, I'd have to set it up without the
recursion and see what breaks. I can't imagine it would be anything
which couldn't be fixed by adding enough states; but perhaps they ran
into something where these types would require so many new states that
the recursion seemed like the lesser of evils.

I have to say that that code is not exactly clear and well
documented...

Yeah. I was happy with the level of documentation that you added with
your new code, but what was there before is mighty thin. If you
gleaned enough information while working on it to feel comfortable
adding documentation anywhere else, that would be a good thing.

So far the only vote is to proceed with the mitigation, which was my
preference, and apparently yours -- so I guess we're at 3 to 0 in
favor of that. I'll mark the patch as "Waiting on Author" so you can
add any comments and regression tests you feel are appropriate.

By the way, I found one typo in the comments -- it should by useful,
not usefull.

I liked what I saw so far, but need to spend more time desk-checking
for correctness, testing to confirm that it doesn't change results,
and confirming the performance improvement.

-Kevin

Import Notes

Resolved by subject fallback

Andres Freund

andres@anarazel.de

over 16 years ago

In reply to: Kevin Grittner (#8)

Re: tsearch parser inefficiency if text includes urls or emails - new version

On Saturday 14 November 2009 15:33:00 Kevin Grittner wrote:

Andres Freund wrote:

On Saturday 14 November 2009 01:03:33 Kevin Grittner wrote:

It is in context format, applies cleanly, and passes "make check".

Unfortunately the latter is not saying much - I had a bug there and
it was not found by the regression tests. Perhaps I should take a
stab and add at least some more...

Sounds like a good idea. The one thing to avoid is anything with a
long enough run time to annoy those that run it many times a day.

Hm. There actually are tests excercising the part where I had a bug...
Strange.
It was a bug involving uninitialized data so probably the regression tests
where just "lucky".

It is in context format, applies cleanly, and passes "make check".
Next I read through the code, and have the same question that
Andres posed 12 days ago. His patch massively reduces the cost of
the parser recursively calling itself for some cases, and it seems
like the least invasive way to modify the parser to solve this
performance problem; but it does beg the question of why a state
machine like this should recursively call itself when it hits
certain states.

I was wondering about that as well. I am not completely sure but to
me it looks like its just done to reduce the amount of rules and
states.

I'm assuming that's the reason, but didn't dig deep enough to be sure.
I suspect to be really sure, I'd have to set it up without the
recursion and see what breaks. I can't imagine it would be anything
which couldn't be fixed by adding enough states; but perhaps they ran
into something where these types would require so many new states that
the recursion seemed like the lesser of evils.

This is similar to my understanding...

I have to say that that code is not exactly clear and well
documented...

Yeah. I was happy with the level of documentation that you added with
your new code, but what was there before is mighty thin. If you
gleaned enough information while working on it to feel comfortable
adding documentation anywhere else, that would be a good thing.

It definitely would be a good thing. But that would definitely be seperate
patch. But I fear my current leel of knowledge is sufficient and also I am not
sure if I can make myself interested enough in that part.

So far the only vote is to proceed with the mitigation, which was my
preference, and apparently yours -- so I guess we're at 3 to 0 in
favor of that. I'll mark the patch as "Waiting on Author" so you can
add any comments and regression tests you feel are appropriate.

By the way, I found one typo in the comments -- it should by useful,
not usefull.

Ok, will update.

I liked what I saw so far, but need to spend more time desk-checking
for correctness, testing to confirm that it doesn't change results,
and confirming the performance improvement.

Thanks again for your reviewing!

Andres

#10

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 16 years ago

In reply to: Andres Freund (#9)

Re: tsearch parser inefficiency if text includes urls or emails - new version

Andres Freund <andres@anarazel.de> wrote:

On Saturday 14 November 2009 15:33:00 Kevin Grittner wrote:

Andres Freund wrote:

I had a bug there and it was not found by the regression tests.
Perhaps I should take a stab and add at least some more...

Sounds like a good idea.

Hm. There actually are tests excercising the part where I had a
bug... Strange. It was a bug involving uninitialized data so
probably the regression tests where just "lucky".

OK. I won't be looking for extra tests.

I have to say that that code is not exactly clear and well
documented...

Yeah. I was happy with the level of documentation that you added
with your new code, but what was there before is mighty thin. If
you gleaned enough information while working on it to feel
comfortable adding documentation anywhere else, that would be a
good thing.

It definitely would be a good thing. But that would definitely be
seperate patch. But I fear my current leel of knowledge is
sufficient and also I am not sure if I can make myself interested
enough in that part.

Fair enough. I won't be looking for new comments for the old code.

By the way, I found one typo in the comments -- it should by
useful, not usefull.

Ok, will update.

Given how trivial that is, I'm putting this back in "Needs Review"
status, and resuming my review work. Barring surprises, I should wrap
this up whenever I can free up a two or three hours.

-Kevin

#11

Greg Smith

gsmith@gregsmith.com

over 16 years ago

In reply to: Kevin Grittner (#10)

Re: tsearch parser inefficiency if text includes urls or emails - new version

After getting off to a good start, it looks like this patch is now stuck
waiting for a second review pass from Kevin right now, with no open
items for Andres to correct. Since the only issues on the table seem to
be that of code aesthetics and long-term planning for this style of
implementation rather than specific functional bits, I'm leaning toward
saying this one is ready to have a committer look at it. Any comments
from Kevin or Andres about where this is at?

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

#12

Andres Freund

andres@anarazel.de

over 16 years ago

In reply to: Greg Smith (#11)

Re: tsearch parser inefficiency if text includes urls or emails - new version

On Monday 07 December 2009 02:12:43 Greg Smith wrote:

After getting off to a good start, it looks like this patch is now stuck
waiting for a second review pass from Kevin right now, with no open
items for Andres to correct. Since the only issues on the table seem to
be that of code aesthetics and long-term planning for this style of
implementation rather than specific functional bits, I'm leaning toward
saying this one is ready to have a committer look at it. Any comments
from Kevin or Andres about where this is at?

I think it should be ready - the only know thing it needs is a
s/usefull/useful/.

I will take another look but I doubt I will see anything new.

Andres

#13

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 16 years ago

In reply to: Greg Smith (#11)

Re: tsearch parser inefficiency if text includes urls or emails - new version

Greg Smith <greg@2ndquadrant.com> wrote:

After getting off to a good start, it looks like this patch is now
stuck waiting for a second review pass from Kevin right now, with
no open items for Andres to correct. Since the only issues on the
table seem to be that of code aesthetics and long-term planning
for this style of implementation rather than specific functional
bits, I'm leaning toward saying this one is ready to have a
committer look at it. Any comments from Kevin or Andres about
where this is at?

Yeah, really the only thing I found to complain about was one
misspelled word in a comment. I am currently the hold-up, due to
fighting off a bout of some virus and having other "real world"
issues impinge. The only thing left to do, besides correcting the
spelling, is to confirm the author's performance improvements and
confirm that there is no degradation in a non-targeted situation.

Frankly, I'd be amazed if there was a performance regression,
because all it really does is pass a pointer to a new spot in an
existing input buffer rather than allocating new space and copying
the input from the desired spot to the end of the buffer. I can't
think of any situations where calculating the new address should be
slower than calculating the new address and copying from there to
the end of the buffer.

-Kevin

#14

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 16 years ago

In reply to: Alvaro Herrera (#6)

tsearch parser overhaul

re:
http://archives.postgresql.org/pgsql-hackers/2009-11/msg00754.php

Alvaro Herrera <alvherre@commandprompt.com> wrote:

Kevin Grittner wrote:

(Note: I personally would much rather see the performance
penalty addressed this way, and a TODO added for the more
invasive work, than to leave this alone for the next release if
there's nobody willing to tackle the problem at a more
fundamental level.)

+1

I haven't added a TODO yet because I'm not sure how to frame it.
I'm inclined that it would be no more work to replace the current
recursively called state engine with something easier to read and
understand than to try to fix the current oddities. Perhaps
something along the lines of this?:

http://vo.astronet.ru/arxiv/dict_regex.html

I suspect we'd need to get it to use the same regexp code used
elsewhere in PostgreSQL.

Thoughts?

-Kevin

#15

Tom Lane

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Kevin Grittner (#14)

Re: tsearch parser overhaul

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

I'm inclined that it would be no more work to replace the current
recursively called state engine with something easier to read and
understand than to try to fix the current oddities. Perhaps
something along the lines of this?:

http://vo.astronet.ru/arxiv/dict_regex.html

I suspect we'd need to get it to use the same regexp code used
elsewhere in PostgreSQL.

We're certainly not going to start carrying two regexp engines,
so yeah ;-)

I guess if this is proposed as a replacement for the existing parser,
we'd need to see some speed comparisons. I have no idea at all which
is faster.

regards, tom lane

[PATCH] tsearch parser inefficiency if text includes urls or emails

Attachments:

Attachments: