[PATCH] Fix for slow GIN index queries when "gin_fuzzy_search_limit" setting is relatively small for large tables
Hello all,
Like the title says, using "gin_fuzzy_search_limit" degrades speed when it
has a relatively low setting.
What do I mean by "relatively low"? An example would be when a table with a
GIN index has many millions of rows and a particular keyword search has
1,000,000 possible results because the keyword is very common (or it's just
that the table is so supremely large that even a somewhat common keyword
appears enough to return one million results). However, you only want to
return around 100 random results from that one million, so you set
gin_fuzzy_search_limit to 100. That limit is relatively low when you look
at the ratio of the limit value to the possible results: 100 / 1,000,000 =
0.0001. You'll find the query is very slow for such a low ratio. It isn't
so slow if gin_fuzzy_search_limit is 100 but the keyword search has only a
total of 10,000 possible results (resulting in a higher ratio of 0.1).
This would explain why in the documentation it is said that "From
experience, values in the thousands (e.g., 5000 — 20000) work well". It's
not so common to have queries that return large enough result sets such
that gin_fuzzy_search_limit values between 5,000 and 20,000 would result in
low ratios and so result in the performance issue I've observed (these
gin_fuzzy_search_limit values have relatively high ratios between 0.005 and
0.02 if you have 1,000,000 results for a keyword search). However, if you
desire a lower gin_fuzzy_search_limit such as 100, while also having a
relatively larger table, you'll find this slowness issue.
I discussed this issue more and the reason for it in my original bug
report:
/messages/by-id/16220-1a0a4f0cb67cafdc@postgresql.org
Attached is SQL to test and observe this issue and also attached is a patch
I want to eventually submit to a commitfest.
Best regards,
Adé
=?UTF-8?B?QWTDqQ==?= <ade.hey@gmail.com> writes:
Like the title says, using "gin_fuzzy_search_limit" degrades speed when it
has a relatively low setting.
...
Attached is SQL to test and observe this issue and also attached is a patch
I want to eventually submit to a commitfest.
I took a brief look at this. It seems like what you're actually trying
to accomplish is to ensure that entryLoadMoreItems's "stepright" path
is taken, instead of re-descending the index from the root. Okay,
I can see why that'd be a big win, but why are you tying it to the
dropItem behavior? It should apply any time we're iterating this loop
more than once. IOW, it seems to me like the logic about when to step
right is just kinda broken, and this is a band-aid rather than a full fix.
The performance hit is worse for fuzzy-search mode because it will
iterate the loop more (relative to the amount of work done elsewhere),
but there's still a potential for wasted work.
Actually, a look at the code coverage report shows that the
not-step-right path is never taken at all in our standard regression
tests. Maybe that just says bad things about the tests' coverage, but
now I'm wondering whether we could just flush that code path altogether,
and assume that we should always step right at this point.
[ cc'ing heikki and alexander, who seem to have originated that code
at 626a120656a75bf4fe64b1d0d83c23cb38d3771a. The commit message says
it saves a lot of I/O, but I'm wondering if this report disproves that.
In any case the lack of test coverage is pretty damning. ]
While we're here, what do you think about the comment in the other
code branch just above:
/* XXX: shouldn't we apply the fuzzy search limit here? */
I'm personally inclined to suspect that the fuzzy-search business has
got lots of bugs, which haven't been noticed because (a) it's so squishily
defined that one can hardly tell whether a given result is buggy or not,
and (b) nearly nobody uses it anyway (possibly not unrelated to (a)).
As somebody who evidently is using it, you're probably more motivated
to track down bugs than the rest of us.
regards, tom lane
Hi, Tom. Thanks for taking a look.
It seems like what you're actually trying
to accomplish is to ensure that entryLoadMoreItems's "stepright" path
is taken, instead of re-descending the index from the root.
What I was primarily trying to do is make sure that when entryLoadMoreItems
is called, it loads new items that it didn’t load previously, which would
occur in special cases. But the solution to that goal does result in the
"stepright" path being used. I explain the main goal in a segment of my bug
report this way, though it's a bit longwinded (from
/messages/by-id/16220-1a0a4f0cb67cafdc@postgresql.org
):
Since the program doesn't load all items into memory at once, it calls the
"entryLoadMoreItems" function when it needs to get another page of items to
iterate through. The "entryLoadMoreItems" function calls are passed an
"advancePast" variable as an argument. This variable decides what leaf page
in the items/posting tree should more items be retrieved from. Usually when
iterating through all possible results, execution will exit the do-while
loop responsible for iteration in order to perform some important actions
(including the updating of the "advancePast" variable) before returning
into
the loop again to iterate over more items. However, when "dropItem" returns
true in succession a great many times, the do-while loop can not be exited
for updating the "advancePast" variable until a non-drop finally occurs.
When this "advancePast" variable is not updated it leads to calls to
"entryLoadMoreItems" repeatedly returning the same items when stuck in the
do-while loop by a high succession of dropped items (because "advancePast"
is never updated to a value after items already iterated through).
Along with the issue of returning the same items, there's the issue of how
the "entryLoadMoreItems" function traverses the posting tree from the root
each time it's called while stuck in the do-while loop. This especially is
the cause for the bad performance seen for low "gin_fuzzy_search_limit"
values. In some cases, this situation is made even worse when "advancePast"
is set to a value that leads to loading a page of items that has relatively
few items actually past "advancePast", and so it must almost immediately
call "entryLoadMoreItems" again. But because "advancePast" never gets
updated, this results in a higher than usual succession of
"entryLoadMoreItems" function calls (the program loads the same page,
iterates over the same relatively few items until it goes and loads the
same
page again), with each call requiring traversal from the root of the
posting
tree down to the same leaf page as before.
My patch makes it so that when stuck in the do-while loop after many
successive "dropItems" returning true, the program instead now loads actual
new items by passing the last item dropped into the "entryLoadMoreItems"
function instead of the "advancePast" variable that can't be appropriately
updated while stuck in the do-while loop. This means "entryLoadMoreItems"
will instead load items ordered right after the last dropped item. This
last
item dropped is also the current item ("curItem") and so the
"entryLoadMoreItems" can directly obtain the next page of items by making a
step right from the current page, instead of traversing from the root of
the
posting tree, which is the most important fix for performance.
In regards to this:
While we're here, what do you think about the comment in the other
code branch just above:
/* XXX: shouldn't we apply the fuzzy search limit here? */
I'm personally inclined to suspect that the fuzzy-search business has
got lots of bugs, which haven't been noticed because (a) it's so squishily
defined that one can hardly tell whether a given result is buggy or not,
and (b) nearly nobody uses it anyway (possibly not unrelated to (a)).
As somebody who evidently is using it, you're probably more motivated
to track down bugs than the rest of us.
I think the comment is correct. It should be applied if you are to stay
consistent. Like the comment above that comment says, that code branch is
for the two cases of either (1) reaching the last page of a posting tree or
(2) when an "entry"/keyword has so few results that the item pointers fit
in just one page containing a posting list. If there is a chance of a
dropped item in the other pages of the posting tree, there should be a
chance of dropped items in the last page too for consistency sake at least.
And there should also be a chance of dropped items when iterating a single
posting list of entry with relatively few results. Placing "||
(entry->reduceResult == true && dropItem(entry))" at the end of the while
condition should be all that's needed to apply the fuzzy search limit there.
And I agree that probably usage of the fuzzy search feature is extremely
rare and the way I'm using it probably even more rare. So thank you for
taking a look at it. It's a really great feature for me though and I'm glad
the creator placed it in.
Regards,
Ade
On Tue, Mar 10, 2020 at 7:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
=?UTF-8?B?QWTDqQ==?= <ade.hey@gmail.com> writes:
Like the title says, using "gin_fuzzy_search_limit" degrades speed when
it
has a relatively low setting.
...
Attached is SQL to test and observe this issue and also attached is apatch
I want to eventually submit to a commitfest.
I took a brief look at this. It seems like what you're actually trying
to accomplish is to ensure that entryLoadMoreItems's "stepright" path
is taken, instead of re-descending the index from the root. Okay,
I can see why that'd be a big win, but why are you tying it to the
dropItem behavior? It should apply any time we're iterating this loop
more than once. IOW, it seems to me like the logic about when to step
right is just kinda broken, and this is a band-aid rather than a full fix.
The performance hit is worse for fuzzy-search mode because it will
iterate the loop more (relative to the amount of work done elsewhere),
but there's still a potential for wasted work.Actually, a look at the code coverage report shows that the
not-step-right path is never taken at all in our standard regression
tests. Maybe that just says bad things about the tests' coverage, but
now I'm wondering whether we could just flush that code path altogether,
and assume that we should always step right at this point.[ cc'ing heikki and alexander, who seem to have originated that code
at 626a120656a75bf4fe64b1d0d83c23cb38d3771a. The commit message says
it saves a lot of I/O, but I'm wondering if this report disproves that.
In any case the lack of test coverage is pretty damning. ]While we're here, what do you think about the comment in the other
code branch just above:/* XXX: shouldn't we apply the fuzzy search limit here? */
I'm personally inclined to suspect that the fuzzy-search business has
got lots of bugs, which haven't been noticed because (a) it's so squishily
defined that one can hardly tell whether a given result is buggy or not,
and (b) nearly nobody uses it anyway (possibly not unrelated to (a)).
As somebody who evidently is using it, you're probably more motivated
to track down bugs than the rest of us.regards, tom lane
=?UTF-8?B?QWTDqQ==?= <ade.hey@gmail.com> writes:
It seems like what you're actually trying
to accomplish is to ensure that entryLoadMoreItems's "stepright" path
is taken, instead of re-descending the index from the root.
What I was primarily trying to do is make sure that when entryLoadMoreItems
is called, it loads new items that it didn't load previously, which would
occur in special cases. But the solution to that goal does result in the
"stepright" path being used.
Oh, hm, now I see what you mean: as things stand, it's likely to
repeatedly (and very expensively) reload the same page we were
already on, until the random dropItem() test finally accepts some
item from that page. Ick.
I think though that the fix can be a bit less messy than you have here,
because advancePast is just a local variable in entryGetItem, so we
can overwrite it without any problem. So what I want to do is just
update it to equal entry->curItem before looping around. But shoving
that assignment into the while-condition was too ugly for my taste
(and no, I didn't like your assignment there either). So I ended up
refactoring the do-loop into a for-loop with internal break conditions,
as attached.
I also made the posting-list case handle reduction in the same way,
and for good measure changed the bitmap-result case to look similar,
which caused me to notice that it had a bug too: the "continue" case
within that loop failed to reset gotitem = false as it should,
if we'd looped around after rejecting an item due to reduceResult.
As far as I can see, that would lead to returning the previously-
rejected curItem value, which isn't fatal but it's still wrong.
So I just got rid of the gotitem variable altogether; it really
wasn't helping with either clarity or correctness.
This patch also adds a couple of test cases so that we have more
code coverage in this area. The overall coverage of ginget.c
is still mighty lame, but at least we're going through some of
these lines that we weren't before.
I'm inclined to back-patch this. Given how fuzzy the definition
of gin_fuzzy_search_limit is, it seems unlikely that anyone would
be depending on the current buggy behavior.
regards, tom lane
Attachments:
gin-fuzzy-search-limit-fix-2.patchtext/x-diff; charset=us-ascii; name=gin-fuzzy-search-limit-fix-2.patchDownload+86-12
I wrote:
I'm inclined to back-patch this. Given how fuzzy the definition
of gin_fuzzy_search_limit is, it seems unlikely that anyone would
be depending on the current buggy behavior.
And done. Thanks for the bug report and patch!
regards, tom lane
Great. Thanks for refactoring it further and fixing other bugs in there
(and making it more clean too)!
Regards,
Ade
On Fri, Apr 3, 2020 at 1:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
I wrote:
I'm inclined to back-patch this. Given how fuzzy the definition
of gin_fuzzy_search_limit is, it seems unlikely that anyone would
be depending on the current buggy behavior.And done. Thanks for the bug report and patch!
regards, tom lane