inefficient loop in StandbyReleaseLockList()
Hi hackers,
I've seen a few cases now for v13 where the startup process on a
standby appears to be stuck on StandbyReleaseLockList(). It looks
like most of the time is spent on list_delete_first(). I believe this
is related to the recent list rewrite (1cff1b9), which has the
following note in the commit message:
* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.
The current form of StandbyReleaseLockList() is something like this:
while (mylist != NIL)
{
int i = linitial_int(mylist);
...
mylist = list_delete_first(mylist);
}
For a long enough list, this is wildly inefficient. The following
form is much faster for longer lists:
foreach(lc, mylist)
{
int i = lfirst_int(lc);
...
}
list_free(mylist);
I wrote up a simple test function for each form. For a list of
500,000 integers, the first form took about a minute, while the second
form took about 6 milliseconds.
I've attached a patch that converts StandbyReleaseLockList() to the
second loop form.
Nathan
Attachments:
improve_loop.patchapplication/octet-stream; name=improve_loop.patchDownload+6-4
On Thu, Oct 28, 2021 at 9:07 AM Bossart, Nathan <bossartn@amazon.com> wrote:
Hi hackers,
I've seen a few cases now for v13 where the startup process on a
standby appears to be stuck on StandbyReleaseLockList(). It looks
like most of the time is spent on list_delete_first(). I believe this
is related to the recent list rewrite (1cff1b9), which has the
following note in the commit message:* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.The current form of StandbyReleaseLockList() is something like this:
while (mylist != NIL)
{
int i = linitial_int(mylist);
...
mylist = list_delete_first(mylist);
}For a long enough list, this is wildly inefficient. The following
form is much faster for longer lists:foreach(lc, mylist)
{
int i = lfirst_int(lc);
...
}
list_free(mylist);I wrote up a simple test function for each form. For a list of
500,000 integers, the first form took about a minute, while the second
form took about 6 milliseconds.I've attached a patch that converts StandbyReleaseLockList() to the
second loop form.
+1, deleting everything at once is much better. Deleting one by one
using list_delete_first is a bit heavy; does the element shifting that
involves memory allocation and copy operation which is unnecessary
here.
Regards,
Amul
At Thu, 28 Oct 2021 09:54:42 +0530, Amul Sul <sulamul@gmail.com> wrote in
On Thu, Oct 28, 2021 at 9:07 AM Bossart, Nathan <bossartn@amazon.com> wrote:
Hi hackers,
I've seen a few cases now for v13 where the startup process on a
standby appears to be stuck on StandbyReleaseLockList(). It looks
like most of the time is spent on list_delete_first(). I believe this
is related to the recent list rewrite (1cff1b9), which has the
following note in the commit message:* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.The current form of StandbyReleaseLockList() is something like this:
while (mylist != NIL)
{
int i = linitial_int(mylist);
...
mylist = list_delete_first(mylist);
}For a long enough list, this is wildly inefficient. The following
form is much faster for longer lists:foreach(lc, mylist)
{
int i = lfirst_int(lc);
...
}
list_free(mylist);I wrote up a simple test function for each form. For a list of
500,000 integers, the first form took about a minute, while the second
form took about 6 milliseconds.
Nice finding!
I've attached a patch that converts StandbyReleaseLockList() to the
second loop form.+1, deleting everything at once is much better. Deleting one by one
using list_delete_first is a bit heavy; does the element shifting that
involves memory allocation and copy operation which is unnecessary
here.
+1.
I found several other instances of the pattern
"while(list){list_delete_first(); /*no-break*/}" in
llvm_release_context, gistProcessEmptyingQueue, AtEOXact_Namespace and
maybe transformGraph and processState in trgm_regexp.c. We might want
to apply this technique to the three first, and maybe to the last two.
However, I'm fine with fixing only StandbyRelaseLockList(), which
actually suffers from list_delete_first().
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Oct 28, 2021 at 03:57:51PM +0900, Kyotaro Horiguchi wrote:
I found several other instances of the pattern
"while(list){list_delete_first(); /*no-break*/}" in
llvm_release_context, gistProcessEmptyingQueue, AtEOXact_Namespace and
maybe transformGraph and processState in trgm_regexp.c. We might want
to apply this technique to the three first, and maybe to the last two.However, I'm fine with fixing only StandbyRelaseLockList(), which
actually suffers from list_delete_first().
I can also see a large gap between one technique and the other, so
this looks like a good catch to me coming from Nathan :)
As it could indeed hurt badly the time it takes to do a shutdown or to
end recovery, we had better back-patch that down to 13 in my opinion.
transformGraph and processState seem to be worth improving on
performance ground, as well, but they look less critical than this
one but we could do something on HEAD. Skimming through the rest of
the code, we may be able to improve some areas related to namespaces,
but that does not seem worth it in terms of code complication.
--
Michael
On 10/28/21, 12:58 AM, "Michael Paquier" <michael@paquier.xyz> wrote:
On Thu, Oct 28, 2021 at 03:57:51PM +0900, Kyotaro Horiguchi wrote:
However, I'm fine with fixing only StandbyRelaseLockList(), which
actually suffers from list_delete_first().I can also see a large gap between one technique and the other, so
this looks like a good catch to me coming from Nathan :)
:)
As it could indeed hurt badly the time it takes to do a shutdown or to
end recovery, we had better back-patch that down to 13 in my opinion.
+1
transformGraph and processState seem to be worth improving on
performance ground, as well, but they look less critical than this
one but we could do something on HEAD. Skimming through the rest of
the code, we may be able to improve some areas related to namespaces,
but that does not seem worth it in terms of code complication.
I just did my own scan through uses of list_delete_first(), and I only
found a couple that might be easily transitioned to the foreach()
approach. I don't think I'm going to pick that up at the moment, but
I'd readily help review such patches if there is a demonstrable
improvement.
Nathan
Hi,
On 2021-10-28 15:57:51 +0900, Kyotaro Horiguchi wrote:
I found several other instances of the pattern
"while(list){list_delete_first(); /*no-break*/}" in
llvm_release_context, gistProcessEmptyingQueue, AtEOXact_Namespace and
maybe transformGraph and processState in trgm_regexp.c. We might want
to apply this technique to the three first, and maybe to the last two.
We should be careful with changes like this, because there's some advantages
in the while(!empty) pattern too. Iterating over the whole list doesn't work
if there's any other modifications to the list, or if there's a chance of
errors. For the latter there just needs to be a CHECK_FOR_INTERRUPTS()
somewhere...
Greetings,
Andres Freund
Hi,
On 2021-10-28 15:07:48 -0700, Andres Freund wrote:
On 2021-10-28 15:57:51 +0900, Kyotaro Horiguchi wrote:
I found several other instances of the pattern
"while(list){list_delete_first(); /*no-break*/}" in
llvm_release_context, gistProcessEmptyingQueue, AtEOXact_Namespace and
maybe transformGraph and processState in trgm_regexp.c. We might want
to apply this technique to the three first, and maybe to the last two.We should be careful with changes like this, because there's some advantages
in the while(!empty) pattern too. Iterating over the whole list doesn't work
if there's any other modifications to the list, or if there's a chance of
errors. For the latter there just needs to be a CHECK_FOR_INTERRUPTS()
somewhere...
Which leads to to wonder whether the better fix would be to switch to deleting
the last element, but still use the while (!empty) style. That should convert
the O(n^2) due to 1cff1b9 back to O(n). It might or might not be faster/slower
than using foreach(), but it should be within the same ballpark.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
Which leads to to wonder whether the better fix would be to switch to deleting
the last element, but still use the while (!empty) style. That should convert
the O(n^2) due to 1cff1b9 back to O(n). It might or might not be faster/slower
than using foreach(), but it should be within the same ballpark.
Does it matter what order we're releasing the locks in?
regards, tom lane
On 10/28/21, 3:15 PM, "Andres Freund" <andres@anarazel.de> wrote:
Which leads to to wonder whether the better fix would be to switch to deleting
the last element, but still use the while (!empty) style. That should convert
the O(n^2) due to 1cff1b9 back to O(n). It might or might not be faster/slower
than using foreach(), but it should be within the same ballpark.
Yeah, deleting from the end of the list yields a similar improvement.
foreach() appears to be slightly faster, but the difference is
basically negligible. For a list of a million integers, foreach()
consistently takes ~12ms, deleting from the end of the list takes
~15ms, and deleting from the beginning of the list takes ~4 minutes.
Nathan
On 10/28/21, 3:25 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Andres Freund <andres@anarazel.de> writes:
Which leads to to wonder whether the better fix would be to switch to deleting
the last element, but still use the while (!empty) style. That should convert
the O(n^2) due to 1cff1b9 back to O(n). It might or might not be faster/slower
than using foreach(), but it should be within the same ballpark.Does it matter what order we're releasing the locks in?
I'm not seeing anything that indicates the ordering matters. AFAICT
either approach would work in this case. IMO changing the order is
scarier than switching to foreach(), though.
Nathan
"Bossart, Nathan" <bossartn@amazon.com> writes:
On 10/28/21, 3:25 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Does it matter what order we're releasing the locks in?
I'm not seeing anything that indicates the ordering matters. AFAICT
either approach would work in this case. IMO changing the order is
scarier than switching to foreach(), though.
Yeah, that was my feeling...
regards, tom lane
Hi,
On 2021-10-28 19:24:08 -0400, Tom Lane wrote:
"Bossart, Nathan" <bossartn@amazon.com> writes:
On 10/28/21, 3:25 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Does it matter what order we're releasing the locks in?
I'm not seeing anything that indicates the ordering matters. AFAICT
either approach would work in this case. IMO changing the order is
scarier than switching to foreach(), though.Yeah, that was my feeling...
I suspect the reverse lock order release could be tad faster. But I probably
wouldn't change it either - I was more thinking of some of the other cases
that deleted the first element, here it's a bit harder to know wether there's
a chance of a CFI() or such.
Greetings,
Andres Freund
On Thu, Oct 28, 2021 at 04:52:48PM -0700, Andres Freund wrote:
I suspect the reverse lock order release could be tad faster. But I probably
wouldn't change it either - I was more thinking of some of the other cases
that deleted the first element, here it's a bit harder to know wether there's
a chance of a CFI() or such.
Actually, as the list of recovery locks is saved in TopMemoryContext,
wouldn't it be better to keep a per-cell deletion of the list, which
would mean that we'd better do the operation in the reverse order to
make things faster with the new list implementation? But that's what
Andres points at with CFIs in the middle of one list of the hash table
processed?
--
Michael
On 10/28/21, 11:53 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
On Thu, Oct 28, 2021 at 04:52:48PM -0700, Andres Freund wrote:
I suspect the reverse lock order release could be tad faster. But I probably
wouldn't change it either - I was more thinking of some of the other cases
that deleted the first element, here it's a bit harder to know wether there's
a chance of a CFI() or such.Actually, as the list of recovery locks is saved in TopMemoryContext,
wouldn't it be better to keep a per-cell deletion of the list, which
would mean that we'd better do the operation in the reverse order to
make things faster with the new list implementation? But that's what
Andres points at with CFIs in the middle of one list of the hash table
processed?
Hm. IIUC anything bad enough to cause the startup process to break
out of the StandbyReleaseLockList() loop will also cause the entire
process to be restarted. I'm not seeing any risk of reusing a half-
released lock list. I might be misunderstanding the concern, though.
Nathan
"Bossart, Nathan" <bossartn@amazon.com> writes:
On 10/28/21, 11:53 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
Actually, as the list of recovery locks is saved in TopMemoryContext,
wouldn't it be better to keep a per-cell deletion of the list, which
would mean that we'd better do the operation in the reverse order to
make things faster with the new list implementation? But that's what
Andres points at with CFIs in the middle of one list of the hash table
processed?
Hm. IIUC anything bad enough to cause the startup process to break
out of the StandbyReleaseLockList() loop will also cause the entire
process to be restarted. I'm not seeing any risk of reusing a half-
released lock list. I might be misunderstanding the concern, though.
Yeah, there's no expectation that this data structure needs to be kept
consistent after an error; and I'm not real sure that the existing
code could claim to satisfy such a requirement if we did need it.
(There's at least a short window where the caller's hash table entry
will point at an already-freed List.)
Pushed the patch as given. I've not yet reviewed other list_delete_first
callers, but I'll take a look. (I seem to remember that I did survey
them while writing 1cff1b95a, but I evidently missed that this code
could be dealing with a list long enough to be problematic.)
regards, tom lane
Hi,
On 2021-10-31 15:38:15 -0400, Tom Lane wrote:
Yeah, there's no expectation that this data structure needs to be kept
consistent after an error; and I'm not real sure that the existing
code could claim to satisfy such a requirement if we did need it.
To be clear, I was making that comment in response to the search for
other places doing the while(!empty) delete_first() style loops. Some of
them are reached via resowners etc, where reaching the same code
repeatedly is perhaps more realistic.
Greetings,
Andres Freund
I wrote:
Pushed the patch as given. I've not yet reviewed other list_delete_first
callers, but I'll take a look. (I seem to remember that I did survey
them while writing 1cff1b95a, but I evidently missed that this code
could be dealing with a list long enough to be problematic.)
I looked at the remaining list_delete_first callers.
1. Attached is a proposed patch to get rid of the calls in trgm_regexp.c.
I'm not certain that those lists could get long enough to be a problem,
given the existing complexity limits in that file (MAX_EXPANDED_STATES
etc). But I'm not certain they can't, either, and it's easy enough to
fix along the same lines as in StandbyReleaseLockList.
2. I think we almost certainly have a problem in SyncPostCheckpoint.
3. Is agg_refill_hash_table a problem? Probably; we've seen cases
with lots of batches.
4. I'm a bit worried about the uses in access/gist/, but I don't know
that code well enough to want to mess with it. It's possible the
list lengths are bounded by the index tree height, in which case it
likely doesn't matter. The logic in gistFindPath looks like a mess
anyway since it's appending to both ends of the "fifo" list in different
places (is that really necessary?).
5. Not sure about CopyMultiInsertInfoFlush ... how many buffers
could we have there?
6. llvm_release_context may not have a long enough list to be a
problem, but on the other hand, it looks easy to fix.
7. The list lengths in the parser and dependency.c, ruleutils.c,
explain.c are bounded by subquery nesting depth or plan tree depth,
so I doubt it's worth worrying about.
8. The uses in namespace.c don't seem like an issue either -- for
instance, GetOverrideSearchPath can't iterate more than twice,
and the overrideStack list shouldn't get very deep.
regards, tom lane
Attachments:
avoid-list-delete-first-in-pg-trgm.patchtext/x-diff; charset=us-ascii; name=avoid-list-delete-first-in-pg-trgm.patchDownload+17-8
At Sun, 31 Oct 2021 16:55:01 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
I wrote:
Pushed the patch as given. I've not yet reviewed other list_delete_first
callers, but I'll take a look. (I seem to remember that I did survey
them while writing 1cff1b95a, but I evidently missed that this code
could be dealing with a list long enough to be problematic.)I looked at the remaining list_delete_first callers.
1. Attached is a proposed patch to get rid of the calls in trgm_regexp.c.
I'm not certain that those lists could get long enough to be a problem,
given the existing complexity limits in that file (MAX_EXPANDED_STATES
etc). But I'm not certain they can't, either, and it's easy enough to
fix along the same lines as in StandbyReleaseLockList.
I should be missing something, but at the added list_free() there's a
case where keysQueue has some elelments. I think the remaining
elements are useless but AFAICS all the memory allocated there is
freed after createTrgmNFAInternal returnes, before the "next cycle"
comes. Do we need to add that list_free there?
2. I think we almost certainly have a problem in SyncPostCheckpoint.
Maybe we want list_delete_first_n() or such to remove the first n
elements in a list at once.
3. Is agg_refill_hash_table a problem? Probably; we've seen cases
with lots of batches.
I excluded it since I'm not sure it is in the pattern at a glance. I
would want to leave it alone, since changing the logic there seems
making things a bit complex and the gain by removing list_delete_first
doesn't look so large..
4. I'm a bit worried about the uses in access/gist/, but I don't know
that code well enough to want to mess with it. It's possible the
list lengths are bounded by the index tree height, in which case it
likely doesn't matter. The logic in gistFindPath looks like a mess
anyway since it's appending to both ends of the "fifo" list in different
places (is that really necessary?).
From the other side, the elemnts are inserted by lcons, then removed
by list_delete_first. It is the worst usage of the current list
implementation as a FIFO. Couldn't we construct and iterate over a
list in the reverse order?
5. Not sure about CopyMultiInsertInfoFlush ... how many buffers
could we have there?
(I'm not sure..)
6. llvm_release_context may not have a long enough list to be a
problem, but on the other hand, it looks easy to fix.
Agreed.
7. The list lengths in the parser and dependency.c, ruleutils.c,
explain.c are bounded by subquery nesting depth or plan tree depth,
so I doubt it's worth worrying about.
Agreed.
8. The uses in namespace.c don't seem like an issue either -- for
instance, GetOverrideSearchPath can't iterate more than twice,
and the overrideStack list shouldn't get very deep.
If we didn't need the resulting list I'm for changing it but actually
it is needed. So I think we won't get so much by changing the
function.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
At Sun, 31 Oct 2021 16:55:01 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
I looked at the remaining list_delete_first callers.
1. Attached is a proposed patch to get rid of the calls in trgm_regexp.c.
I'm not certain that those lists could get long enough to be a problem,
given the existing complexity limits in that file (MAX_EXPANDED_STATES
etc). But I'm not certain they can't, either, and it's easy enough to
fix along the same lines as in StandbyReleaseLockList.
I should be missing something, but at the added list_free() there's a
case where keysQueue has some elelments. I think the remaining
elements are useless but AFAICS all the memory allocated there is
freed after createTrgmNFAInternal returnes, before the "next cycle"
comes. Do we need to add that list_free there?
I was mainly trying to preserve the memory allocation behavior of the
current code, which will free the List when its last element is removed.
I agree that it *probably* doesn't matter --- but processState is
invoked multiple times during one createTrgmNFAInternal call, so I'm
not quite sure that leaking all those lists couldn't amount to anything.
It seemed prudent to ensure that things couldn't be made worse by this
patch.
3. Is agg_refill_hash_table a problem? Probably; we've seen cases
with lots of batches.
I excluded it since I'm not sure it is in the pattern at a glance. I
would want to leave it alone, since changing the logic there seems
making things a bit complex and the gain by removing list_delete_first
doesn't look so large..
It looks to me like nodeAgg.c uses the hash_batches list as a stack:
it's pushing things on with lcons, and later popping them off with
list_delete_first. This is exactly the pattern that 1cff1b95a
recommended reversing. Now, it's possible that that stack never gets
deep enough for the O(N^2) cost to matter, but ...
The logic in gistFindPath looks like a mess
anyway since it's appending to both ends of the "fifo" list in different
places (is that really necessary?).
From the other side, the elemnts are inserted by lcons, then removed
by list_delete_first. It is the worst usage of the current list
implementation as a FIFO. Couldn't we construct and iterate over a
list in the reverse order?
Yeah; at the very least, the use of both lcons and lappend falsifies
the variable name "fifo". I wonder though if that was intentional
or just somebody's sloppy coding.
regards, tom lane
On 10/31/21, 12:39 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Yeah, there's no expectation that this data structure needs to be kept
consistent after an error; and I'm not real sure that the existing
code could claim to satisfy such a requirement if we did need it.
(There's at least a short window where the caller's hash table entry
will point at an already-freed List.)
Right.
Pushed the patch as given. I've not yet reviewed other list_delete_first
callers, but I'll take a look. (I seem to remember that I did survey
them while writing 1cff1b95a, but I evidently missed that this code
could be dealing with a list long enough to be problematic.)
Thanks!
Nathan