Odd code around ginScanToDelete
Hi,
While looking at converting more places to UnlockReleaseBuffer(), in the
course of making UnlockReleaseBuffer() faster than the two separate
operations, I found this code:
static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
DataPageDeleteStack *parent, OffsetNumber myoff)
...
if (!meDelete)
{
if (BufferIsValid(me->leftBuffer))
UnlockReleaseBuffer(me->leftBuffer);
me->leftBuffer = buffer;
}
else
{
if (!isRoot)
LockBuffer(buffer, GIN_UNLOCK);
ReleaseBuffer(buffer);
}
if (isRoot)
ReleaseBuffer(buffer);
Which sure looks like it'd release buffer twice if isRoot is set? I guess
that's not reachable, because presumably the root page will always go down the
!meDelete path. But it sure made me wonder if there's a hard to reach bug.
This code was introduced in
commit e14641197a5
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: 2019-11-19 23:07:36 +0300
Fix deadlock between ginDeletePage() and ginStepRight()
I didn't trace it further to see if it existed before that in some fashion.
There's another oddity here: ginScanToDelete() requires that the root page has
been locked by the caller already, but will afaict re-read the root page? But
then have code to avoid locking it again, because that would not have worked?
Seems odd.
Greetings,
Andres Freund
On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <andres@anarazel.de> wrote:
While looking at converting more places to UnlockReleaseBuffer(), in the
course of making UnlockReleaseBuffer() faster than the two separate
operations, I found this code:static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
DataPageDeleteStack *parent, OffsetNumber myoff)
...if (!meDelete)
{
if (BufferIsValid(me->leftBuffer))
UnlockReleaseBuffer(me->leftBuffer);
me->leftBuffer = buffer;
}
else
{
if (!isRoot)
LockBuffer(buffer, GIN_UNLOCK);ReleaseBuffer(buffer);
}if (isRoot)
ReleaseBuffer(buffer);Which sure looks like it'd release buffer twice if isRoot is set? I guess
that's not reachable, because presumably the root page will always go down the
!meDelete path. But it sure made me wonder if there's a hard to reach bug.
Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).
This code was introduced in
commit e14641197a5
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: 2019-11-19 23:07:36 +0300Fix deadlock between ginDeletePage() and ginStepRight()
I didn't trace it further to see if it existed before that in some fashion.
Yes. I think generally this area needs to be reworked to become more
clear, and have vast more comments. It was wrong from my side trying
to fix bugs there without reworking it into something more
appropriate. I'm planning to put work on this during this week.
There's another oddity here: ginScanToDelete() requires that the root page has
been locked by the caller already, but will afaict re-read the root page? But
then have code to avoid locking it again, because that would not have worked?
Seems odd.
It seems a bit odd for me that caller already have locked buffer, but
passes BlockNumber making us re-read the buffer. But I'm not sure
that's the same as your point. Could you, please, elaborate more on
this?
------
Regards,
Alexander Korotkov
Supabase
On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <andres@anarazel.de> wrote:
While looking at converting more places to UnlockReleaseBuffer(), in the
course of making UnlockReleaseBuffer() faster than the two separate
operations, I found this code:static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
DataPageDeleteStack *parent, OffsetNumber myoff)
...if (!meDelete)
{
if (BufferIsValid(me->leftBuffer))
UnlockReleaseBuffer(me->leftBuffer);
me->leftBuffer = buffer;
}
else
{
if (!isRoot)
LockBuffer(buffer, GIN_UNLOCK);ReleaseBuffer(buffer);
}if (isRoot)
ReleaseBuffer(buffer);Which sure looks like it'd release buffer twice if isRoot is set? I guess
that's not reachable, because presumably the root page will always go down the
!meDelete path. But it sure made me wonder if there's a hard to reach bug.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).This code was introduced in
commit e14641197a5
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: 2019-11-19 23:07:36 +0300Fix deadlock between ginDeletePage() and ginStepRight()
I didn't trace it further to see if it existed before that in some fashion.
Yes. I think generally this area needs to be reworked to become more
clear, and have vast more comments. It was wrong from my side trying
to fix bugs there without reworking it into something more
appropriate. I'm planning to put work on this during this week.There's another oddity here: ginScanToDelete() requires that the root page has
been locked by the caller already, but will afaict re-read the root page? But
then have code to avoid locking it again, because that would not have worked?
Seems odd.It seems a bit odd for me that caller already have locked buffer, but
passes BlockNumber making us re-read the buffer. But I'm not sure
that's the same as your point. Could you, please, elaborate more on
this?
Here is the refactoring patch. Sorry for the delay.
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v1-0001-Rework-ginScanToDelete-to-pass-Buffers-instead-of.patchapplication/octet-stream; name=v1-0001-Rework-ginScanToDelete-to-pass-Buffers-instead-of.patchDownload+111-78
Hi, Andres and Alexander!
On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <andres@anarazel.de> wrote:
While looking at converting more places to UnlockReleaseBuffer(), in the
course of making UnlockReleaseBuffer() faster than the two separate
operations, I found this code:static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
DataPageDeleteStack *parent, OffsetNumber myoff)
...if (!meDelete)
{
if (BufferIsValid(me->leftBuffer))
UnlockReleaseBuffer(me->leftBuffer);
me->leftBuffer = buffer;
}
else
{
if (!isRoot)
LockBuffer(buffer, GIN_UNLOCK);ReleaseBuffer(buffer);
}if (isRoot)
ReleaseBuffer(buffer);Which sure looks like it'd release buffer twice if isRoot is set? I guess
that's not reachable, because presumably the root page will always go down the
!meDelete path. But it sure made me wonder if there's a hard to reach bug.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).This code was introduced in
commit e14641197a5
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: 2019-11-19 23:07:36 +0300Fix deadlock between ginDeletePage() and ginStepRight()
I didn't trace it further to see if it existed before that in some fashion.
Yes. I think generally this area needs to be reworked to become more
clear, and have vast more comments. It was wrong from my side trying
to fix bugs there without reworking it into something more
appropriate. I'm planning to put work on this during this week.There's another oddity here: ginScanToDelete() requires that the root page has
been locked by the caller already, but will afaict re-read the root page? But
then have code to avoid locking it again, because that would not have worked?
Seems odd.It seems a bit odd for me that caller already have locked buffer, but
passes BlockNumber making us re-read the buffer. But I'm not sure
that's the same as your point. Could you, please, elaborate more on
this?Here is the refactoring patch. Sorry for the delay.
Hi, Andres and Alexander!
I've looked into the patch v1.
Overall, it looks good to me.
Some thoughts:
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.
Could limiting the maximum recursion level be useful?
In the comment to myStackItem before ginScanToDelete(), it might be
worth adding that after processing all pages on the current level,
myStackItem is not needed anymore.
Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).
Looks like this additional Assert is not in patch v1.
In the root call of ginScanToDelete(gvs, &root); we can add Assert
checking that its return result is false:
- ginScanToDelete(gvs, &root);
+ deleted = ginScanToDelete(gvs, &root);
+. Assert(!deleted); /* Root page is never deleted */
Additionally, it could be good to rename all vacuum functions related
to posting tree pages only, to include "Posting" (e.g., ginDeletePage
-> ginDeletePostingPage). The same is for the functions only for the
entry tree. It is already named this way in many places (e.g.
ginVacuumPostingTreeLeaves). It could be good to extend this to all
relevant functions.
Several small proposals on wording:
"rightmost non-deleted page to its left" -> "closest non-deleted
sibling page to its left"
"each entry tracks the buffer of the page" -> "each entry tracks the
buffers of the page" (as two buffers are mentioned)
"must already be pinned" -> "must already have been pinned"
Best regards,
Pavel Borisov
Supabase
Hi Pavel, Alexander,
On Fri, Mar 6, 2026 at 10:45 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Andres and Alexander!
On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <andres@anarazel.de> wrote:
While looking at converting more places to UnlockReleaseBuffer(), in the
course of making UnlockReleaseBuffer() faster than the two separate
operations, I found this code:static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
DataPageDeleteStack *parent, OffsetNumber myoff)
...if (!meDelete)
{
if (BufferIsValid(me->leftBuffer))
UnlockReleaseBuffer(me->leftBuffer);
me->leftBuffer = buffer;
}
else
{
if (!isRoot)
LockBuffer(buffer, GIN_UNLOCK);ReleaseBuffer(buffer);
}if (isRoot)
ReleaseBuffer(buffer);Which sure looks like it'd release buffer twice if isRoot is set? I guess
that's not reachable, because presumably the root page will always go down the
!meDelete path. But it sure made me wonder if there's a hard to reach bug.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).This code was introduced in
commit e14641197a5
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: 2019-11-19 23:07:36 +0300Fix deadlock between ginDeletePage() and ginStepRight()
I didn't trace it further to see if it existed before that in some fashion.
Yes. I think generally this area needs to be reworked to become more
clear, and have vast more comments. It was wrong from my side trying
to fix bugs there without reworking it into something more
appropriate. I'm planning to put work on this during this week.There's another oddity here: ginScanToDelete() requires that the root page has
been locked by the caller already, but will afaict re-read the root page? But
then have code to avoid locking it again, because that would not have worked?
Seems odd.It seems a bit odd for me that caller already have locked buffer, but
passes BlockNumber making us re-read the buffer. But I'm not sure
that's the same as your point. Could you, please, elaborate more on
this?Here is the refactoring patch. Sorry for the delay.
Hi, Andres and Alexander!
I've looked into the patch v1.
Overall, it looks good to me.
The refactor LGTM in general. The buffer-ownership rewrite looks
cleaner and safer overall.
Some thoughts:
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.
I think freeing myStackItem->child inside recursive calls might not be
worthwhile here. That node is intentionally reused for subsequent
siblings at the same depth, and it carries state (leftBuffer) that can
still be needed until the level is fully processed.
Freeing/reallocating it per subtree would add churn and make the
lifetime rules harder to reason about without meaningful memory
savings (the number of nodes is bounded by tree depth, not number of
pages). We currently free the chain once after ginScanToDelete()
returns in ginVacuumPostingTree(), which matches the natural lifetime
boundary
Could limiting the maximum recursion level be useful?
Posting-tree depth is naturally small; a hard cap seems to add failure
risk with little practical benefit.
In the comment to myStackItem before ginScanToDelete(), it might be
worth adding that after processing all pages on the current level,
myStackItem is not needed anymore.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).Looks like this additional Assert is not in patch v1.
In the root call of ginScanToDelete(gvs, &root); we can add Assert checking that its return result is false: - ginScanToDelete(gvs, &root); + deleted = ginScanToDelete(gvs, &root); +. Assert(!deleted); /* Root page is never deleted */
+ 1, this is a good invariant check and improves readability
One minor nit for the comment:
The DataPageDeleteStack.buffer field comment says "valid only while
recursing into children"
this is true for internal pages, but for leaf pages the buffer is
valid until the pageWasDeleted / leftBuffer decision. The validity
window is actually "from when the caller sets it until the
post-recursion cleanup."
--
Best,
Xuneng
Hi, Xuneng
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.Freeing/reallocating it per subtree would add churn and make the
lifetime rules harder to reason about without meaningful memory
savings (the number of nodes is bounded by tree depth, not number of
pages). We currently free the chain once after ginScanToDelete()
returns in ginVacuumPostingTree(), which matches the natural lifetime
boundary
I proposed not freeing child when child iteration is complete. They
indeed can be reused. I proposed cleaning children when "my" iteration
is complete. At that time all the children iterations are completed
and not needed when we return level up.
Regards,
Pavel Borisov
Supabase
On Tue, Mar 10, 2026 at 11:29 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Xuneng
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.Freeing/reallocating it per subtree would add churn and make the
lifetime rules harder to reason about without meaningful memory
savings (the number of nodes is bounded by tree depth, not number of
pages). We currently free the chain once after ginScanToDelete()
returns in ginVacuumPostingTree(), which matches the natural lifetime
boundaryI proposed not freeing child when child iteration is complete. They
indeed can be reused. I proposed cleaning children when "my" iteration
is complete. At that time all the children iterations are completed
and not needed when we return level up.
This is not clear for me. We need stack items to keep track of left
pages until we scan the whole posting tree. After scanning the whole
posting tree we can free stack items as we do now.
------
Regards,
Alexander Korotkov
Supabase
On Thu, 12 Mar 2026 at 02:22, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Mar 10, 2026 at 11:29 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Xuneng
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.Freeing/reallocating it per subtree would add churn and make the
lifetime rules harder to reason about without meaningful memory
savings (the number of nodes is bounded by tree depth, not number of
pages). We currently free the chain once after ginScanToDelete()
returns in ginVacuumPostingTree(), which matches the natural lifetime
boundaryI proposed not freeing child when child iteration is complete. They
indeed can be reused. I proposed cleaning children when "my" iteration
is complete. At that time all the children iterations are completed
and not needed when we return level up.This is not clear for me. We need stack items to keep track of left
pages until we scan the whole posting tree. After scanning the whole
posting tree we can free stack items as we do now.
Hi, Alexander!
You are right, that we can free all posting tree stack items after the
whole tree, as we do now. But I think we can also do it earlier. It
looks like all "children" items are needed and could be reused only
until iteration on "my" level ends. When function returns up the
recursion "my" level becomes "child" for a caller, and previous
"child" is not used anymore.
It's optional, maybe we can do freeing at the end of posting tree iteration.
Regards,
Pavel Borisov
On Thu, Mar 12, 2026 at 12:35 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
On Thu, 12 Mar 2026 at 02:22, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Mar 10, 2026 at 11:29 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Xuneng
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.Freeing/reallocating it per subtree would add churn and make the
lifetime rules harder to reason about without meaningful memory
savings (the number of nodes is bounded by tree depth, not number of
pages). We currently free the chain once after ginScanToDelete()
returns in ginVacuumPostingTree(), which matches the natural lifetime
boundaryI proposed not freeing child when child iteration is complete. They
indeed can be reused. I proposed cleaning children when "my" iteration
is complete. At that time all the children iterations are completed
and not needed when we return level up.This is not clear for me. We need stack items to keep track of left
pages until we scan the whole posting tree. After scanning the whole
posting tree we can free stack items as we do now.You are right, that we can free all posting tree stack items after the
whole tree, as we do now. But I think we can also do it earlier. It
looks like all "children" items are needed and could be reused only
until iteration on "my" level ends. When function returns up the
recursion "my" level becomes "child" for a caller, and previous
"child" is not used anymore.
No matter how many levels we can go up, we can still descend and need
the leftBuffer stored at any stack level.
------
Regards,
Alexander Korotkov
Supabase
Hi, Pavel!
Thank you for your review!
On Fri, Mar 6, 2026 at 4:45 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Some thoughts:
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.Could limiting the maximum recursion level be useful?
In the comment to myStackItem before ginScanToDelete(), it might be
worth adding that after processing all pages on the current level,
myStackItem is not needed anymore.
Already answered in this thread.
Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).Looks like this additional Assert is not in patch v1.
In the root call of ginScanToDelete(gvs, &root); we can add Assert checking that its return result is false: - ginScanToDelete(gvs, &root); + deleted = ginScanToDelete(gvs, &root); +. Assert(!deleted); /* Root page is never deleted */
Done.
Additionally, it could be good to rename all vacuum functions related
to posting tree pages only, to include "Posting" (e.g., ginDeletePage
-> ginDeletePostingPage). The same is for the functions only for the
entry tree. It is already named this way in many places (e.g.
ginVacuumPostingTreeLeaves). It could be good to extend this to all
relevant functions.
Renamed as you proposed.
Several small proposals on wording:
"rightmost non-deleted page to its left" -> "closest non-deleted
sibling page to its left"
I renamed that to just "left sibling". Deleted pages are not in the
tree already. And "the rightmost page to its left" is just left
sibling.
"each entry tracks the buffer of the page" -> "each entry tracks the
buffers of the page" (as two buffers are mentioned)
I prefer to just use word "buffer" twice to make it more explicit.
"must already be pinned" -> "must already have been pinned"
Done.
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v2-0001-Rework-ginScanToDelete-to-pass-Buffers-instead-of.patchapplication/octet-stream; name=v2-0001-Rework-ginScanToDelete-to-pass-Buffers-instead-of.patchDownload+113-78
On Tue, Mar 10, 2026 at 11:19 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Fri, Mar 6, 2026 at 10:45 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Andres and Alexander!
On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <andres@anarazel.de> wrote:
While looking at converting more places to UnlockReleaseBuffer(), in the
course of making UnlockReleaseBuffer() faster than the two separate
operations, I found this code:static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
DataPageDeleteStack *parent, OffsetNumber myoff)
...if (!meDelete)
{
if (BufferIsValid(me->leftBuffer))
UnlockReleaseBuffer(me->leftBuffer);
me->leftBuffer = buffer;
}
else
{
if (!isRoot)
LockBuffer(buffer, GIN_UNLOCK);ReleaseBuffer(buffer);
}if (isRoot)
ReleaseBuffer(buffer);Which sure looks like it'd release buffer twice if isRoot is set? I guess
that's not reachable, because presumably the root page will always go down the
!meDelete path. But it sure made me wonder if there's a hard to reach bug.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).This code was introduced in
commit e14641197a5
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: 2019-11-19 23:07:36 +0300Fix deadlock between ginDeletePage() and ginStepRight()
I didn't trace it further to see if it existed before that in some fashion.
Yes. I think generally this area needs to be reworked to become more
clear, and have vast more comments. It was wrong from my side trying
to fix bugs there without reworking it into something more
appropriate. I'm planning to put work on this during this week.There's another oddity here: ginScanToDelete() requires that the root page has
been locked by the caller already, but will afaict re-read the root page? But
then have code to avoid locking it again, because that would not have worked?
Seems odd.It seems a bit odd for me that caller already have locked buffer, but
passes BlockNumber making us re-read the buffer. But I'm not sure
that's the same as your point. Could you, please, elaborate more on
this?Here is the refactoring patch. Sorry for the delay.
Hi, Andres and Alexander!
I've looked into the patch v1.
Overall, it looks good to me.The refactor LGTM in general. The buffer-ownership rewrite looks
cleaner and safer overall.Some thoughts:
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.I think freeing myStackItem->child inside recursive calls might not be
worthwhile here. That node is intentionally reused for subsequent
siblings at the same depth, and it carries state (leftBuffer) that can
still be needed until the level is fully processed.
Freeing/reallocating it per subtree would add churn and make the
lifetime rules harder to reason about without meaningful memory
savings (the number of nodes is bounded by tree depth, not number of
pages). We currently free the chain once after ginScanToDelete()
returns in ginVacuumPostingTree(), which matches the natural lifetime
boundaryCould limiting the maximum recursion level be useful?
Posting-tree depth is naturally small; a hard cap seems to add failure
risk with little practical benefit.In the comment to myStackItem before ginScanToDelete(), it might be
worth adding that after processing all pages on the current level,
myStackItem is not needed anymore.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).Looks like this additional Assert is not in patch v1.
In the root call of ginScanToDelete(gvs, &root); we can add Assert checking that its return result is false: - ginScanToDelete(gvs, &root); + deleted = ginScanToDelete(gvs, &root); +. Assert(!deleted); /* Root page is never deleted */+ 1, this is a good invariant check and improves readability
One minor nit for the comment:
The DataPageDeleteStack.buffer field comment says "valid only while
recursing into children"
this is true for internal pages, but for leaf pages the buffer is
valid until the pageWasDeleted / leftBuffer decision. The validity
window is actually "from when the caller sets it until the
post-recursion cleanup."
Thank you for catching this. I decided to remove this statement from
v2. It's hard to explain the life cycle of the buffer clearly in one
sentence. On the other hand, it's explained in the comments of
ginScanPostingTreeToDelete().
------
Regards,
Alexander Korotkov
Supabase
On Thu, 12 Mar 2026 at 02:52, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Mar 10, 2026 at 11:19 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Fri, Mar 6, 2026 at 10:45 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Andres and Alexander!
On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <andres@anarazel.de> wrote:
While looking at converting more places to UnlockReleaseBuffer(), in the
course of making UnlockReleaseBuffer() faster than the two separate
operations, I found this code:static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
DataPageDeleteStack *parent, OffsetNumber myoff)
...if (!meDelete)
{
if (BufferIsValid(me->leftBuffer))
UnlockReleaseBuffer(me->leftBuffer);
me->leftBuffer = buffer;
}
else
{
if (!isRoot)
LockBuffer(buffer, GIN_UNLOCK);ReleaseBuffer(buffer);
}if (isRoot)
ReleaseBuffer(buffer);Which sure looks like it'd release buffer twice if isRoot is set? I guess
that's not reachable, because presumably the root page will always go down the
!meDelete path. But it sure made me wonder if there's a hard to reach bug.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).This code was introduced in
commit e14641197a5
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: 2019-11-19 23:07:36 +0300Fix deadlock between ginDeletePage() and ginStepRight()
I didn't trace it further to see if it existed before that in some fashion.
Yes. I think generally this area needs to be reworked to become more
clear, and have vast more comments. It was wrong from my side trying
to fix bugs there without reworking it into something more
appropriate. I'm planning to put work on this during this week.There's another oddity here: ginScanToDelete() requires that the root page has
been locked by the caller already, but will afaict re-read the root page? But
then have code to avoid locking it again, because that would not have worked?
Seems odd.It seems a bit odd for me that caller already have locked buffer, but
passes BlockNumber making us re-read the buffer. But I'm not sure
that's the same as your point. Could you, please, elaborate more on
this?Here is the refactoring patch. Sorry for the delay.
Hi, Andres and Alexander!
I've looked into the patch v1.
Overall, it looks good to me.The refactor LGTM in general. The buffer-ownership rewrite looks
cleaner and safer overall.Some thoughts:
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.I think freeing myStackItem->child inside recursive calls might not be
worthwhile here. That node is intentionally reused for subsequent
siblings at the same depth, and it carries state (leftBuffer) that can
still be needed until the level is fully processed.
Freeing/reallocating it per subtree would add churn and make the
lifetime rules harder to reason about without meaningful memory
savings (the number of nodes is bounded by tree depth, not number of
pages). We currently free the chain once after ginScanToDelete()
returns in ginVacuumPostingTree(), which matches the natural lifetime
boundaryCould limiting the maximum recursion level be useful?
Posting-tree depth is naturally small; a hard cap seems to add failure
risk with little practical benefit.In the comment to myStackItem before ginScanToDelete(), it might be
worth adding that after processing all pages on the current level,
myStackItem is not needed anymore.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).Looks like this additional Assert is not in patch v1.
In the root call of ginScanToDelete(gvs, &root); we can add Assert checking that its return result is false: - ginScanToDelete(gvs, &root); + deleted = ginScanToDelete(gvs, &root); +. Assert(!deleted); /* Root page is never deleted */+ 1, this is a good invariant check and improves readability
One minor nit for the comment:
The DataPageDeleteStack.buffer field comment says "valid only while
recursing into children"
this is true for internal pages, but for leaf pages the buffer is
valid until the pageWasDeleted / leftBuffer decision. The validity
window is actually "from when the caller sets it until the
post-recursion cleanup."Thank you for catching this. I decided to remove this statement from
v2. It's hard to explain the life cycle of the buffer clearly in one
sentence. On the other hand, it's explained in the comments of
ginScanPostingTreeToDelete().
Hi, Alexander!
Patch v2 looks good to me. I also agree with Xuneng that this
refactoring improved the logic to make it look clearer.
Thank you for the explanation of buffers lifetime!
Regards,
Pavel Borisov
Hi, Pavel!
On Thu, Mar 12, 2026 at 10:41 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
On Thu, 12 Mar 2026 at 02:52, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Mar 10, 2026 at 11:19 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Fri, Mar 6, 2026 at 10:45 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Andres and Alexander!
On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <andres@anarazel.de> wrote:
While looking at converting more places to UnlockReleaseBuffer(), in the
course of making UnlockReleaseBuffer() faster than the two separate
operations, I found this code:static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
DataPageDeleteStack *parent, OffsetNumber myoff)
...if (!meDelete)
{
if (BufferIsValid(me->leftBuffer))
UnlockReleaseBuffer(me->leftBuffer);
me->leftBuffer = buffer;
}
else
{
if (!isRoot)
LockBuffer(buffer, GIN_UNLOCK);ReleaseBuffer(buffer);
}if (isRoot)
ReleaseBuffer(buffer);Which sure looks like it'd release buffer twice if isRoot is set? I guess
that's not reachable, because presumably the root page will always go down the
!meDelete path. But it sure made me wonder if there's a hard to reach bug.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).This code was introduced in
commit e14641197a5
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: 2019-11-19 23:07:36 +0300Fix deadlock between ginDeletePage() and ginStepRight()
I didn't trace it further to see if it existed before that in some fashion.
Yes. I think generally this area needs to be reworked to become more
clear, and have vast more comments. It was wrong from my side trying
to fix bugs there without reworking it into something more
appropriate. I'm planning to put work on this during this week.There's another oddity here: ginScanToDelete() requires that the root page has
been locked by the caller already, but will afaict re-read the root page? But
then have code to avoid locking it again, because that would not have worked?
Seems odd.It seems a bit odd for me that caller already have locked buffer, but
passes BlockNumber making us re-read the buffer. But I'm not sure
that's the same as your point. Could you, please, elaborate more on
this?Here is the refactoring patch. Sorry for the delay.
Hi, Andres and Alexander!
I've looked into the patch v1.
Overall, it looks good to me.The refactor LGTM in general. The buffer-ownership rewrite looks
cleaner and safer overall.Some thoughts:
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.I think freeing myStackItem->child inside recursive calls might not be
worthwhile here. That node is intentionally reused for subsequent
siblings at the same depth, and it carries state (leftBuffer) that can
still be needed until the level is fully processed.
Freeing/reallocating it per subtree would add churn and make the
lifetime rules harder to reason about without meaningful memory
savings (the number of nodes is bounded by tree depth, not number of
pages). We currently free the chain once after ginScanToDelete()
returns in ginVacuumPostingTree(), which matches the natural lifetime
boundaryCould limiting the maximum recursion level be useful?
Posting-tree depth is naturally small; a hard cap seems to add failure
risk with little practical benefit.In the comment to myStackItem before ginScanToDelete(), it might be
worth adding that after processing all pages on the current level,
myStackItem is not needed anymore.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).Looks like this additional Assert is not in patch v1.
In the root call of ginScanToDelete(gvs, &root); we can add Assert checking that its return result is false: - ginScanToDelete(gvs, &root); + deleted = ginScanToDelete(gvs, &root); +. Assert(!deleted); /* Root page is never deleted */+ 1, this is a good invariant check and improves readability
One minor nit for the comment:
The DataPageDeleteStack.buffer field comment says "valid only while
recursing into children"
this is true for internal pages, but for leaf pages the buffer is
valid until the pageWasDeleted / leftBuffer decision. The validity
window is actually "from when the caller sets it until the
post-recursion cleanup."Thank you for catching this. I decided to remove this statement from
v2. It's hard to explain the life cycle of the buffer clearly in one
sentence. On the other hand, it's explained in the comments of
ginScanPostingTreeToDelete().Patch v2 looks good to me. I also agree with Xuneng that this
refactoring improved the logic to make it look clearer.
Thank you for the explanation of buffers lifetime!
Thank you for your feedback. I'll push the patch if no objections.
------
Regards,
Alexander Korotkov
Supabase
At 2026-03-12 17:05:51, "Alexander Korotkov" <aekorotkov@gmail.com> wrote:
Hi, Pavel!
On Thu, Mar 12, 2026 at 10:41 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
On Thu, 12 Mar 2026 at 02:52, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Mar 10, 2026 at 11:19 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
On Fri, Mar 6, 2026 at 10:45 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Andres and Alexander!
On Sat, 21 Feb 2026 at 13:55, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Feb 4, 2026 at 12:32 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Feb 3, 2026 at 6:26 PM Andres Freund <andres@anarazel.de> wrote:
While looking at converting more places to UnlockReleaseBuffer(), in the
course of making UnlockReleaseBuffer() faster than the two separate
operations, I found this code:static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
DataPageDeleteStack *parent, OffsetNumber myoff)
...if (!meDelete)
{
if (BufferIsValid(me->leftBuffer))
UnlockReleaseBuffer(me->leftBuffer);
me->leftBuffer = buffer;
}
else
{
if (!isRoot)
LockBuffer(buffer, GIN_UNLOCK);ReleaseBuffer(buffer);
}if (isRoot)
ReleaseBuffer(buffer);Which sure looks like it'd release buffer twice if isRoot is set? I guess
that's not reachable, because presumably the root page will always go down the
!meDelete path. But it sure made me wonder if there's a hard to reach bug.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).This code was introduced in
commit e14641197a5
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: 2019-11-19 23:07:36 +0300Fix deadlock between ginDeletePage() and ginStepRight()
I didn't trace it further to see if it existed before that in some fashion.
Yes. I think generally this area needs to be reworked to become more
clear, and have vast more comments. It was wrong from my side trying
to fix bugs there without reworking it into something more
appropriate. I'm planning to put work on this during this week.There's another oddity here: ginScanToDelete() requires that the root page has
been locked by the caller already, but will afaict re-read the root page? But
then have code to avoid locking it again, because that would not have worked?
Seems odd.It seems a bit odd for me that caller already have locked buffer, but
passes BlockNumber making us re-read the buffer. But I'm not sure
that's the same as your point. Could you, please, elaborate more on
this?Here is the refactoring patch. Sorry for the delay.
Hi, Andres and Alexander!
I've looked into the patch v1.
Overall, it looks good to me.The refactor LGTM in general. The buffer-ownership rewrite looks
cleaner and safer overall.Some thoughts:
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.I think freeing myStackItem->child inside recursive calls might not be
worthwhile here. That node is intentionally reused for subsequent
siblings at the same depth, and it carries state (leftBuffer) that can
still be needed until the level is fully processed.
Freeing/reallocating it per subtree would add churn and make the
lifetime rules harder to reason about without meaningful memory
savings (the number of nodes is bounded by tree depth, not number of
pages). We currently free the chain once after ginScanToDelete()
returns in ginVacuumPostingTree(), which matches the natural lifetime
boundaryCould limiting the maximum recursion level be useful?
Posting-tree depth is naturally small; a hard cap seems to add failure
risk with little practical benefit.In the comment to myStackItem before ginScanToDelete(), it might be
worth adding that after processing all pages on the current level,
myStackItem is not needed anymore.Yes, it's not possible to have meDelete set for root, because
me->leftBuffer is always InvalidBuffer for the root. So the branch
handling meDelete case should better do Assert(!isRoot).Looks like this additional Assert is not in patch v1.
In the root call of ginScanToDelete(gvs, &root); we can add Assert checking that its return result is false: - ginScanToDelete(gvs, &root); + deleted = ginScanToDelete(gvs, &root); +. Assert(!deleted); /* Root page is never deleted */+ 1, this is a good invariant check and improves readability
One minor nit for the comment:
The DataPageDeleteStack.buffer field comment says "valid only while
recursing into children"
this is true for internal pages, but for leaf pages the buffer is
valid until the pageWasDeleted / leftBuffer decision. The validity
window is actually "from when the caller sets it until the
post-recursion cleanup."Thank you for catching this. I decided to remove this statement from
v2. It's hard to explain the life cycle of the buffer clearly in one
sentence. On the other hand, it's explained in the comments of
ginScanPostingTreeToDelete().Patch v2 looks good to me. I also agree with Xuneng that this
refactoring improved the logic to make it look clearer.
Thank you for the explanation of buffers lifetime!Thank you for your feedback. I'll push the patch if no objections.
------
Regards,
Alexander Korotkov
Supabase
Hi,
Overall solid to me. I got a nitpick:
in the code comments, ginDeletePage should be ginDeletePostingPag(ginDeletePage -> ginDeletePostingPage).
There are 3 places that need to be revised.
------
Regards,
jinbinge
On Thu, Mar 12, 2026 at 6:37 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Thu, Mar 12, 2026 at 12:35 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
On Thu, 12 Mar 2026 at 02:22, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Mar 10, 2026 at 11:29 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
Hi, Xuneng
Is it worth/possible in recursive calls of ginScanToDelete() to free
allocated myStackItem->child after processing all children of the
current level, when they are not needed anymore?
Previously to this patch, palloc-ed "me" variable also was't freed at
recursion levels.Freeing/reallocating it per subtree would add churn and make the
lifetime rules harder to reason about without meaningful memory
savings (the number of nodes is bounded by tree depth, not number of
pages). We currently free the chain once after ginScanToDelete()
returns in ginVacuumPostingTree(), which matches the natural lifetime
boundaryI proposed not freeing child when child iteration is complete. They
indeed can be reused. I proposed cleaning children when "my" iteration
is complete. At that time all the children iterations are completed
and not needed when we return level up.This is not clear for me. We need stack items to keep track of left
pages until we scan the whole posting tree. After scanning the whole
posting tree we can free stack items as we do now.You are right, that we can free all posting tree stack items after the
whole tree, as we do now. But I think we can also do it earlier. It
looks like all "children" items are needed and could be reused only
until iteration on "my" level ends. When function returns up the
recursion "my" level becomes "child" for a caller, and previous
"child" is not used anymore.No matter how many levels we can go up, we can still descend and need
the leftBuffer stored at any stack level.
Yeah, the important point is that a stack item here represents
per-depth scan state, not just one recursive invocation. Returning
from one subtree does not necessarily mean that depth is finished
globally: the caller may move to a sibling and descend again, and that
later descent can still need the child level's saved leftBuffer from
the subtree we just finished. The stronger condition is that no more
pages remain to be scanned to the right at that depth; the code
already uses GinPageRightMost(page) for that when releasing the child
level's leftBuffer.
--
Best,
Xuneng