A very quick observation of dangling pointers in Postgres pathlists
Hi,
It looks like a community decision has been developing that Postgres should
separate optimisation features into 'conventional' and 'magic' classes [1]/messages/by-id/CA+TgmoaPgXYYEivQWxyVV=eYhN+T9JAgS9Xe4m7g9wVitVPF8g@mail.gmail.com. This
has raised my concern that hidden contracts about pathlists' state and ordering
could lead to subtle bugs if an extension optimisation goes too far.
I think this topic is of interest because of the growing number of features that
impact path choice, such as ‘disable node’ or pg_plan_advice. Also, emerging
techniques that involve two or more levels of plan trees, like ‘eager
aggregation’, might catch another dangling pointer hidden in path lists for a
while. Don’t forget complicated cases with FDW and Custom nodes too.
For this purpose, a tiny debugging extension module, pg_pathcheck [2]https://github.com/danolivo/pg_pathcheck, has been
invented. It uses create_upper_paths_hook and planner_shutdown_hook. The
extension walks the entire Path tree, starting from the top PlannerInfo, then
recurses into glob::subroots, traversing each RelOptInfo and each pathlist.
Also, it traverses the path→subpath subtrees to ensure that potentially quite
complex path trees are covered when implemented as a single RelOptInfo. For each
pointer it visits, it checks if the NodeTag matches a known Path type. If not,
the memory was freed (and, with CLOBBER_FREED_MEMORY, set to 0x7F) or reused for
something else.
This approach is not free of caveats. For example, most Path nodes and many Plan
nodes fall within the 128-byte gap of the minimal allocated chunk. That means
freeing one path allows the optimiser to immediately allocate another Path node
at a potentially different query tree level. I had such a case at least once in
production. It was actually hard to realise, reproduce, and fix.
Running make check-world tests with the debug module loaded at startup revealed
many cases in which RelOptInfo structures contain dangling pointers. What
exactly do we see there?
The pathlist contents at the moment of an ‘Invalid’ path detection:
* ProjectionPath, Invalid — by far the most common, on JOIN RelOptInfos.
* ProjectionPath, Invalid, SortPath.
* AggPath, Invalid.
* NestPath, Invalid
* HashPath, Invalid
* cheapest_startup_path referencing a dangling pointer, on what looks
like a join of two partitions.
* cheapest_startup_path referencing a dangling pointer on a plain base
RelOptInfo.
The best-known problematic code example causing this issue is
apply_scanjoin_target_to_paths(), and the current_rel/final_rel game from commit
0927d2f46dd. Quickly fixing it, I see some more combinations have emerged:
* UniquePath, Invalid
* MergePath, Invalid
* SubqueryScanPath, Invalid
* SetOpPath, Invalid
* GatherPath, Path, Invalid
* AppendPath, AggPath, Invalid, AggPath
* HashPath, Invalid
* AppendPath, HashPath, Invalid
These new invalid references occur outside the originally identified code path,
showing that fixing one place does not address the broader issue (maybe my fixes
were wrong?). While some claim that the cost-dominance principle ('the cheapest
path is never invalid') provides safety, I have not found any acknowledgment of
this. As the planner is expanded, undocumented rules leave the system vulnerable.
The purpose of this email is basically to highlight the issue and raise a
discussion on how to solve it. Ashutosh designed a 'smart pointer' approach,
which seems the most balanced and bulletproof way. Another approach: 'used' flag
seems less interesting as well as local memory contexts - we should always
remember about multi-children cases that need freeing unnecessary paths in-place
to reduce memory consumption. But before diving into the code and identifying
origins of these cases, I’d like to know: is it an actual problem, or is the
cost-dominance contract enough?
[1]: /messages/by-id/CA+TgmoaPgXYYEivQWxyVV=eYhN+T9JAgS9Xe4m7g9wVitVPF8g@mail.gmail.com
/messages/by-id/CA+TgmoaPgXYYEivQWxyVV=eYhN+T9JAgS9Xe4m7g9wVitVPF8g@mail.gmail.com
[2]: https://github.com/danolivo/pg_pathcheck
--
regards, Andrei Lepikhov,
pgEdge
On 17/04/2026 10:56, Andrei Lepikhov wrote:
The best-known problematic code example causing this issue is
apply_scanjoin_target_to_paths(), and the current_rel/final_rel game from commit
0927d2f46dd. Quickly fixing it, I see some more combinations have emerged:
On closer inspection, it looks like all the detected cases come from the same
issue in create_ordered_paths. The ordered_rel has the same path in its pathlist
as the input_rel. Sometimes, this path is removed and freed from ordered_rel,
which leads to a dangling pointer in the child RelOptInfo.
I've attached a patch that shows how to fix the issue. Some regression tests
change because of a hidden rule where a projection and its subpath have
different target lists. Right now, the patch always enforces a projection, even
if the target lists are the same. This is still open for discussion on whether
there's a better way to handle it.
--
regards, Andrei Lepikhov,
pgEdge
Attachments:
v0-0001-Do-not-put-one-path-into-different-pathlists.patchtext/plain; charset=UTF-8; name=v0-0001-Do-not-put-one-path-into-different-pathlists.patchDownload+60-42
On Tue, 21 Apr 2026 at 19:29, Andrei Lepikhov <lepihov@gmail.com> wrote:
On 17/04/2026 10:56, Andrei Lepikhov wrote:
The best-known problematic code example causing this issue is
apply_scanjoin_target_to_paths(), and the current_rel/final_rel game from commit
0927d2f46dd. Quickly fixing it, I see some more combinations have emerged:On closer inspection, it looks like all the detected cases come from the same
issue in create_ordered_paths. The ordered_rel has the same path in its pathlist
as the input_rel. Sometimes, this path is removed and freed from ordered_rel,
which leads to a dangling pointer in the child RelOptInfo.I've attached a patch that shows how to fix the issue. Some regression tests
change because of a hidden rule where a projection and its subpath have
different target lists. Right now, the patch always enforces a projection, even
if the target lists are the same. This is still open for discussion on whether
there's a better way to handle it.
IMO, we should write a function like copy_path() or reparent_path(),
which creates a copy of the given Path, or the latter also would copy
then set the ->parent to the given RelOptInfo. Any time we use a path
directly from the pathlist of another RelOptInfo, we should reparent
or copy it. We could add an Assert in add_path() to check the new path
has the correct parent to help us find the places where we forget to
do this.
David
On 17.04.2026 11:56, Andrei Lepikhov wrote:
Hi,
It looks like a community decision has been developing that Postgres should
separate optimisation features into 'conventional' and 'magic' classes [1]. This
has raised my concern that hidden contracts about pathlists' state and ordering
could lead to subtle bugs if an extension optimisation goes too far.I think this topic is of interest because of the growing number of features that
impact path choice, such as ‘disable node’ or pg_plan_advice. Also, emerging
techniques that involve two or more levels of plan trees, like ‘eager
aggregation’, might catch another dangling pointer hidden in path lists for a
while. Don’t forget complicated cases with FDW and Custom nodes too.For this purpose, a tiny debugging extension module, pg_pathcheck [2], has been
invented. It uses create_upper_paths_hook and planner_shutdown_hook. The
extension walks the entire Path tree, starting from the top PlannerInfo, then
recurses into glob::subroots, traversing each RelOptInfo and each pathlist.
Also, it traverses the path→subpath subtrees to ensure that potentially quite
complex path trees are covered when implemented as a single RelOptInfo. For each
pointer it visits, it checks if the NodeTag matches a known Path type. If not,
the memory was freed (and, with CLOBBER_FREED_MEMORY, set to 0x7F) or reused for
something else.This approach is not free of caveats. For example, most Path nodes and many Plan
nodes fall within the 128-byte gap of the minimal allocated chunk. That means
freeing one path allows the optimiser to immediately allocate another Path node
at a potentially different query tree level. I had such a case at least once in
production. It was actually hard to realise, reproduce, and fix.
Hi! I raised such a problem before in this thread and proposed a patch
to delete freed refused paths from pathlist.
You can find it here [0]/messages/by-id/CAExHW5uhc5JVOUExjo24oYLLcJAyD04+BRb080sV08pO_=7w=A@mail.gmail.com of you are interested.
[0]: /messages/by-id/CAExHW5uhc5JVOUExjo24oYLLcJAyD04+BRb080sV08pO_=7w=A@mail.gmail.com
/messages/by-id/CAExHW5uhc5JVOUExjo24oYLLcJAyD04+BRb080sV08pO_=7w=A@mail.gmail.com
--
-----------
Best regards,
Alena Rybakina
On 21/04/2026 10:35, David Rowley wrote:
On Tue, 21 Apr 2026 at 19:29, Andrei Lepikhov <lepihov@gmail.com> wrote:
I've attached a patch that shows how to fix the issue. Some regression tests
change because of a hidden rule where a projection and its subpath have
different target lists. Right now, the patch always enforces a projection, even
if the target lists are the same. This is still open for discussion on whether
there's a better way to handle it.IMO, we should write a function like copy_path() or reparent_path(),
which creates a copy of the given Path, or the latter also would copy
then set the ->parent to the given RelOptInfo. Any time we use a path
directly from the pathlist of another RelOptInfo, we should reparent
or copy it. We could add an Assert in add_path() to check the new path
has the correct parent to help us find the places where we forget to
do this.
It would be great to have a copy_path() function. At the moment, I create a
limited version each time in an extension module, using
reparameterize_path_by_child as a guide since it ensures the core can handle
path copies.
Do you mean we can introduce such a copy routine to fix current issue? Here is
the problem: dangling pointers are detected only by external tools. I can't
imagine an SQL reproducer to test this machinery.
--
regards, Andrei Lepikhov,
pgEdge
On Tue, 21 Apr 2026 at 20:54, Andrei Lepikhov <lepihov@gmail.com> wrote:
On 21/04/2026 10:35, David Rowley wrote:
IMO, we should write a function like copy_path() or reparent_path(),
which creates a copy of the given Path, or the latter also would copy
then set the ->parent to the given RelOptInfo. Any time we use a path
directly from the pathlist of another RelOptInfo, we should reparent
or copy it. We could add an Assert in add_path() to check the new path
has the correct parent to help us find the places where we forget to
do this.It would be great to have a copy_path() function. At the moment, I create a
limited version each time in an extension module, using
reparameterize_path_by_child as a guide since it ensures the core can handle
path copies.
Do you mean we can introduce such a copy routine to fix current issue? Here is
the problem: dangling pointers are detected only by external tools. I can't
imagine an SQL reproducer to test this machinery.
I had anticipated that we'd only fix in master as we'd probably need a
new callback in CustomPathMethods.
David
On 21/04/2026 10:35, David Rowley wrote:
On Tue, 21 Apr 2026 at 19:29, Andrei Lepikhov <lepihov@gmail.com> wrote:
I've attached a patch that shows how to fix the issue. Some regression tests
change because of a hidden rule where a projection and its subpath have
different target lists. Right now, the patch always enforces a projection, even
if the target lists are the same. This is still open for discussion on whether
there's a better way to handle it.IMO, we should write a function like copy_path() or reparent_path(),
which creates a copy of the given Path, or the latter also would copy
then set the ->parent to the given RelOptInfo. Any time we use a path
directly from the pathlist of another RelOptInfo, we should reparent
or copy it. We could add an Assert in add_path() to check the new path
has the correct parent to help us find the places where we forget to
do this.
I've attached the patch so we can keep the discussion going.
I used a shallow copy since re-parenting is not that obvious. As I see it, the
parent pointer does not indicate ownership. Instead, it points to the source
operator (RelOptInfo). For instance, createplan.c uses it to get the relid of
the scan operation.
Should the parent point to the pathlist's owner? Possibly, but right now I am
not sure how introducing such an explicit contract would affect the optimiser.
This issue can't be explicitly reproduced with the current optimiser without
deep code intervention. So, if tests are needed, I propose a minor debug-only
check inside the plan-building code: with best_path, we can scan RelOptInfo's
pathlist and partial_pathlist to detect dangling pointers. It seems not stable
enough, so I just left the patch without a test infrastructure.
--
regards, Andrei Lepikhov,
pgEdge
Attachments:
0001-Fix-dangling-Path-pointer-in-create_ordered_paths.patchtext/plain; charset=UTF-8; name=0001-Fix-dangling-Path-pointer-in-create_ordered_paths.patchDownload+166-3
On 27/04/2026 10:19, Andrei Lepikhov wrote:
On 21/04/2026 10:35, David Rowley wrote:
IMO, we should write a function like copy_path() or reparent_path(),
which creates a copy of the given Path, or the latter also would copy
then set the ->parent to the given RelOptInfo. Any time we use a path
directly from the pathlist of another RelOptInfo, we should reparent
or copy it. We could add an Assert in add_path() to check the new path
has the correct parent to help us find the places where we forget to
do this.I've attached the patch so we can keep the discussion going.
While playing with random path choices [1]https://github.com/danolivo/pg-chaos-test, I found additional cases where a
path is assigned to two different RelOptInfos. See the attachment for a modified
patch.
[1]: https://github.com/danolivo/pg-chaos-test
--
regards, Andrei Lepikhov,
pgEdge