BUG #19026: ResourceOwnerForget can't find owner for invalid plancache
The following bug has been logged on the website:
Bug reference: 19026
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 18beta3
Operating system: Ubuntu 24.04
Description:
The following script:
create function part_hashint4_noop(value int4, seed int8)
returns int8 as $$
create table t();
select value + seed;
$$ language sql strict immutable parallel safe;
create operator class part_test_int4_ops for type int4 using hash as
function 2 part_hashint4_noop(int4, int8);
create table pt(i int) partition by hash (i part_test_int4_ops);
create table p1 partition of pt for values with (modulus 4, remainder 0);
insert into pt values (1);
insert into pt values (1);
triggers an expected error and then an internal one:
ERROR: CREATE TABLE is not allowed in a non-volatile function
CONTEXT: SQL function "part_hashint4_noop" statement 1
ERROR: XX000: plancache reference 0x643ae41a7d98 is not owned by resource
owner Portal
CONTEXT: SQL function "part_hashint4_noop" during startup
LOCATION: ResourceOwnerForget, resowner.c:618
I also encountered a gibberish owner name (with more complicated queries)
like:
2025-08-19 16:32:22.878 EEST|postgres|regress002|68a47cdb.2d8921|ERROR:
plancache reference 0x5d56f0675c50 is not owned by resource owner
?^O^^?H??^H?????H??^H??^O^^?AUATUSH??^HI??H?w(H?=??
Reproduced starting from 0313c5dc6.
On Wed, Aug 20, 2025 at 4:01 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
The following bug has been logged on the website:
Bug reference: 19026
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 18beta3
Operating system: Ubuntu 24.04
Description:The following script:
create function part_hashint4_noop(value int4, seed int8)
returns int8 as $$
create table t();
select value + seed;
$$ language sql strict immutable parallel safe;create operator class part_test_int4_ops for type int4 using hash as
function 2 part_hashint4_noop(int4, int8);create table pt(i int) partition by hash (i part_test_int4_ops);
create table p1 partition of pt for values with (modulus 4, remainder 0);insert into pt values (1);
insert into pt values (1);triggers an expected error and then an internal one:
ERROR: CREATE TABLE is not allowed in a non-volatile function
CONTEXT: SQL function "part_hashint4_noop" statement 1ERROR: XX000: plancache reference 0x643ae41a7d98 is not owned by resource
owner Portal
CONTEXT: SQL function "part_hashint4_noop" during startup
LOCATION: ResourceOwnerForget, resowner.c:618
The problem is in init_execution_state() we store the
On error resource owner will be released, but that references are
still maintained in the plan and cowner in fcache, but on error the
transaction will be aborted and this resource owner will be released.
So next time when we try to clean up, it will access invalid memory.
So a simple fix would be to cleanup on error as attached POC.
--
Regards,
Dilip Kumar
Google
Attachments:
release_plan_on_error.patchapplication/octet-stream; name=release_plan_on_error.patchDownload+18-0
Hi!
First of all, Alexander, thank you for the excellent bug report!
On Wed, 20 Aug 2025 at 16:21, Dilip Kumar <dilipbalaut@gmail.com> wrote:
ERROR: XX000: plancache reference 0x643ae41a7d98 is not owned by resource
owner Portal
CONTEXT: SQL function "part_hashint4_noop" during startup
LOCATION: ResourceOwnerForget, resowner.c:618The problem is in init_execution_state() we store the
On error resource owner will be released, but that references are
still maintained in the plan and cowner in fcache, but on error the
transaction will be aborted and this resource owner will be released.
So next time when we try to clean up, it will access invalid memory.
So a simple fix would be to cleanup on error as attached POC.--
Regards,
Dilip Kumar
As for the patch: should we add some regression test for this?
Also, I'm not terribly sure what this fix does is the right thing to do.
Doesn’t it break some layer of abstraction here? My understanding is
that on transaction rollback, all resources should be freed in the
ResourceOwnerRelease
function and friends. Namely, the ReleaseCachedPlan call we make
before elog(ERROR) is breaking logic. Am I wrong?
For this sql cache there exists sql_exec_error_callback, which is
probably a better place to clean up on error.
How about attached?
--
Best regards,
Kirill Reshke
Attachments:
v1-0001-Forget-cached-plan-on-transaction-rollback-in-SQL.patchapplication/octet-stream; name=v1-0001-Forget-cached-plan-on-transaction-rollback-in-SQL.patchDownload+2-1
On Wed, Aug 20, 2025 at 5:13 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
Hi!
First of all, Alexander, thank you for the excellent bug report!
On Wed, 20 Aug 2025 at 16:21, Dilip Kumar <dilipbalaut@gmail.com> wrote:
ERROR: XX000: plancache reference 0x643ae41a7d98 is not owned by resource
owner Portal
CONTEXT: SQL function "part_hashint4_noop" during startup
LOCATION: ResourceOwnerForget, resowner.c:618The problem is in init_execution_state() we store the
On error resource owner will be released, but that references are
still maintained in the plan and cowner in fcache, but on error the
transaction will be aborted and this resource owner will be released.
So next time when we try to clean up, it will access invalid memory.
So a simple fix would be to cleanup on error as attached POC.--
Regards,
Dilip KumarAs for the patch: should we add some regression test for this?
Also, I'm not terribly sure what this fix does is the right thing to do.
Doesn’t it break some layer of abstraction here? My understanding is
that on transaction rollback, all resources should be freed in the
ResourceOwnerRelease
function and friends. Namely, the ReleaseCachedPlan call we make
before elog(ERROR) is breaking logic. Am I wrong?
For this sql cache there exists sql_exec_error_callback, which is
probably a better place to clean up on error.How about attached?
Thanks, right it's better to clean up in sql_exec_error_callback. But
we should still release the cached plan before setting it to NULL in
order to leak the cache plan as I was doing in my patch. So I have
merged the logic of both the ideas and came up with this.
I will add a test for this and send it in sometime.
--
Regards,
Dilip Kumar
Google
Attachments:
v2-0001-Fix-Release-cached-plan-and-clear-function-cache-.patchapplication/octet-stream; name=v2-0001-Fix-Release-cached-plan-and-clear-function-cache-.patchDownload+7-1
On Wed, 20 Aug 2025 at 18:25, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 20, 2025 at 5:13 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
How about attached?
Thanks, right it's better to clean up in sql_exec_error_callback. But
we should still release the cached plan before setting it to NULL in
order to leak the cache plan as I was doing in my patch.
I am under the impression that we don’t. Resource owner will release
all cached plans in the rollback tx handler.
This can be confirmed via gdb session: print fcache->cplan in
sql_exec_error_callback and after this plan in ReleaseCachedPlan in
this repro.
These two pointers will be equal
--
Best regards,
Kirill Reshke
On Wed, Aug 20, 2025 at 7:05 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
On Wed, 20 Aug 2025 at 18:25, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 20, 2025 at 5:13 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
How about attached?
Thanks, right it's better to clean up in sql_exec_error_callback. But
we should still release the cached plan before setting it to NULL in
order to leak the cache plan as I was doing in my patch.I am under the impression that we don’t. Resource owner will release
all cached plans in the rollback tx handler.
This can be confirmed via gdb session: print fcache->cplan in
sql_exec_error_callback and after this plan in ReleaseCachedPlan in
this repro.
These two pointers will be equal
That's a valid point. Thanks. I think we can add the same test case
to what is given in this example, maybe in plpgsql.sql or maybe some
other file which is more relevant. So we can continue with the v1
patch only you have sent, maybe you can add a one liner comment to it.
--
Regards,
Dilip Kumar
Google
Dilip Kumar <dilipbalaut@gmail.com> writes:
That's a valid point. Thanks. I think we can add the same test case
to what is given in this example, maybe in plpgsql.sql or maybe some
other file which is more relevant. So we can continue with the v1
patch only you have sent, maybe you can add a one liner comment to it.
I do not like either of these proposed patches. Dilip's suffers from
an extremely myopic idea of which error reports could trigger the
problem, while Kirill's is abusing (IMV) the purpose of an error
context callback. Those exist to add detail to an error report, not
to clean up state outside the error system, and elog.c doesn't exactly
guarantee that they will be invoked.
The reason this broke at 0313c5dc6 is that that enabled
SQLFunctionCaches to be re-used for the life of the associated
FmgrInfo, and when we are talking about an opclass support procedure,
that FmgrInfo is in the relcache so it is likely to last for the
life of the session. So the presented test case causes us to error
out of execution of the SQL function during the first INSERT, but
its SQLFunctionCache still exists and has fcache->cplan != NULL,
even though error cleanup would've released the reference count
already. When we come back to this point in the second INSERT,
init_execution_state is fooled into trying to release the
already-released cplan.
In practice, fcache->cplan will never be not-null after successful
completion of a SQL function, so one idea is to simply clear it
unconditionally as soon as we know we're starting a fresh execution,
more or less as in alternative-1 attached. However that leaves me
a bit unsatisfied, because it doesn't protect against the case of
erroring out of a set-returning function: if we come in and see
eslist != NULL, we'll pick right back up attempting to execute
plans that probably aren't there anymore. I think that that case
is unreachable today because we don't allow any opclass support
functions to be SRFs, and AFAIK there are no other cases where an
FmgrInfo would be re-used after a failed query. Still, I'm inclined
to go with something more like alternative-2, which feels a little
more future-proof.
regards, tom lane
I wrote:
In practice, fcache->cplan will never be not-null after successful
completion of a SQL function, so one idea is to simply clear it
unconditionally as soon as we know we're starting a fresh execution,
more or less as in alternative-1 attached. However that leaves me
a bit unsatisfied, because it doesn't protect against the case of
erroring out of a set-returning function: if we come in and see
eslist != NULL, we'll pick right back up attempting to execute
plans that probably aren't there anymore. I think that that case
is unreachable today because we don't allow any opclass support
functions to be SRFs, and AFAIK there are no other cases where an
FmgrInfo would be re-used after a failed query. Still, I'm inclined
to go with something more like alternative-2, which feels a little
more future-proof.
After closer inspection: Alexander's test case doesn't expose the full
scope of the problem. If the function suffers an error at run-time
rather than early in setup, then we will fall out with fcache->eslist
being non-null as well as fcache->cplan. Then on the next call,
init_sql_fcache will believe that it is resuming execution of a
set-returning function, and we'll merrily try to execute an executor
state tree that's not there anymore.
So we have to do something more like my alternative-2 than any of
the other proposals, and after reviewing the code I believe it
had better reset the tstore and shutdown_reg fields as well.
Hence, v3 attached, now with regression test case.
regards, tom lane
Attachments:
bug-19026-alternative-3.patchtext/x-diff; charset=us-ascii; name=bug-19026-alternative-3.patchDownload+66-1
On Wed, 20 Aug 2025 at 21:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The reason this broke at 0313c5dc6 is that that enabled
SQLFunctionCaches to be re-used for the life of the associated
FmgrInfo, and when we are talking about an opclass support procedure,
that FmgrInfo is in the relcache so it is likely to last for the
life of the session. So the presented test case causes us to error
out of execution of the SQL function during the first INSERT, but
its SQLFunctionCache still exists and has fcache->cplan != NULL,
even though error cleanup would've released the reference count
already. When we come back to this point in the second INSERT,
init_execution_state is fooled into trying to release the
already-released cplan.
Thank you for this explanation. I understood how things got buggy and
how they should be fixed.
The v3 LGTM.
--
Best regards,
Kirill Reshke