BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

Started by PG Bug reporting form27 days ago9 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 19480
Logged by: Andrzej Doros
Email address: adoros@starfishstorage.com
PostgreSQL version: 17.9
Operating system: Ubuntu 22.04.5 LTS (x86_64), kernel 5.15, glibc 2.
Description:

PostgreSQL version: 17.9 (production crash), confirmed identical on 17.10
OS: Ubuntu 22.04.5 LTS, x86_64, kernel 5.15, glibc 2.35
Package: postgresql-plpython3-17 from pgdg apt repository

DESCRIPTION
-----------

A PL/Python set-returning function (SRF) crashes the backend with SIGSEGV
when
another session executes CREATE OR REPLACE FUNCTION (or ALTER FUNCTION) on
the
same function while the SRF is mid-iteration.

This is a use-after-free. srfstate->savedargs is allocated inside proc->mcxt
by
PLy_function_save_args() (plpy_exec.c:503). On each per-call SRF invocation,
plpython3_call_handler calls PLy_procedure_get(), which may call
PLy_procedure_delete(old_proc) -> MemoryContextDelete(old_proc->mcxt) if the
function's pg_proc row has changed (different xmin or ctid). After that,
srfstate->savedargs is a dangling pointer — it is not cleared. The next
PLy_function_restore_args() reads freed memory:

if (srfstate->savedargs) /* non-NULL dangling pointer
*/
PLy_function_restore_args(proc, srfstate->savedargs); /* reads
freed mem */

Inside PLy_function_restore_args (plpy_exec.c:551):

for (i = 0; i < savedargs->nargs; i++) /* nargs from freed memory */
{
if (proc->argnames[i] && ...)
PyDict_SetItemString(..., proc->argnames[i], ...);

When savedargs->nargs is garbage (e.g. 2056017128 in two production core
dumps),
proc->argnames[i] for large i reads an invalid pointer, which is passed to
PyDict_SetItemString -> PyUnicode_FromString -> strlen -> SIGSEGV.

CRASH STACK (two identical core dumps from production, PG 17.9, Ubuntu
22.04)
------------------------------------------------------------------------------

#0 __strlen_evex()
#1 PyUnicode_FromString(u=0x69ffff0000)
#2 PyDict_SetItemString(...)
#3 PLy_function_restore_args(proc=..., savedargs=...)
#4 PLy_exec_function(...)
#5 plpython3_call_handler(...)
#6 fmgr_security_definer(...)
#7 ExecMakeTableFunctionResult(...)

State from the newer core dump:

proc->proname = "tags_report_plpython"
proc->nargs = 1
proc->argnames[0]= "flavour"
savedargs->nargs = 2056017128 <- should be 1; contains garbage
savedargs->namedargs[0] = 'tags' <- still valid (not yet overwritten)
i = 4 <- loop has iterated far past argnames[]

TRIGGER CONDITION
-----------------

The pg_proc invalidation reaches Session A's backend when
AcceptInvalidationMessages() is called. This happens when Session A's Python
code calls plpy.execute() with a statement that acquires a NEW relation lock
(e.g. CREATE TEMP TABLE, any table not previously locked in this statement).
Simply calling plpy.execute("SELECT 1") is not sufficient because the lock
on
pg_proc is already held and subsequent requests are served from the
per-process
lock table without invoking AcceptInvalidationMessages.

In production the trigger is autovacuum on pg_proc (which moves the tuple's
ctid) or any concurrent DDL from another session. Long-running SRFs (hours)
are much more likely to hit this window.

STEPS TO REPRODUCE
------------------

Requires two concurrent sessions and PostgreSQL with plpython3u.

Session A — start and leave running:

CREATE EXTENSION IF NOT EXISTS plpython3u;

CREATE OR REPLACE FUNCTION repro_srf(flavour VARCHAR)
RETURNS TABLE (i BIGINT) AS $$
import time
for i in range(100):
-- CREATE TEMP TABLE acquires a new relation lock each iteration,
-- which causes AcceptInvalidationMessages to be called.
plpy.execute(f"CREATE TEMP TABLE _rt_{i} (x int)")
plpy.execute(f"DROP TABLE _rt_{i}")
time.sleep(0.3)
yield i
$$ LANGUAGE plpython3u VOLATILE;

SELECT count(*) FROM repro_srf('test');

Session B — while Session A is running (after ~2 seconds):

CREATE OR REPLACE FUNCTION repro_srf(flavour VARCHAR)
RETURNS TABLE (i BIGINT) AS $$
import time
for i in range(100):
plpy.execute(f"CREATE TEMP TABLE _rt_{i} (x int)")
plpy.execute(f"DROP TABLE _rt_{i}")
time.sleep(0.3)
yield i
$$ LANGUAGE plpython3u VOLATILE;

NOTE: In a minimal test without memory pressure, the freed savedargs memory
is often not overwritten quickly enough to produce a crash —
savedargs->nargs
accidentally retains its correct value of 1 and restore_args succeeds. Under
production load (long-running SRF, many Python allocations), the freed
region
is overwritten and the crash occurs.

The crash can be triggered deterministically with gdb by setting
savedargs->nargs to a large value immediately after PLy_procedure_delete
fires
(see gdb script below). This produces the identical crash stack seen in
production.

GDB CONFIRMATION (PostgreSQL 17.10)
-------------------------------------

The following gdb session was used to confirm the exact sequence:

(gdb) b PLy_procedure_delete
(gdb) commands 1

printf "DELETE proname=%s mcxt=%p\n", proc->proname, proc->mcxt
set $corrupt_next = 1
c
end

(gdb) b PLy_function_restore_args
(gdb) commands 2

if $corrupt_next
set {int}((long)savedargs + 24) = 2056017128
set $corrupt_next = 0
end
c
end

Output:

DELETE proname=repro_srf mcxt=0x5686641e1b20
[PLy_function_restore_args fires with savedargs=0x5686641e28e8]
[nargs set to 2056017128]
Program received signal SIGSEGV, Segmentation fault.
__strlen_avx2 ()

PostgreSQL log:
server process (PID 366) was terminated by signal 11: Segmentation fault
all server processes terminated; reinitializing

AFFECTED CODE
-------------

src/pl/plpython/plpy_exec.c, lines 503-506:
PLy_function_save_args allocates savedargs in proc->mcxt

src/pl/plpython/plpy_exec.c, lines 117-119:
PLy_function_restore_args is called with potentially dangling savedargs
(no check whether proc was rebuilt since savedargs was created)

src/pl/plpython/plpy_procedure.c, line 405 (PLy_procedure_delete):
MemoryContextDelete(proc->mcxt) frees savedargs without nulling
srfstate->savedargs

PROPOSED FIX
------------

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A — allocate savedargs in funcctx->multi_call_memory_ctx:
Change PLy_function_save_args to accept a MemoryContext parameter and pass
funcctx->multi_call_memory_ctx from PLy_exec_function. The saved PyObject*
references are valid regardless of which MemoryContext holds the struct.

Option B — detect proc rebuild and discard stale savedargs:
After PLy_procedure_get returns a new proc, check whether it differs from
the
proc that created srfstate->savedargs. If so, discard savedargs
(PLy_function_drop_args or simply set to NULL) and skip the restore.

#2Matheus Alcantara
matheusssilv97@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On Fri May 15, 2026 at 8:11 AM -03, PG Bug reporting form wrote:

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A — allocate savedargs in funcctx->multi_call_memory_ctx:
Change PLy_function_save_args to accept a MemoryContext parameter and pass
funcctx->multi_call_memory_ctx from PLy_exec_function. The saved PyObject*
references are valid regardless of which MemoryContext holds the struct.

Option B — detect proc rebuild and discard stale savedargs:
After PLy_procedure_get returns a new proc, check whether it differs from
the
proc that created srfstate->savedargs. If so, discard savedargs
(PLy_function_drop_args or simply set to NULL) and skip the restore.

Hi, thank you for the very detailed bug report. I've managed to
reproduce the issue on master.

Option A seems to fix the issue (see attached patch) but I've found
another issue while playing with this that I think it's related:

CREATE OR REPLACE FUNCTION trigger_stack_overflow(x BIGINT)
RETURNS TABLE(i BIGINT) AS $$
import time
plpy.execute(f"CREATE TEMP TABLE _rt_{x} (x int)")
plpy.execute(f"DROP TABLE _rt_{x}")
time.sleep(0.3)
plpy.execute("SELECT trigger_stack_overflow(1)")
yield x
$$ LANGUAGE plpython3u VOLATILE;

Run SELECT trigger_stack_overflow(1) and on another session execute the
CREATE OR REPLACE and wait for the first session to crash with this
stacktrace:
frame #3: 0x000000010554a694 postgres`ExceptionalCondition(conditionName="proc->calldepth > 0", fileName="../src/pl/plpython/plpy_exec.c", lineNumber=701) at assert.c:65:2
frame #4: 0x0000000105e41984 plpython3.dylib`PLy_global_args_pop(proc=0x000000014b03cf00) at plpy_exec.c:701:2
frame #5: 0x0000000105e40d94 plpython3.dylib`PLy_exec_function(fcinfo=0x000000011e077738, proc=0x000000014b03cf00) at plpy_exec.c:264:3

The expected output from the first session should be something like
this:

ERROR: 54001: error fetching next item from iterator
DETAIL: spiexceptions.StatementTooComplex: error fetching next item from iterator
HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.

This is because when PLy_procedure_delete() is executed on
PLy_procedure_get() it also destroy information related with recursive
functions, such as "calldepth", "argstack" and "globals" which cause the
assert failure Assert(proc->calldepth > 0) on PLy_global_args_pop() when
it's executed on PG_CATCH block on PLy_exec_function() or EXC_BAD_ACCESS
when accessing "argstack" or "globals".

Althrought changing the memory context where savedargs is allocated fix
the reported issue I think that the long term fix is to preserve such
necessary execution information during PLyProcedure re-creation. I'm
still studying the code to see if and how this can implemented.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Attachments:

0001-plpython-Use-correct-memory-context-for-savedargs.patchtext/plain; charset=utf-8; name=0001-plpython-Use-correct-memory-context-for-savedargs.patchDownload+20-8
#3Matheus Alcantara
matheusssilv97@gmail.com
In reply to: Matheus Alcantara (#2)
Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On 25/05/26 19:26, Matheus Alcantara wrote:

On Fri May 15, 2026 at 8:11 AM -03, PG Bug reporting form wrote:

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A — allocate savedargs in funcctx->multi_call_memory_ctx:
Change PLy_function_save_args to accept a MemoryContext parameter and pass
funcctx->multi_call_memory_ctx from PLy_exec_function. The saved PyObject*
references are valid regardless of which MemoryContext holds the struct.

Option B — detect proc rebuild and discard stale savedargs:
After PLy_procedure_get returns a new proc, check whether it differs from
the
proc that created srfstate->savedargs. If so, discard savedargs
(PLy_function_drop_args or simply set to NULL) and skip the restore.

Hi, thank you for the very detailed bug report. I've managed to
reproduce the issue on master.

Option A seems to fix the issue (see attached patch) but I've found
another issue while playing with this that I think it's related:

...

This is because when PLy_procedure_delete() is executed on
PLy_procedure_get() it also destroy information related with recursive
functions, such as "calldepth", "argstack" and "globals" which cause the
assert failure Assert(proc->calldepth > 0) on PLy_global_args_pop() when
it's executed on PG_CATCH block on PLy_exec_function() or EXC_BAD_ACCESS
when accessing "argstack" or "globals".

Although changing the memory context where savedargs is allocated fix
the reported issue I think that the long term fix is to preserve such
necessary execution information during PLyProcedure re-creation. I'm
still studying the code to see if and how this can implemented.

This is being tricky to debug. I'm not being able to reproduce the
issue with assert disabled, not even with the steps shared on the bug
report.

Andrzej could you please confirm if you hit this failure with assert
enable? And if it's enable, could you please check if it's also
happens with assert disabled?

Also, the 17.10 version was released some weeks ago, can you also test
against this new minor release?

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Matheus Alcantara (#2)
Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Fri May 15, 2026 at 8:11 AM -03, PG Bug reporting form wrote:

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A seems to fix the issue (see attached patch) but I've found
another issue while playing with this that I think it's related:
...
This is because when PLy_procedure_delete() is executed on
PLy_procedure_get() it also destroy information related with recursive
functions, such as "calldepth", "argstack" and "globals" which cause the
assert failure Assert(proc->calldepth > 0) on PLy_global_args_pop() when
it's executed on PG_CATCH block on PLy_exec_function() or EXC_BAD_ACCESS
when accessing "argstack" or "globals".

Yeah. The bigger picture though is: if we are re-entrantly calling
either a recursive function or a SRF, we should not destroy any of the
existing state, nor do we want to replace the function body. The only
way to have sane behavior is to keep executing the same function body
until the execution instance (recursion level or continued SRF) is
done. So these concerns about associated state are only part of the
problem.

plpgsql ran into this years ago, and its solution has been to maintain
a reference count on each function parsetree and not destroy an
obsoleted parsetree till the reference count goes to zero. I've had
in the back of my head that the other PLs need to do likewise, but it
hasn't gotten to the front of the to-do list, mainly because the other
PLs are much less used and so field complaints about this have been
rare. I had hoped also that the language interpreters underlying the
other PLs might solve some of this for us, but it's unclear to what
extent they help. Certainly it's not cool to be clobbering our own
execution state that's outside the language interpreter.

We might want to go as far as converting the other PLs to use the
utils/cache/funccache.c infrastructure, but perhaps there is a
less invasive fix. Certainly, a fix based on funccache.c could not
be back-patched. (On the other hand, given the rarity of complaints,
perhaps a HEAD-only fix is acceptable.)

regards, tom lane

#5Matheus Alcantara
matheusssilv97@gmail.com
In reply to: Tom Lane (#4)
Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On Thu May 28, 2026 at 12:12 PM -03, Tom Lane wrote:

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Fri May 15, 2026 at 8:11 AM -03, PG Bug reporting form wrote:

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A seems to fix the issue (see attached patch) but I've found
another issue while playing with this that I think it's related:
...
This is because when PLy_procedure_delete() is executed on
PLy_procedure_get() it also destroy information related with recursive
functions, such as "calldepth", "argstack" and "globals" which cause the
assert failure Assert(proc->calldepth > 0) on PLy_global_args_pop() when
it's executed on PG_CATCH block on PLy_exec_function() or EXC_BAD_ACCESS
when accessing "argstack" or "globals".

Yeah. The bigger picture though is: if we are re-entrantly calling
either a recursive function or a SRF, we should not destroy any of the
existing state, nor do we want to replace the function body. The only
way to have sane behavior is to keep executing the same function body
until the execution instance (recursion level or continued SRF) is
done. So these concerns about associated state are only part of the
problem.

plpgsql ran into this years ago, and its solution has been to maintain
a reference count on each function parsetree and not destroy an
obsoleted parsetree till the reference count goes to zero. I've had
in the back of my head that the other PLs need to do likewise, but it
hasn't gotten to the front of the to-do list, mainly because the other
PLs are much less used and so field complaints about this have been
rare. I had hoped also that the language interpreters underlying the
other PLs might solve some of this for us, but it's unclear to what
extent they help. Certainly it's not cool to be clobbering our own
execution state that's outside the language interpreter.

We might want to go as far as converting the other PLs to use the
utils/cache/funccache.c infrastructure, but perhaps there is a
less invasive fix. Certainly, a fix based on funccache.c could not
be back-patched. (On the other hand, given the rarity of complaints,
perhaps a HEAD-only fix is acceptable.)

I've been exploring the funccache.c approach for plpython. The main
challenge is that plpython uses SFRM_ValuePerCall for SRFs, whereas
plpgsql uses SFRM_Materialize. This means plpgsql can simply increment
use_count at the start of plpgsql_call_handler() and decrement it at the
end, since all results are produced in a single call. For plpython,
ExecMakeTableFunctionResult() calls the handler multiple times, with
use_count returning to zero between calls.

With ValuePerCall, cached_function_compile() may try to re-create an
invalid cache entry because use_count can be 0 while
ExecMakeTableFunctionResult() is in the middle of its loop. In that
case, the SRFState would be lost for the currently running plpython
function.

I'm still not sure how to proceed here but It seems like we would need
some refactoring in plpython to make it work with funccache. Not sure if
changing ValuePerCall to Materialize is a way to go or perhaps there's
another way to fix this.

I've also tried to fix this without funccache, but it seems like we
would end up implementing something similar anyway. That might be a way
to go, but I'm also not sure if it's the best path.

Thoughts?

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Matheus Alcantara (#5)
Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Thu May 28, 2026 at 12:12 PM -03, Tom Lane wrote:

Yeah. The bigger picture though is: if we are re-entrantly calling
either a recursive function or a SRF, we should not destroy any of the
existing state, nor do we want to replace the function body. The only
way to have sane behavior is to keep executing the same function body
until the execution instance (recursion level or continued SRF) is
done. So these concerns about associated state are only part of the
problem.

I've been exploring the funccache.c approach for plpython. The main
challenge is that plpython uses SFRM_ValuePerCall for SRFs, whereas
plpgsql uses SFRM_Materialize. This means plpgsql can simply increment
use_count at the start of plpgsql_call_handler() and decrement it at the
end, since all results are produced in a single call. For plpython,
ExecMakeTableFunctionResult() calls the handler multiple times, with
use_count returning to zero between calls.

Right. I think what we have to do is maintain the increased use_count
across the whole series of SRF executions and decrement it only once
we're done. That implies that we need some out-of-band mechanism for
decrementing the use_count if the query fails to run the SRF to
completion for whatever reason (error, LIMIT, etc). The first tool
I would reach for is a context reset callback attached to the query's
executor context, but there may be a better answer. Whether we do it
like that or some other way, it might be appropriate to put
infrastructure for it into funccache.c instead of expecting every PL
that wants to use SFRM_ValuePerCall to re-invent this wheel.

I'm still not sure how to proceed here but It seems like we would need
some refactoring in plpython to make it work with funccache.

plpython will certainly need some work, but I'm entirely amenable to
also changing funccache if it doesn't support this requirement well.
That module is new as of v18, so it doesn't have much claim to have
a stabilized API yet.

I've also tried to fix this without funccache, but it seems like we
would end up implementing something similar anyway.

Yeah, that was my suspicion as well. funccache.c exists because
I realized that SQL-language functions (executor/functions.c) were
going to need logic that plpgsql had had for years.

Actually ... if memory serves, SQL-language functions use ValuePerCall
mode, so there probably already is a solution to this embedded in
functions.c. Did you look at that?

regards, tom lane

#7Matheus Alcantara
matheusssilv97@gmail.com
In reply to: Tom Lane (#6)
Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On Mon Jun 1, 2026 at 8:26 PM -03, Tom Lane wrote:

Yeah, that was my suspicion as well. funccache.c exists because
I realized that SQL-language functions (executor/functions.c) were
going to need logic that plpgsql had had for years.

Actually ... if memory serves, SQL-language functions use ValuePerCall
mode, so there probably already is a solution to this embedded in
functions.c. Did you look at that?

I dind't look at this before but this was exactly the right call. SQL
functions handle this by maintaining a per-call-site cache struct
(SQLFunctionCache) in fn_extra that holds both the pointer to the
long-lived hash entry and the execution state. The use_count is
incremented when we first obtain the function and decremented via a
MemoryContextCallback when fn_mcxt is deleted.

I've adapted the same approach for PL/Python. The main changes are:

PLyProcedure now embeds CachedFunction as its first member and is
managed by cached_function_compile(). A new PLyProcedureCache struct
lives in fn_extra and holds the pointer to PLyProcedure plus SRF state.
For cleanup, I use a MemoryContextCallback on fn_mcxt to decrement
use_count, and an ExprContextCallback to clean up Python iterator state
when the SRF is interrupted.

Since fn_extra is now used for PLyProcedureCache, I had to remove the
SRF macros and switch to direct isDone signaling via ReturnSetInfo,
which is how SQL functions do it anyway.

I also fixed the validator to create a fake fcinfo with the correct
fn_oid (the function being validated), matching what PL/pgSQL does.

Patch attached.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Attachments:

v1-0001-plpython-Use-funccache.c-infrastructure-for-proce.patchtext/plain; charset=utf-8; name=v1-0001-plpython-Use-funccache.c-infrastructure-for-proce.patchDownload+305-246
#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Matheus Alcantara (#7)
Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Mon Jun 1, 2026 at 8:26 PM -03, Tom Lane wrote:

Actually ... if memory serves, SQL-language functions use ValuePerCall
mode, so there probably already is a solution to this embedded in
functions.c. Did you look at that?

I dind't look at this before but this was exactly the right call. SQL
functions handle this by maintaining a per-call-site cache struct
(SQLFunctionCache) in fn_extra that holds both the pointer to the
long-lived hash entry and the execution state. The use_count is
incremented when we first obtain the function and decremented via a
MemoryContextCallback when fn_mcxt is deleted.

I've adapted the same approach for PL/Python.

I've not read this patch yet but your high-level description seems
on-target.

Assuming the patch withstands review, there are three ways we could
proceed:

1. Hold it for v20.

2. Sneak it into v19.

3. Treat it as a back-patchable fix and put it into v18 as well.
(Going further back than v18 seems unreasonable because funccache.c
doesn't exist before that, so we'd have to back-patch it too.)

I do not think that #3 is really a great idea, mainly because the
failure case doesn't seem very likely to be hit in production,
and the lack of previous reports about this very ancient bug
bears that out.

I do find some attraction in #2, mainly because it would get the fix
into the field a year earlier than #1. But considering we're past
beta1 it may be too late for #2 to be reasonable either.

Looping in the RMT to see what they think...

regards, tom lane

#9Matheus Alcantara
matheusssilv97@gmail.com
In reply to: Tom Lane (#8)
Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On 05/06/26 16:11, Tom Lane wrote:

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Mon Jun 1, 2026 at 8:26 PM -03, Tom Lane wrote:

Actually ... if memory serves, SQL-language functions use ValuePerCall
mode, so there probably already is a solution to this embedded in
functions.c. Did you look at that?

I dind't look at this before but this was exactly the right call. SQL
functions handle this by maintaining a per-call-site cache struct
(SQLFunctionCache) in fn_extra that holds both the pointer to the
long-lived hash entry and the execution state. The use_count is
incremented when we first obtain the function and decremented via a
MemoryContextCallback when fn_mcxt is deleted.

I've adapted the same approach for PL/Python.

I've not read this patch yet but your high-level description seems
on-target.

Assuming the patch withstands review, there are three ways we could
proceed:

1. Hold it for v20.

2. Sneak it into v19.

3. Treat it as a back-patchable fix and put it into v18 as well.
(Going further back than v18 seems unreasonable because funccache.c
doesn't exist before that, so we'd have to back-patch it too.)

I do not think that #3 is really a great idea, mainly because the
failure case doesn't seem very likely to be hit in production,
and the lack of previous reports about this very ancient bug
bears that out.

I do find some attraction in #2, mainly because it would get the fix
into the field a year earlier than #1. But considering we're past
beta1 it may be too late for #2 to be reasonable either.

Yeah, this sounds a better option for me too, otherwise we can go with
#1. Back-patching this seems complicated, so I agree #3 does not seems
a good idea.

Looping in the RMT to see what they think...

Ok

--
Matheus Alcantara
EDB: https://www.enterprisedb.com