BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

Started by PG Bug reporting form2 months ago17 messagesbugs

noreply@postgresql.org

2 months ago

The following bug has been logged on the website:

Bug reference: 19480
Logged by: Andrzej Doros
Email address: adoros@starfishstorage.com
PostgreSQL version: 17.9
Operating system: Ubuntu 22.04.5 LTS (x86_64), kernel 5.15, glibc 2.
Description:

PostgreSQL version: 17.9 (production crash), confirmed identical on 17.10
OS: Ubuntu 22.04.5 LTS, x86_64, kernel 5.15, glibc 2.35
Package: postgresql-plpython3-17 from pgdg apt repository

DESCRIPTION
-----------

A PL/Python set-returning function (SRF) crashes the backend with SIGSEGV
when
another session executes CREATE OR REPLACE FUNCTION (or ALTER FUNCTION) on
the
same function while the SRF is mid-iteration.

This is a use-after-free. srfstate->savedargs is allocated inside proc->mcxt
by
PLy_function_save_args() (plpy_exec.c:503). On each per-call SRF invocation,
plpython3_call_handler calls PLy_procedure_get(), which may call
PLy_procedure_delete(old_proc) -> MemoryContextDelete(old_proc->mcxt) if the
function's pg_proc row has changed (different xmin or ctid). After that,
srfstate->savedargs is a dangling pointer — it is not cleared. The next
PLy_function_restore_args() reads freed memory:

if (srfstate->savedargs) /* non-NULL dangling pointer
*/
PLy_function_restore_args(proc, srfstate->savedargs); /* reads
freed mem */

Inside PLy_function_restore_args (plpy_exec.c:551):

for (i = 0; i < savedargs->nargs; i++) /* nargs from freed memory */
{
if (proc->argnames[i] && ...)
PyDict_SetItemString(..., proc->argnames[i], ...);

When savedargs->nargs is garbage (e.g. 2056017128 in two production core
dumps),
proc->argnames[i] for large i reads an invalid pointer, which is passed to
PyDict_SetItemString -> PyUnicode_FromString -> strlen -> SIGSEGV.

CRASH STACK (two identical core dumps from production, PG 17.9, Ubuntu
22.04)
------------------------------------------------------------------------------

#0 __strlen_evex()
#1 PyUnicode_FromString(u=0x69ffff0000)
#2 PyDict_SetItemString(...)
#3 PLy_function_restore_args(proc=..., savedargs=...)
#4 PLy_exec_function(...)
#5 plpython3_call_handler(...)
#6 fmgr_security_definer(...)
#7 ExecMakeTableFunctionResult(...)

State from the newer core dump:

proc->proname = "tags_report_plpython"
proc->nargs = 1
proc->argnames[0]= "flavour"
savedargs->nargs = 2056017128 <- should be 1; contains garbage
savedargs->namedargs[0] = 'tags' <- still valid (not yet overwritten)
i = 4 <- loop has iterated far past argnames[]

TRIGGER CONDITION
-----------------

The pg_proc invalidation reaches Session A's backend when
AcceptInvalidationMessages() is called. This happens when Session A's Python
code calls plpy.execute() with a statement that acquires a NEW relation lock
(e.g. CREATE TEMP TABLE, any table not previously locked in this statement).
Simply calling plpy.execute("SELECT 1") is not sufficient because the lock
on
pg_proc is already held and subsequent requests are served from the
per-process
lock table without invoking AcceptInvalidationMessages.

In production the trigger is autovacuum on pg_proc (which moves the tuple's
ctid) or any concurrent DDL from another session. Long-running SRFs (hours)
are much more likely to hit this window.

STEPS TO REPRODUCE
------------------

Requires two concurrent sessions and PostgreSQL with plpython3u.

Session A — start and leave running:

CREATE EXTENSION IF NOT EXISTS plpython3u;

CREATE OR REPLACE FUNCTION repro_srf(flavour VARCHAR)
RETURNS TABLE (i BIGINT) AS $$
import time
for i in range(100):
-- CREATE TEMP TABLE acquires a new relation lock each iteration,
-- which causes AcceptInvalidationMessages to be called.
plpy.execute(f"CREATE TEMP TABLE _rt_{i} (x int)")
plpy.execute(f"DROP TABLE _rt_{i}")
time.sleep(0.3)
yield i
$$ LANGUAGE plpython3u VOLATILE;

SELECT count(*) FROM repro_srf('test');

Session B — while Session A is running (after ~2 seconds):

CREATE OR REPLACE FUNCTION repro_srf(flavour VARCHAR)
RETURNS TABLE (i BIGINT) AS $$
import time
for i in range(100):
plpy.execute(f"CREATE TEMP TABLE _rt_{i} (x int)")
plpy.execute(f"DROP TABLE _rt_{i}")
time.sleep(0.3)
yield i
$$ LANGUAGE plpython3u VOLATILE;

NOTE: In a minimal test without memory pressure, the freed savedargs memory
is often not overwritten quickly enough to produce a crash —
savedargs->nargs
accidentally retains its correct value of 1 and restore_args succeeds. Under
production load (long-running SRF, many Python allocations), the freed
region
is overwritten and the crash occurs.

The crash can be triggered deterministically with gdb by setting
savedargs->nargs to a large value immediately after PLy_procedure_delete
fires
(see gdb script below). This produces the identical crash stack seen in
production.

GDB CONFIRMATION (PostgreSQL 17.10)
-------------------------------------

The following gdb session was used to confirm the exact sequence:

(gdb) b PLy_procedure_delete
(gdb) commands 1

printf "DELETE proname=%s mcxt=%p\n", proc->proname, proc->mcxt
set $corrupt_next = 1
c
end

(gdb) b PLy_function_restore_args
(gdb) commands 2

if $corrupt_next
set {int}((long)savedargs + 24) = 2056017128
set $corrupt_next = 0
end
c
end

Output:

DELETE proname=repro_srf mcxt=0x5686641e1b20
[PLy_function_restore_args fires with savedargs=0x5686641e28e8]
[nargs set to 2056017128]
Program received signal SIGSEGV, Segmentation fault.
__strlen_avx2 ()

PostgreSQL log:
server process (PID 366) was terminated by signal 11: Segmentation fault
all server processes terminated; reinitializing

AFFECTED CODE
-------------

src/pl/plpython/plpy_exec.c, lines 503-506:
PLy_function_save_args allocates savedargs in proc->mcxt

src/pl/plpython/plpy_exec.c, lines 117-119:
PLy_function_restore_args is called with potentially dangling savedargs
(no check whether proc was rebuilt since savedargs was created)

src/pl/plpython/plpy_procedure.c, line 405 (PLy_procedure_delete):
MemoryContextDelete(proc->mcxt) frees savedargs without nulling
srfstate->savedargs

PROPOSED FIX
------------

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A — allocate savedargs in funcctx->multi_call_memory_ctx:
Change PLy_function_save_args to accept a MemoryContext parameter and pass
funcctx->multi_call_memory_ctx from PLy_exec_function. The saved PyObject*
references are valid regardless of which MemoryContext holds the struct.

Option B — detect proc rebuild and discard stale savedargs:
After PLy_procedure_get returns a new proc, check whether it differs from
the
proc that created srfstate->savedargs. If so, discard savedargs
(PLy_function_drop_args or simply set to NULL) and skip the restore.

Matheus Alcantara

matheusssilv97@gmail.com

2 months ago

In reply to: PG Bug reporting form (#1)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On Fri May 15, 2026 at 8:11 AM -03, PG Bug reporting form wrote:

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A — allocate savedargs in funcctx->multi_call_memory_ctx:
Change PLy_function_save_args to accept a MemoryContext parameter and pass
funcctx->multi_call_memory_ctx from PLy_exec_function. The saved PyObject*
references are valid regardless of which MemoryContext holds the struct.

Option B — detect proc rebuild and discard stale savedargs:
After PLy_procedure_get returns a new proc, check whether it differs from
the
proc that created srfstate->savedargs. If so, discard savedargs
(PLy_function_drop_args or simply set to NULL) and skip the restore.

Hi, thank you for the very detailed bug report. I've managed to
reproduce the issue on master.

Option A seems to fix the issue (see attached patch) but I've found
another issue while playing with this that I think it's related:

CREATE OR REPLACE FUNCTION trigger_stack_overflow(x BIGINT)
RETURNS TABLE(i BIGINT) AS $$
import time
plpy.execute(f"CREATE TEMP TABLE _rt_{x} (x int)")
plpy.execute(f"DROP TABLE _rt_{x}")
time.sleep(0.3)
plpy.execute("SELECT trigger_stack_overflow(1)")
yield x
$$ LANGUAGE plpython3u VOLATILE;

Run SELECT trigger_stack_overflow(1) and on another session execute the
CREATE OR REPLACE and wait for the first session to crash with this
stacktrace:
frame #3: 0x000000010554a694 postgres`ExceptionalCondition(conditionName="proc->calldepth > 0", fileName="../src/pl/plpython/plpy_exec.c", lineNumber=701) at assert.c:65:2
frame #4: 0x0000000105e41984 plpython3.dylib`PLy_global_args_pop(proc=0x000000014b03cf00) at plpy_exec.c:701:2
frame #5: 0x0000000105e40d94 plpython3.dylib`PLy_exec_function(fcinfo=0x000000011e077738, proc=0x000000014b03cf00) at plpy_exec.c:264:3

The expected output from the first session should be something like
this:

ERROR: 54001: error fetching next item from iterator
DETAIL: spiexceptions.StatementTooComplex: error fetching next item from iterator
HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.

This is because when PLy_procedure_delete() is executed on
PLy_procedure_get() it also destroy information related with recursive
functions, such as "calldepth", "argstack" and "globals" which cause the
assert failure Assert(proc->calldepth > 0) on PLy_global_args_pop() when
it's executed on PG_CATCH block on PLy_exec_function() or EXC_BAD_ACCESS
when accessing "argstack" or "globals".

Althrought changing the memory context where savedargs is allocated fix
the reported issue I think that the long term fix is to preserve such
necessary execution information during PLyProcedure re-creation. I'm
still studying the code to see if and how this can implemented.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Matheus Alcantara

matheusssilv97@gmail.com

about 2 months ago

In reply to: Matheus Alcantara (#2)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On 25/05/26 19:26, Matheus Alcantara wrote:

On Fri May 15, 2026 at 8:11 AM -03, PG Bug reporting form wrote:

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A — allocate savedargs in funcctx->multi_call_memory_ctx:
Change PLy_function_save_args to accept a MemoryContext parameter and pass
funcctx->multi_call_memory_ctx from PLy_exec_function. The saved PyObject*
references are valid regardless of which MemoryContext holds the struct.

Option B — detect proc rebuild and discard stale savedargs:
After PLy_procedure_get returns a new proc, check whether it differs from
the
proc that created srfstate->savedargs. If so, discard savedargs
(PLy_function_drop_args or simply set to NULL) and skip the restore.

Hi, thank you for the very detailed bug report. I've managed to
reproduce the issue on master.

Option A seems to fix the issue (see attached patch) but I've found
another issue while playing with this that I think it's related:

...

This is because when PLy_procedure_delete() is executed on
PLy_procedure_get() it also destroy information related with recursive
functions, such as "calldepth", "argstack" and "globals" which cause the
assert failure Assert(proc->calldepth > 0) on PLy_global_args_pop() when
it's executed on PG_CATCH block on PLy_exec_function() or EXC_BAD_ACCESS
when accessing "argstack" or "globals".

Although changing the memory context where savedargs is allocated fix
the reported issue I think that the long term fix is to preserve such
necessary execution information during PLyProcedure re-creation. I'm
still studying the code to see if and how this can implemented.

This is being tricky to debug. I'm not being able to reproduce the
issue with assert disabled, not even with the steps shared on the bug
report.

Andrzej could you please confirm if you hit this failure with assert
enable? And if it's enable, could you please check if it's also
happens with assert disabled?

Also, the 17.10 version was released some weeks ago, can you also test
against this new minor release?

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

about 2 months ago

In reply to: Matheus Alcantara (#2)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Fri May 15, 2026 at 8:11 AM -03, PG Bug reporting form wrote:

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A seems to fix the issue (see attached patch) but I've found
another issue while playing with this that I think it's related:
...
This is because when PLy_procedure_delete() is executed on
PLy_procedure_get() it also destroy information related with recursive
functions, such as "calldepth", "argstack" and "globals" which cause the
assert failure Assert(proc->calldepth > 0) on PLy_global_args_pop() when
it's executed on PG_CATCH block on PLy_exec_function() or EXC_BAD_ACCESS
when accessing "argstack" or "globals".

Yeah. The bigger picture though is: if we are re-entrantly calling
either a recursive function or a SRF, we should not destroy any of the
existing state, nor do we want to replace the function body. The only
way to have sane behavior is to keep executing the same function body
until the execution instance (recursion level or continued SRF) is
done. So these concerns about associated state are only part of the
problem.

plpgsql ran into this years ago, and its solution has been to maintain
a reference count on each function parsetree and not destroy an
obsoleted parsetree till the reference count goes to zero. I've had
in the back of my head that the other PLs need to do likewise, but it
hasn't gotten to the front of the to-do list, mainly because the other
PLs are much less used and so field complaints about this have been
rare. I had hoped also that the language interpreters underlying the
other PLs might solve some of this for us, but it's unclear to what
extent they help. Certainly it's not cool to be clobbering our own
execution state that's outside the language interpreter.

We might want to go as far as converting the other PLs to use the
utils/cache/funccache.c infrastructure, but perhaps there is a
less invasive fix. Certainly, a fix based on funccache.c could not
be back-patched. (On the other hand, given the rarity of complaints,
perhaps a HEAD-only fix is acceptable.)

regards, tom lane

Matheus Alcantara

matheusssilv97@gmail.com

about 2 months ago

In reply to: Tom Lane (#4)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On Thu May 28, 2026 at 12:12 PM -03, Tom Lane wrote:

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Fri May 15, 2026 at 8:11 AM -03, PG Bug reporting form wrote:

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A seems to fix the issue (see attached patch) but I've found
another issue while playing with this that I think it's related:
...
This is because when PLy_procedure_delete() is executed on
PLy_procedure_get() it also destroy information related with recursive
functions, such as "calldepth", "argstack" and "globals" which cause the
assert failure Assert(proc->calldepth > 0) on PLy_global_args_pop() when
it's executed on PG_CATCH block on PLy_exec_function() or EXC_BAD_ACCESS
when accessing "argstack" or "globals".

Yeah. The bigger picture though is: if we are re-entrantly calling
either a recursive function or a SRF, we should not destroy any of the
existing state, nor do we want to replace the function body. The only
way to have sane behavior is to keep executing the same function body
until the execution instance (recursion level or continued SRF) is
done. So these concerns about associated state are only part of the
problem.

plpgsql ran into this years ago, and its solution has been to maintain
a reference count on each function parsetree and not destroy an
obsoleted parsetree till the reference count goes to zero. I've had
in the back of my head that the other PLs need to do likewise, but it
hasn't gotten to the front of the to-do list, mainly because the other
PLs are much less used and so field complaints about this have been
rare. I had hoped also that the language interpreters underlying the
other PLs might solve some of this for us, but it's unclear to what
extent they help. Certainly it's not cool to be clobbering our own
execution state that's outside the language interpreter.

We might want to go as far as converting the other PLs to use the
utils/cache/funccache.c infrastructure, but perhaps there is a
less invasive fix. Certainly, a fix based on funccache.c could not
be back-patched. (On the other hand, given the rarity of complaints,
perhaps a HEAD-only fix is acceptable.)

I've been exploring the funccache.c approach for plpython. The main
challenge is that plpython uses SFRM_ValuePerCall for SRFs, whereas
plpgsql uses SFRM_Materialize. This means plpgsql can simply increment
use_count at the start of plpgsql_call_handler() and decrement it at the
end, since all results are produced in a single call. For plpython,
ExecMakeTableFunctionResult() calls the handler multiple times, with
use_count returning to zero between calls.

With ValuePerCall, cached_function_compile() may try to re-create an
invalid cache entry because use_count can be 0 while
ExecMakeTableFunctionResult() is in the middle of its loop. In that
case, the SRFState would be lost for the currently running plpython
function.

I'm still not sure how to proceed here but It seems like we would need
some refactoring in plpython to make it work with funccache. Not sure if
changing ValuePerCall to Materialize is a way to go or perhaps there's
another way to fix this.

I've also tried to fix this without funccache, but it seems like we
would end up implementing something similar anyway. That might be a way
to go, but I'm also not sure if it's the best path.

Thoughts?

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

about 2 months ago

In reply to: Matheus Alcantara (#5)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Thu May 28, 2026 at 12:12 PM -03, Tom Lane wrote:

Yeah. The bigger picture though is: if we are re-entrantly calling
either a recursive function or a SRF, we should not destroy any of the
existing state, nor do we want to replace the function body. The only
way to have sane behavior is to keep executing the same function body
until the execution instance (recursion level or continued SRF) is
done. So these concerns about associated state are only part of the
problem.

I've been exploring the funccache.c approach for plpython. The main
challenge is that plpython uses SFRM_ValuePerCall for SRFs, whereas
plpgsql uses SFRM_Materialize. This means plpgsql can simply increment
use_count at the start of plpgsql_call_handler() and decrement it at the
end, since all results are produced in a single call. For plpython,
ExecMakeTableFunctionResult() calls the handler multiple times, with
use_count returning to zero between calls.

Right. I think what we have to do is maintain the increased use_count
across the whole series of SRF executions and decrement it only once
we're done. That implies that we need some out-of-band mechanism for
decrementing the use_count if the query fails to run the SRF to
completion for whatever reason (error, LIMIT, etc). The first tool
I would reach for is a context reset callback attached to the query's
executor context, but there may be a better answer. Whether we do it
like that or some other way, it might be appropriate to put
infrastructure for it into funccache.c instead of expecting every PL
that wants to use SFRM_ValuePerCall to re-invent this wheel.

I'm still not sure how to proceed here but It seems like we would need
some refactoring in plpython to make it work with funccache.

plpython will certainly need some work, but I'm entirely amenable to
also changing funccache if it doesn't support this requirement well.
That module is new as of v18, so it doesn't have much claim to have
a stabilized API yet.

I've also tried to fix this without funccache, but it seems like we
would end up implementing something similar anyway.

Yeah, that was my suspicion as well. funccache.c exists because
I realized that SQL-language functions (executor/functions.c) were
going to need logic that plpgsql had had for years.

Actually ... if memory serves, SQL-language functions use ValuePerCall
mode, so there probably already is a solution to this embedded in
functions.c. Did you look at that?

regards, tom lane

Matheus Alcantara

matheusssilv97@gmail.com

about 2 months ago

In reply to: Tom Lane (#6)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On Mon Jun 1, 2026 at 8:26 PM -03, Tom Lane wrote:

Yeah, that was my suspicion as well. funccache.c exists because
I realized that SQL-language functions (executor/functions.c) were
going to need logic that plpgsql had had for years.

Actually ... if memory serves, SQL-language functions use ValuePerCall
mode, so there probably already is a solution to this embedded in
functions.c. Did you look at that?

I dind't look at this before but this was exactly the right call. SQL
functions handle this by maintaining a per-call-site cache struct
(SQLFunctionCache) in fn_extra that holds both the pointer to the
long-lived hash entry and the execution state. The use_count is
incremented when we first obtain the function and decremented via a
MemoryContextCallback when fn_mcxt is deleted.

I've adapted the same approach for PL/Python. The main changes are:

PLyProcedure now embeds CachedFunction as its first member and is
managed by cached_function_compile(). A new PLyProcedureCache struct
lives in fn_extra and holds the pointer to PLyProcedure plus SRF state.
For cleanup, I use a MemoryContextCallback on fn_mcxt to decrement
use_count, and an ExprContextCallback to clean up Python iterator state
when the SRF is interrupted.

Since fn_extra is now used for PLyProcedureCache, I had to remove the
SRF macros and switch to direct isDone signaling via ReturnSetInfo,
which is how SQL functions do it anyway.

I also fixed the validator to create a fake fcinfo with the correct
fn_oid (the function being validated), matching what PL/pgSQL does.

Patch attached.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

about 2 months ago

In reply to: Matheus Alcantara (#7)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Mon Jun 1, 2026 at 8:26 PM -03, Tom Lane wrote:

Actually ... if memory serves, SQL-language functions use ValuePerCall
mode, so there probably already is a solution to this embedded in
functions.c. Did you look at that?

I dind't look at this before but this was exactly the right call. SQL
functions handle this by maintaining a per-call-site cache struct
(SQLFunctionCache) in fn_extra that holds both the pointer to the
long-lived hash entry and the execution state. The use_count is
incremented when we first obtain the function and decremented via a
MemoryContextCallback when fn_mcxt is deleted.

I've adapted the same approach for PL/Python.

I've not read this patch yet but your high-level description seems
on-target.

Assuming the patch withstands review, there are three ways we could
proceed:

1. Hold it for v20.

2. Sneak it into v19.

3. Treat it as a back-patchable fix and put it into v18 as well.
(Going further back than v18 seems unreasonable because funccache.c
doesn't exist before that, so we'd have to back-patch it too.)

I do not think that #3 is really a great idea, mainly because the
failure case doesn't seem very likely to be hit in production,
and the lack of previous reports about this very ancient bug
bears that out.

I do find some attraction in #2, mainly because it would get the fix
into the field a year earlier than #1. But considering we're past
beta1 it may be too late for #2 to be reasonable either.

Looping in the RMT to see what they think...

regards, tom lane

Matheus Alcantara

matheusssilv97@gmail.com

about 2 months ago

In reply to: Tom Lane (#8)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On 05/06/26 16:11, Tom Lane wrote:

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Mon Jun 1, 2026 at 8:26 PM -03, Tom Lane wrote:

Actually ... if memory serves, SQL-language functions use ValuePerCall
mode, so there probably already is a solution to this embedded in
functions.c. Did you look at that?

I dind't look at this before but this was exactly the right call. SQL
functions handle this by maintaining a per-call-site cache struct
(SQLFunctionCache) in fn_extra that holds both the pointer to the
long-lived hash entry and the execution state. The use_count is
incremented when we first obtain the function and decremented via a
MemoryContextCallback when fn_mcxt is deleted.

I've adapted the same approach for PL/Python.

I've not read this patch yet but your high-level description seems
on-target.

Assuming the patch withstands review, there are three ways we could
proceed:

1. Hold it for v20.

2. Sneak it into v19.

3. Treat it as a back-patchable fix and put it into v18 as well.
(Going further back than v18 seems unreasonable because funccache.c
doesn't exist before that, so we'd have to back-patch it too.)

I do not think that #3 is really a great idea, mainly because the
failure case doesn't seem very likely to be hit in production,
and the lack of previous reports about this very ancient bug
bears that out.

I do find some attraction in #2, mainly because it would get the fix
into the field a year earlier than #1. But considering we're past
beta1 it may be too late for #2 to be reasonable either.

Yeah, this sounds a better option for me too, otherwise we can go with
#1. Back-patching this seems complicated, so I agree #3 does not seems
a good idea.

Looping in the RMT to see what they think...

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 1 month ago

In reply to: Matheus Alcantara (#9)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On 05/06/2026 22:35, Matheus Alcantara wrote:

On 05/06/26 16:11, Tom Lane wrote:

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Mon Jun 1, 2026 at 8:26 PM -03, Tom Lane wrote:

Actually ... if memory serves, SQL-language functions use ValuePerCall
mode, so there probably already is a solution to this embedded in
functions.c. Did you look at that?

I dind't look at this before but this was exactly the right call. SQL
functions handle this by maintaining a per-call-site cache struct
(SQLFunctionCache) in fn_extra that holds both the pointer to the
long-lived hash entry and the execution state. The use_count is
incremented when we first obtain the function and decremented via a
MemoryContextCallback when fn_mcxt is deleted.

I've adapted the same approach for PL/Python.

I've not read this patch yet but your high-level description seems
on-target.

Assuming the patch withstands review, there are three ways we could
proceed:

1. Hold it for v20.

2. Sneak it into v19.

3. Treat it as a back-patchable fix and put it into v18 as well.
(Going further back than v18 seems unreasonable because funccache.c
doesn't exist before that, so we'd have to back-patch it too.)

I do not think that #3 is really a great idea, mainly because the
failure case doesn't seem very likely to be hit in production,
and the lack of previous reports about this very ancient bug
bears that out.

I do find some attraction in #2, mainly because it would get the fix
into the field a year earlier than #1. But considering we're past
beta1 it may be too late for #2 to be reasonable either.

Yeah, this sounds a better option for me too, otherwise we can go with
#1. Back-patching this seems complicated, so I agree #3 does not seems a
good idea.

Looping in the RMT to see what they think...

It's fine to still sneak it into v19. It's better to have it earlier,
even if it means more churn during beta period.

I haven't looked closely at the patch, but since it's a bug fix it would
make sense to backpatch. If we're uncomfortable with backpatching it
now, we could commit in master now, and backpatch later when we have
more confidence.

- Heikki

#11

Nathan Bossart

nathandbossart@gmail.com

about 1 month ago

In reply to: Heikki Linnakangas (#10)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On Wed, Jun 17, 2026 at 06:30:30PM +0300, Heikki Linnakangas wrote:

It's fine to still sneak it into v19. It's better to have it earlier, even
if it means more churn during beta period.

I haven't looked closely at the patch, but since it's a bug fix it would
make sense to backpatch. If we're uncomfortable with backpatching it now, we
could commit in master now, and backpatch later when we have more
confidence.

--
nathan

#12

Melanie Plageman

melanieplageman@gmail.com

about 1 month ago

In reply to: Heikki Linnakangas (#10)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On Wed, Jun 17, 2026 at 11:30 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

It's fine to still sneak it into v19. It's better to have it earlier,
even if it means more churn during beta period.

I haven't looked closely at the patch, but since it's a bug fix it would
make sense to backpatch. If we're uncomfortable with backpatching it
now, we could commit in master now, and backpatch later when we have
more confidence.

Agreed.

- Melanie

#13

Tom Lane

tgl@sss.pgh.pa.us

about 1 month ago

In reply to: Heikki Linnakangas (#10)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

Heikki Linnakangas <hlinnaka@iki.fi> writes:

Looping in the RMT to see what they think...

It's fine to still sneak it into v19. It's better to have it earlier,
even if it means more churn during beta period.

OK. I haven't looked closely at the patch yet, but will proceed with
reviewing it.

I haven't looked closely at the patch, but since it's a bug fix it would
make sense to backpatch. If we're uncomfortable with backpatching it
now, we could commit in master now, and backpatch later when we have
more confidence.

I'm of the opinion that the risk-reward ratio is not great for putting
this into stable branches. The case that fails is just not something
I'd expect people to do a lot in production. So I'm content with
sneaking it into v19.

regards, tom lane

#14

Tom Lane

tgl@sss.pgh.pa.us

about 1 month ago

In reply to: Matheus Alcantara (#7)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

Patch attached.

I had been planning to wait for v20 development to open, but with
RMT approval the target is now v19 instead, so I'd like to get
this done before the end of June. I looked through the patch
and found a couple of issues immediately:

* Your refactoring to have just one PLy_procedure_get call in
plpython3_call_handler is no good. You missed the comment
block just above:

/*
* Push execution context onto stack. It is important that this get
* popped again, so avoid putting anything that could throw error between
* here and the PG_TRY.
*/
exec_ctx = PLy_push_execution_context(!nonatomic);

+   proc = PLy_procedure_get(fcinfo, false);
+
    PG_TRY();
    {

I counsel putting those PLy_procedure_get calls back where they were.

* I also question the decision to refactor where/how is_trigger is
computed; that doesn't seem necessary to the purposes of the patch,
nor is it a clear improvement. I'd just as soon leave that
mechanism alone as much as we can. If there is an improvement to
be had, let's address that separately. (Alternative thought:
should we rely on the isTrigger/isEventTrigger bools that
funccache.c sets up for us? I'm not quite sure if getting friendly
with struct CachedFunctionHashKey is a good idea or not.)

* I find it confusing that you called "PLyProcedureCache *" variables
"pcache" in some places and "proc" in others. The latter choice seems
poor because mostly "proc" is a PLyProcedure pointer. Using "proc"
leads to constructions like "proc->proc", which I don't find
intelligible.

* The new code could do with more comments. I realize that plpython
is poorly commented in many places, but let's see if we can leave it
better than we found it.

regards, tom lane

#15

Matheus Alcantara

matheusssilv97@gmail.com

about 1 month ago

In reply to: Tom Lane (#14)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On Wed Jun 17, 2026 at 6:56 PM -03, Tom Lane wrote:

* Your refactoring to have just one PLy_procedure_get call in
plpython3_call_handler is no good. You missed the comment
block just above:

/*
* Push execution context onto stack. It is important that this get
* popped again, so avoid putting anything that could throw error between
* here and the PG_TRY.
*/
exec_ctx = PLy_push_execution_context(!nonatomic);
+   proc = PLy_procedure_get(fcinfo, false);
+
PG_TRY();
{
I counsel putting those PLy_procedure_get calls back where they were.

You're right, it was a mistake, it was not my original goal to move
outside of PG_TRY(). I've moved the PLy_procedure_get() call back inside
the PG_TRY(). Since the new signature no longer needs a per-call-context
argument, a single call at the top of the PG_TRY block now covers all
three cases, and exec_ctx->curr_proc is set once right after the lookup.
Let me know if I misunderstood your point.

* I also question the decision to refactor where/how is_trigger is
computed; that doesn't seem necessary to the purposes of the patch,
nor is it a clear improvement. I'd just as soon leave that
mechanism alone as much as we can.

I've restored PLy_procedure_is_trigger() and the validator uses it again
exactly as before, instead of the inlined prorettype checks. The one
unavoidable change that it seems to me is that the trigger type is now
determined inside the compile callback rather than passed in as a
PLyTrigType argument — that's forced by the funccache API, since
cached_function_compile() takes the FunctionCallInfo and the procedure
is created from within the callback instead of PLy_procedure_get(). Or
I'm missing something?

(Alternative thought: should we rely on the isTrigger/isEventTrigger
bools that funccache.c sets up for us? I'm not quite sure if getting
friendly with struct CachedFunctionHashKey is a good idea or not.)

I left the callback using CALLED_AS_TRIGGER() / CALLED_AS_EVENT_TRIGGER()
rather than reaching into CachedFunctionHashKey. That keeps us off from
funccache.c internals and matches what plpgsql_compile_callback() does,
which seems to me the safer way to go. What do you think?

* I find it confusing that you called "PLyProcedureCache *" variables
"pcache" in some places and "proc" in others. The latter choice seems
poor because mostly "proc" is a PLyProcedure pointer. Using "proc"
leads to constructions like "proc->proc", which I don't find
intelligible.

Fixed. Definitely agree, oversight from my side.

* The new code could do with more comments. I realize that plpython
is poorly commented in many places, but let's see if we can leave it
better than we found it.

Added header comments to the new PLy_compile_callback and
PLy_delete_callback, expanded the validator comment about why the fake
fcinfo context is built, and expanded the SRF first-call-setup comment
to explain the ValuePerCall model, the per-call-site cache, and the
shutdown-callback handling.

I've also added a regression test, not sure if there is a better way to
exercise this fix but this test crash without this patch applied.

---

On top of your points, I did another self-review pass over v2 and found
a possible pre-existing problem in v1 in the way the patch handled SRF
cleanup, which I've also fixed in v2.

The patch had switched the set-returning-function cleanup from the
original MemoryContextRegisterResetCallback to a
RegisterExprContextCallback (ShutdownPLyFunction), modeled on what
functions.c does for SQL functions. But that copies a property that I
don't that apply for PL/Python: ShutdownExprContext() does not invoke
ExprContext callbacks during an error abort (it only frees the callback
list), and functions.c is fine with that because, as its comment notes,
"transaction abort will take care of releasing executor resources."
PL/Python's resource is a Python refcount, and transaction abort does
not release those. So if a SETOF function was left partially iterated
and the surrounding query then errored, e.g.

CREATE OR REPLACE FUNCTION mysrf() RETURNS SETOF int LANGUAGE plpython3u AS $$
return [1,2,3,4,5]
$$;
SELECT mysrf() / 0;

the iterator's references were leaked for the life of the session. The
original code didn't have this problem because a memory-context reset
callback does run during abort.

Rather than reintroduce a second mechanism, v2 reuses the memory-context
callback that's already there for reference counting.
RemovePLyProcedureCache is registered on the FmgrInfo's fn_mcxt and
therefore runs on abort; it now also releases any Python state left
behind by an interrupted SRF. ShutdownPLyFunction is kept for the cases
it does handle correctly.

---

Thanks for the review! I've tried to address all your points in the
attached v2.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

#16

Tom Lane

tgl@sss.pgh.pa.us

about 1 month ago

In reply to: Matheus Alcantara (#15)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

Thanks for the review! I've tried to address all your points in the
attached v2.

Pushed after a round of review. I made some mostly-cosmetic changes,
such as rewriting comments (consolidating some stuff I thought was
duplicative). The main thing I fixed that was an actual bug was
you were careless about lifespan of variables around PG_TRY blocks.
The rule of thumb is that if a variable is modified inside PG_TRY
and then used after that block (including in the PG_CATCH) then it
has to be marked volatile. Where possible, I avoid using the
volatile marking by assigning the variable's value before PG_TRY.

I've also added a regression test, not sure if there is a better way to
exercise this fix but this test crash without this patch applied.

Kind of a hokey test, since it doesn't model the likely actual case
where the CREATE happens in another session, but this is as close as
we'll get without a much more complex test setup. I kept it, and
also added another test that exercises the early-termination path,
since code coverage showed me that ShutdownPLyFunction() wasn't being
reached.

regards, tom lane

#17

Matheus Alcantara

matheusssilv97@gmail.com

about 1 month ago

In reply to: Tom Lane (#16)

Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On Sun Jun 21, 2026 at 4:40 PM -03, Tom Lane wrote:

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

Thanks for the review! I've tried to address all your points in the
attached v2.

Pushed after a round of review. I made some mostly-cosmetic changes,
such as rewriting comments (consolidating some stuff I thought was
duplicative). The main thing I fixed that was an actual bug was
you were careless about lifespan of variables around PG_TRY blocks.
The rule of thumb is that if a variable is modified inside PG_TRY
and then used after that block (including in the PG_CATCH) then it
has to be marked volatile. Where possible, I avoid using the
volatile marking by assigning the variable's value before PG_TRY.

Noted, thanks for the call.

I've also added a regression test, not sure if there is a better way to
exercise this fix but this test crash without this patch applied.

Kind of a hokey test, since it doesn't model the likely actual case
where the CREATE happens in another session, but this is as close as
we'll get without a much more complex test setup. I kept it, and
also added another test that exercises the early-termination path,
since code coverage showed me that ShutdownPLyFunction() wasn't being
reached.

Thank you for reviewing and committing the patch!

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

Attachments:

Attachments:

Attachments: