BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

Started by PG Bug reporting form15 days ago4 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 19480
Logged by: Andrzej Doros
Email address: adoros@starfishstorage.com
PostgreSQL version: 17.9
Operating system: Ubuntu 22.04.5 LTS (x86_64), kernel 5.15, glibc 2.
Description:

PostgreSQL version: 17.9 (production crash), confirmed identical on 17.10
OS: Ubuntu 22.04.5 LTS, x86_64, kernel 5.15, glibc 2.35
Package: postgresql-plpython3-17 from pgdg apt repository

DESCRIPTION
-----------

A PL/Python set-returning function (SRF) crashes the backend with SIGSEGV
when
another session executes CREATE OR REPLACE FUNCTION (or ALTER FUNCTION) on
the
same function while the SRF is mid-iteration.

This is a use-after-free. srfstate->savedargs is allocated inside proc->mcxt
by
PLy_function_save_args() (plpy_exec.c:503). On each per-call SRF invocation,
plpython3_call_handler calls PLy_procedure_get(), which may call
PLy_procedure_delete(old_proc) -> MemoryContextDelete(old_proc->mcxt) if the
function's pg_proc row has changed (different xmin or ctid). After that,
srfstate->savedargs is a dangling pointer — it is not cleared. The next
PLy_function_restore_args() reads freed memory:

if (srfstate->savedargs) /* non-NULL dangling pointer
*/
PLy_function_restore_args(proc, srfstate->savedargs); /* reads
freed mem */

Inside PLy_function_restore_args (plpy_exec.c:551):

for (i = 0; i < savedargs->nargs; i++) /* nargs from freed memory */
{
if (proc->argnames[i] && ...)
PyDict_SetItemString(..., proc->argnames[i], ...);

When savedargs->nargs is garbage (e.g. 2056017128 in two production core
dumps),
proc->argnames[i] for large i reads an invalid pointer, which is passed to
PyDict_SetItemString -> PyUnicode_FromString -> strlen -> SIGSEGV.

CRASH STACK (two identical core dumps from production, PG 17.9, Ubuntu
22.04)
------------------------------------------------------------------------------

#0 __strlen_evex()
#1 PyUnicode_FromString(u=0x69ffff0000)
#2 PyDict_SetItemString(...)
#3 PLy_function_restore_args(proc=..., savedargs=...)
#4 PLy_exec_function(...)
#5 plpython3_call_handler(...)
#6 fmgr_security_definer(...)
#7 ExecMakeTableFunctionResult(...)

State from the newer core dump:

proc->proname = "tags_report_plpython"
proc->nargs = 1
proc->argnames[0]= "flavour"
savedargs->nargs = 2056017128 <- should be 1; contains garbage
savedargs->namedargs[0] = 'tags' <- still valid (not yet overwritten)
i = 4 <- loop has iterated far past argnames[]

TRIGGER CONDITION
-----------------

The pg_proc invalidation reaches Session A's backend when
AcceptInvalidationMessages() is called. This happens when Session A's Python
code calls plpy.execute() with a statement that acquires a NEW relation lock
(e.g. CREATE TEMP TABLE, any table not previously locked in this statement).
Simply calling plpy.execute("SELECT 1") is not sufficient because the lock
on
pg_proc is already held and subsequent requests are served from the
per-process
lock table without invoking AcceptInvalidationMessages.

In production the trigger is autovacuum on pg_proc (which moves the tuple's
ctid) or any concurrent DDL from another session. Long-running SRFs (hours)
are much more likely to hit this window.

STEPS TO REPRODUCE
------------------

Requires two concurrent sessions and PostgreSQL with plpython3u.

Session A — start and leave running:

CREATE EXTENSION IF NOT EXISTS plpython3u;

CREATE OR REPLACE FUNCTION repro_srf(flavour VARCHAR)
RETURNS TABLE (i BIGINT) AS $$
import time
for i in range(100):
-- CREATE TEMP TABLE acquires a new relation lock each iteration,
-- which causes AcceptInvalidationMessages to be called.
plpy.execute(f"CREATE TEMP TABLE _rt_{i} (x int)")
plpy.execute(f"DROP TABLE _rt_{i}")
time.sleep(0.3)
yield i
$$ LANGUAGE plpython3u VOLATILE;

SELECT count(*) FROM repro_srf('test');

Session B — while Session A is running (after ~2 seconds):

CREATE OR REPLACE FUNCTION repro_srf(flavour VARCHAR)
RETURNS TABLE (i BIGINT) AS $$
import time
for i in range(100):
plpy.execute(f"CREATE TEMP TABLE _rt_{i} (x int)")
plpy.execute(f"DROP TABLE _rt_{i}")
time.sleep(0.3)
yield i
$$ LANGUAGE plpython3u VOLATILE;

NOTE: In a minimal test without memory pressure, the freed savedargs memory
is often not overwritten quickly enough to produce a crash —
savedargs->nargs
accidentally retains its correct value of 1 and restore_args succeeds. Under
production load (long-running SRF, many Python allocations), the freed
region
is overwritten and the crash occurs.

The crash can be triggered deterministically with gdb by setting
savedargs->nargs to a large value immediately after PLy_procedure_delete
fires
(see gdb script below). This produces the identical crash stack seen in
production.

GDB CONFIRMATION (PostgreSQL 17.10)
-------------------------------------

The following gdb session was used to confirm the exact sequence:

(gdb) b PLy_procedure_delete
(gdb) commands 1

printf "DELETE proname=%s mcxt=%p\n", proc->proname, proc->mcxt
set $corrupt_next = 1
c
end

(gdb) b PLy_function_restore_args
(gdb) commands 2

if $corrupt_next
set {int}((long)savedargs + 24) = 2056017128
set $corrupt_next = 0
end
c
end

Output:

DELETE proname=repro_srf mcxt=0x5686641e1b20
[PLy_function_restore_args fires with savedargs=0x5686641e28e8]
[nargs set to 2056017128]
Program received signal SIGSEGV, Segmentation fault.
__strlen_avx2 ()

PostgreSQL log:
server process (PID 366) was terminated by signal 11: Segmentation fault
all server processes terminated; reinitializing

AFFECTED CODE
-------------

src/pl/plpython/plpy_exec.c, lines 503-506:
PLy_function_save_args allocates savedargs in proc->mcxt

src/pl/plpython/plpy_exec.c, lines 117-119:
PLy_function_restore_args is called with potentially dangling savedargs
(no check whether proc was rebuilt since savedargs was created)

src/pl/plpython/plpy_procedure.c, line 405 (PLy_procedure_delete):
MemoryContextDelete(proc->mcxt) frees savedargs without nulling
srfstate->savedargs

PROPOSED FIX
------------

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A — allocate savedargs in funcctx->multi_call_memory_ctx:
Change PLy_function_save_args to accept a MemoryContext parameter and pass
funcctx->multi_call_memory_ctx from PLy_exec_function. The saved PyObject*
references are valid regardless of which MemoryContext holds the struct.

Option B — detect proc rebuild and discard stale savedargs:
After PLy_procedure_get returns a new proc, check whether it differs from
the
proc that created srfstate->savedargs. If so, discard savedargs
(PLy_function_drop_args or simply set to NULL) and skip the restore.

#2Matheus Alcantara
matheusssilv97@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On Fri May 15, 2026 at 8:11 AM -03, PG Bug reporting form wrote:

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A — allocate savedargs in funcctx->multi_call_memory_ctx:
Change PLy_function_save_args to accept a MemoryContext parameter and pass
funcctx->multi_call_memory_ctx from PLy_exec_function. The saved PyObject*
references are valid regardless of which MemoryContext holds the struct.

Option B — detect proc rebuild and discard stale savedargs:
After PLy_procedure_get returns a new proc, check whether it differs from
the
proc that created srfstate->savedargs. If so, discard savedargs
(PLy_function_drop_args or simply set to NULL) and skip the restore.

Hi, thank you for the very detailed bug report. I've managed to
reproduce the issue on master.

Option A seems to fix the issue (see attached patch) but I've found
another issue while playing with this that I think it's related:

CREATE OR REPLACE FUNCTION trigger_stack_overflow(x BIGINT)
RETURNS TABLE(i BIGINT) AS $$
import time
plpy.execute(f"CREATE TEMP TABLE _rt_{x} (x int)")
plpy.execute(f"DROP TABLE _rt_{x}")
time.sleep(0.3)
plpy.execute("SELECT trigger_stack_overflow(1)")
yield x
$$ LANGUAGE plpython3u VOLATILE;

Run SELECT trigger_stack_overflow(1) and on another session execute the
CREATE OR REPLACE and wait for the first session to crash with this
stacktrace:
frame #3: 0x000000010554a694 postgres`ExceptionalCondition(conditionName="proc->calldepth > 0", fileName="../src/pl/plpython/plpy_exec.c", lineNumber=701) at assert.c:65:2
frame #4: 0x0000000105e41984 plpython3.dylib`PLy_global_args_pop(proc=0x000000014b03cf00) at plpy_exec.c:701:2
frame #5: 0x0000000105e40d94 plpython3.dylib`PLy_exec_function(fcinfo=0x000000011e077738, proc=0x000000014b03cf00) at plpy_exec.c:264:3

The expected output from the first session should be something like
this:

ERROR: 54001: error fetching next item from iterator
DETAIL: spiexceptions.StatementTooComplex: error fetching next item from iterator
HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.

This is because when PLy_procedure_delete() is executed on
PLy_procedure_get() it also destroy information related with recursive
functions, such as "calldepth", "argstack" and "globals" which cause the
assert failure Assert(proc->calldepth > 0) on PLy_global_args_pop() when
it's executed on PG_CATCH block on PLy_exec_function() or EXC_BAD_ACCESS
when accessing "argstack" or "globals".

Althrought changing the memory context where savedargs is allocated fix
the reported issue I think that the long term fix is to preserve such
necessary execution information during PLyProcedure re-creation. I'm
still studying the code to see if and how this can implemented.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Attachments:

0001-plpython-Use-correct-memory-context-for-savedargs.patchtext/plain; charset=utf-8; name=0001-plpython-Use-correct-memory-context-for-savedargs.patchDownload+20-8
#3Matheus Alcantara
matheusssilv97@gmail.com
In reply to: Matheus Alcantara (#2)
Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

On 25/05/26 19:26, Matheus Alcantara wrote:

On Fri May 15, 2026 at 8:11 AM -03, PG Bug reporting form wrote:

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A — allocate savedargs in funcctx->multi_call_memory_ctx:
Change PLy_function_save_args to accept a MemoryContext parameter and pass
funcctx->multi_call_memory_ctx from PLy_exec_function. The saved PyObject*
references are valid regardless of which MemoryContext holds the struct.

Option B — detect proc rebuild and discard stale savedargs:
After PLy_procedure_get returns a new proc, check whether it differs from
the
proc that created srfstate->savedargs. If so, discard savedargs
(PLy_function_drop_args or simply set to NULL) and skip the restore.

Hi, thank you for the very detailed bug report. I've managed to
reproduce the issue on master.

Option A seems to fix the issue (see attached patch) but I've found
another issue while playing with this that I think it's related:

...

This is because when PLy_procedure_delete() is executed on
PLy_procedure_get() it also destroy information related with recursive
functions, such as "calldepth", "argstack" and "globals" which cause the
assert failure Assert(proc->calldepth > 0) on PLy_global_args_pop() when
it's executed on PG_CATCH block on PLy_exec_function() or EXC_BAD_ACCESS
when accessing "argstack" or "globals".

Although changing the memory context where savedargs is allocated fix
the reported issue I think that the long term fix is to preserve such
necessary execution information during PLyProcedure re-creation. I'm
still studying the code to see if and how this can implemented.

This is being tricky to debug. I'm not being able to reproduce the
issue with assert disabled, not even with the steps shared on the bug
report.

Andrzej could you please confirm if you hit this failure with assert
enable? And if it's enable, could you please check if it's also
happens with assert disabled?

Also, the 17.10 version was released some weeks ago, can you also test
against this new minor release?

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Matheus Alcantara (#2)
Re: BUG #19480: PL/Python SRF crashes (SIGSEGV) when function is replaced mid-iteration: use-after-free in PLy_funct

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:

On Fri May 15, 2026 at 8:11 AM -03, PG Bug reporting form wrote:

The root cause is that srfstate->savedargs is tied to proc->mcxt (which can
be deleted at any per-call boundary) rather than to
funcctx->multi_call_memory_ctx (which lives for the entire SRF lifetime).

Option A seems to fix the issue (see attached patch) but I've found
another issue while playing with this that I think it's related:
...
This is because when PLy_procedure_delete() is executed on
PLy_procedure_get() it also destroy information related with recursive
functions, such as "calldepth", "argstack" and "globals" which cause the
assert failure Assert(proc->calldepth > 0) on PLy_global_args_pop() when
it's executed on PG_CATCH block on PLy_exec_function() or EXC_BAD_ACCESS
when accessing "argstack" or "globals".

Yeah. The bigger picture though is: if we are re-entrantly calling
either a recursive function or a SRF, we should not destroy any of the
existing state, nor do we want to replace the function body. The only
way to have sane behavior is to keep executing the same function body
until the execution instance (recursion level or continued SRF) is
done. So these concerns about associated state are only part of the
problem.

plpgsql ran into this years ago, and its solution has been to maintain
a reference count on each function parsetree and not destroy an
obsoleted parsetree till the reference count goes to zero. I've had
in the back of my head that the other PLs need to do likewise, but it
hasn't gotten to the front of the to-do list, mainly because the other
PLs are much less used and so field complaints about this have been
rare. I had hoped also that the language interpreters underlying the
other PLs might solve some of this for us, but it's unclear to what
extent they help. Certainly it's not cool to be clobbering our own
execution state that's outside the language interpreter.

We might want to go as far as converting the other PLs to use the
utils/cache/funccache.c infrastructure, but perhaps there is a
less invasive fix. Certainly, a fix based on funccache.c could not
be back-patched. (On the other hand, given the rarity of complaints,
perhaps a HEAD-only fix is acceptable.)

regards, tom lane