BUG #7516: PL/Perl crash

Started by Marko Tiikkajaover 13 years ago18 messagesbugs
Jump to latest
#1Marko Tiikkaja
marko@joh.to

The following bug has been logged on the website:

Bug reference: 7516
Logged by: Marko Tiikkaja
Email address: pgmail@joh.to
PostgreSQL version: 9.1.5
Operating system: 64-bit linux
Description:

Hi,

We had a segmentation fault in PostgreSQL 9.1.5 with PL/PerlU.

The backtrace looks like this:

#0 0x00007f80be88cb04 in plperl_spi_exec_prepared (query=0x1671eb8
"0x168cd70", attr=0x0, argc=2, argv=0x19b4e40) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/pl/plperl/plperl.c:3373
#1 0x00007f80be88e4f0 in XS__spi_exec_prepared (my_perl=<optimized out>,
cv=<optimized out>) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/pl/plperl/SPI.xs:141
#2 0x00007f80be5b67ff in Perl_pp_entersub () from /usr/lib/libperl.so.5.14
#3 0x00007f80be5adc96 in Perl_runops_standard () from
/usr/lib/libperl.so.5.14
#4 0x00007f80be54959e in Perl_call_sv () from /usr/lib/libperl.so.5.14
#5 0x00007f80be885c55 in plperl_call_perl_func (desc=0x189a010,
fcinfo=0x188c490) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/pl/plperl/plperl.c:2017
#6 0x00007f80be88a17c in plperl_func_handler (fcinfo=0x188c490) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/pl/plperl/plperl.c:2156
#7 plperl_call_handler (fcinfo=0x188c490) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/pl/plperl/plperl.c:1671
#8 0x0000000000707147 in fmgr_security_definer (fcinfo=0x188c490) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/utils/fmgr/fmgr.c:975
#9 0x0000000000575e45 in ExecMakeFunctionResult (fcache=0x188c420,
econtext=0x188c230, isNull=0x188cec8 "id", isDone=0x188cfe0)
at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/executor/execQual.c:1917
#10 0x0000000000578512 in ExecTargetList (isDone=0x7fff051891ec,
itemIsDone=0x188cfe0, isnull=0x188cec8 "id", values=0x188ceb0,
econtext=0x188c230, targetlist=0x188cfb0)
at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/executor/execQual.c:5210
#11 ExecProject (projInfo=<optimized out>, isDone=0x7fff051891ec) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/executor/execQual.c:5425
#12 0x0000000000588b4a in ExecResult (node=0x188c120) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/executor/nodeResult.c:155
#13 0x00000000005713e8 in ExecProcNode (node=0x188c120) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/executor/execProcnode.c:367
#14 0x000000000056e662 in ExecutePlan (dest=0x15ae900, direction=<optimized
out>, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT,
planstate=0x188c120, estate=0x188c010)
at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/executor/execMain.c:1440
15 standard_ExecutorRun (queryDesc=0x1464830, direction=<optimized out>,
count=0) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/executor/execMain.c:314
#16 0x0000000000642ee7 in PortalRunSelect (portal=0x1421710,
forward=<optimized out>, count=0, dest=0x15ae900) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/tcop/pquery.c:943
#17 0x00000000006443d0 in PortalRun (portal=0x1421710,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x15ae900,
altdest=0x15ae900, completionTag=0x7fff05189670 "")
at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/tcop/pquery.c:787
#18 0x0000000000640fba in exec_execute_message (max_rows=<optimized out>,
portal_name=0x15ae4f0 "") at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/tcop/postgres.c:1965
#19 PostgresMain (argc=<optimized out>, argv=<optimized out>,
username=<optimized out>) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/tcop/postgres.c:4025
#20 0x0000000000602813 in BackendRun (port=0x145b990) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/postmaster/postmaster.c:3617
#21 BackendStartup (port=0x145b990) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/postmaster/postmaster.c:3302
#22 ServerLoop () at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/postmaster/postmaster.c:1466
#23 0x0000000000603281 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/postmaster/postmaster.c:1127
#24 0x000000000045a440 in main (argc=5, argv=0x1416170) at
/build/buildd/postgresql-9.1-9.1.5/build/../src/backend/main/main.c:199

It seems to have happened when a PL/PerlU executed a prepared statement
which calls another PL/PerlU function.

So far we have not been able to reproduce the problem. Any thoughts?
Someone seen something similar?

Regards,
Marko Tiikkaja

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#1)
Re: BUG #7516: PL/Perl crash

pgmail@joh.to writes:

We had a segmentation fault in PostgreSQL 9.1.5 with PL/PerlU.
...
It seems to have happened when a PL/PerlU executed a prepared statement
which calls another PL/PerlU function.

Hm. Is it possible that the prepared statement recursively called the
*same* plperl function? And that somebody did a CREATE OR REPLACE on
that function meanwhile? It looks to me like validate_plperl_function()
for the inner invocation could pull the rug out from under the outer
invocation's prodesc pointer. Although even granting all that, you'd
have to be quite unlucky to get a crash right there, because free'ing
the prodesc wouldn't ordinarily cause the memory to disappear. This
might be the wrong theory. You seem to have debug symbols for that
core dump --- can you tell which of the dereferences in line 3373
caused the crash?

regards, tom lane

#3Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#2)
Re: BUG #7516: PL/Perl crash

Hi,

On 03/09/2012 18:06, Tom Lane wrote:

pgmail@joh.to writes:

We had a segmentation fault in PostgreSQL 9.1.5 with PL/PerlU.
...
It seems to have happened when a PL/PerlU executed a prepared statement
which calls another PL/PerlU function.

Hm. Is it possible that the prepared statement recursively called the
*same* plperl function? And that somebody did a CREATE OR REPLACE on
that function meanwhile?

No, that doesn't seem possible in this case. The function calls some
prepared statements repeatedly, but no recursion should occur.

This
might be the wrong theory. You seem to have debug symbols for that
core dump --- can you tell which of the dereferences in line 3373
caused the crash?

The current_call_data->prodesc->fn_readonly one.

Please let me know if I can be of more assistance.

Regards,
Marko Tiikkaja

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#3)
Re: BUG #7516: PL/Perl crash

Marko Tiikkaja <pgmail@joh.to> writes:

On 03/09/2012 18:06, Tom Lane wrote:

This might be the wrong theory. You seem to have debug symbols for that
core dump --- can you tell which of the dereferences in line 3373
caused the crash?

The current_call_data->prodesc->fn_readonly one.
Please let me know if I can be of more assistance.

Hm, can you show us the contents of the *current_call_data struct?

Another thing that seems a bit risky is that plperl_func_handler sets up
current_call_data as a pointer to a mostly-zeroed struct and then does
assorted things that could throw error. That would leave a dangling
value of current_call_data ...

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#4)
Re: BUG #7516: PL/Perl crash

I wrote:

Another thing that seems a bit risky is that plperl_func_handler sets up
current_call_data as a pointer to a mostly-zeroed struct and then does
assorted things that could throw error. That would leave a dangling
value of current_call_data ...

Meh, scratch that --- I missed that the caller of plperl_func_handler is
where there's a PG_TRY block that's responsible for restoring the old
value of current_call_data. It would still be interesting to nose
around and see what's in *current_call_data, as well as any of the
pointed-to data structures that exist.

regards, tom lane

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#3)
Re: BUG #7516: PL/Perl crash

Marko Tiikkaja <pgmail@joh.to> writes:

On 03/09/2012 18:06, Tom Lane wrote:

Hm. Is it possible that the prepared statement recursively called the
*same* plperl function? And that somebody did a CREATE OR REPLACE on
that function meanwhile?

No, that doesn't seem possible in this case. The function calls some
prepared statements repeatedly, but no recursion should occur.

Hm. Well, I can definitely reproduce a segfault using this example:

CREATE OR REPLACE FUNCTION self_modify(INTEGER) RETURNS INTEGER AS $$
spi_exec_query('CREATE OR REPLACE FUNCTION self_modify(INTEGER) RETURNS INTEGER AS \'return $_[0] * 3;\' LANGUAGE plperl;');
spi_exec_query('select self_modify(42) AS a');
return $_[0] * 2;
$$ LANGUAGE plperl;

select self_modify(42);

The crash is 100% reproducible if you tweak plperl like this:

diff --git a/src/pl/plperl/plperl.c b/src/pl/plperl/plperl.c
index bfa805d..307874c 100644
--- a/src/pl/plperl/plperl.c
+++ b/src/pl/plperl/plperl.c
@@ -2393,7 +2393,7 @@ validate_plperl_function(plperl_proc_ptr *proc_ptr, HeapTu
            activate_interpreter(oldinterp);
        }
        free(prodesc->proname);
-       free(prodesc);
+       memset(prodesc, 0, sizeof(plperl_proc_desc));
    }

return false;

Without that it's probabilistic, since the subsequent malloc in
compile_plperl_function is likely to realloc the exact same space.

So I think we need to institute a reference counting scheme for
the plperl_proc_desc records ...

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#6)
Re: BUG #7516: PL/Perl crash

I wrote:

Hm. Well, I can definitely reproduce a segfault using this example:

CREATE OR REPLACE FUNCTION self_modify(INTEGER) RETURNS INTEGER AS $$
spi_exec_query('CREATE OR REPLACE FUNCTION self_modify(INTEGER) RETURNS INTEGER AS \'return $_[0] * 3;\' LANGUAGE plperl;');
spi_exec_query('select self_modify(42) AS a');
return $_[0] * 2;
$$ LANGUAGE plperl;

select self_modify(42);

So I think we need to institute a reference counting scheme for
the plperl_proc_desc records ...

The attached patch fixes the problem I'm seeing. I am not sure whether
it fixes what you saw; the crash you showed is in the right place, but
unless there was a recursive call to a pl/perl function, I don't see
how the existing code could have freed the prodesc too soon.

regards, tom lane

#8Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#7)
Re: BUG #7516: PL/Perl crash

On 10/09/2012 00:37, Tom Lane wrote:

The attached patch fixes the problem I'm seeing. I am not sure whether
it fixes what you saw; the crash you showed is in the right place, but
unless there was a recursive call to a pl/perl function, I don't see
how the existing code could have freed the prodesc too soon.

Joel Jacobson managed to narrow it down to this test case, which crashes
consistently on Ubuntu 12.04 both with and without your patch. I,
however, wasn't able to reproduce the problem on my OS X Mountain Lion.
I'll try to get some more information about it tomorrow, but in the
mean time if you can reproduce the problem or think of something, I'll
post the test case.

Regards,
Marko Tiikkaja

Attachments:

perlbug.sqltext/plain; charset=windows-1252; name=perlbug.sqlDownload
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#8)
Re: BUG #7516: PL/Perl crash

Marko Tiikkaja <pgmail@joh.to> writes:

Joel Jacobson managed to narrow it down to this test case, which crashes
consistently on Ubuntu 12.04 both with and without your patch. I,
however, wasn't able to reproduce the problem on my OS X Mountain Lion.

Doesn't reproduce for me either ...

regards, tom lane

#10Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#9)
Re: BUG #7516: PL/Perl crash

On 9/12/12 1:50 AM, Tom Lane wrote:

Marko Tiikkaja <pgmail@joh.to> writes:

Joel Jacobson managed to narrow it down to this test case, which crashes
consistently on Ubuntu 12.04 both with and without your patch. I,
however, wasn't able to reproduce the problem on my OS X Mountain Lion.

Doesn't reproduce for me either ...

Ok, I can reproduce it on my Ubuntu virtual machine.

The problem looks like this:

#0 free_plperl_function (prodesc=0x24195a0) at plperl.c:2428
#1 0x00007ff47babc045 in plperl_call_handler (fcinfo=0x247c160) at
plperl.c:1710
#2 0x00000000005663fa in ExecMakeFunctionResult (fcache=0x247c0f0,
econtext=0x247bf00, isNull=0x247ca78 "", isDone=0x247cb90) at
execQual.c:1917
#3 0x0000000000568942 in ExecTargetList (isDone=0x7fff1402fc0c,
itemIsDone=0x247cb90, isnull=0x247ca78 "", values=0x247ca60,
econtext=0x247bf00, targetlist=0x247cb60) at execQual.c:5210
#4 ExecProject (projInfo=<optimized out>, isDone=0x7fff1402fc0c) at
execQual.c:5425
#5 0x0000000000578aea in ExecResult (node=0x247bdf0) at nodeResult.c:155
#6 0x0000000000561b18 in ExecProcNode (node=0x247bdf0) at
execProcnode.c:367
#7 0x000000000055ef2a in ExecutePlan (dest=0xad4600,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, planstate=0x247bdf0, estate=0x247bce0) at
execMain.c:1440
#8 standard_ExecutorRun (queryDesc=0x244f6d0, direction=<optimized
out>, count=0) at execMain.c:314
#9 0x000000000058192d in _SPI_pquery (tcount=<optimized out>,
fire_triggers=1 '\001', queryDesc=<optimized out>) at spi.c:2110
#10 _SPI_execute_plan (paramLI=0x245e1f0, snapshot=<optimized out>,
crosscheck_snapshot=0x0, read_only=0 '\000', fire_triggers=1 '\001',
tcount=0, plan=<optimized out>) at spi.c:1922
#11 0x0000000000581e57 in SPI_execute_plan_with_paramlist
(plan=0x2476c30, params=0x245e1f0, read_only=0 '\000', tcount=0) at
spi.c:423
#12 0x00007ff47b0c7075 in exec_run_select (estate=0x7fff14030270,
expr=0x24569f0, maxtuples=0, portalP=0x0) at pl_exec.c:4573
#13 0x00007ff47b0ca7b4 in exec_stmt_perform (estate=0x7fff14030270,
stmt=<optimized out>) at pl_exec.c:1413
#14 exec_stmt (stmt=0x2456ab0, estate=0x7fff14030270) at pl_exec.c:1289
#15 exec_stmts (estate=0x7fff14030270, stmts=<optimized out>) at
pl_exec.c:1248
#16 0x00007ff47b0cce59 in exec_stmt_block (estate=0x7fff14030270,
block=0x2457448) at pl_exec.c:1047
#17 0x00007ff47b0ca798 in exec_stmt (stmt=0x2457448,
estate=0x7fff14030270) at pl_exec.c:1281
#18 exec_stmts (estate=0x7fff14030270, stmts=<optimized out>) at
pl_exec.c:1248
#19 0x00007ff47b0ccfcc in exec_stmt_block (estate=0x7fff14030270,
block=0x2457508) at pl_exec.c:1186
#20 0x00007ff47b0cda37 in plpgsql_exec_function (func=0x243a140,
fcinfo=0x7fff14030580) at pl_exec.c:324
#21 0x00007ff47b0c26d3 in plpgsql_call_handler (fcinfo=0x7fff14030580)
at pl_handler.c:122
#22 0x0000000000566d4d in ExecMakeTableFunctionResult
(funcexpr=0x2443368, econtext=0x24428b0, expectedDesc=0x2443110,
randomAccess=0 '\000') at execQual.c:2146
#23 0x0000000000578381 in FunctionNext (node=0x24427a0) at
nodeFunctionscan.c:65
#24 0x0000000000568dde in ExecScanFetch (recheckMtd=0x578300
<FunctionRecheck>, accessMtd=0x578310 <FunctionNext>, node=0x24427a0) at
execScan.c:82
#25 ExecScan (node=0x24427a0, accessMtd=0x578310 <FunctionNext>,
recheckMtd=0x578300 <FunctionRecheck>) at execScan.c:132
#26 0x0000000000561a78 in ExecProcNode (node=0x24427a0) at
execProcnode.c:416
#27 0x000000000055ef2a in ExecutePlan (dest=0xad4600,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, planstate=0x24427a0, estate=0x2442690) at
execMain.c:1440
#28 standard_ExecutorRun (queryDesc=0x243f700, direction=<optimized
out>, count=0) at execMain.c:314
#29 0x000000000058192d in _SPI_pquery (tcount=<optimized out>,
fire_triggers=1 '\001', queryDesc=<optimized out>) at spi.c:2110
#30 _SPI_execute_plan (paramLI=0x243f670, snapshot=<optimized out>,
crosscheck_snapshot=0x0, read_only=0 '\000', fire_triggers=1 '\001',
tcount=0, plan=<optimized out>) at spi.c:1922
#31 0x0000000000581f39 in SPI_execute_plan (plan=0x23ee750,
Values=0x243df38, Nulls=0x24376b0 " RCHAR", read_only=0 '\000',
tcount=0) at spi.c:391
#32 0x00007ff47babce2d in plperl_spi_exec_prepared (query=0x2437690
"0x22930e0", attr=0x0, argc=2, argv=0x243df18) at plperl.c:3425
#33 0x00007ff47babe810 in XS__spi_exec_prepared (my_perl=<optimized
out>, cv=<optimized out>) at SPI.xs:141
#34 0x00007ff47b7e67ff in Perl_pp_entersub (my_perl=0x237d800) at
pp_hot.c:3046
#35 0x00007ff47b7ddc96 in Perl_runops_standard (my_perl=0x237d800) at
run.c:41
#36 0x00007ff47b77959e in Perl_call_sv (my_perl=0x237d800, sv=0x24017f8,
flags=10) at perl.c:2647
#37 0x00007ff47bab5f62 in plperl_call_perl_func (desc=0x24195a0,
fcinfo=0x242fa90) at plperl.c:2056
#38 0x00007ff47baba45b in plperl_func_handler (fcinfo=0x242fa90) at
plperl.c:2196
#39 plperl_call_handler (fcinfo=0x242fa90) at plperl.c:1705
#40 0x00000000005663fa in ExecMakeFunctionResult (fcache=0x242fa20,
econtext=0x242f830, isNull=0x24303a8 "", isDone=0x24304c0) at
execQual.c:1917
#41 0x0000000000568942 in ExecTargetList (isDone=0x7fff140312fc,
itemIsDone=0x24304c0, isnull=0x24303a8 "", values=0x2430390,
econtext=0x242f830, targetlist=0x2430490) at execQual.c:5210
#42 ExecProject (projInfo=<optimized out>, isDone=0x7fff140312fc) at
execQual.c:5425
#43 0x0000000000578aea in ExecResult (node=0x242f720) at nodeResult.c:155
#44 0x0000000000561b18 in ExecProcNode (node=0x242f720) at
execProcnode.c:367
#45 0x000000000055ef2a in ExecutePlan (dest=0x2429a80,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT, planstate=0x242f720, estate=0x242f610) at
execMain.c:1440
#46 standard_ExecutorRun (queryDesc=0x22ae450, direction=<optimized
out>, count=0) at execMain.c:314
#47 0x0000000000629317 in PortalRunSelect (portal=0x22abc30,
forward=<optimized out>, count=0, dest=0x2429a80) at pquery.c:943
#48 0x000000000062a6a0 in PortalRun (portal=0x22abc30,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2429a80,
altdest=0x2429a80, completionTag=0x7fff14031740 "") at pquery.c:787
#49 0x000000000062698c in exec_simple_query (query_string=0x233b200
"SELECT func0003();") at postgres.c:1020
#50 PostgresMain (argc=<optimized out>, argv=<optimized out>,
username=<optimized out>) at postgres.c:3968
#51 0x00000000005eba29 in BackendRun (port=0x22b5c70) at postmaster.c:3617
#52 BackendStartup (port=0x22b5c70) at postmaster.c:3302
#53 ServerLoop () at postmaster.c:1466
#54 0x00000000005ec34c in PostmasterMain (argc=<optimized out>,
argv=0x228b540) at postmaster.c:1127
#55 0x0000000000453de0 in main (argc=1, argv=0x228b540) at main.c:199

What happens is that free_plperl_function() for some reason gets called
with the prodesc of func0003. Later on, func0003 wants to get rid of
his prodesc and I get a crash. What's weird about this is that
current_call_data->prodesc actually points to the correct prodesc (the
one of func0005), but still somehow something different is passed to
free_plperl_function().

Anything I can do to help further, let me know.

Regards,
Marko Tiikkaja

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#10)
Re: BUG #7516: PL/Perl crash

Marko Tiikkaja <pgmail@joh.to> writes:

On 9/12/12 1:50 AM, Tom Lane wrote:

Marko Tiikkaja <pgmail@joh.to> writes:

Joel Jacobson managed to narrow it down to this test case, which crashes
consistently on Ubuntu 12.04 both with and without your patch. I,
however, wasn't able to reproduce the problem on my OS X Mountain Lion.

Doesn't reproduce for me either ...

Ok, I can reproduce it on my Ubuntu virtual machine.

Hm, I wonder if it's Ubuntu-specific? What Perl version is that exactly?

What happens is that free_plperl_function() for some reason gets called
with the prodesc of func0003. Later on, func0003 wants to get rid of
his prodesc and I get a crash. What's weird about this is that
current_call_data->prodesc actually points to the correct prodesc (the
one of func0005), but still somehow something different is passed to
free_plperl_function().

The only theory that comes to mind is that current_call_data is somehow
getting aliased (free'd and realloc'd). I don't see how that could
happen, but it occurs to me that it's kinda dumb to be palloc'ing it
in the first place. Its lifetime is exactly that of the call, so it
would be simpler and more foolproof to make it a local variable.

I've pushed a HEAD patch to do that, and I wonder if you could check
whether it changes anything.

regards, tom lane

#12Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#11)
Re: BUG #7516: PL/Perl crash

On 13/09/2012 19:48, Tom Lane wrote:

Marko Tiikkaja <pgmail@joh.to> writes:

On 9/12/12 1:50 AM, Tom Lane wrote:

Hm, I wonder if it's Ubuntu-specific? What Perl version is that exactly?

We've reproduced it on both 5.14.2 and 5.16.1.

What happens is that free_plperl_function() for some reason gets called
with the prodesc of func0003. Later on, func0003 wants to get rid of
his prodesc and I get a crash. What's weird about this is that
current_call_data->prodesc actually points to the correct prodesc (the
one of func0005), but still somehow something different is passed to
free_plperl_function().

The only theory that comes to mind is that current_call_data is somehow
getting aliased (free'd and realloc'd). I don't see how that could
happen, but it occurs to me that it's kinda dumb to be palloc'ing it
in the first place. Its lifetime is exactly that of the call, so it
would be simpler and more foolproof to make it a local variable.

I've pushed a HEAD patch to do that, and I wonder if you could check
whether it changes anything.

Will look at that tomorrow, thanks.

Regards,
Marko Tiikkaja

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#12)
Re: BUG #7516: PL/Perl crash

Marko Tiikkaja <pgmail@joh.to> writes:

On 13/09/2012 19:48, Tom Lane wrote:

Hm, I wonder if it's Ubuntu-specific? What Perl version is that exactly?

We've reproduced it on both 5.14.2 and 5.16.1.

Meh. I've managed to reproduce it on the fifth system I tried. I don't
think it's got anything to do with the Perl version, but with the gcc
version (which is 4.7.0 on this machine). Apparently, very recent gcc
versions are willing to throw away this line of code:

PG_CATCH();
{
-----> current_call_data = save_call_data;
activate_interpreter(oldinterp);
PG_RE_THROW();
}
PG_END_TRY();

current_call_data = save_call_data;
activate_interpreter(oldinterp);
return retval;

Apparently the reasoning is that current_call_data is a static and
therefore the compiler can see everything that can happen to it and
therefore this store into current_call_data is dead code, since the
store six lines below will replace it. Sigh. I shall file a bug,
but I've found that the current crop of gcc maintainers are quite
likely to reject such reports.

We could fix the immediate problem by marking current_call_data volatile
(I've tested that this makes the problem go away on F17), but I think
what we'd better do instead is mark pg_re_throw() as noreturn.
Hopefully that will also prevent this misoptimization, as well as
similar ones that might be happening elsewhere.

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#13)
Re: BUG #7516: PL/Perl crash

I wrote:

Apparently the reasoning is that current_call_data is a static and
therefore the compiler can see everything that can happen to it and
therefore this store into current_call_data is dead code, since the
store six lines below will replace it. Sigh. I shall file a bug,
but I've found that the current crop of gcc maintainers are quite
likely to reject such reports.

Filed at https://bugzilla.redhat.com/show_bug.cgi?id=857236

On the good side: developing a minimal test case showed me that the code
is incorrectly compiled only if perl.h has been included. (WTF? No,
I don't know why.) So at least this gcc bug should only be affecting
plperl.c and not the rest of postgres.

We could fix the immediate problem by marking current_call_data volatile
(I've tested that this makes the problem go away on F17), but I think
what we'd better do instead is mark pg_re_throw() as noreturn.
Hopefully that will also prevent this misoptimization, as well as
similar ones that might be happening elsewhere.

But, of course, pg_re_throw() already is marked noreturn.

A probably less costly solution than marking current_call_data volatile
is just to make it not-static.

regards, tom lane

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#14)
Re: BUG #7516: PL/Perl crash

I wrote:

A probably less costly solution than marking current_call_data volatile
is just to make it not-static.

And on still further investigation, the patch I just applied to HEAD
seems to make it go away too. Bizarre as can be. If that holds up for
you, I think back-patching that change is less ugly than making the
variable non-static.

regards, tom lane

#16Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#15)
Re: BUG #7516: PL/Perl crash

On 9/13/12 11:58 PM, Tom Lane wrote:

I wrote:

A probably less costly solution than marking current_call_data volatile
is just to make it not-static.

And on still further investigation, the patch I just applied to HEAD
seems to make it go away too. Bizarre as can be. If that holds up for
you, I think back-patching that change is less ugly than making the
variable non-static.

It does, indeed. I can't reproduce the bug on my end with that patch
applied.

Regards,
Marko Tiikkaja

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Tiikkaja (#16)
Re: BUG #7516: PL/Perl crash

Marko Tiikkaja <pgmail@joh.to> writes:

On 9/13/12 11:58 PM, Tom Lane wrote:

And on still further investigation, the patch I just applied to HEAD
seems to make it go away too. Bizarre as can be. If that holds up for
you, I think back-patching that change is less ugly than making the
variable non-static.

It does, indeed. I can't reproduce the bug on my end with that patch
applied.

Here's what Jakub Jelinek wrote in the RH bugzilla:

That is a glibc bug. While in setjmp.h the __sigsetjmp prototype is properly
marked as non-leaf:
extern int __sigsetjmp (struct __jmp_buf_tag __env[1], int __savemask)
__THROWNL;
in pthread.h it is incorrectly marked as leaf:
extern int __sigsetjmp (struct __jmp_buf_tag *__env, int __savemask) __THROW;
When pthread.h is fixed, the testcase works properly.

So that explains why including perl.h is relevant. Would you like to
try modifying your copy of pthread.h to confirm it's the same situation
on Ubuntu?

It's probably pure luck that my no-palloc patch dodges the problem, but
anyway it seems like a good enough work-around until the glibc issue is
fixed. I'll go ahead and back-patch that.

regards, tom lane

#18Marko Tiikkaja
marko@joh.to
In reply to: Tom Lane (#17)
Re: BUG #7516: PL/Perl crash

On 9/14/12 4:26 PM, Tom Lane wrote:

So that explains why including perl.h is relevant. Would you like to
try modifying your copy of pthread.h to confirm it's the same situation
on Ubuntu?

Yup, that fixes it.

Regards,
Marko Tiikkaja