[PATCH] plpython function causes server panic

Started by Hao Zhangover 2 years ago26 messageshackers
Jump to latest
#1Hao Zhang
zhrt1446384557@gmail.com

Hi hackers,
I found a problem when executing the plpython function:
After the plpython function returns an error, in the same session, if we
continue to execute
plpython function, the server panic will be caused.

*Reproduce*
preparation

SET max_parallel_workers_per_gather=4;
SET parallel_setup_cost=1;
SET min_parallel_table_scan_size ='4kB';

CREATE TABLE t(i int);
INSERT INTO t SELECT generate_series(1, 10000);

CREATE EXTENSION plpython3u;
CREATE OR REPLACE FUNCTION test_func() RETURNS SETOF int AS
$$
plpy.execute("select pg_backend_pid()")

for i in range(0, 5):
yield (i)

$$ LANGUAGE plpython3u parallel safe;

execute the function twice in the same session

postgres=# SELECT test_func() from t where i>10 and i<100;
ERROR: error fetching next item from iterator
DETAIL: Exception: cannot start subtransactions during a parallel
operation
CONTEXT: Traceback (most recent call last):
PL/Python function "test_func"

postgres=# SELECT test_func();
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.

*Analysis*

- There is an SPI call in test_func(): plpy.execute().
- Then the server will start a subtransaction by
PLy_spi_subtransaction_begin(); BUT! The Python processor does not know
whether an error happened during PLy_spi_subtransaction_begin().
- If there is an error that occurs in PLy_spi_subtransaction_begin(),
the SPI call will be terminated but the python error indicator won't be set
and the PyObject won't be free.
- Then the next plpython UDF in the same session will fail due to the
wrong Python environment.

*Solution*
Use try-catch to catch the error that occurs in
PLy_spi_subtransaction_begin(), and set the python error indicator.

With Regards
Hao Zhang

Attachments:

v1-0001-Fix-bug-plpython-function-causes-server-panic.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Fix-bug-plpython-function-causes-server-panic.patchDownload+42-14
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hao Zhang (#1)
Re: [PATCH] plpython function causes server panic

Hao Zhang <zhrt1446384557@gmail.com> writes:

I found a problem when executing the plpython function:
After the plpython function returns an error, in the same session, if we
continue to execute
plpython function, the server panic will be caused.

Thanks for the report! I see the problem is that we're not expecting
BeginInternalSubTransaction to fail. However, I'm not sure I like
this solution, mainly because it's only covering a fraction of the
problem. There are similarly unsafe usages in plperl, pltcl, and
very possibly a lot of third-party PLs. I wonder if there's a way
to deal with this issue without changing these API assumptions.

The only readily-reachable error case in BeginInternalSubTransaction
is this specific one about IsInParallelMode, which was added later
than the original design and evidently with not a lot of thought or
testing. The comment for it speculates about whether we could get
rid of it, so I wonder if our thoughts about this ought to go in that
direction.

In any case, if we do proceed along the lines of catching errors
from BeginInternalSubTransaction, I think your patch is a bit shy
of a load because it doesn't do all the same things that other callers
of PLy_spi_exception_set do. Looking at that made me wonder why
the PLy_spi_exceptions lookup business was being duplicated by every
caller rather than being done once in PLy_spi_exception_set. So
0001 attached is a quick refactoring patch to remove that code
duplication, and then 0002 is your patch adapted to that.

I also attempted to include a test case in 0002, but I'm not very
satisfied with that. Your original test case seemed pretty expensive
for the amount of code coverage it adds, so I tried to make it happen
with debug_parallel_query instead. That does exercise the new code,
but it does not exhibit the crash if run against unpatched code.
That's because with this test case the error is only thrown in worker
processes not the leader, so we don't end up with corrupted Python
state in the leader. That result also points up that the original
test case isn't very reliable for this either: you have to have
parallel_leader_participation on, and you have to have the leader
process at least one row, which makes it pretty timing-sensitive.
On top of all that, the test would become useless if we do eventually
get rid of the !IsInParallelMode restriction. So I'm kind of inclined
to not bother with a test case if this gets to be committed in this
form.

Thoughts anyone?

regards, tom lane

Attachments:

v2-0001-simplify-PLy_spi_exception_set-API.patchtext/x-diff; charset=us-ascii; name=v2-0001-simplify-PLy_spi_exception_set-API.patchDownload+14-38
v2-0002-fix-plpython-subtrans-start-failure.patchtext/x-diff; charset=us-ascii; name=v2-0002-fix-plpython-subtrans-start-failure.patchDownload+95-13
#3Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#2)
Re: [PATCH] plpython function causes server panic

Hi,

On 2023-12-01 20:04:15 -0500, Tom Lane wrote:

Hao Zhang <zhrt1446384557@gmail.com> writes:

I found a problem when executing the plpython function:
After the plpython function returns an error, in the same session, if we
continue to execute
plpython function, the server panic will be caused.

Thanks for the report! I see the problem is that we're not expecting
BeginInternalSubTransaction to fail. However, I'm not sure I like
this solution, mainly because it's only covering a fraction of the
problem. There are similarly unsafe usages in plperl, pltcl, and
very possibly a lot of third-party PLs. I wonder if there's a way
to deal with this issue without changing these API assumptions.

There are plenty other uses, but it's not clear to me that they are similarly
affected by BeginInternalSubTransaction raising an error? It e.g. doesn't
immediately look like plperl's usage would be affected in a similar way?

Greetings,

Andres Freund

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#3)
Re: [PATCH] plpython function causes server panic

Andres Freund <andres@anarazel.de> writes:

On 2023-12-01 20:04:15 -0500, Tom Lane wrote:

Thanks for the report! I see the problem is that we're not expecting
BeginInternalSubTransaction to fail. However, I'm not sure I like
this solution, mainly because it's only covering a fraction of the
problem. There are similarly unsafe usages in plperl, pltcl, and
very possibly a lot of third-party PLs. I wonder if there's a way
to deal with this issue without changing these API assumptions.

There are plenty other uses, but it's not clear to me that they are similarly
affected by BeginInternalSubTransaction raising an error? It e.g. doesn't
immediately look like plperl's usage would be affected in a similar way?

Why not? We'd be longjmp'ing out from inside the Perl interpreter.
Maybe Perl is so robust it doesn't care, but I'd be surprised if this
can't break it. The same for Tcl.

I think that plpgsql indeed doesn't care because it has no non-PG
interpreter state to worry about. But it's in the minority I fear.

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: [PATCH] plpython function causes server panic

I wrote:

The only readily-reachable error case in BeginInternalSubTransaction
is this specific one about IsInParallelMode, which was added later
than the original design and evidently with not a lot of thought or
testing. The comment for it speculates about whether we could get
rid of it, so I wonder if our thoughts about this ought to go in that
direction.

After thinking a bit more I wonder why we need that error check at all.
Why isn't it sufficient to rely on GetNewTransactionId()'s check that
throws an error if a parallelized subtransaction tries to obtain an XID?
I don't see why we'd need to "synchronize transaction state" about
anything that never acquires an XID.

regards, tom lane

#6Hao Zhang
zhrt1446384557@gmail.com
In reply to: Tom Lane (#5)
Re: [PATCH] plpython function causes server panic

Thanks for your reply. These patches look good to me!

The only readily-reachable error case in BeginInternalSubTransaction
is this specific one about IsInParallelMode, which was added later
than the original design and evidently with not a lot of thought or
testing. The comment for it speculates about whether we could get
rid of it, so I wonder if our thoughts about this ought to go in that
direction.

IMHO, there are other error reports in the function
BeginInternalSubTransaction(), like
```
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory"),
errdetail("Failed on request of size %zu in memory context
\"%s\".",
size, context->name)));
```
we cannot avoid this crash by just getting rid of IsInParallelMode().

And in my test, the server won't crash in the plperl test.

With regards,
Hao Zhang

Tom Lane <tgl@sss.pgh.pa.us> 于2023年12月2日周六 09:51写道:

Show quoted text

I wrote:

The only readily-reachable error case in BeginInternalSubTransaction
is this specific one about IsInParallelMode, which was added later
than the original design and evidently with not a lot of thought or
testing. The comment for it speculates about whether we could get
rid of it, so I wonder if our thoughts about this ought to go in that
direction.

After thinking a bit more I wonder why we need that error check at all.
Why isn't it sufficient to rely on GetNewTransactionId()'s check that
throws an error if a parallelized subtransaction tries to obtain an XID?
I don't see why we'd need to "synchronize transaction state" about
anything that never acquires an XID.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hao Zhang (#6)
Re: [PATCH] plpython function causes server panic

Hao Zhang <zhrt1446384557@gmail.com> writes:

The only readily-reachable error case in BeginInternalSubTransaction
is this specific one about IsInParallelMode, which was added later
than the original design and evidently with not a lot of thought or
testing. The comment for it speculates about whether we could get
rid of it, so I wonder if our thoughts about this ought to go in that
direction.

IMHO, there are other error reports in the function
BeginInternalSubTransaction(), like

Sure, but all the other ones are extremely hard to hit, which is why
we didn't bother to worry about them to begin with. If we want to
make this more formally bulletproof, my inclination would be to
(a) get rid of the IsInParallelMode restriction and then (b) turn
the function into a critical section, so that any other error gets
treated as a PANIC. Maybe at some point we'd be willing to make a
variant of BeginInternalSubTransaction that has a different API and
can manage such cases without a PANIC, but that seems far down the
road to me, and certainly not something to be back-patched.

The main reason for my caution here is that, by catching an error
and allowing Python (or Perl, or something else) code to decide
what to do next, we are very dependent on that code doing the right
thing. This is already a bit of a leap of faith for run-of-the-mill
errors. For errors in transaction startup or shutdown, I think it's
a bigger leap than I care to make. We're pretty well hosed if we
can't make the transaction machinery work, so imagining that we can
clean up after such an error and march merrily onwards seems mighty
optimistic.

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: [PATCH] plpython function causes server panic

I wrote:

Hao Zhang <zhrt1446384557@gmail.com> writes:

IMHO, there are other error reports in the function
BeginInternalSubTransaction(), like

Sure, but all the other ones are extremely hard to hit, which is why
we didn't bother to worry about them to begin with. If we want to
make this more formally bulletproof, my inclination would be to
(a) get rid of the IsInParallelMode restriction and then (b) turn
the function into a critical section, so that any other error gets
treated as a PANIC.

Here's a draft patch along this line. Basically the idea is that
subtransactions used for error control are now legal in parallel
mode (including in parallel workers) so long as they don't try to
acquire their own XIDs. I had to clean up some error handling
in xact.c, but really this is a pretty simple patch.

Rather than a true critical section (ie PANIC on failure), it seemed
to me to be enough to force FATAL exit if BeginInternalSubTransaction
fails. Given the likelihood that our transaction state is messed up
if we get a failure partway through, it's not clear to me that we
could do much better than that even if we were willing to make an API
change for BeginInternalSubTransaction.

I haven't thought hard about what new test cases we might want to
add for this. It gets through check-world as is, meaning that
nobody has made any test cases exercising the previous restrictions
either. There might be more documentation work to be done, too.

regards, tom lane

Attachments:

v1-allow-subxacts-in-parallel-workers.patchtext/x-diff; charset=us-ascii; name=v1-allow-subxacts-in-parallel-workers.patchDownload+44-42
#9Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#8)
Re: [PATCH] plpython function causes server panic

On Fri, Dec 29, 2023 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Here's a draft patch along this line. Basically the idea is that
subtransactions used for error control are now legal in parallel
mode (including in parallel workers) so long as they don't try to
acquire their own XIDs. I had to clean up some error handling
in xact.c, but really this is a pretty simple patch.

I agree with the general direction. A few comments:

- Isn't it redundant to test if IsInParallelMode() ||
IsParallelWorker()? We can't be in a parallel worker without also
being in parallel mode, except during the worker startup sequence.

- I don't think the documentation changes are entirely accurate. The
whole point of the patch is to allow parallel workers to make changes
to the transaction state, but the documentation says you can't. Maybe
we should just delete "change the transaction state" entirely from the
list of things that you're not allowed to do, since "write to the
database" is already listed separately; or maybe we should replace it
with something like "assign new transaction IDs or command IDs,"
although that's kind of low-level. I don't think we should just delete
the "even temporarily" bit, as you've done.

- While I like the new comments in BeginInternalSubTransaction(), I
think the changes in ReleaseCurrentSubTransaction() and
RollbackAndReleaseCurrentSubTransaction() need more thought. For one
thing, it's got to be wildly optimistic to claim that we would have
caught *anything* that's forbidden in parallel mode; that would
require solving the halting problem. I'd rather have no comment at all
here than one making such an ambitious claim, and I think that might
be a fine way to go. But if we do have a comment, I think it should be
more narrowly focused e.g. "We do not check for parallel mode here.
It's permissible to start and end subtransactions while in parallel
mode, as long as no new XIDs or command IDs are assigned." One
additional thing that might (or might not) be worth mentioning or
checking for here is that the leader shouldn't try to reduce the
height of the transaction state stack to anything less than what it
was when the parallel operation started; if it wants to do that, it
needs to clean up the parallel workers and exit parallel mode first.
Similarly a worker shouldn't ever end the toplevel transaction except
during backend cleanup.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: [PATCH] plpython function causes server panic

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Dec 29, 2023 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Here's a draft patch along this line. Basically the idea is that
subtransactions used for error control are now legal in parallel
mode (including in parallel workers) so long as they don't try to
acquire their own XIDs. I had to clean up some error handling
in xact.c, but really this is a pretty simple patch.

I agree with the general direction. A few comments:

Thanks for looking at this! I was hoping you'd review it, because
I thought there was a pretty significant chance that I'd missed some
fundamental reason it couldn't work. I feel better now about it
being worth pursuing.

I consider the patch draft quality at this point: I didn't spend
much effort on docs or comments, and none on test cases. I'll
work on those issues and come back with a v2.

regards, tom lane

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: [PATCH] plpython function causes server panic

Robert Haas <robertmhaas@gmail.com> writes:

I agree with the general direction. A few comments:

- Isn't it redundant to test if IsInParallelMode() ||
IsParallelWorker()? We can't be in a parallel worker without also
being in parallel mode, except during the worker startup sequence.

Hmm. The existing code in AssignTransactionId and
CommandCounterIncrement tests both, so I figured that the conservative
course was to make DefineSavepoint and friends test both. Are you
saying AssignTransactionId and CommandCounterIncrement are wrong?
If you're saying you don't believe that these routines are reachable
during parallel worker start, that could be true, but I'm not sure
I want to make that assumption. In any case, surely the xxxSavepoint
routines are not hot enough to make it an interesting
micro-optimization. (Perhaps it is worthwhile in AssignTransactionId
and CCI, but changing those seems like a job for another patch.)

regards, tom lane

#12Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#11)
Re: [PATCH] plpython function causes server panic

On Fri, Mar 22, 2024 at 1:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I agree with the general direction. A few comments:

- Isn't it redundant to test if IsInParallelMode() ||
IsParallelWorker()? We can't be in a parallel worker without also
being in parallel mode, except during the worker startup sequence.

Hmm. The existing code in AssignTransactionId and
CommandCounterIncrement tests both, so I figured that the conservative
course was to make DefineSavepoint and friends test both. Are you
saying AssignTransactionId and CommandCounterIncrement are wrong?
If you're saying you don't believe that these routines are reachable
during parallel worker start, that could be true, but I'm not sure
I want to make that assumption. In any case, surely the xxxSavepoint
routines are not hot enough to make it an interesting
micro-optimization. (Perhaps it is worthwhile in AssignTransactionId
and CCI, but changing those seems like a job for another patch.)

Yeah, that's all fair enough. I went back and looked at the history of
this and found 94b4f7e2a635c3027a23b07086f740615b56aa64.

--
Robert Haas
EDB: http://www.enterprisedb.com

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: [PATCH] plpython function causes server panic

Robert Haas <robertmhaas@gmail.com> writes:

- I don't think the documentation changes are entirely accurate. The
whole point of the patch is to allow parallel workers to make changes
to the transaction state, but the documentation says you can't. Maybe
we should just delete "change the transaction state" entirely from the
list of things that you're not allowed to do, since "write to the
database" is already listed separately; or maybe we should replace it
with something like "assign new transaction IDs or command IDs,"
although that's kind of low-level. I don't think we should just delete
the "even temporarily" bit, as you've done.

Fair enough. In the attached v2, I wrote "change the transaction
state (other than by using a subtransaction for error recovery)";
what do you think of that?

I dug around in the docs and couldn't really find anything about
parallel-query transaction limitations other than this bit in
parallel.sgml and the more or less copy-pasted text in
create_function.sgml; did you have any other spots in mind?
(I did find the commentary in README.parallel, but that's not
exactly user-facing.)

- While I like the new comments in BeginInternalSubTransaction(), I
think the changes in ReleaseCurrentSubTransaction() and
RollbackAndReleaseCurrentSubTransaction() need more thought.

Yah. After studying the code a bit more, I realized that what
I'd done would cause IsInParallelMode() to start returning false
during a subtransaction within parallel mode, which is surely not
what we want. That state has to be heritable into subtransactions
in some fashion. The attached keeps the current semantics of
parallelModeLevel and adds a bool parallelChildXact field that is
true if any outer transaction level has nonzero parallelModeLevel.
That's possibly more general than we need today, but it seems like
a reasonably clean definition.

One additional thing that might (or might not) be worth mentioning or
checking for here is that the leader shouldn't try to reduce the
height of the transaction state stack to anything less than what it
was when the parallel operation started; if it wants to do that, it
needs to clean up the parallel workers and exit parallel mode first.
Similarly a worker shouldn't ever end the toplevel transaction except
during backend cleanup.

I think these things are already dealt with. However, one thing
worth questioning is that CommitSubTransaction() will just silently
kill any workers started during the current subxact, and likewise
CommitTransaction() zaps workers without complaint. Shouldn't these
instead throw an error about how you didn't close parallel mode,
and then the corresponding Abort function does the cleanup?
I did not change that behavior here, but it seems dubious.

v2 attached works a bit harder on the comments and adds a simplistic
test case. I feel that I don't want to incorporate the plpython
crash that started this thread, as it's weird and dependent on
Python code outside our control (though I have checked that we
don't crash on that anymore).

regards, tom lane

Attachments:

v2-allow-subxacts-in-parallel-workers.patchtext/x-diff; charset=us-ascii; name=v2-allow-subxacts-in-parallel-workers.patchDownload+154-73
#14Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: [PATCH] plpython function causes server panic

On Fri, Mar 22, 2024 at 4:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Fair enough. In the attached v2, I wrote "change the transaction
state (other than by using a subtransaction for error recovery)";
what do you think of that?

I think that's pretty good. I wonder if there are some bizarre cases
where the patch would allow slightly more than that ... who is to say
that you must pop the subtransaction you pushed? But that sort of
pedantry is probably not worth worrying about for purposes of the
documentation, especially because such a thing might not be a very
good idea anyway.

I dug around in the docs and couldn't really find anything about
parallel-query transaction limitations other than this bit in
parallel.sgml and the more or less copy-pasted text in
create_function.sgml; did you have any other spots in mind?
(I did find the commentary in README.parallel, but that's not
exactly user-facing.)

I don't have anything else in mind at the moment.

I think these things are already dealt with. However, one thing
worth questioning is that CommitSubTransaction() will just silently
kill any workers started during the current subxact, and likewise
CommitTransaction() zaps workers without complaint. Shouldn't these
instead throw an error about how you didn't close parallel mode,
and then the corresponding Abort function does the cleanup?
I did not change that behavior here, but it seems dubious.

I'm not sure. I definitely knew when I wrote this code that we often
emit warnings about resources that aren't cleaned up at (sub)commit
time rather than just silently releasing them, and I feel like the
fact that I didn't implement that behavior here was probably a
deliberate choice to avoid some problem. But I have no memory of what
that problem was, and it is entirely possible that it was eliminated
at some later phase of development. I think that decision was made
quite early, before much of anything was working.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#14)
Re: [PATCH] plpython function causes server panic

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 22, 2024 at 4:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think these things are already dealt with. However, one thing
worth questioning is that CommitSubTransaction() will just silently
kill any workers started during the current subxact, and likewise
CommitTransaction() zaps workers without complaint. Shouldn't these
instead throw an error about how you didn't close parallel mode,
and then the corresponding Abort function does the cleanup?
I did not change that behavior here, but it seems dubious.

I'm not sure. I definitely knew when I wrote this code that we often
emit warnings about resources that aren't cleaned up at (sub)commit
time rather than just silently releasing them, and I feel like the
fact that I didn't implement that behavior here was probably a
deliberate choice to avoid some problem.

Ah, right, it's reasonable to consider this an end-of-xact resource
leak, which we generally handle with WARNING not ERROR. And I see
that AtEOXact_Parallel and AtEOSubXact_Parallel already do

if (isCommit)
elog(WARNING, "leaked parallel context");

However, the calling logic seems a bit shy of a load, in that it
trusts IsInParallelMode() completely to decide whether to check for
leaked parallel contexts. So we'd miss the case where somebody did
ExitParallelMode without having cleaned up workers. It's not like
AtEOXact_Parallel and AtEOSubXact_Parallel cost a lot when they have
nothing to do, so I think we should call them unconditionally, and
separately from that issue a warning if parallelModeLevel isn't zero
(and we're committing).

regards, tom lane

Attachments:

v3-allow-subxacts-in-parallel-workers.patchtext/x-diff; charset=us-ascii; name=v3-allow-subxacts-in-parallel-workers.patchDownload+181-88
#16Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#15)
Re: [PATCH] plpython function causes server panic

On Sat, Mar 23, 2024 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, the calling logic seems a bit shy of a load, in that it
trusts IsInParallelMode() completely to decide whether to check for
leaked parallel contexts. So we'd miss the case where somebody did
ExitParallelMode without having cleaned up workers. It's not like
AtEOXact_Parallel and AtEOSubXact_Parallel cost a lot when they have
nothing to do, so I think we should call them unconditionally, and
separately from that issue a warning if parallelModeLevel isn't zero
(and we're committing).

I wasn't worried about this case when I wrote this code. The general
flow that I anticipated was that somebody would run a query, and
ExecMain.c would enter parallel mode, and then maybe eventually reach
some SQL-callable C function that hadn't gotten the memo about
parallel query but had been mistakenly labelled as PARALLEL RESTRICTED
or PARALLEL SAFE when it wasn't really, and so the goal was for core
functions that such a function might reasonably attempt to call to
notice that something bad was happening.

But if the user puts a call to ExitParallelMode() inside such a
function, it's hard to imagine what goal they have other than to
deliberately circumvent the safeguards. And they're always going to be
able to do that somehow, if they're coding in C. So I'm not convinced
that the sanity checks you've added are really going to do anything
other than burn a handful of CPU cycles. If there's some plausible
case in which they protect us against a user who has legitimately made
an error, fine; but if we're just wandering down the slippery slope of
believing we can defend against malicious C code, we absolutely should
not do that, not even a little bit. The first CPU instruction we burn
in the service of a hopeless cause is already one too many.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#16)
Re: [PATCH] plpython function causes server panic

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, Mar 23, 2024 at 12:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, the calling logic seems a bit shy of a load, in that it
trusts IsInParallelMode() completely to decide whether to check for
leaked parallel contexts. So we'd miss the case where somebody did
ExitParallelMode without having cleaned up workers.

But if the user puts a call to ExitParallelMode() inside such a
function, it's hard to imagine what goal they have other than to
deliberately circumvent the safeguards. And they're always going to be
able to do that somehow, if they're coding in C. So I'm not convinced
that the sanity checks you've added are really going to do anything
other than burn a handful of CPU cycles. If there's some plausible
case in which they protect us against a user who has legitimately made
an error, fine; but if we're just wandering down the slippery slope of
believing we can defend against malicious C code, we absolutely should
not do that, not even a little bit. The first CPU instruction we burn
in the service of a hopeless cause is already one too many.

By that logic, we should rip out every Assert in the system, as well
as all of the (extensive) resource leak checking that already happens
during CommitTransaction. We've always felt that those leak checks
were worth the cost to help us find bugs --- which they have done and
still do from time to time. I don't see why this case is different,
especially when the added cost compared to HEAD is not much more than
one C function call.

Or in other words: the point is not about stopping malicious C code,
it's about recognizing that we make mistakes.

regards, tom lane

#18Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#17)
Re: [PATCH] plpython function causes server panic

On Mon, Mar 25, 2024 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

By that logic, we should rip out every Assert in the system, as well
as all of the (extensive) resource leak checking that already happens
during CommitTransaction. We've always felt that those leak checks
were worth the cost to help us find bugs --- which they have done and
still do from time to time. I don't see why this case is different,
especially when the added cost compared to HEAD is not much more than
one C function call.

Well, I explained why *I* thought it was different, but obviously you
don't agree.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#18)
Re: [PATCH] plpython function causes server panic

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Mar 25, 2024 at 11:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

... I don't see why this case is different,
especially when the added cost compared to HEAD is not much more than
one C function call.

Well, I explained why *I* thought it was different, but obviously you
don't agree.

After mulling it over for awhile, I still think the extra checking
is appropriate, especially since this patch is enlarging the set of
things that can happen in parallel mode. How do you want to proceed?

regards, tom lane

#20Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#19)
Re: [PATCH] plpython function causes server panic

On Wed, Mar 27, 2024 at 5:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

After mulling it over for awhile, I still think the extra checking
is appropriate, especially since this patch is enlarging the set of
things that can happen in parallel mode. How do you want to proceed?

I sort of assumed you were going to commit the patch as you had it.
I'm not a huge fan of that, but I don't think that's it's catastrophe,
either. It pains me a bit to add CPU cycles that I consider
unnecessary to a very frequently taken code path, but as you say, it's
not a lot of CPU cycles, so maybe nobody will ever notice. I actually
really wish we could find some way of making subtransactions
significantly lighter-wait, because I think the cost of spinning up
and tearing down a trivial subtransaction is a real performance
problem, but fixing that is probably a pretty hard problem whether
this patch gets committed or not.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#22)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#24)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#25)