pgsql: Add pg_terminate_backend() to allow terminating only a single

Started by Bruce Momjianabout 18 years ago35 messageshackerscomitters

bruce@momjian.us

about 18 years ago

hackerscomitters

Log Message:
-----------
Add pg_terminate_backend() to allow terminating only a single session.

Modified Files:
--------------
pgsql/doc/src/sgml:
func.sgml (r1.430 -> r1.431)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/doc/src/sgml/func.sgml?r1=1.430&r2=1.431)
runtime.sgml (r1.411 -> r1.412)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/doc/src/sgml/runtime.sgml?r1=1.411&r2=1.412)
pgsql/src/backend/tcop:
postgres.c (r1.548 -> r1.549)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/tcop/postgres.c?r1=1.548&r2=1.549)
pgsql/src/backend/utils/adt:
misc.c (r1.59 -> r1.60)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/adt/misc.c?r1=1.59&r2=1.60)
pgsql/src/include/catalog:
pg_proc.h (r1.489 -> r1.490)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/catalog/pg_proc.h?r1=1.489&r2=1.490)
pgsql/src/include/storage:
proc.h (r1.104 -> r1.105)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/storage/proc.h?r1=1.104&r2=1.105)
pgsql/src/include/utils:
builtins.h (r1.312 -> r1.313)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/utils/builtins.h?r1=1.312&r2=1.313)

Alvaro Herrera

alvherre@2ndquadrant.com

about 18 years ago

In reply to: Bruce Momjian (#1)

hackerscomitters

Re: [COMMITTERS] pgsql: Add pg_terminate_backend() to allow terminating only a single

Bruce Momjian wrote:

Log Message:
-----------
Add pg_terminate_backend() to allow terminating only a single session.

I wonder if it's OK to grab an LWLock in a signal handler.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Bruce Momjian

bruce@momjian.us

about 18 years ago

In reply to: Alvaro Herrera (#2)

hackerscomitters

Re: Re: [COMMITTERS] pgsql: Add pg_terminate_backend() to allow terminating only a single

Alvaro Herrera wrote:

Bruce Momjian wrote:

Log Message:
-----------
Add pg_terminate_backend() to allow terminating only a single session.

I wonder if it's OK to grab an LWLock in a signal handler.

I am not in a signal handler there --- pg_terminate_backend() happens
for the person requesting the termination, not the terminated backend
that gets the signal.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Bruce Momjian

bruce@momjian.us

about 18 years ago

In reply to: Bruce Momjian (#3)

hackerscomitters

Re: [COMMITTERS] pgsql: Add pg_terminate_backend() to allow terminating only a single

"Bruce Momjian" <bruce@momjian.us> writes:

Log Message:
-----------
Add pg_terminate_backend() to allow terminating only a single session.

I'm interested in this because I was already looking for a solution to the
"out of signals" problem we have.

I think we could expand this by having a bunch of boolean flags, one each for
different conditions including the sinval processing conditions, interrupt,
info, and terminate. (Any more?)

The two things we would have to check to be sure of is:

1) Do we care about how many times events are processed? Ie, if you press
interrupt twice is it important that that be handled as an interrupt twice? It
doesn't work that way currently for interrupt but are any of the other
conditions sensitive to this? I don't think so.

2) Do we care what order things happen in? Ie, if you send an info request and
then a cancel request is it ok if the cancel is handled first. I don't see why
not myself. And if it's a terminate request we *clear* don't want to bother
handling any other events first.

It seems to me we could replace all of the above with either SIGINT or USR1
and have a bunch of boolean flags in MyProc. I'm not sure of the implication
for sinval processing of having to get a whole bunch of LWLocks though.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

Bruce Momjian

bruce@momjian.us

about 18 years ago

In reply to: Bruce Momjian (#4)

hackerscomitters

pg_terminate_backend() idea

Gregory Stark wrote:

Add pg_terminate_backend() to allow terminating only a single session.

I'm interested in this because I was already looking for a solution to the
"out of signals" problem we have.

I think we could expand this by having a bunch of boolean flags, one each for
different conditions including the sinval processing conditions, interrupt,
info, and terminate. (Any more?)

The two things we would have to check to be sure of is:

1) Do we care about how many times events are processed? Ie, if you press
interrupt twice is it important that that be handled as an interrupt twice? It
doesn't work that way currently for interrupt but are any of the other
conditions sensitive to this? I don't think so.

Not that I can think of.

2) Do we care what order things happen in? Ie, if you send an info request and
then a cancel request is it ok if the cancel is handled first. I don't see why
not myself. And if it's a terminate request we *clear* don't want to bother
handling any other events first.

I don't think we care.

It seems to me we could replace all of the above with either SIGINT or USR1
and have a bunch of boolean flags in MyProc. I'm not sure of the implication
for sinval processing of having to get a whole bunch of LWLocks though.

Keep in mind PGPROC->terminate was easier because once it was set it was
never cleared. If you need to clear the flag and keep going the code is
going to need to be a little more sophisticated to avoid lost interrupts.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Bruce Momjian (#3)

hackerscomitters

Re: Re: [COMMITTERS] pgsql: Add pg_terminate_backend() to allow terminating only a single

Bruce Momjian <bruce@momjian.us> writes:

Alvaro Herrera wrote:

I wonder if it's OK to grab an LWLock in a signal handler.

I am not in a signal handler there --- pg_terminate_backend() happens
for the person requesting the termination, not the terminated backend
that gets the signal.

A more interesting question is what makes you think that taking
ProcArrayLock here has any value whatsoever.

regards, tom lane

Bruce Momjian

bruce@momjian.us

about 18 years ago

In reply to: Tom Lane (#6)

hackerscomitters

Re: Re: [COMMITTERS] pgsql: Add pg_terminate_backend() to allow terminating only a single

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Alvaro Herrera wrote:

I wonder if it's OK to grab an LWLock in a signal handler.

I am not in a signal handler there --- pg_terminate_backend() happens
for the person requesting the termination, not the terminated backend
that gets the signal.

A more interesting question is what makes you think that taking
ProcArrayLock here has any value whatsoever.

I took the lock so I would be sure the PGPROC array was the matching pid
and not some other pid that was created between my check and the setting
of the variable.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Bruce Momjian (#7)

hackerscomitters

Re: Re: [COMMITTERS] pgsql: Add pg_terminate_backend() to allow terminating only a single

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

A more interesting question is what makes you think that taking
ProcArrayLock here has any value whatsoever.

I took the lock so I would be sure the PGPROC array was the matching pid
and not some other pid that was created between my check and the setting
of the variable.

Unfortunately, that doesn't work at all, even disregarding the fact that
you forgot to make proc.c clear the flag when setting up a new process's
PGPROC.

The race condition would go away if ProcArrayRemove zeroed the pid
field, but I'm afraid that that might break something else; in
particular I think we still need to be able to manipulate LWLocks after
ProcArrayRemove, so I suspect this is not safe either.

I think the only really safe way to do it is to make a new procarray.c
function that searches like BackendPidGetProc, but hands the proc back
with the ProcArrayLock still held. Or maybe we should just redefine
BackendPidGetProc that way.

There are other race conditions in this patch too; notably that lots of
places feel free to clear QueryCancelPending on-the-fly, thus possibly
letting them disregard a terminate signal.

Even assuming that we can fix the garden-variety bugs like these,
there's still a fundamental problem that an uncooperative user function
(particularly one in plperl/pltcl/plpython) can indefinitely delay
response to pg_terminate_backend. Maybe that's okay, seeing that it can
similarly hold off or disregard QueryCancel, but I'm not sure the people
asking for this are really gonna be satisfied. (One thing we should
consider is making ERRCODE_ADMIN_SHUTDOWN unconditionally untrappable
by plpgsql exception blocks, which'd at least fix the issue for plpgsql
functions.)

I'm also thinking that this doesn't fix the problem that we're relying
on die() to not leave shared memory in a bad state. All it does is
restrict things so that we only try that from the main loop rather than
at any random CHECK_FOR_INTERRUPTS. That certainly reduces the odds of
hitting a bug of this type, but I don't see that it can be said to make
them zero.

All in all, this patch wasn't ready to apply without review. I'm
currently looking to see whether it's salvageable or not.

regards, tom lane

Bruce Momjian

bruce@momjian.us

about 18 years ago

In reply to: Tom Lane (#8)

hackerscomitters

Re: Re: [COMMITTERS] pgsql: Add pg_terminate_backend() to allow terminating only a single

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

A more interesting question is what makes you think that taking
ProcArrayLock here has any value whatsoever.

I took the lock so I would be sure the PGPROC array was the matching pid
and not some other pid that was created between my check and the setting
of the variable.

Unfortunately, that doesn't work at all, even disregarding the fact that
you forgot to make proc.c clear the flag when setting up a new process's
PGPROC.

I assume the PGPROC was cleared with zeros on start. I can fix that.

The race condition would go away if ProcArrayRemove zeroed the pid
field, but I'm afraid that that might break something else; in
particular I think we still need to be able to manipulate LWLocks after
ProcArrayRemove, so I suspect this is not safe either.

Oh, good point. The PGPROC might be dead already but the pid is still
there. Don't we mark it dead somehow? Is it only because it isn't in
ProcArrayStruct that it is dead?

I think the only really safe way to do it is to make a new procarray.c
function that searches like BackendPidGetProc, but hands the proc back
with the ProcArrayLock still held. Or maybe we should just redefine
BackendPidGetProc that way.

Yea, we could do that and it would clean up my code.
BackendPidGetProc() doesn't get called many places. I was a little
worried of looping over ProcArrayStruct while holding an exclusive
lock.

There are other race conditions in this patch too; notably that lots of
places feel free to clear QueryCancelPending on-the-fly, thus possibly
letting them disregard a terminate signal.

True.

Even assuming that we can fix the garden-variety bugs like these,
there's still a fundamental problem that an uncooperative user function
(particularly one in plperl/pltcl/plpython) can indefinitely delay
response to pg_terminate_backend. Maybe that's okay, seeing that it can
similarly hold off or disregard QueryCancel, but I'm not sure the people
asking for this are really gonna be satisfied. (One thing we should
consider is making ERRCODE_ADMIN_SHUTDOWN unconditionally untrappable
by plpgsql exception blocks, which'd at least fix the issue for plpgsql
functions.)

They are going to be more satisfied than they are now. This is the
first listed Admin TODO item. Every other database allows this simple
capability and we should support it. Cancel is not the same as exit.

I'm also thinking that this doesn't fix the problem that we're relying
on die() to not leave shared memory in a bad state. All it does is
restrict things so that we only try that from the main loop rather than
at any random CHECK_FOR_INTERRUPTS. That certainly reduces the odds of
hitting a bug of this type, but I don't see that it can be said to make
them zero.

Well, OK, can we use proc_exit() instead? That seems fine too and
probably better than die().

All in all, this patch wasn't ready to apply without review. I'm
currently looking to see whether it's salvageable or not.

OK.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#10

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Tom Lane (#8)

hackerscomitters

Re: Re: [COMMITTERS] pgsql: Add pg_terminate_backend() to allow terminating only a single

I wrote:

All in all, this patch wasn't ready to apply without review. I'm
currently looking to see whether it's salvageable or not.

After further study, I've concluded that it is in fact not salvageable,
and I respectfully request that it be reverted.

I was able to get things to more or less work most of the time with
three or four additional ugly hacks in postgres.c, but I still don't
have any great confidence that there aren't windows where a terminate
request wouldn't be ignored (even without considering the uncooperative-
function scenario). Moreover it's entirely unclear that this approach
actually dodges any of the hypothetical bugs in SIGTERM handling.
Given the complexity that it adds, I think it's fair to say that this
approach is probably buggier than the other way.

I think if we want pg_terminate_backend, we have to do it the way that
was originally discussed: have it issue SIGTERM and fix whatever needs
to be fixed in the SIGTERM code path. As the TODO list used to say
before it got editorialized upon, this is mostly a matter of testing
(something that I can tell this patch was sadly lacking :-().
We do need also to go around and fix any places that think a PG_TRY
block is a sufficient way to clean up state in shared memory --- cf
this AFAIK-still-unfixed bug:
http://archives.postgresql.org/pgsql-hackers/2007-04/msg00218.php

regards, tom lane

#11

Bruce Momjian

bruce@momjian.us

about 18 years ago

In reply to: Tom Lane (#10)

hackerscomitters

pg_terminate_backend() issues

Tom Lane wrote:

I wrote:

All in all, this patch wasn't ready to apply without review. I'm
currently looking to see whether it's salvageable or not.

After further study, I've concluded that it is in fact not salvageable,
and I respectfully request that it be reverted.

OK, reverted.

I was able to get things to more or less work most of the time with
three or four additional ugly hacks in postgres.c, but I still don't
have any great confidence that there aren't windows where a terminate
request wouldn't be ignored (even without considering the uncooperative-
function scenario). Moreover it's entirely unclear that this approach
actually dodges any of the hypothetical bugs in SIGTERM handling.

I don't understand. If we call proc_exit(0) instead, it is the same as
someone exiting the backend via PQfinish().

Given the complexity that it adds, I think it's fair to say that this
approach is probably buggier than the other way.

I think if we want pg_terminate_backend, we have to do it the way that
was originally discussed: have it issue SIGTERM and fix whatever needs
to be fixed in the SIGTERM code path. As the TODO list used to say
before it got editorialized upon, this is mostly a matter of testing
(something that I can tell this patch was sadly lacking :-().
We do need also to go around and fix any places that think a PG_TRY
block is a sufficient way to clean up state in shared memory --- cf
this AFAIK-still-unfixed bug:
http://archives.postgresql.org/pgsql-hackers/2007-04/msg00218.php

Well, with no movement on this TODO item since it was removed in 8.0, I
am willing to give users something that works most of the time. As you
said already, cancel isn't going to be 100% reliable anyway because of
places we can't check for cancel.

I will work on a new version of the patch that people can see.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#12

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Bruce Momjian (#11)

hackerscomitters

Re: pg_terminate_backend() issues

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

I was able to get things to more or less work most of the time with
three or four additional ugly hacks in postgres.c, but I still don't
have any great confidence that there aren't windows where a terminate
request wouldn't be ignored (even without considering the uncooperative-
function scenario). Moreover it's entirely unclear that this approach
actually dodges any of the hypothetical bugs in SIGTERM handling.

I don't understand. If we call proc_exit(0) instead, it is the same as
someone exiting the backend via PQfinish().

Well, using proc_exit() instead of die() might have alleviated that
particular objection, but there are still the others.

I think if we want pg_terminate_backend, we have to do it the way that
was originally discussed: have it issue SIGTERM and fix whatever needs
to be fixed in the SIGTERM code path.

Well, with no movement on this TODO item since it was removed in 8.0, I
am willing to give users something that works most of the time.

If the users want it so bad, why has no one stepped up to do the
testing?

regards, tom lane

#13

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Tom Lane (#12)

hackerscomitters

Re: pg_terminate_backend() issues

I did a quick grep for PG_CATCH uses to see what we have along the lines
of this bug:
http://archives.postgresql.org/pgsql-hackers/2007-04/msg00218.php

In current sources there are three places at risk:

btbulkdelete, as noted in the above message
pg_start_backup needs to reset forcePageWrites = false
createdb wants to UnlockSharedObject and delete any already-copied files

In createdb, the Unlock actually doesn't need to get cleaned up, since
transaction abort would release the lock anyway. But leaving a possibly
large mess of orphaned files doesn't seem nice.

ISTM that there will be more cases like this in future, so we need a
general solution anyway. I propose the following sort of code structure
for these situations:

on_shmem_exit(cleanup_routine, arg);
PG_TRY();
{
... do something ...
cancel_shmem_exit(cleanup_routine, arg);
}
PG_CATCH();
{
cancel_shmem_exit(cleanup_routine, arg);
cleanup_routine(arg);
PG_RE_THROW();
}
PG_END_TRY();

where cancel_shmem_exit is defined to pop the latest shmem_exit_list
entry if it matches the passed arguments (I don't see any particular
need to search further than that in the list). This structure
guarantees that cleanup_routine will be called on the way out of
either a plain ERROR or a FATAL exit.

Thoughts? We clearly must do something about this before we can even
think of calling retail SIGTERM a supported feature ...

regards, tom lane

#14

Bruce Momjian

bruce@momjian.us

about 18 years ago

In reply to: Bruce Momjian (#5)

hackerscomitters

Re: pg_terminate_backend() idea

"Bruce Momjian" <bruce@momjian.us> writes:

Gregory Stark wrote:

It seems to me we could replace all of the above with either SIGINT or USR1
and have a bunch of boolean flags in MyProc. I'm not sure of the implication
for sinval processing of having to get a whole bunch of LWLocks though.

Keep in mind PGPROC->terminate was easier because once it was set it was
never cleared. If you need to clear the flag and keep going the code is
going to need to be a little more sophisticated to avoid lost interrupts.

That's kind of the point. I don't think we care about lost interrupts for any
of these things. As long as we know a sinval flush happened I think we don't
care how many more happened. Or a cancel request or terminate request.

For sinval of course it would be important to clear the flags (with MyProc
locked) prior to processing the queue in case one arrives as soon as we're
finished. But I think that's the only detail.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

#15

Bruce Momjian

bruce@momjian.us

about 18 years ago

In reply to: Tom Lane (#13)

hackerscomitters

Re: pg_terminate_backend() issues

Tom Lane wrote:

I did a quick grep for PG_CATCH uses to see what we have along the lines
of this bug:
http://archives.postgresql.org/pgsql-hackers/2007-04/msg00218.php

In current sources there are three places at risk:

btbulkdelete, as noted in the above message
pg_start_backup needs to reset forcePageWrites = false
createdb wants to UnlockSharedObject and delete any already-copied files

In createdb, the Unlock actually doesn't need to get cleaned up, since
transaction abort would release the lock anyway. But leaving a possibly
large mess of orphaned files doesn't seem nice.

ISTM that there will be more cases like this in future, so we need a
general solution anyway. I propose the following sort of code structure
for these situations:

on_shmem_exit(cleanup_routine, arg);
PG_TRY();
{
... do something ...
cancel_shmem_exit(cleanup_routine, arg);
}
PG_CATCH();
{
cancel_shmem_exit(cleanup_routine, arg);
cleanup_routine(arg);
PG_RE_THROW();
}
PG_END_TRY();

where cancel_shmem_exit is defined to pop the latest shmem_exit_list
entry if it matches the passed arguments (I don't see any particular
need to search further than that in the list). This structure
guarantees that cleanup_routine will be called on the way out of
either a plain ERROR or a FATAL exit.

Thoughts? We clearly must do something about this before we can even
think of calling retail SIGTERM a supported feature ...

Here is a little different idea. Seems we want something more like:

PG_ON_EXIT();
{
cleanup_routine(arg);
}

PG_TRY();
{
... do something ...
}
PG_CATCH();
{
PG_RE_THROW();
}
PG_END_TRY();

PG_ON_EXIT() would place cleanup_routine() into the shmem_exit_list and
it would be called at the end of PG_CATCH and removed from
shmem_exit_list by PG_END_TRY(). Another idea would be to define the
PG_CATCH() first so it would be put in the shmem_exit_list before the
TRY was tried, but I assume you don't want all those functions to run on
exit.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#16

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Bruce Momjian (#15)

hackerscomitters

Re: pg_terminate_backend() issues

Bruce Momjian <bruce@momjian.us> writes:

Here is a little different idea. Seems we want something more like:

Yeah, I thought about building a set of macros that would have this
ability folded in, but it didn't seem like much notational advantage
given that you have to write the cleanup routine out-of-line anyway.

Also, until we have a lot more than three use-sites, more macros
don't seem to be worth the trouble.

regards, tom lane

#17

Bruce Momjian

bruce@momjian.us

about 18 years ago

In reply to: Tom Lane (#16)

hackerscomitters

Re: pg_terminate_backend() issues

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Here is a little different idea. Seems we want something more like:

Yeah, I thought about building a set of macros that would have this
ability folded in, but it didn't seem like much notational advantage
given that you have to write the cleanup routine out-of-line anyway.

Yea, agreed. I am a little worried about the out-of-line issue because
some cases might become more complex because you have to pass lots of
parameters, perhaps.

Also, until we have a lot more than three use-sites, more macros
don't seem to be worth the trouble.

Oh, only three use sites. I thought we had three wrong sites but that
more had to be changed. Yea, go for it!

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#18

Bruce Momjian

bruce@momjian.us

about 18 years ago

In reply to: Tom Lane (#12)

hackerscomitters

Re: pg_terminate_backend() issues

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

I think if we want pg_terminate_backend, we have to do it the way that
was originally discussed: have it issue SIGTERM and fix whatever needs
to be fixed in the SIGTERM code path.

Well, with no movement on this TODO item since it was removed in 8.0, I
am willing to give users something that works most of the time.

If the users want it so bad, why has no one stepped up to do the
testing?

Good question. Tom and I talked about this on the phone today.

I think the problem is testing to try to prove the lack of a bug. How
long does someone test to know they have reasonably proven a bug doesn't
exist?

I think the other problem is what to test. One SIGTERM problem that was
reported was related to schema changes. Certainly that has to be
tested, but what else: prepared statements, queries, HOT updates,
connect/disconnect, notify? There are lots of place to test and it is
hard to know which ones, and for how long.

I am starting to think we need to analyze the source code rather than
testing, because of what we are testing for.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#19

Bruce Momjian

bruce@momjian.us

about 18 years ago

In reply to: Tom Lane (#12)

hackerscomitters

Re: pg_terminate_backend() issues

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

I was able to get things to more or less work most of the time with
three or four additional ugly hacks in postgres.c, but I still don't
have any great confidence that there aren't windows where a terminate
request wouldn't be ignored (even without considering the uncooperative-
function scenario). Moreover it's entirely unclear that this approach
actually dodges any of the hypothetical bugs in SIGTERM handling.

I don't understand. If we call proc_exit(0) instead, it is the same as
someone exiting the backend via PQfinish().

Well, using proc_exit() instead of die() might have alleviated that
particular objection, but there are still the others.

Tom, thanks for you feedback. It was very helpful.

I have adjusted the patch to return a locked PGPROC entry, initialize
the PGPROC->terminate boolean, called proc_exit(), and moved the
PGPROC->terminate test out of the cancel loop entirely so lost cancel
signals are not as big a problem anymore.

I agree it would be best for pg_terminate_backend() and for the
postmaster use of SIGTERM if SIGTERM was more reliable about cleanup, so
hopefully that will work for 8.4. I have attached my patch in hopes we
can use it as a backup in case we can't get SIGTERM to work for
individual backends for 8.4.

[ I have posted about testing in a separate email.]

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#20

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Bruce Momjian (#18)

hackerscomitters

Re: pg_terminate_backend() issues

Bruce Momjian <bruce@momjian.us> writes:

I am starting to think we need to analyze the source code rather than
testing, because of what we are testing for.

I was thinking about that a bit more, in connection with trying to
verbalize why I don't like your patch ;-)

The fundamental assumption here is that we think that proc_exit() from
the main loop of PostgresMain() should be safe, because that's exercised
anytime a client quits or dies unexpectedly, and so it should be a
well-tested path. What is lacking such test coverage is the many code
paths that start proc_exit() from any random CHECK_FOR_INTERRUPTS()
point in the backend. Some of those points might be (in fact are known
to be, as of today) holding locks or otherwise in a state that prevents
a clean proc_exit().

Your patch proposes to replace that code path by one that fakes a
QueryCancel error and then does proc_exit() once control reaches the
outer level. While that certainly *should* work (ignoring the question
of whether control gets to the outer level in any reasonable amount of
time), it's a bit of a stretch to claim that it's a well-tested case.
At the moment it would only arise if a QueryCancel interrupt arrived at
approximately the same time that the client died or sent a disconnect
message, which is surely far less probable than client death by itself.
So I don't buy the argument that this is a better-tested path, and
I definitely don't want to introduce new complexity in order to make it
happen like that.

Now, as to whether a SIGTERM-style exit is safe. With the exception
of the PG_CATCH issues that we already know about, any other case seems
like a clear coding error: you'd have to be doing something that you
know could throw an error, without having made provision to clean up
whatever unclean state you are in. It'd be nice to think that we
haven't been that silly anywhere. Experience however teaches that such
an assumption is hubris.

I think that the way forward is to fix the PG_CATCH issues (which I will
try to get done later this week) and then get someone to mount a serious
effort to break it, as outlined in previous discussions:
http://archives.postgresql.org/pgsql-hackers/2006-08/msg00174.php

I'm willing to enable a SIGTERM-based pg_terminate_backend for 8.4
if there is some reasonable amount of testing done during this
development cycle to try to expose any problems. If no one is willing
to do any such testing, then as far as I'm concerned there is no market
for the feature and we should drop it from TODO.

regards, tom lane

#21