RE: SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea
I think we'd be lots better off to abandon the notion that we can exit
directly from the SIGTERM interrupt handler, and instead treat SIGTERM
the same way we treat QueryCancel: set a flag that is inspected at
specific places where we know we are in a good state.Comments?
This will be much cleaner.
Vadim
-----Original Message-----
From: Mikheev, VadimI think we'd be lots better off to abandon the notion that we can exit
directly from the SIGTERM interrupt handler, and instead treat SIGTERM
the same way we treat QueryCancel: set a flag that is inspected at
specific places where we know we are in a good state.Comments?
This will be much cleaner.
Hmm, CancelQuery isn't so urgent an operation currently.
For example, VACUUM checks QueryCancel flag only
once per table.
Regards.
Hiroshi Inoue
I think we'd be lots better off to abandon the notion that we can exit
directly from the SIGTERM interrupt handler, and instead treat SIGTERM
the same way we treat QueryCancel: set a flag that is inspected at
specific places where we know we are in a good state.Comments?
This will be much cleaner.
OK, here's a more detailed proposal.
We'll eliminate elog() directly from the signal handlers, except in the
extremely limited case where the main line is waiting for a lock (the
existing "cancel wait for lock" mechanism for QueryCancel can be used
for SIGTERM as well, I think). Otherwise, both die() and
QueryCancelHandler() will just set flags and return. handle_warn()
(the SIGQUIT handler) should probably go away entirely; it does nothing
that's not done better by QueryCancel. I'm inclined to make SIGQUIT
invoke die() the same as SIGTERM does, unless someone has a better idea
what to do with that signal.
I believe we should keep the "critical section count" mechanism that
Vadim already created, even though it is no longer needed to discourage
the signal handlers themselves from aborting processing. With the
critical section counter, it is OK to have flag checks inside subroutines
that are safe to abort from in some contexts and not others --- the
contexts that don't want an abort just have to do START/END_CRIT_SECTION
to ensure that QueryCancel/ProcDie won't happen in routines they call.
So the basic flag checks are like "if (InterruptFlagSet &&
CritSectionCounter == 0) elog(...);".
Having done that, the $64 question is where to test the flags.
Obviously we can check for interrupts at all the places where
QueryCancel is tested now, which is primarily the outer loops.
I suggest we also check for interrupts in s_lock's wait loop (ie, we can
cancel/die if we haven't yet got the lock AND we are not in a crit
section), as well as in END_CRIT_SECTION.
I intend to devise a macro CHECK_FOR_INTERRUPTS() that embodies
all the test code, rather than duplicating these if-tests in
many places.
Note I am assuming that it's always reasonable to check for QueryCancel
and ProcDie at the same places. I do not see any hole in that theory,
but if someone finds one, we could introduce additional flags/counters
to distinguish safe-to-cancel from safe-to-die states.
Comments?
regards, tom lane
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
Hmm, CancelQuery isn't so urgent an operation currently.
For example, VACUUM checks QueryCancel flag only
once per table.
Right, we'll need to check in more places. See my just-posted proposal.
Checking at any spinlock grab should ensure that we check reasonably
often.
I just realized I forgot to mention the case of SIGTERM while the main
line is waiting for input from the frontend --- obviously we want to
quit immediately in that case, too.
regards, tom lane
Note that elog(ERROR/FATAL) is changed to elog(STOP) if Critical
SectionCount > 0.Not in current sources ;-).
Perhaps Vadim will say that I broke his error scheme, but if so it's
his own fault for not documenting such delicate code at all.
Ok, it's my fault (though I placed NO ELOG(ERROR) comments everywhere
around critical xlog-related sections of code).
I believe he's out of town this weekend, so let's wait till he gets
back and then discuss it some more. Perhaps there is a need to
distinguish xlog-related critical sections from other ones, or
perhaps not.
Perhaps. For xlog-related code rule is simple - backend must not be
interrupted till changes are logged.
Vadim
Import Notes
Resolved by subject fallback
Because I think turning an elog(ERROR) into a system-wide crash is
not a good idea ;-). If you are correct that this behavior
is necessary for WAL-related critical sections, then indeed we need
two kinds of critical sections, one that just holds off cancel/die
response and one that turns elog(ERROR) into a dangerous weapon.
I'm going to wait and see Vadim's response before I do anything ...
I've tried to move "dangerous" ops with non-zero probability of
elog(ERROR) (eg new file block allocation) out of crit sections.
Anyway we need in ERROR-->STOP for safety when changes aren't logged.
Vadim
Import Notes
Resolved by subject fallback
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
Because I think turning an elog(ERROR) into a system-wide crash is
not a good idea ;-). If you are correct that this behavior
is necessary for WAL-related critical sections, then indeed we need
two kinds of critical sections, one that just holds off cancel/die
response and one that turns elog(ERROR) into a dangerous weapon.
I'm going to wait and see Vadim's response before I do anything ...
I've tried to move "dangerous" ops with non-zero probability of
elog(ERROR) (eg new file block allocation) out of crit sections.
Anyway we need in ERROR-->STOP for safety when changes aren't logged.
Why is that safer than just treating an ERROR as an ERROR? It seems to
me there's a real risk of a crash/restart loop if we force a restart
whenever we see an xlog-related problem.
regards, tom lane
I've tried to move "dangerous" ops with non-zero probability of
elog(ERROR) (eg new file block allocation) out of crit sections.
Anyway we need in ERROR-->STOP for safety when changes
aren't logged.Why is that safer than just treating an ERROR as an ERROR?
It seems to me there's a real risk of a crash/restart loop if we
force a restart whenever we see an xlog-related problem.
Why don't we elog(ERROR) in assert checking but abort?
Consider elog(STOP) on any errors inside critical sections
as assert checking. Rule is simple - validate operation before
applying it to permanent storage - and it's better to force
any future development to follow this rule by any means.
It's very easy to don't notice ERROR - it's just transaction
abort and transaction abort is normal thing, - but errors inside
critical sections are *unexpected* things which mean that something
totally wrong in code.
As for crash/restart loop, Hiroshi rised this issue ~month ago and I
was going to avoid elog(STOP) in AM-specific redo functions and do
elog(LOG) instead, wherever possible, but was busy with CRC/backup stuff
- ok, I'll look there soon.
Vadim
Import Notes
Resolved by subject fallback
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
It's very easy to don't notice ERROR - it's just transaction
abort and transaction abort is normal thing, - but errors inside
critical sections are *unexpected* things which mean that something
totally wrong in code.
Okay. That means we do need two kinds of critical sections, then,
because the crit sections I've just sprinkled everywhere are not that
critical ;-). They just want to hold off cancel/die interrupts.
I'll take care of fixing what I broke, but does anyone have suggestions
for good names for the two concepts? The best I could come up with
offhand is BEGIN/END_CRIT_SECTION and BEGIN/END_SUPER_CRIT_SECTION,
but I'm not pleased with that... Ideas?
regards, tom lane
I'll take care of fixing what I broke, but does anyone have
suggestions for good names for the two concepts? The best I could
come up with offhand is BEGIN/END_CRIT_SECTION
HOLD_INTERRUPTS
RESUME_INTERRUPTS
and BEGIN/END_SUPER_CRIT_SECTION, but I'm not pleased with that...
BEGIN_CRIT_SECTION
END_CRIT_SECTION
- as it was -:)
Vadim
Import Notes
Resolved by subject fallback
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
HOLD_INTERRUPTS
RESUME_INTERRUPTS
BEGIN_CRIT_SECTION
END_CRIT_SECTION
Works for me...
regards, tom lane
I'll take care of fixing what I broke, but does anyone have suggestions
for good names for the two concepts? The best I could come up with
offhand is BEGIN/END_CRIT_SECTION and BEGIN/END_SUPER_CRIT_SECTION,
but I'm not pleased with that... Ideas?
Let CRITICAL be critical. If the other section are there just to be
cautious. Then the name should represent that. While I like the
BEGIN/END_OH_MY_GOD_IF_THIS_GETS_INTERRUPTED_YOU_DONT_WANT_TO_KNOW
markers.. They are a little hard to work with.
Possibly try demoting the NON_CRITICAL_SECTIONS to something like the
following.
BEGIN/END_CAUTION_SECTION,
BEGIN/END_WATCH_SECTION