How to solve the problem of one backend process crashing and causing other processes to restart?

Started by yuansongabout 2 years ago9 messages
#1yuansong
yyuansong@126.com

In PostgreSQL, when a backend process crashes, it can cause other backend processes to also require a restart, primarily to ensure data consistency. I understand that the correct approach is to analyze and identify the cause of the crash and resolve it. However, it is also important to be able to handle a backend process crash without affecting the operation of other processes, thus minimizing the scope of negative impact and improving availability. To achieve this goal, could we mimic the Oracle process by introducing a "pmon" process dedicated to rolling back crashed process transactions and performing resource cleanup? I wonder if anyone has attempted such a strategy or if there have been previous discussions on this topic.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: yuansong (#1)
Re: How to solve the problem of one backend process crashing and causing other processes to restart?

yuansong <yyuansong@126.com> writes:

In PostgreSQL, when a backend process crashes, it can cause other backend processes to also require a restart, primarily to ensure data consistency. I understand that the correct approach is to analyze and identify the cause of the crash and resolve it. However, it is also important to be able to handle a backend process crash without affecting the operation of other processes, thus minimizing the scope of negative impact and improving availability. To achieve this goal, could we mimic the Oracle process by introducing a "pmon" process dedicated to rolling back crashed process transactions and performing resource cleanup? I wonder if anyone has attempted such a strategy or if there have been previous discussions on this topic.

The reason we force a database-wide restart is that there's no way to
be certain that the crashed process didn't corrupt anything in shared
memory. (Even with the forced restart, there's a window where bad
data could reach disk before we kill off the other processes that
might write it. But at least it's a short window.) "Corruption"
here doesn't just involve bad data placed into disk buffers; more
often it's things like unreleased locks, which would block other
processes indefinitely.

I seriously doubt that anything like what you're describing
could be made reliable enough to be acceptable. "Oracle does
it like this" isn't a counter-argument: they have a much different
(and non-extensible) architecture, and they also have an army of
programmers to deal with minutiae like undoing resource acquisition.
Even with that, you'd have to wonder about the number of bugs
existing in such necessarily-poorly-tested code paths.

regards, tom lane

#3Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tom Lane (#2)
Re: How to solve the problem of one backend process crashing and causing other processes to restart?

On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:

yuansong <yyuansong@126.com> writes:

In PostgreSQL, when a backend process crashes, it can cause other backend
processes to also require a restart, primarily to ensure data consistency.
I understand that the correct approach is to analyze and identify the
cause of the crash and resolve it. However, it is also important to be
able to handle a backend process crash without affecting the operation of
other processes, thus minimizing the scope of negative impact and
improving availability. To achieve this goal, could we mimic the Oracle
process by introducing a "pmon" process dedicated to rolling back crashed
process transactions and performing resource cleanup? I wonder if anyone
has attempted such a strategy or if there have been previous discussions
on this topic.

The reason we force a database-wide restart is that there's no way to
be certain that the crashed process didn't corrupt anything in shared
memory. (Even with the forced restart, there's a window where bad
data could reach disk before we kill off the other processes that
might write it. But at least it's a short window.) "Corruption"
here doesn't just involve bad data placed into disk buffers; more
often it's things like unreleased locks, which would block other
processes indefinitely.

I seriously doubt that anything like what you're describing
could be made reliable enough to be acceptable. "Oracle does
it like this" isn't a counter-argument: they have a much different
(and non-extensible) architecture, and they also have an army of
programmers to deal with minutiae like undoing resource acquisition.
Even with that, you'd have to wonder about the number of bugs
existing in such necessarily-poorly-tested code paths.

Yes.
I think that PostgreSQL's approach is superior: rather than investing in
code to mitigate the impact of data corruption caused by a crash, invest
in quality code that doesn't crash in the first place.

Euphemistically naming a crash "ORA-600 error" seems to be part of
their strategy.

Yours,
Laurenz Albe

#4yuansong
yyuansong@126.com
In reply to: Laurenz Albe (#3)
Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?

Enhancing the overall fault tolerance of the entire system for this feature is quite important. No one can avoid bugs, and I don't believe that this approach is a more advanced one. It might be worth considering adding it to the roadmap so that interested parties can conduct relevant research.

The current major issue is that when one process crashes, resetting all connections has a significant impact on other connections. Is it possible to only disconnect the crashed connection and have the other connections simply roll back the current transaction without reconnecting? Perhaps this problem can be mitigated through the use of a connection pool.

If we want to implement this feature, would it be sufficient to clean up or restore the shared memory and disk changes caused by the crashed backend? Besides clearing various known locks, what else needs to be changed? Does anyone have any insights? Please help me. Thank you.

At 2023-11-13 13:53:29, "Laurenz Albe" <laurenz.albe@cybertec.at> wrote:

Show quoted text

On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:

yuansong <yyuansong@126.com> writes:

In PostgreSQL, when a backend process crashes, it can cause other backend
processes to also require a restart, primarily to ensure data consistency.
I understand that the correct approach is to analyze and identify the
cause of the crash and resolve it. However, it is also important to be
able to handle a backend process crash without affecting the operation of
other processes, thus minimizing the scope of negative impact and
improving availability. To achieve this goal, could we mimic the Oracle
process by introducing a "pmon" process dedicated to rolling back crashed
process transactions and performing resource cleanup? I wonder if anyone
has attempted such a strategy or if there have been previous discussions
on this topic.

The reason we force a database-wide restart is that there's no way to
be certain that the crashed process didn't corrupt anything in shared
memory. (Even with the forced restart, there's a window where bad
data could reach disk before we kill off the other processes that
might write it. But at least it's a short window.) "Corruption"
here doesn't just involve bad data placed into disk buffers; more
often it's things like unreleased locks, which would block other
processes indefinitely.

I seriously doubt that anything like what you're describing
could be made reliable enough to be acceptable. "Oracle does
it like this" isn't a counter-argument: they have a much different
(and non-extensible) architecture, and they also have an army of
programmers to deal with minutiae like undoing resource acquisition.
Even with that, you'd have to wonder about the number of bugs
existing in such necessarily-poorly-tested code paths.

Yes.
I think that PostgreSQL's approach is superior: rather than investing in
code to mitigate the impact of data corruption caused by a crash, invest
in quality code that doesn't crash in the first place.

Euphemistically naming a crash "ORA-600 error" seems to be part of
their strategy.

Yours,
Laurenz Albe

#5Junwang Zhao
zhjwpku@gmail.com
In reply to: yuansong (#4)
Re: Re: How to solve the problem of one backend process crashing and causing other processes to restart?

On Mon, Nov 13, 2023 at 5:14 PM yuansong <yyuansong@126.com> wrote:

Enhancing the overall fault tolerance of the entire system for this feature is quite important. No one can avoid bugs, and I don't believe that this approach is a more advanced one. It might be worth considering adding it to the roadmap so that interested parties can conduct relevant research.

The current major issue is that when one process crashes, resetting all connections has a significant impact on other connections. Is it possible to only disconnect the crashed connection and have the other connections simply roll back the current transaction without reconnecting? Perhaps this problem can be mitigated through the use of a connection pool.

It's not about the other connections, it's that the crashed connection
has no way to rollback.

If we want to implement this feature, would it be sufficient to clean up or restore the shared memory and disk changes caused by the crashed backend? Besides clearing various known locks, what else needs to be changed? Does anyone have any insights? Please help me. Thank you.

At 2023-11-13 13:53:29, "Laurenz Albe" <laurenz.albe@cybertec.at> wrote:

On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:

yuansong <yyuansong@126.com> writes:

In PostgreSQL, when a backend process crashes, it can cause other backend
processes to also require a restart, primarily to ensure data consistency.
I understand that the correct approach is to analyze and identify the
cause of the crash and resolve it. However, it is also important to be
able to handle a backend process crash without affecting the operation of
other processes, thus minimizing the scope of negative impact and
improving availability. To achieve this goal, could we mimic the Oracle
process by introducing a "pmon" process dedicated to rolling back crashed
process transactions and performing resource cleanup? I wonder if anyone
has attempted such a strategy or if there have been previous discussions
on this topic.

The reason we force a database-wide restart is that there's no way to
be certain that the crashed process didn't corrupt anything in shared
memory. (Even with the forced restart, there's a window where bad
data could reach disk before we kill off the other processes that
might write it. But at least it's a short window.) "Corruption"
here doesn't just involve bad data placed into disk buffers; more
often it's things like unreleased locks, which would block other
processes indefinitely.

I seriously doubt that anything like what you're describing
could be made reliable enough to be acceptable. "Oracle does
it like this" isn't a counter-argument: they have a much different
(and non-extensible) architecture, and they also have an army of
programmers to deal with minutiae like undoing resource acquisition.
Even with that, you'd have to wonder about the number of bugs
existing in such necessarily-poorly-tested code paths.

Yes.
I think that PostgreSQL's approach is superior: rather than investing in
code to mitigate the impact of data corruption caused by a crash, invest
in quality code that doesn't crash in the first place.

Euphemistically naming a crash "ORA-600 error" seems to be part of
their strategy.

Yours,
Laurenz Albe

--
Regards
Junwang Zhao

#6Joe Conway
mail@joeconway.com
In reply to: Laurenz Albe (#3)
Re: How to solve the problem of one backend process crashing and causing other processes to restart?

On 11/13/23 00:53, Laurenz Albe wrote:

On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:

yuansong <yyuansong@126.com> writes:

In PostgreSQL, when a backend process crashes, it can cause other backend
processes to also require a restart, primarily to ensure data consistency.
I understand that the correct approach is to analyze and identify the
cause of the crash and resolve it. However, it is also important to be
able to handle a backend process crash without affecting the operation of
other processes, thus minimizing the scope of negative impact and
improving availability. To achieve this goal, could we mimic the Oracle
process by introducing a "pmon" process dedicated to rolling back crashed
process transactions and performing resource cleanup? I wonder if anyone
has attempted such a strategy or if there have been previous discussions
on this topic.

The reason we force a database-wide restart is that there's no way to
be certain that the crashed process didn't corrupt anything in shared
memory. (Even with the forced restart, there's a window where bad
data could reach disk before we kill off the other processes that
might write it. But at least it's a short window.) "Corruption"
here doesn't just involve bad data placed into disk buffers; more
often it's things like unreleased locks, which would block other
processes indefinitely.

I seriously doubt that anything like what you're describing
could be made reliable enough to be acceptable. "Oracle does
it like this" isn't a counter-argument: they have a much different
(and non-extensible) architecture, and they also have an army of
programmers to deal with minutiae like undoing resource acquisition.
Even with that, you'd have to wonder about the number of bugs
existing in such necessarily-poorly-tested code paths.

Yes.
I think that PostgreSQL's approach is superior: rather than investing in
code to mitigate the impact of data corruption caused by a crash, invest
in quality code that doesn't crash in the first place.

While true, this does nothing to prevent OOM kills, which are becoming
more prevalent as, for example, running Postgres in a container (or
otherwise) with a cgroup memory limit becomes more popular.

And in any case, there are enterprise use cases that necessarily avoid
Postgres due to this behavior, which is a shame.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#7Thomas wen
Thomas_valentine_365@outlook.com
In reply to: yuansong (#4)
回复: Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?

Hi yuansong
there is connnection pool path (https://commitfest.postgresql.org/34/3043/) ,but it has been dormant for few years,You can check this patch to get what you want to need
________________________________
发件人: yuansong <yyuansong@126.com>
发送时间: 2023年11月13日 17:13
收件人: Laurenz Albe <laurenz.albe@cybertec.at>
抄送: pgsql-hackers@lists.postgresql.org <pgsql-hackers@lists.postgresql.org>
主题: Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?

Enhancing the overall fault tolerance of the entire system for this feature is quite important. No one can avoid bugs, and I don't believe that this approach is a more advanced one. It might be worth considering adding it to the roadmap so that interested parties can conduct relevant research.

The current major issue is that when one process crashes, resetting all connections has a significant impact on other connections. Is it possible to only disconnect the crashed connection and have the other connections simply roll back the current transaction without reconnecting? Perhaps this problem can be mitigated through the use of a connection pool.

If we want to implement this feature, would it be sufficient to clean up or restore the shared memory and disk changes caused by the crashed backend? Besides clearing various known locks, what else needs to be changed? Does anyone have any insights? Please help me. Thank you.

At 2023-11-13 13:53:29, "Laurenz Albe" <laurenz.albe@cybertec.at> wrote:

Show quoted text

On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:

yuansong <yyuansong@126.com> writes:

In PostgreSQL, when a backend process crashes, it can cause other backend
processes to also require a restart, primarily to ensure data consistency.
I understand that the correct approach is to analyze and identify the
cause of the crash and resolve it. However, it is also important to be
able to handle a backend process crash without affecting the operation of
other processes, thus minimizing the scope of negative impact and
improving availability. To achieve this goal, could we mimic the Oracle
process by introducing a "pmon" process dedicated to rolling back crashed
process transactions and performing resource cleanup? I wonder if anyone
has attempted such a strategy or if there have been previous discussions
on this topic.

The reason we force a database-wide restart is that there's no way to
be certain that the crashed process didn't corrupt anything in shared
memory. (Even with the forced restart, there's a window where bad
data could reach disk before we kill off the other processes that
might write it. But at least it's a short window.) "Corruption"
here doesn't just involve bad data placed into disk buffers; more
often it's things like unreleased locks, which would block other
processes indefinitely.

I seriously doubt that anything like what you're describing
could be made reliable enough to be acceptable. "Oracle does
it like this" isn't a counter-argument: they have a much different
(and non-extensible) architecture, and they also have an army of
programmers to deal with minutiae like undoing resource acquisition.
Even with that, you'd have to wonder about the number of bugs
existing in such necessarily-poorly-tested code paths.

Yes.
I think that PostgreSQL's approach is superior: rather than investing in
code to mitigate the impact of data corruption caused by a crash, invest
in quality code that doesn't crash in the first place.

Euphemistically naming a crash "ORA-600 error" seems to be part of
their strategy.

Yours,
Laurenz Albe

#8Merlin Moncure
mmoncure@gmail.com
In reply to: yuansong (#4)
Re: Re: How to solve the problem of one backend process crashing and causing other processes to restart?

On Mon, Nov 13, 2023 at 3:14 AM yuansong <yyuansong@126.com> wrote:

Enhancing the overall fault tolerance of the entire system for this
feature is quite important. No one can avoid bugs, and I don't believe that
this approach is a more advanced one. It might be worth considering adding
it to the roadmap so that interested parties can conduct relevant research.

The current major issue is that when one process crashes, resetting all
connections has a significant impact on other connections. Is it possible
to only disconnect the crashed connection and have the other connections
simply roll back the current transaction without reconnecting? Perhaps this
problem can be mitigated through the use of a connection pool.

If we want to implement this feature, would it be sufficient to clean up
or restore the shared memory and disk changes caused by the crashed
backend? Besides clearing various known locks, what else needs to be
changed? Does anyone have any insights? Please help me. Thank you.

One thing that's really key to understand about postgres is that there are
a different set of rules regarding what is the database's job to solve vs
supporting libraries and frameworks. It isn't that hard to wait and retry
a query in most applications, and it is up to you to do that. There are
also various connection poolers that might implement retry logic, and not
having to work through those concerns keeps the code lean and has other
benefits. While postgres might implement things like a built in connection
pooler, 'o_direct' type memory management, and things like that, there are
long term costs to doing them.

There's another side to this. Suppose I had to choose between a
hypothetical postgres that had some kind of process local crash recovery
and the current implementation. I might still choose the current
implementation because, in general, crashes are good, and the full reset
has a much better chance of clearing the underlying issue that caused the
problem vs managing the symptoms of it.

merlin

#9yuansong
yyuansong@126.com
In reply to: Thomas wen (#7)
Re:Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?

thanks,After reconsideration, I realized that what I really want is for other connections to remain unaffected when a process crashes. This is something that a connection pool cannot solve.

At 2023-11-14 09:41:03, "Thomas wen" <Thomas_valentine_365@outlook.com> wrote:

Hi yuansong
there is connnection pool path (https://commitfest.postgresql.org/34/3043/) ,but it has been dormant for few years,You can check this patch to get what you want to need
发件人: yuansong <yyuansong@126.com>
发送时间: 2023年11月13日 17:13
收件人: Laurenz Albe <laurenz.albe@cybertec.at>
抄送: pgsql-hackers@lists.postgresql.org <pgsql-hackers@lists.postgresql.org>
主题: Re:Re: How to solve the problem of one backend process crashing and causing other processes to restart?

Enhancing the overall fault tolerance of the entire system for this feature is quite important. No one can avoid bugs, and I don't believe that this approach is a more advanced one. It might be worth considering adding it to the roadmap so that interested parties can conduct relevant research.

The current major issue is that when one process crashes, resetting all connections has a significant impact on other connections. Is it possible to only disconnect the crashed connection and have the other connections simply roll back the current transaction without reconnecting? Perhaps this problem can be mitigated through the use of a connection pool.

If we want to implement this feature, would it be sufficient to clean up or restore the shared memory and disk changes caused by the crashed backend? Besides clearing various known locks, what else needs to be changed? Does anyone have any insights? Please help me. Thank you.

At 2023-11-13 13:53:29, "Laurenz Albe" <laurenz.albe@cybertec.at> wrote:

Show quoted text

On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:

yuansong <yyuansong@126.com> writes:

In PostgreSQL, when a backend process crashes, it can cause other backend
processes to also require a restart, primarily to ensure data consistency.
I understand that the correct approach is to analyze and identify the
cause of the crash and resolve it. However, it is also important to be
able to handle a backend process crash without affecting the operation of
other processes, thus minimizing the scope of negative impact and
improving availability. To achieve this goal, could we mimic the Oracle
process by introducing a "pmon" process dedicated to rolling back crashed
process transactions and performing resource cleanup? I wonder if anyone
has attempted such a strategy or if there have been previous discussions
on this topic.

The reason we force a database-wide restart is that there's no way to
be certain that the crashed process didn't corrupt anything in shared
memory. (Even with the forced restart, there's a window where bad
data could reach disk before we kill off the other processes that
might write it. But at least it's a short window.) "Corruption"
here doesn't just involve bad data placed into disk buffers; more
often it's things like unreleased locks, which would block other
processes indefinitely.

I seriously doubt that anything like what you're describing
could be made reliable enough to be acceptable. "Oracle does
it like this" isn't a counter-argument: they have a much different
(and non-extensible) architecture, and they also have an army of
programmers to deal with minutiae like undoing resource acquisition.
Even with that, you'd have to wonder about the number of bugs
existing in such necessarily-poorly-tested code paths.

Yes.
I think that PostgreSQL's approach is superior: rather than investing in
code to mitigate the impact of data corruption caused by a crash, invest
in quality code that doesn't crash in the first place.

Euphemistically naming a crash "ORA-600 error" seems to be part of
their strategy.

Yours,
Laurenz Albe