[Bug Fix]standby may crash when switching-over in certain special cases
Hi, hackers,
I've recently encountered an issue where a standby crashes when
reconnecting to a new primary after a switchover under certain conditions.
Here's a procedure of the crash scenario:
1) We have three instances: one primary and two standbys (s1 and s2, both
using streaming replication).
2) The primary crashed when the standby’s `flushed_lsn` was at the
beginning of a WAL segment (e.g., `0/22000000`). Both s1 and s2 logged the
following:
```
FATAL: could not connect to the primary server...
LOG: waiting for WAL to become available at 0/220000ED
```
3) s1 was promoted to the new primary, s1 logged the following:
```
LOG: received promote request
LOG: redo done at 0/21FFFEE8
LOG: selected new timeline ID: 2
```
4) s2's `primary_conninfo` was updated to point to s1, s2 logged the
following:
```
LOG: received SIGHUP, reloading configuration files
LOG: parameter "primary_conninfo" changed to ...
```
5) s2 began replication with s1 and attempted to fetch `0/22000000` on
timeline 2, s2 logged the following:
```
LOG: fetching timeline history file for timeline 2 from primary server
FATAL: could not start WAL streaming: ERROR: requested starting point
0/22000000 on timeline 1 is not this server's history
DETAIL: This server's history forked from timeline 1 at 0/21FFFFE0.
LOG: started streaming WAL from primary at 0/22000000 on timeline 2
```
6) WAL record mismatch caused the walreceiver process to terminate, s2
logged the following:
```
LOG: invalid contrecord length 10 (expected 213) at 0/21FFFFE0
FATAL: terminating walreceiver process due to administrator command
```
7) s2 then attempted to fetch `0/21000000` on timeline 2. However, the
startup process failed to open the WAL file before it was created, leading
to a crash:
```
PANIC: could not open file "pg_wal/000000020000000000000021": No such
file or directory
LOG: startup process was terminated by signal 6: Aborted
```
In this scenario, s2 attempts replication twice. First, it starts from
`0/22000000` on timeline 2, setting `walrcv->flushedUpto` to `0/22000000`.
But when a mismatch occurs, the walreceiver process is terminated. On the
second attempt, replication starts from `0/21000000` on timeline 2. The
startup process expects the WAL file to exist because WalRcvFlushRecPtr()
return `0/22000000`, but the file is not found, causing s2's startup
process to crash.
I think it should check recptr and walrcv->flushedUpto when Request
XLogStreming, so I create a patch for it.
Best regards,
pixian shi
Attachments:
0001-Fix-walreceiver-set-incorrect-flushedUpto-when-switc.patchapplication/octet-stream; name=0001-Fix-walreceiver-set-incorrect-flushedUpto-when-switc.patchDownload
From 26ea49eac8f4f16b896928dfb5be96504c057649 Mon Sep 17 00:00:00 2001
From: shipixian <shipixian_yewu@cmss.chinamobile.com>
Date: Mon, 19 Aug 2024 12:08:08 +0800
Subject: [PATCH] Fix walreceiver set incorrect flushedUpto when switching
primary
Signed-off-by: shipixian <shipixian_yewu@cmss.chinamobile.com>
---
src/backend/replication/walreceiverfuncs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 90798b9d53..bc2ae2ae61 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -298,10 +298,10 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
walrcv->startTime = now;
/*
- * If this is the first startup of walreceiver (on this timeline),
+ * If this is the first startup of walreceiver (on this timeline) or recptr lt last flushedLsn,
* initialize flushedUpto and latestChunkStart to the starting point.
*/
- if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
+ if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli || recptr < walrcv->flushedUpto)
{
walrcv->flushedUpto = recptr;
walrcv->receivedTLI = tli;
--
2.39.3 (Apple Git-145)
On Mon, 19 Aug 2024 16:43:09 +0800
px shi <spxlyy123@gmail.com> wrote:
Hi, hackers,
I've recently encountered an issue where a standby crashes when
reconnecting to a new primary after a switchover under certain conditions.
Here's a procedure of the crash scenario:1) We have three instances: one primary and two standbys (s1 and s2, both
using streaming replication).2) The primary crashed when the standby’s `flushed_lsn` was at the
beginning of a WAL segment (e.g., `0/22000000`). Both s1 and s2 logged the
following:```
FATAL: could not connect to the primary server...
LOG: waiting for WAL to become available at 0/220000ED
```
3) s1 was promoted to the new primary, s1 logged the following:
```
LOG: received promote request
LOG: redo done at 0/21FFFEE8
LOG: selected new timeline ID: 2
```
4) s2's `primary_conninfo` was updated to point to s1, s2 logged the
following:```
LOG: received SIGHUP, reloading configuration files
LOG: parameter "primary_conninfo" changed to ...
```
5) s2 began replication with s1 and attempted to fetch `0/22000000` on
timeline 2, s2 logged the following:```
LOG: fetching timeline history file for timeline 2 from primary server
FATAL: could not start WAL streaming: ERROR: requested starting point
0/22000000 on timeline 1 is not this server's historyDETAIL: This server's history forked from timeline 1 at 0/21FFFFE0.
LOG: started streaming WAL from primary at 0/22000000 on timeline 2
```
6) WAL record mismatch caused the walreceiver process to terminate, s2
logged the following:```
LOG: invalid contrecord length 10 (expected 213) at 0/21FFFFE0
FATAL: terminating walreceiver process due to administrator command
```
7) s2 then attempted to fetch `0/21000000` on timeline 2. However, the
startup process failed to open the WAL file before it was created, leading
to a crash:```
PANIC: could not open file "pg_wal/000000020000000000000021": No such
file or directoryLOG: startup process was terminated by signal 6: Aborted
```
In this scenario, s2 attempts replication twice. First, it starts from
`0/22000000` on timeline 2, setting `walrcv->flushedUpto` to `0/22000000`.
But when a mismatch occurs, the walreceiver process is terminated. On the
second attempt, replication starts from `0/21000000` on timeline 2. The
startup process expects the WAL file to exist because WalRcvFlushRecPtr()
return `0/22000000`, but the file is not found, causing s2's startup
process to crash.I think it should check recptr and walrcv->flushedUpto when Request
XLogStreming, so I create a patch for it.
Is s1 a cascading standby of s2? If otherwise s1 and s2 is the standbys of
the primary server respectively, it is not surprising that s2 has progressed
far than s1 when the primary fails. I believe that this is the case you should
use pg_rewind. Even if flushedUpto is reset as proposed in your patch, s2 might
already have applied a WAL record that s1 has not processed yet, and there
would be no gurantee that subsecuent applys suceed.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道:
Is s1 a cascading standby of s2? If otherwise s1 and s2 is the standbys
of
the primary server respectively, it is not surprising that s2 has
progressed
far than s1 when the primary fails. I believe that this is the case you
should
use pg_rewind. Even if flushedUpto is reset as proposed in your patch,
s2 might
already have applied a WAL record that s1 has not processed yet, and
there
would be no gurantee that subsecuent applys suceed.
Thank you for your response. In my scenario, s1 and s2 is the standbys of
the primary server respectively, and s1 a synchronous standby and s2 is an
asynchronous standby. You mentioned that if s2's replay progress is ahead
of s1, pg_rewind should be used. However, what I'm trying to address is an
issue where s2 crashes during replay after s1 has been promoted to primary,
even though s2's progress hasn't surpassed s1.
Regards,
Pixian Shi
On Wed, 21 Aug 2024 09:11:03 +0800
px shi <spxlyy123@gmail.com> wrote:
Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道:
Is s1 a cascading standby of s2? If otherwise s1 and s2 is the standbys
of
the primary server respectively, it is not surprising that s2 has
progressed
far than s1 when the primary fails. I believe that this is the case you
should
use pg_rewind. Even if flushedUpto is reset as proposed in your patch,
s2 might
already have applied a WAL record that s1 has not processed yet, and
there
would be no gurantee that subsecuent applys suceed.
Thank you for your response. In my scenario, s1 and s2 is the standbys of
the primary server respectively, and s1 a synchronous standby and s2 is an
asynchronous standby. You mentioned that if s2's replay progress is ahead
of s1, pg_rewind should be used. However, what I'm trying to address is an
issue where s2 crashes during replay after s1 has been promoted to primary,
even though s2's progress hasn't surpassed s1.
I understood your point. It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.
Regards,
Yugo Nagata
Regards,
Pixian Shi
--
Yugo NAGATA <nagata@sraoss.co.jp>
Thanks for responding.
It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.
As I mentioned earlier, when replication fails, it retries to establish
streaming replication. At this point, the value of *walrcv->flushedUpto *is
not necessarily the data actually flushed to disk. However, the startup
process mistakenly believes that the latest flushed LSN is
*walrcv->flushedUpto* and attempts to open the corresponding WAL file,
which doesn't exist, leading to a file open failure and causing the startup
process to PANIC.
Regards,
Pixian Shi
Yugo NAGATA <nagata@sraoss.co.jp> 于2024年9月30日周一 13:47写道:
Show quoted text
On Wed, 21 Aug 2024 09:11:03 +0800
px shi <spxlyy123@gmail.com> wrote:Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道:
Is s1 a cascading standby of s2? If otherwise s1 and s2 is the
standbys
of
the primary server respectively, it is not surprising that s2 has
progressed
far than s1 when the primary fails. I believe that this is the case
you
should
use pg_rewind. Even if flushedUpto is reset as proposed in your
patch,
s2 might
already have applied a WAL record that s1 has not processed yet, and
there
would be no gurantee that subsecuent applys suceed.
Thank you for your response. In my scenario, s1 and s2 is the standbys
of
the primary server respectively, and s1 a synchronous standby and s2 is
an
asynchronous standby. You mentioned that if s2's replay progress is ahead
of s1, pg_rewind should be used. However, what I'm trying to address isan
issue where s2 crashes during replay after s1 has been promoted to
primary,
even though s2's progress hasn't surpassed s1.
I understood your point. It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.Regards,
Yugo NagataRegards,
Pixian Shi--
Yugo NAGATA <nagata@sraoss.co.jp>
On Mon, 30 Sep 2024 15:14:54 +0800
px shi <spxlyy123@gmail.com> wrote:
Thanks for responding.
It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.
As I mentioned earlier, when replication fails, it retries to establish
streaming replication. At this point, the value of *walrcv->flushedUpto *is
not necessarily the data actually flushed to disk. However, the startup
process mistakenly believes that the latest flushed LSN is
*walrcv->flushedUpto* and attempts to open the corresponding WAL file,
which doesn't exist, leading to a file open failure and causing the startup
process to PANIC.
Although I've not tested it because I don't have good way to reproduce the problem,
the patch seems to fix the issue by rewinding walrcv->flushedUpto to the requested
WAL location always when the streaming is requested to start. However, this causes
to request the primary to stream data from the requested location again, even if
the standby has already valid data. It could be wasteful, for example, when applying WAL
record is delayed due to conflict with recovery or recovery_min_apply_delay.
It might be better if if could notice that there is not requested record in the primary's
timeline before requesting the streaming, but I don't have a good solution for now.
Regards,
Yugo Nagata
Regards,
Pixian ShiYugo NAGATA <nagata@sraoss.co.jp> 于2024年9月30日周一 13:47写道:
On Wed, 21 Aug 2024 09:11:03 +0800
px shi <spxlyy123@gmail.com> wrote:Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道:
Is s1 a cascading standby of s2? If otherwise s1 and s2 is the
standbys
of
the primary server respectively, it is not surprising that s2 has
progressed
far than s1 when the primary fails. I believe that this is the case
you
should
use pg_rewind. Even if flushedUpto is reset as proposed in your
patch,
s2 might
already have applied a WAL record that s1 has not processed yet, and
there
would be no gurantee that subsecuent applys suceed.
Thank you for your response. In my scenario, s1 and s2 is the standbys
of
the primary server respectively, and s1 a synchronous standby and s2 is
an
asynchronous standby. You mentioned that if s2's replay progress is ahead
of s1, pg_rewind should be used. However, what I'm trying to address isan
issue where s2 crashes during replay after s1 has been promoted to
primary,
even though s2's progress hasn't surpassed s1.
I understood your point. It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.Regards,
Yugo NagataRegards,
Pixian Shi--
Yugo NAGATA <nagata@sraoss.co.jp>
--
Yugo NAGATA <nagata@sraoss.co.jp>
Although I've not tested it because I don't have good way to reproduce
the problem
I use GDB to reproduce the issue by killing the primary node with kill -9
when the standby’s flushedlsn was at the begining of a WAL segment.
However, this causes to request the primary to stream data from the
requested location again, even if the standby has already valid data. It
could be wasteful, for example, when applying WAL record is delayed due to
conflict with recovery or recovery_min_apply_delay.
I had overlooked this problem, thanks for pointing that out.
I think that comparing walrcv->receiveStart with recptr would be a better
approach. Specifically, if walrcv->receiveStart is greater than recptr, then
reset walrcv->flushedUpto. This way, we can prevent fetching WAL data from
the primary when the standby already has valid data.A new patch has been
submitted.
Regards,
Pixian Shi
Yugo NAGATA <nagata@sraoss.co.jp> 于2024年10月9日周三 14:46写道:
Show quoted text
On Mon, 30 Sep 2024 15:14:54 +0800
px shi <spxlyy123@gmail.com> wrote:Thanks for responding.
It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.
As I mentioned earlier, when replication fails, it retries to establish
streaming replication. At this point, the value of *walrcv->flushedUpto*is
not necessarily the data actually flushed to disk. However, the startup
process mistakenly believes that the latest flushed LSN is
*walrcv->flushedUpto* and attempts to open the corresponding WAL file,
which doesn't exist, leading to a file open failure and causing thestartup
process to PANIC.
Although I've not tested it because I don't have good way to reproduce the
problem,
the patch seems to fix the issue by rewinding walrcv->flushedUpto to the
requested
WAL location always when the streaming is requested to start. However,
this causes
to request the primary to stream data from the requested location again,
even if
the standby has already valid data. It could be wasteful, for example,
when applying WAL
record is delayed due to conflict with recovery or
recovery_min_apply_delay.It might be better if if could notice that there is not requested record
in the primary's
timeline before requesting the streaming, but I don't have a good solution
for now.Regards,
Yugo NagataRegards,
Pixian ShiYugo NAGATA <nagata@sraoss.co.jp> 于2024年9月30日周一 13:47写道:
On Wed, 21 Aug 2024 09:11:03 +0800
px shi <spxlyy123@gmail.com> wrote:Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道:
Is s1 a cascading standby of s2? If otherwise s1 and s2 is the
standbys
of
the primary server respectively, it is not surprising that s2 has
progressed
far than s1 when the primary fails. I believe that this is the
case
you
should
use pg_rewind. Even if flushedUpto is reset as proposed in your
patch,
s2 might
already have applied a WAL record that s1 has not processed yet,
and
there
would be no gurantee that subsecuent applys suceed.
Thank you for your response. In my scenario, s1 and s2 is the
standbys
of
the primary server respectively, and s1 a synchronous standby and s2
is
an
asynchronous standby. You mentioned that if s2's replay progress is
ahead
of s1, pg_rewind should be used. However, what I'm trying to address
is
an
issue where s2 crashes during replay after s1 has been promoted to
primary,
even though s2's progress hasn't surpassed s1.
I understood your point. It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.Regards,
Yugo NagataRegards,
Pixian Shi--
Yugo NAGATA <nagata@sraoss.co.jp>--
Yugo NAGATA <nagata@sraoss.co.jp>
Attachments:
v2-0001-Fix-walreceiver-set-incorrect-flushedUpto-when-sw.patchapplication/octet-stream; name=v2-0001-Fix-walreceiver-set-incorrect-flushedUpto-when-sw.patchDownload
From fa95bc5f6ded9e155ce00d8e7630c58d7d4fbde9 Mon Sep 17 00:00:00 2001
From: shipixian <shipixian_yewu@cmss.chinamobile.com>
Date: Mon, 19 Aug 2024 12:08:08 +0800
Subject: [PATCH v2] Fix walreceiver set incorrect flushedUpto when switching
primary
Signed-off-by: shipixian <shipixian_yewu@cmss.chinamobile.com>
---
src/backend/replication/walreceiverfuncs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 90798b9d53..a8d3e14f55 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -298,10 +298,10 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
walrcv->startTime = now;
/*
- * If this is the first startup of walreceiver (on this timeline),
+ * If this is the first startup of walreceiver (on this timeline) or recptr lt last flushedLsn,
* initialize flushedUpto and latestChunkStart to the starting point.
*/
- if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli)
+ if (walrcv->receiveStart == 0 || walrcv->receivedTLI != tli || recptr < walrcv->receiveStart)
{
walrcv->flushedUpto = recptr;
walrcv->receivedTLI = tli;
--
2.39.3 (Apple Git-145)
Hi,
I inserted the following code in walsender.c:2509(v15.8) to reproduce the
issue.
{
WalSnd *walsnd = MyWalSnd;
SpinLockAcquire(&walsnd->mutex);
if (walsnd->flush % wal_segment_size == 0 && walsnd->sentPtr ==
walsnd->flush && walsnd->flush > wal_segment_size)
{
for (int i = 0; i < max_wal_senders; i++)
{
WalSnd *walsnder = &WalSndCtl->walsnds[i];
if (walsnder->pid == walsnd->pid)
continue;
SpinLockAcquire(&walsnder->mutex);
if (walsnder->pid == 0)
{
SpinLockRelease(&walsnder->mutex);
continue;
}
if (walsnder->flush % wal_segment_size == 0 &&
walsnder->flush > wal_segment_size)
elog(PANIC, "simulate primary crashed,
flushedlsn %X/%X", LSN_FORMAT_ARGS(walsnder->flush));
SpinLockRelease(&walsnder->mutex);
}
}
SpinLockRelease(&walsnd->mutex);
}
Regards,
Pixian Shi
px shi <spxlyy123@gmail.com> 于2024年10月10日周四 15:01写道:
Show quoted text
Although I've not tested it because I don't have good way to reproduce
the problem
I use GDB to reproduce the issue by killing the primary node with kill -9
when the standby’s flushedlsn was at the begining of a WAL segment.However, this causes to request the primary to stream data from the
requested location again, even if the standby has already valid data. It
could be wasteful, for example, when applying WAL record is delayed due to
conflict with recovery or recovery_min_apply_delay.I had overlooked this problem, thanks for pointing that out.
I think that comparing walrcv->receiveStart with recptr would be a better
approach. Specifically, if walrcv->receiveStart is greater than recptr, then
reset walrcv->flushedUpto. This way, we can prevent fetching WAL data from
the primary when the standby already has valid data.A new patch has been
submitted.Regards,
Pixian Shi
Yugo NAGATA <nagata@sraoss.co.jp> 于2024年10月9日周三 14:46写道:
On Mon, 30 Sep 2024 15:14:54 +0800
px shi <spxlyy123@gmail.com> wrote:Thanks for responding.
It is odd that the standby server crashes when
replication fails because the standby would keep retrying to get the
next record even in such case.
As I mentioned earlier, when replication fails, it retries to establish
streaming replication. At this point, the value of *walrcv->flushedUpto*is
not necessarily the data actually flushed to disk. However, the startup
process mistakenly believes that the latest flushed LSN is
*walrcv->flushedUpto* and attempts to open the corresponding WAL file,
which doesn't exist, leading to a file open failure and causing thestartup
process to PANIC.
Although I've not tested it because I don't have good way to reproduce
the problem,
the patch seems to fix the issue by rewinding walrcv->flushedUpto to the
requested
WAL location always when the streaming is requested to start. However,
this causes
to request the primary to stream data from the requested location again,
even if
the standby has already valid data. It could be wasteful, for example,
when applying WAL
record is delayed due to conflict with recovery or
recovery_min_apply_delay.It might be better if if could notice that there is not requested record
in the primary's
timeline before requesting the streaming, but I don't have a good
solution for now.Regards,
Yugo NagataRegards,
Pixian ShiYugo NAGATA <nagata@sraoss.co.jp> 于2024年9月30日周一 13:47写道:
On Wed, 21 Aug 2024 09:11:03 +0800
px shi <spxlyy123@gmail.com> wrote:Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道:
Is s1 a cascading standby of s2? If otherwise s1 and s2 is the
standbys
of
the primary server respectively, it is not surprising that s2
has
progressed
far than s1 when the primary fails. I believe that this is the
case
you
should
use pg_rewind. Even if flushedUpto is reset as proposed in your
patch,
s2 might
already have applied a WAL record that s1 has not processed
yet, and
there
would be no gurantee that subsecuent applys suceed.
Thank you for your response. In my scenario, s1 and s2 is the
standbys
of
the primary server respectively, and s1 a synchronous standby and
s2 is
an
asynchronous standby. You mentioned that if s2's replay progress is
ahead
of s1, pg_rewind should be used. However, what I'm trying to
address is
an
issue where s2 crashes during replay after s1 has been promoted to
primary,
even though s2's progress hasn't surpassed s1.
I understood your point. It is odd that the standby server crashes
when
replication fails because the standby would keep retrying to get the
next record even in such case.Regards,
Yugo NagataRegards,
Pixian Shi--
Yugo NAGATA <nagata@sraoss.co.jp>--
Yugo NAGATA <nagata@sraoss.co.jp>