Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

Started by Roman Eskin12 months ago4 messages
#1Roman Eskin
r.eskin@arenadata.io
2 attachment(s)

Hello pgsql-hackers,

It appears to exist an issue with timeline switching in archive
recovery mode if the startup process is interrupted by some external
reason (for ex. system power loss) in a specific tight time window
(after a new history file is created but before end-of-recovery record
is done). If this happens, and the Postgres instance is launched
again, it carries out the common crash recovery, starts successfully
to 'in production' state, but continues the previous timeline. Thus,
we are having inconsistent WAL state, where we are writing WAL segment
files to old timeline and having presented new orphaned timeline in
the history file.

This issue can reveal itself in streaming replication scenario, for
ex.:
1. Assume we have instance_1 (Master) and instance_2 (Replica);
2. instance_1 (Master) fails for any reason, instance_2 (Replica)
starts to promote. But while instance_2 is still in `StartupXLOG()`
somewhere between `writeTimeLineHistory()` and
`CreateEndOfRecoveryRecord()`, we are getting, for example, power
loss on instance_2 server.
3. Once instance_2 server is up again, instance_2 is restarted. The
problem with timeline already exists here, but very likely to be
unnoticed, as instance_2 (now Master) is accepting connections and
working.
4. Bring back instance_1 as Replica.
5. Switch back instance_1 and instance_2 to the original
configuration. And here, when we try to start instance_2 as Replica,
we'll get a FATAL:
"FATAL: requested timeline 2 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/303FF90 on timeline 1, but in the
history of the requested timeline, the server forked off from that
timeline at 0/3023538."

I agree that one have to be very unlucky to get into such situation
with 2 consequent failures and to hit the specific time window, but if
this happens, it will look quite severe and hard to debug. We faced
this situation on our synthetic tests.

To reproduce the issue on latest code from Postgres master branch in a
stable manner, I used the attached
'test-code-for-timeline-issue.patch'. It postpones execution in
`StartupXLOG()` if we need to hit the requested time window. Detailed
steps to reproduce I've used with the patch are in the attached file
'steps - NG scenario.txt'.

Possible way to fix this issue, that we now consider, is to postpone
the removal of signal files ("recovery.signal" and "standby.signal")
until the end-of-recovery record is done, and thus retry interrupted
promotion in case previous one failed. Looks like, the only
insignificant negative outcome for this approach is possible
assignment of one more timeline identity. But we would like to get
more feedback from the community. Do you foresee any possible
drawbacks for this approach? Please advise.

Many thanks in advance!

P.S. The topic above is a bit similar to this old one -
/messages/by-id/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com.

Best regards,
Roman Eskin

Attachments:

steps - NG scenario.txttext/plain; charset=UTF-8; name="steps - NG scenario.txt"Download
test-code-for-timeline-issue.patchtext/plain; charset=UTF-8; name=test-code-for-timeline-issue.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901..bdfc92cd6e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6126,6 +6126,19 @@ StartupXLOG(void)
 	Insert->fullPageWrites = lastFullPageWrites;
 	UpdateFullPageWrites();
 
+	/*
+	 * Code below is for test purposes only to reveal possible timeline issue.
+	 */
+	{
+		struct stat st;
+
+		elog(LOG, "Start delay in StartupXLOG()");
+
+		while (stat("delay_recovery.signal", &st) == 0)
+			pg_usleep(10000L); /* sleep for 10 ms*/
+
+		elog(LOG, "Stop delay in StartupXLOG()");
+	}
 	/*
 	 * Emit checkpoint or end-of-recovery record in XLOG, if required.
 	 */
#2Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Roman Eskin (#1)
Re: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

Hi Roman!
Thanks for raising the issue. I think the root cause is that many systems imply that higher number of timeline id means more recent timeline write. This invariant is not uphold. It's not even more recent timeline start.
"latest timeline" effectively means "random timeline".

On 17 Jan 2025, at 06:05, Roman Eskin <r.eskin@arenadata.io> wrote:

5. Switch back instance_1 and instance_2 to the original
configuration. And here, when we try to start instance_2 as Replica,
we'll get a FATAL:
"FATAL: requested timeline 2 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/303FF90 on timeline 1, but in the
history of the requested timeline, the server forked off from that
timeline at 0/3023538."

I think here you can just specify target timeline for the standby instance_1 and it will continue recovery from instance_2.

Having say that, I must admit that we observe something similar approximately 2 times a week, tried several fixes, but still have to live with it.
In our case we have a "resetup" cron job, which will automatically rebuild replica from backup if Postgres cannot start recovery for some hours.
So in our case this looks like extra 3 hours of standby downtime.

I'm not sure if this is a result of pgconsul not setting up target timeline or some other error...

Persisting recovery signal file for some _timeout_ seems super dangerous to me. In distributed systems every extra _timeout_ is a source of complexity, uncertainty and despair.

Thanks!

Best regards, Andrey Borodin.

#3Roman Eskin
r.eskin@arenadata.io
In reply to: Andrey M. Borodin (#2)
Re: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

Hi Andrey,

Thank you for your feedback!

I think here you can just specify target timeline for the standby instance_1 and it will continue recovery from instance_2.

Most likely yes, but nevertheless it looks more like a W/A.

Persisting recovery signal file for some _timeout_ seems super dangerous to me. In distributed systems every extra _timeout_ is a source of complexity, uncertainty and despair.

The approach is not about persisting the signal files for some timeout.
Currently the files are removed in StartupXLOG() before
writeTimeLineHistory() and PerformRecoveryXLogAction() are called. The
suggestion is to move the file removal after PerformRecoveryXLogAction()
inside StartupXLOG().

Best regards,
Roman Eskin

#4Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Roman Eskin (#3)
Re: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

On 21 Jan 2025, at 16:47, Roman Eskin <r.eskin@arenadata.io> wrote:

Persisting recovery signal file for some _timeout_ seems super dangerous to me. In distributed systems every extra _timeout_ is a source of complexity, uncertainty and despair.

The approach is not about persisting the signal files for some timeout. Currently the files are removed in StartupXLOG() before writeTimeLineHistory() and PerformRecoveryXLogAction() are called. The suggestion is to move the file removal after PerformRecoveryXLogAction() inside StartupXLOG().

Sending node to repeated promote-fail cycle without resolving root cause seems like even less appealing idea.
If something prevented promotion, why we should retry by this particular method?

Even in case of transient failure which you described - power loss - it does not sound like a very good idea to retry promotion after returning online. The user will get unexpected splitbrain.

Best regards, Andrey Borodin.