Race condition in recovery?

Started by Dilip Kumarover 5 years ago131 messageshackers

dilipbalaut@gmail.com

over 5 years ago

While analyzing one of the customer issues, based on the log it
appeared that there is a race condition in the recovery process.

The summary of the issue is, that one of the standby is promoted as
the new primary (Node2) and another standby (Node3) is restarted and
set the primary_info and the restore_command so that it can
stream/restore from Node2 (new primary). The problem is that during
the promotion the timeline switch happened in the middle of the
segment in Node2 and the Node3 is able to restore the newTLI.history
file from the archive but the WAL file with the new TLI is not yet
archived. Now, we will try to stream the wal file from the primary
but if we are fetching the checkpoint that time we will not use the
latest timeline instead we will try with the checkpoint timeline and
walsender will send the WAL from the new timeline file because
requested WAL recptr is before the TLI switch. And once that
happened the expectedTLEs will be set based on the oldTLI.history
file. Now, whenever we try to restore the required WAL file and
recoveryTargetTimeLineGoal is set to the latest we again try to rescan
the latest timeline (rescanLatestTimeLine) but the problem is
recoveryTargetTLI was already set to the latest timeline. So now the
problem is expectedTLEs is set to oldTLI and recoveryTargetTLI is set
to newTLI and rescanLatestTimeLine will never update the expectedTLEs.
Now, walsender will fail to stream new WAL using the old TLI and the
archiver process will also fail because that file doesn't not exists
anymore (converted to .partial). Basically, now we will never try
with the newTLI.

I have given the sequence of the events based on my analysis.

Refer to the sequence of event
-----------------------------------------
Node1 primary, Node2 standby1, Node3 standby2

1. Node2 got promoted to new primary, and node 2 picked new TL 13 in
the middle of the segment.
2. Node3, restarted with new primary info of Node2 and restore command
3. Node3, found the newest TL13 in validateRecoveryParameters()
Because the latest TL was requested in recovery.conf (history file
restored from TL13) and set recoveryTargetTLI to 13
So point to note is recoveryTargetTLI is set to 13 but expectedTLEs is
not yet set.
4. Node3, entered into the standby mode.
5. Node3, tries to read the checkpoint Record, on Node3 still the
checkpoint TL (ControlFile->checkPointCopy.ThisTimeLineID) is 12.
6. Node3, tries to get the checkpoint record file using new TL13 from
the archive which it should get ideally but it may not if the Node2
haven't yet archived it.
7. Node3, tries to stream from primary but using TL12 because
ControlFile->checkPointCopy.ThisTimeLineID is 12.
8. Node3, get it because walsender of Node2 read it from TL13 and send
it and Node2 write in the new WAL file but with TL12.

WalSndSegmentOpen()
{
/*-------
* When reading from a historic timeline, and there is a timeline switch
* within this segment, read from the WAL segment belonging to the new
* timeline.
}

9. Node3, now set the expectedTLEs to 12 because that is what
walreceiver has streamed the WAL using.

10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12. So
whenever it tries to find the latest TLE (rescanLatestTimeLine ) it
finds it is 13 which is the same as recoveryTargetTLI so nothing to
change but expectedTLEs is 12 using which it will try to
restore/stream further WAL and fail every time.

rescanLatestTimeLine(void)
{
....
newtarget = findNewestTimeLine(recoveryTargetTLI);
if (newtarget == recoveryTargetTLI)
{
/* No new timelines found */
return false;
}

11. So now the situation is that ideally in rescanLatestTimeLine() we
should get the latest TLI but it assumes that it is already on the
latest TLI because recoveryTargetTLI is on the latest TLI.

Other points to be noted:
- The actual issue happened on 9.6.11 but based on the code comparison
it appeared that same can occur on the latest code as well.
- After Node3 is shutdown wal from its pg_wal/ directory were removed
so that it can follow the new primary.

Based on the sequence of events It is quite clear that something is
wrong in rescanLatestTimeLine, maybe after selecting the latest TLI we
should compare it with the head of the expectedTLEs as well instead of
just comparing it to the recoveryTargetTLI?

Log from Node2:
2020-12-22 04:49:02 UTC [1019]: [9-1] user=; db=; app=; client=;
SQLcode=00000 LOG: received promote request
2020-12-22 04:49:02 UTC [1019]: [10-1] user=; db=; app=; client=;
SQLcode=00000 LOG: redo done at 0/F8000028
2020-12-22 04:49:02 UTC [1019]: [11-1] user=; db=; app=; client=;
SQLcode=00000 LOG: last completed transaction was at log time
2020-12-22 04:48:01.833476+00
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000D.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:02 UTC [1019]: [12-1] user=; db=; app=; client=;
SQLcode=00000 LOG: selected new timeline ID: 13
2020-12-22 04:49:02 UTC [1019]: [13-1] user=; db=; app=; client=;
SQLcode=00000 LOG: archive recovery complete

Log From Node3 (with pointwise analysis):

1. Node3 restarted, restored 0000000D.history from archive and
recoveryTargetTLI is set to 13
2020-12-22 04:49:40 UTC [2896]: [2-1] user=; db=; app=; client=;
SQLcode=00000 LOG: database system is shut down
2020-12-22 04:49:40 UTC [2872]: [6-1] user=; db=; app=; client=;
SQLcode=00000 LOG: database system is shut down
2020-12-22 04:49:41 UTC [9082]: [1-1] user=; db=; app=; client=;
SQLcode=00000 LOG: database system was shut down in recovery at
2020-12-22 04:49:40 UTC
2020-12-22 04:49:41 UTC [9082]: [2-1] user=; db=; app=; client=;
SQLcode=00000 LOG: creating missing WAL directory
"pg_xlog/archive_status"
2020-12-22 04:49:41 UTC [9082]: [3-1] user=; db=; app=; client=;
SQLcode=00000 LOG: restored log file "0000000D.history" from archive
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:41 UTC [9082]: [4-1] user=; db=; app=; client=;
SQLcode=00000 LOG: entering standby mode
2020-12-22 04:49:41 UTC [9082]: [5-1] user=; db=; app=; client=;
SQLcode=00000 LOG: restored log file "0000000D.history" from archive

2. Tries to get the WAL file with different timelines from the archive
but did not get so expectedTLEs is not yet set

rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000D00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000B00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000100000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]

3. Since fetching the checkpoint record so use the checkpoint TLI
which is 12, although primary doesn’t have 0000000C00000000000000F8
file as it renamed it to 0000000C00000000000000F8.partial

But there is the logic in walsender that if requested wal is there in
the current TLI then send from their so it will stream from
0000000D00000000000000F8 file

2020-12-22 04:49:42 UTC [9105]: [1-1] user=; db=; app=; client=;
SQLcode=00000 LOG: fetching timeline history file for timeline 12
from primary server
2020-12-22 04:49:42 UTC [9105]: [2-1] user=; db=; app=; client=;
SQLcode=00000 LOG: started streaming WAL from primary at 0/F8000000
on timeline 12
2020-12-22 04:49:42 UTC [9105]: [3-1] user=; db=; app=; client=;
SQLcode=00000 LOG: replication terminated by primary server
2020-12-22 04:49:42 UTC [9105]: [4-1] user=; db=; app=; client=;
SQLcode=00000 DETAIL: End of WAL reached on timeline 12 at
0/F8000098.

4. Now since walreciver assumes that it has restore the WAL from the
TL 12 the recieveTLI is 12 and the expectedTLEs is set base on the
0000000C.history.
See below Logic in WaitForWalToBecomeAvailable
if (readFile < 0)
{
if (!expectedTLEs)
expectedTLEs = readTimeLineHistory(receiveTLI);

2020-12-22 04:49:42 UTC [9082]: [6-1] user=; db=; app=; client=;
SQLcode=00000 LOG: restored log file "0000000C.history" from archive
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:42 UTC [9082]: [7-1] user=; db=; app=; client=;
SQLcode=00000 LOG: restored log file "0000000C.history" from archive
2020-12-22 04:49:42 UTC [9082]: [8-1] user=; db=; app=; client=;
SQLcode=00000 LOG: consistent recovery state reached at 0/F8000098
2020-12-22 04:49:42 UTC [9082]: [9-1] user=; db=; app=; client=;
SQLcode=00000 LOG: invalid record length at 0/F8000098: wanted 24,
got 0
2020-12-22 04:49:42 UTC [9074]: [3-1] user=; db=; app=; client=;
SQLcode=00000 LOG: database system is ready to accept read only
connections
2020-12-22 04:49:42 UTC [9105]: [5-1] user=; db=; app=; client=;
SQLcode=00000 LOG: restarted WAL streaming at 0/F8000000 on timeline
12
2020-12-22 04:49:42 UTC [9105]: [6-1] user=; db=; app=; client=;
SQLcode=00000 LOG: replication terminated by primary server
2020-12-22 04:49:42 UTC [9105]: [7-1] user=; db=; app=; client=;
SQLcode=00000 DETAIL: End of WAL reached on timeline 12 at
0/F8000098.

4. Now, expectedTLEs head is as 12 and recoveryTargetTLI is 13 so in
rescanLatestTimeLine we always assume we are at the latest Ali but we
try to archive from expectedTLEs which is old TLI.

rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:47 UTC [9105]: [8-1] user=; db=; app=; client=;
SQLcode=00000 LOG: restarted WAL streaming at 0/F8000000 on timeline
12
2020-12-22 04:49:47 UTC [9105]: [9-1] user=; db=; app=; client=;
SQLcode=00000 LOG: replication terminated by primary server
2020-12-22 04:49:47 UTC [9105]: [10-1] user=; db=; app=; client=;
SQLcode=00000 DETAIL: End of WAL reached on timeline 12 at
0/F8000098.
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com