Possible missing segments in archiving on standby

Started by Kyotaro Horiguchiover 5 years ago15 messages
#1Kyotaro Horiguchi
horikyota.ntt@gmail.com
5 attachment(s)

Hello.

While looking a patch, I found that a standby with archive_mode=always
fails to archive segments under certain conditions.

A. Walreceiver is gracefully terminated just after a segment is
finished.

B. Walreceiver is gracefully terminated while receiving filling chunks
for a segment switch.

The two above are reprodusible (without distinction between the two)
using a helper patch. See below.

There's one more issue here.

C. Standby doesn't archive a segment until walreceiver receives any
data for the next segment.

I'm not sure wehther we assume C as an issue.

The first attached patch fixes A and B. A side-effect of that is that
standby archives the previous segment of the streaming start
location. Concretely 00..0100..2 gets to be archived in the above case
(recovery starts at 0/3000000). That behavior doesn't seem to be a
proble since the segment is a part of the standby's data anyway.

The second attached patch fixes all of A to C, but seems somewhat
redundant.

Any opnions and/or suggestions are welcome.

The attached files are:

1. v1-0001-Make-sure-standby-archives-all-segments.patch:
Fix for A and B.

2. v1-0001-Make-sure-standby-archives-all-segments-immediate.patch:
Fix for A, B and C.

3. repro.sh
The reproducer shell script used below.

4. repro_helper.patch
Helper patch for repro.sh for master and patch 1 above.

5. repro_helper2.patch
Helper patch for repro.sh for patch 2 above.

=====
** REPRODUCER

The failure is reproducible with some code tweak.

1. Create a primary server with archive_mode=always then start it.
2. Create and start a standby.
3. touch /tmp/hoge

4. psql -c "create table t(); drop table t; select pg_switch_wal(); select pg_sleep(1); create table t(); drop table t; select pg_switch_wal();"

5. look into the archive directory of the standby.
If no missing segments found in archive, repeat from 3.

The third attached shell script is a reproducer for the problem,
needing the aid of the fourth patch attached.

$ mkdir testdir
$ cd testdir
$ bash ..../repro.sh
....
After test 2:
Primary location: 0/8000310
Standby location: 0/8000310
# primary archive
000000010000000000000003
000000010000000000000004
000000010000000000000005
000000010000000000000006
000000010000000000000007
000000010000000000000008
# standby archive
000000010000000000000003
000000010000000000000005
000000010000000000000006
000000010000000000000008

The segment 4 is skipped by the issue A and 7 is skipped by the issue
B.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v1-0001-Make-sure-standby-archives-all-segments.patchtext/x-patch; charset=us-asciiDownload
From afa907bca7db8ea6335d47bd02761f567591d553 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 30 Jun 2020 14:21:30 +0900
Subject: [PATCH v1] Make sure standby archives all segments

Standby fails to archive a segment if standby is stopped just after a
segment is finished or stopped just after a segment swtich. Make sure
that walreceiver archives all segments by rechecking at start.
---
 src/backend/replication/walreceiver.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d1ad75da87..680154365d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -938,6 +938,23 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				else
 					XLogArchiveNotify(xlogfname);
 			}
+			else if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+			{
+				/*
+				 * If we are starting streaming at the beginning of a segment,
+				 * there may be casees where the previous segment have not been
+				 * archived yet.  Make sure it is archived.
+				 */
+				char		xlogfname[MAXFNAMELEN];
+				XLogSegNo	prevseg;
+
+				XLByteToPrevSeg(recptr, prevseg, wal_segment_size);
+				XLogFileName(xlogfname, ThisTimeLineID, prevseg,
+							 wal_segment_size);
+
+				/* Mark as ".ready" of not yet */
+				XLogArchiveCheckDone(xlogfname);
+			}
 			recvFile = -1;
 
 			/* Create/use new log file */
-- 
2.18.4

v1-0001-Make-sure-standby-archives-all-segments-immediate.patchtext/x-patch; charset=us-asciiDownload
From 7af716134dceb3bafce421dfeaffebf1e1e3e17d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 29 Jun 2020 16:12:01 +0900
Subject: [PATCH v1] Make sure standby archives all segments immediately

Standby may get a bit being late in archive, since walsender doesn't
archive a segment until it receives any data for the next segment, Fix
that by archiving just after a segment is finished.

Also, standby fails to archive a segment if standby is stopped just
after a segment is finished or stopped just after a segment
swtich. Make sure that walreceiver archives all segments by rechecking
at start.
---
 src/backend/replication/walreceiver.c | 82 ++++++++++++++++++---------
 1 file changed, 54 insertions(+), 28 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d1ad75da87..831718c859 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -902,49 +902,34 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		/* Open the segment if not yet */
+		if (recvFile < 0)
 		{
 			bool		use_existent;
 
+			recvFileTLI = ThisTimeLineID;
+
 			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
+			 * If we are starting streaming at the beginning of a segment,
+			 * there may be the case where the previous segment have not been
+			 * archived yet.  Make sure it is archived.
 			 */
-			if (recvFile >= 0)
+			if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS && recvSegNo == 0)
 			{
 				char		xlogfname[MAXFNAMELEN];
+				XLogSegNo	prevseg;
 
-				XLogWalRcvFlush(false);
+				XLByteToPrevSeg(recptr, prevseg, wal_segment_size);
+				XLogFileName(xlogfname, recvFileTLI, prevseg, wal_segment_size);
 
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									xlogfname)));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-					XLogArchiveForceDone(xlogfname);
-				else
-					XLogArchiveNotify(xlogfname);
+				/* Mark as ".ready" of not yet */
+				XLogArchiveCheckDone(xlogfname);
 			}
-			recvFile = -1;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			use_existent = true;
 			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
-			recvFileTLI = ThisTimeLineID;
 		}
 
 		/* Calculate the start offset of the received logs */
@@ -985,6 +970,47 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		buf += byteswritten;
 
 		LogstreamResult.Write = recptr;
+
+		/*
+		 * Close the current WAL segment if it is completed then let the file
+		 * be archived if needed.
+		 */
+		if (!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		{
+			char		xlogfname[MAXFNAMELEN];
+
+			Assert (recvFile >= 0);
+
+			/*
+			 * fsync() and close current file before we switch to next one. We
+			 * would otherwise have to reopen this file to fsync it later
+			 */
+			XLogWalRcvFlush(false);
+
+			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+			/*
+			 * XLOG segment files will be re-read by recovery in startup
+			 * process soon, so we don't advise the OS to release cache
+			 * pages associated with the file like XLogFileClose() does.
+			 */
+			if (close(recvFile) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not close log segment %s: %m",
+								xlogfname)));
+
+			/*
+			 * Create .done file forcibly to prevent the streamed segment
+			 * from being archived later.
+			 */
+			if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+				XLogArchiveForceDone(xlogfname);
+			else
+				XLogArchiveNotify(xlogfname);
+
+			recvFile = -1;
+		}
 	}
 
 	/* Update shared-memory status */
-- 
2.18.4

repro.shtext/plain; charset=us-asciiDownload
repro_helper.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d1ad75da87..dcccef151d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -985,6 +985,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		buf += byteswritten;
 
 		LogstreamResult.Write = recptr;
+
+		{
+			/* fake oneshot SIGTERM just at segment end */
+			struct stat b;
+			char *sigfile1 = "/tmp/hoge1";
+			char *sigfile2 = "/tmp/hoge2";
+
+			if (LogstreamResult.Write % wal_segment_size == 0 &&
+				stat(sigfile1, &b) == 0)
+			{
+				unlink(sigfile1);
+				got_SIGTERM = true;
+				ereport(LOG,(errmsg("STOP BY trig1@%lX", LogstreamResult.Write)));
+			}
+
+			if (LogstreamResult.Write % wal_segment_size == 0x500000 &&
+				stat(sigfile2, &b) == 0)
+			{
+				unlink(sigfile2);
+				got_SIGTERM = true;
+				ereport(LOG,(errmsg("STOP BY trig2@%lX", LogstreamResult.Write)));
+			}
+		}
 	}
 
 	/* Update shared-memory status */
repro_helper2.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 831718c859..b9a7c73ed7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1011,6 +1011,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 			recvFile = -1;
 		}
+
+		{
+			/* fake oneshot SIGTERM just at segment end */
+			struct stat b;
+			char *sigfile1 = "/tmp/hoge1";
+			char *sigfile2 = "/tmp/hoge2";
+
+			if (LogstreamResult.Write % wal_segment_size == 0 &&
+				stat(sigfile1, &b) == 0)
+			{
+				unlink(sigfile1);
+				got_SIGTERM = true;
+				ereport(LOG,(errmsg("STOP BY trig1@%lX", LogstreamResult.Write)));
+			}
+
+			if (LogstreamResult.Write % wal_segment_size == 0x500000 &&
+				stat(sigfile2, &b) == 0)
+			{
+				unlink(sigfile2);
+				got_SIGTERM = true;
+				ereport(LOG,(errmsg("STOP BY trig2@%lX", LogstreamResult.Write)));
+			}
+		}
 	}
 
 	/* Update shared-memory status */
#2Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#1)
1 attachment(s)
Re: Possible missing segments in archiving on standby

On 2020/06/30 16:55, Kyotaro Horiguchi wrote:

Hello.

While looking a patch, I found that a standby with archive_mode=always
fails to archive segments under certain conditions.

I encountered this issue, too.

1. v1-0001-Make-sure-standby-archives-all-segments.patch:
Fix for A and B.

2. v1-0001-Make-sure-standby-archives-all-segments-immediate.patch:
Fix for A, B and C.

You proposed two patches, but this patch should be reviewed preferentially
because this addresses all the issues (i.e., A, B and C) that you reported?

+			 * If we are starting streaming at the beginning of a segment,
+			 * there may be the case where the previous segment have not been
+			 * archived yet.  Make sure it is archived.

Could you clarify why the archive notification file of the previous
WAL segment needs to be checked?

As far as I read the code, the cause of the issue seems to be that
XLogWalRcvWrite() exits without creating an archive notification file
even if the current WAL segment is fully written up in the last cycle of
XLogWalRcvWrite()'s loop. That is, creation of the notification file
and WAL archiving of that completed segment will be delayed
until any data in the next segment is received and written (by next call
to XLogWalRcvWrite()). Furthermore, in that case, if walreceiver exits
before receiving such next segment, the completed current segment
fails to be archived as Horiguchi-san reported.

Therefore, IMO that the simple approach to fix the issue is to create
an archive notification file if possible at the end of XLogWalRcvWrite().
I implemented this idea. Patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

walreceiver_notify_archive_soon.patchtext/plain; charset=UTF-8; name=walreceiver_notify_archive_soon.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c..5b07eef3aa 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -125,6 +125,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvClose(XLogRecPtr recptr);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -883,42 +884,11 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
-		{
-			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
-			 */
-			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									xlogfname)));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-					XLogArchiveForceDone(xlogfname);
-				else
-					XLogArchiveNotify(xlogfname, true);
-			}
-			recvFile = -1;
+		/* Close the current segment if it's completed */
+		XLogWalRcvClose(recptr);
 
+		if (recvFile < 0)
+		{
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			recvFile = XLogFileInit(recvSegNo);
@@ -967,6 +937,14 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+
+	/*
+	 * Close the current segment if it's fully written up in the last cycle of
+	 * the loop, to create its archive notification file soon. Otherwise WAL
+	 * archiving of the segment will be delayed until any data in the next
+	 * segment is received and written.
+	 */
+	XLogWalRcvClose(recptr);
 }
 
 /*
@@ -1020,6 +998,53 @@ XLogWalRcvFlush(bool dying)
 	}
 }
 
+/*
+ * Close the current segment if it's completed.
+ *
+ * Flush the segment to disk before closing it. Otherwise we have to
+ * reopen and fsync it later.
+ *
+ * Create an archive notification file since the segment is known completed.
+ */
+static void
+XLogWalRcvClose(XLogRecPtr recptr)
+{
+	if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+	{
+		char		xlogfname[MAXFNAMELEN];
+
+		/*
+		 * fsync() and close current file before we switch to next one. We
+		 * would otherwise have to reopen this file to fsync it later
+		 */
+		XLogWalRcvFlush(false);
+
+		XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+		/*
+		 * XLOG segment files will be re-read by recovery in startup process
+		 * soon, so we don't advise the OS to release cache pages associated
+		 * with the file like XLogFileClose() does.
+		 */
+		if (close(recvFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not close log segment %s: %m",
+							xlogfname)));
+
+		/*
+		 * Create .done file forcibly to prevent the streamed segment from
+		 * being archived later.
+		 */
+		if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+			XLogArchiveForceDone(xlogfname);
+		else
+			XLogArchiveNotify(xlogfname, true);
+
+		recvFile = -1;
+	}
+}
+
 /*
  * Send reply message to primary, indicating our current WAL locations, oldest
  * xmin and the current time.
#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#2)
3 attachment(s)
Re: Possible missing segments in archiving on standby

At Tue, 31 Aug 2021 01:54:36 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/06/30 16:55, Kyotaro Horiguchi wrote:

Hello.
While looking a patch, I found that a standby with archive_mode=always
fails to archive segments under certain conditions.

I encountered this issue, too.

1. v1-0001-Make-sure-standby-archives-all-segments.patch:
Fix for A and B.
2. v1-0001-Make-sure-standby-archives-all-segments-immediate.patch:
Fix for A, B and C.

You proposed two patches, but this patch should be reviewed
preferentially
because this addresses all the issues (i.e., A, B and C) that you
reported?

Maybe. The point here was whether we regard C as an issue, but now I
think it is an issue.

+ * If we are starting streaming at the beginning of a segment,
+ * there may be the case where the previous segment have not been
+			 * archived yet.  Make sure it is archived.

Could you clarify why the archive notification file of the previous
WAL segment needs to be checked?

As far as I read the code, the cause of the issue seems to be that
XLogWalRcvWrite() exits without creating an archive notification file
even if the current WAL segment is fully written up in the last cycle
of
XLogWalRcvWrite()'s loop. That is, creation of the notification file
and WAL archiving of that completed segment will be delayed
until any data in the next segment is received and written (by next
call
to XLogWalRcvWrite()). Furthermore, in that case, if walreceiver exits
before receiving such next segment, the completed current segment
fails to be archived as Horiguchi-san reported.

Right. Finally such segments are archived when a future checkpoint
removes them. In that sense the patch works to just let archiving
happens faster, but on the other hand I came to think we are supposed
to archive a segment as soon as it is completed. (That is, I think C
is a problem.)

Therefore, IMO that the simple approach to fix the issue is to create
an archive notification file if possible at the end of
XLogWalRcvWrite().
I implemented this idea. Patch attached.

I'm not sure which is simpler, but it works except for B, the case of
a long-jump by a segment switch. When a segment switch happens,
walsender sends filling zero-pages but even if walreceiver is
terminated before the segment is completed, walsender restarts from
the next segment at the next startup. Concretely like the following.

- pg_switch_wal() invoked at 6003228 (for example)
- walreceiver terminates at 6500000 (or a bit later).
- walrecever rstarts from 7000000

In this case the segment 6 is not notified even with the patch, and my
old patches works the same way. (In other words, the call to
XLogWalRcvClose() at the end of XLogWalRcvWrite doens't work for the
case as you might expect.) If we think it ok that we don't notify the
segment earlier than a future checkpoint removes it, yours or only the
last half of my one is sufficient, but do we really think so?
Furthermore, your patch or only the last half of my second patch
doesn't save the case of a crash unlike the case of a graceful
termination.

The attached files are:

v2wip-0001-Make-sure... : a rebased patch of the old second patch
repro_helper.diff : reproducer helper patch, used by the script below.
repro.sh : reproducer script.

(The second diff conflicts with the first patch. Since the second just
inserts a single code block, it is easily applicable manually:p)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v2wip-0001-Make-sure-standby-archives-all-segments-immediate.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c..a44d84f0a7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -883,46 +883,32 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		/* Open the segment if not yet */
+		if (recvFile < 0)
 		{
+			recvFileTLI = ThisTimeLineID;
+
 			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
+			 * If we are starting streaming at the beginning of a segment,
+			 * there may be the case where the previous segment have not been
+			 * archived yet.  Make sure it is archived.
 			 */
-			if (recvFile >= 0)
+			if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS && recvSegNo == 0)
 			{
 				char		xlogfname[MAXFNAMELEN];
+				XLogSegNo	prevseg;
 
 				XLogWalRcvFlush(false);
+				XLByteToPrevSeg(recptr, prevseg, wal_segment_size);
+				XLogFileName(xlogfname, recvFileTLI, prevseg, wal_segment_size);
 
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									xlogfname)));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-					XLogArchiveForceDone(xlogfname);
-				else
-					XLogArchiveNotify(xlogfname, true);
+				/* Mark as ".ready" if not yet */
+				XLogArchiveCheckDone(xlogfname);
 			}
-			recvFile = -1;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			recvFile = XLogFileInit(recvSegNo);
-			recvFileTLI = ThisTimeLineID;
 		}
 
 		/* Calculate the start offset of the received logs */
@@ -963,6 +949,47 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		buf += byteswritten;
 
 		LogstreamResult.Write = recptr;
+
+		/*
+		 * Close the current WAL segment if it is completed then let the file
+		 * be archived if needed.
+		 */
+		if (!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		{
+			char		xlogfname[MAXFNAMELEN];
+
+			Assert (recvFile >= 0);
+
+			/*
+			 * fsync() and close current file before we switch to next one. We
+			 * would otherwise have to reopen this file to fsync it later
+			 */
+			XLogWalRcvFlush(false);
+
+			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+			/*
+			 * XLOG segment files will be re-read by recovery in startup
+			 * process soon, so we don't advise the OS to release cache
+			 * pages associated with the file like XLogFileClose() does.
+			 */
+			if (close(recvFile) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not close log segment %s: %m",
+								xlogfname)));
+
+			/*
+			 * Create .done file forcibly to prevent the streamed segment
+			 * from being archived later.
+			 */
+			if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+				XLogArchiveForceDone(xlogfname);
+			else
+				XLogArchiveNotify(xlogfname, true);
+
+			recvFile = -1;
+		}
 	}
 
 	/* Update shared-memory status */
repro_helper.difftext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c..396510a81d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -963,6 +963,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		buf += byteswritten;
 
 		LogstreamResult.Write = recptr;
+
+		{
+			/* fake oneshot SIGTERM just at segment end */
+			struct stat b;
+			char *sigfile1 = "/tmp/hoge1";
+			char *sigfile2 = "/tmp/hoge2";
+
+			if (LogstreamResult.Write % wal_segment_size == 0 &&
+				stat(sigfile1, &b) == 0)
+			{
+				unlink(sigfile1);
+				ShutdownRequestPending = true;
+				ereport(LOG,(errmsg("STOP BY trig1@%lX", LogstreamResult.Write)));
+			}
+
+			if (LogstreamResult.Write % wal_segment_size == 0x500000 &&
+				stat(sigfile2, &b) == 0)
+			{
+				unlink(sigfile2);
+				ShutdownRequestPending = true;
+				ereport(LOG,(errmsg("STOP BY trig2@%lX", LogstreamResult.Write)));
+			}
+		}
 	}
 
 	/* Update shared-memory status */
repro.shtext/plain; charset=us-asciiDownload
#4Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#3)
Re: Possible missing segments in archiving on standby

On 2021/08/31 16:35, Kyotaro Horiguchi wrote:

I'm not sure which is simpler, but it works except for B, the case of
a long-jump by a segment switch. When a segment switch happens,
walsender sends filling zero-pages but even if walreceiver is
terminated before the segment is completed, walsender restarts from
the next segment at the next startup. Concretely like the following.

- pg_switch_wal() invoked at 6003228 (for example)
- walreceiver terminates at 6500000 (or a bit later).
- walrecever rstarts from 7000000

In this case the segment 6 is not notified even with the patch, and my
old patches works the same way. (In other words, the call to
XLogWalRcvClose() at the end of XLogWalRcvWrite doens't work for the
case as you might expect.) If we think it ok that we don't notify the
segment earlier than a future checkpoint removes it, yours or only the
last half of my one is sufficient, but do we really think so?
Furthermore, your patch or only the last half of my second patch
doesn't save the case of a crash unlike the case of a graceful
termination.

Thanks for the clarification!
Please let me check my understanding about the issue.

The issue happens when walreceiver exits after it receives XLOG_SWITCH record
but before receives the remaining bytes of the segment including that
XLOG_SWITCH record. In this case, the startup process tries to replay that
"half-received" segment, finds XLOG_SWITCH record in it, moves to the next
segment and then starts new walreceiver from that next segment. Therefore,
even with my patch, the segment including that XLOG_SWITCH record is not
archived soon. Is my understanding right? I agree that we should address also
this issue.

ISTM, to address the issue, it's simpler and less fragile to make the startup
process call XLogArchiveCheckDone() or something whenever it moves
the next segment, rather than make walreceiver do that. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#4)
1 attachment(s)
Re: Possible missing segments in archiving on standby

At Tue, 31 Aug 2021 23:23:27 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/08/31 16:35, Kyotaro Horiguchi wrote:

I'm not sure which is simpler, but it works except for B, the case of
a long-jump by a segment switch. When a segment switch happens,
walsender sends filling zero-pages but even if walreceiver is
terminated before the segment is completed, walsender restarts from
the next segment at the next startup. Concretely like the following.
- pg_switch_wal() invoked at 6003228 (for example)
- walreceiver terminates at 6500000 (or a bit later).
- walrecever rstarts from 7000000
In this case the segment 6 is not notified even with the patch, and my
old patches works the same way. (In other words, the call to
XLogWalRcvClose() at the end of XLogWalRcvWrite doens't work for the
case as you might expect.) If we think it ok that we don't notify the
segment earlier than a future checkpoint removes it, yours or only the
last half of my one is sufficient, but do we really think so?
Furthermore, your patch or only the last half of my second patch
doesn't save the case of a crash unlike the case of a graceful
termination.

Thanks for the clarification!
Please let me check my understanding about the issue.

The issue happens when walreceiver exits after it receives XLOG_SWITCH
record
but before receives the remaining bytes of the segment including that
XLOG_SWITCH record. In this case, the startup process tries to replay
that
"half-received" segment, finds XLOG_SWITCH record in it, moves to the
next
segment and then starts new walreceiver from that next
segment. Therefore,
even with my patch, the segment including that XLOG_SWITCH record is
not
archived soon. Is my understanding right? I agree that we should
address also
this issue.

Right.

ISTM, to address the issue, it's simpler and less fragile to make the
startup
process call XLogArchiveCheckDone() or something whenever it moves
the next segment, rather than make walreceiver do that. Thought?

Putting aside the issue C, it would work as far as recovery is not
paused or delayed. Although simply doing that means we run additional
and a bit) wasteful XLogArchiveCheckDone() in most cases, It's hard to
imagine moving the responsibility to notify a finished segment from
walsender (writer side) to startup (reader side).

In the first place A and B happens only at termination or crash of
walsender so there's no fragility in checking only the previous
segment at start of walsender. After a bit thought I noticed that we
don't need to do that in the wal-writing loop. And I noticed that we
need to consider timeline transitions while calculating the previous
segment. Even though missing-notification at a timeline-switch
doesn't happen unless walsender is killed hard for example by a
sigkill or a power cut, though.

So the attached is a new version of the patch to fix only A and B.

- Moved the check code out of the replication loop.

- Track timeline transition while calculating the previous segment.
If we don't do that, we would need another means to avoid notifying
non-existent segment instead of the correct one.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v3wip-0001-Make-sure-standby-archives-all-segments-immediate.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c..81dde27372 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -173,6 +173,7 @@ WalReceiverMain(void)
 	XLogRecPtr	startpoint;
 	TimeLineID	startpointTLI;
 	TimeLineID	primaryTLI;
+	XLogSegNo	startsegno;
 	bool		first_stream;
 	WalRcvData *walrcv = WalRcv;
 	TimestampTz last_recv_timestamp;
@@ -313,6 +314,32 @@ WalReceiverMain(void)
 	if (sender_host)
 		pfree(sender_host);
 
+	/*
+	 * There's a case walreceiver terminated before notifying the last
+	 * finished segment. Make sure the last finished segment is archived
+	 * immediately.
+	 */
+	XLByteToSeg(startpoint, startsegno, wal_segment_size);
+	if (startsegno > 1)
+	{
+		char 		xlogfname[MAXFNAMELEN];
+		TimeLineID	prevsegTLI;
+		XLogRecPtr	prevsegEndRecPtr;
+		List	   *tles;
+
+		/*
+		 * The previous segment may be in the previous timeline.  Track
+		 * timelines to find the segment on the correct timeline.
+		 */
+		tles = readTimeLineHistory(startpointTLI);
+		prevsegEndRecPtr =
+			startpoint - XLogSegmentOffset(startpoint, wal_segment_size) - 1;
+		prevsegTLI = tliOfPointInHistory(prevsegEndRecPtr, tles);
+		XLogFileName(xlogfname, prevsegTLI, startsegno - 1,
+					 wal_segment_size);
+		XLogArchiveCheckDone(xlogfname);
+	}
+		
 	first_stream = true;
 	for (;;)
 	{
#6Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#5)
Re: Possible missing segments in archiving on standby

On 2021/09/01 12:12, Kyotaro Horiguchi wrote:

Putting aside the issue C, it would work as far as recovery is not
paused or delayed. Although simply doing that means we run additional
and a bit) wasteful XLogArchiveCheckDone() in most cases, It's hard to
imagine moving the responsibility to notify a finished segment from
walsender (writer side) to startup (reader side).

You mean walreceiver, not walsender?

I was thinking to apply my latest patch, to address the issue A and C.
So walreceiver is still basically responsible to create .ready file.
Also regarding the issue B, I was thinking to make the startup process
call XLogArchiveCheckDone() or something only when it finds
XLOG_SWITCH record. Thought?

In the first place A and B happens only at termination or crash of
walsender so there's no fragility in checking only the previous
segment at start of walsender. After a bit thought I noticed that we
don't need to do that in the wal-writing loop. And I noticed that we
need to consider timeline transitions while calculating the previous
segment. Even though missing-notification at a timeline-switch
doesn't happen unless walsender is killed hard for example by a
sigkill or a power cut, though.

What happens if the server is promoted before that walreceiver is invoked?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#6)
Re: Possible missing segments in archiving on standby

At Wed, 1 Sep 2021 14:37:43 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/09/01 12:12, Kyotaro Horiguchi wrote:

Putting aside the issue C, it would work as far as recovery is not
paused or delayed. Although simply doing that means we run additional
and a bit) wasteful XLogArchiveCheckDone() in most cases, It's hard to
imagine moving the responsibility to notify a finished segment from
walsender (writer side) to startup (reader side).

You mean walreceiver, not walsender?

Sorry, it's walreceiver.

I was thinking to apply my latest patch, to address the issue A and C.
So walreceiver is still basically responsible to create .ready file.

Considering the following discussion, I don't object to the patch.

Also regarding the issue B, I was thinking to make the startup process
call XLogArchiveCheckDone() or something only when it finds
XLOG_SWITCH record. Thought?

Sounds workable. I came to agree to the reader-side amendment as
below. But I might prefer to do that at every segment-switch in case
of a crash.

What happens if the server is promoted before that walreceiver is
invoked?

Hmmmmm. A partial segment is not created if a server promotes just at
a segment boundary, then the previous segment won't get archived until
the next checkpoint runs.

Ok, I agree that the reader-side needs an amendment.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#7)
1 attachment(s)
Re: Possible missing segments in archiving on standby

On 2021/09/02 10:16, Kyotaro Horiguchi wrote:

Ok, I agree that the reader-side needs an amendment.

Thanks for the review! Attached is the updated version of the patch.
Based on my latest patch, I changed the startup process so that
it creates an archive notification file of the streamed WAL segment
including XLOG_SWITCH record if the notification file has not been created yet.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

walreceiver_notify_archive_soon_v2.patchtext/plain; charset=UTF-8; name=walreceiver_notify_archive_soon_v2.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab03e..6c407045dd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7576,6 +7576,34 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * Create an archive notification file of a streamed WAL
+				 * segment if it includes an XLOG_SWITCH record and its
+				 * notification file has not been created yet. This is
+				 * necessary to handle the corner case that walreceiver may
+				 * fail to create such notification file if it exits after it
+				 * receives XLOG_SWITCH record but while it's receiving the
+				 * remaining bytes in the segment. Without this handling, WAL
+				 * archiving of the segment will be delayed until subsequent
+				 * checkpoint creates its notification file when removing it
+				 * even though it can be archived soon.
+				 */
+				if (readSource == XLOG_FROM_STREAM &&
+					record->xl_rmid == RM_XLOG_ID &&
+					(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
+				{
+					char		xlogfilename[MAXFNAMELEN];
+
+					XLogFileName(xlogfilename, curFileTLI, readSegNo, wal_segment_size);
+					if (!XLogArchiveIsReadyOrDone(xlogfilename))
+					{
+						if (XLogArchivingAlways())
+							XLogArchiveNotify(xlogfilename, true);
+						else
+							XLogArchiveForceDone(xlogfilename);
+					}
+				}
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c..5b07eef3aa 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -125,6 +125,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvClose(XLogRecPtr recptr);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -883,42 +884,11 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		/* Close the current segment if it's completed */
+		XLogWalRcvClose(recptr);
+
+		if (recvFile < 0)
 		{
-			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
-			 */
-			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									xlogfname)));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-					XLogArchiveForceDone(xlogfname);
-				else
-					XLogArchiveNotify(xlogfname, true);
-			}
-			recvFile = -1;
-
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			recvFile = XLogFileInit(recvSegNo);
@@ -967,6 +937,14 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+
+	/*
+	 * Close the current segment if it's fully written up in the last cycle of
+	 * the loop, to create its archive notification file soon. Otherwise WAL
+	 * archiving of the segment will be delayed until any data in the next
+	 * segment is received and written.
+	 */
+	XLogWalRcvClose(recptr);
 }
 
 /*
@@ -1020,6 +998,53 @@ XLogWalRcvFlush(bool dying)
 	}
 }
 
+/*
+ * Close the current segment if it's completed.
+ *
+ * Flush the segment to disk before closing it. Otherwise we have to
+ * reopen and fsync it later.
+ *
+ * Create an archive notification file since the segment is known completed.
+ */
+static void
+XLogWalRcvClose(XLogRecPtr recptr)
+{
+	if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+	{
+		char		xlogfname[MAXFNAMELEN];
+
+		/*
+		 * fsync() and close current file before we switch to next one. We
+		 * would otherwise have to reopen this file to fsync it later
+		 */
+		XLogWalRcvFlush(false);
+
+		XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+		/*
+		 * XLOG segment files will be re-read by recovery in startup process
+		 * soon, so we don't advise the OS to release cache pages associated
+		 * with the file like XLogFileClose() does.
+		 */
+		if (close(recvFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not close log segment %s: %m",
+							xlogfname)));
+
+		/*
+		 * Create .done file forcibly to prevent the streamed segment from
+		 * being archived later.
+		 */
+		if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+			XLogArchiveForceDone(xlogfname);
+		else
+			XLogArchiveNotify(xlogfname, true);
+
+		recvFile = -1;
+	}
+}
+
 /*
  * Send reply message to primary, indicating our current WAL locations, oldest
  * xmin and the current time.
#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#8)
Re: Possible missing segments in archiving on standby

At Fri, 3 Sep 2021 02:06:45 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/09/02 10:16, Kyotaro Horiguchi wrote:

Ok, I agree that the reader-side needs an amendment.

Thanks for the review! Attached is the updated version of the patch.
Based on my latest patch, I changed the startup process so that
it creates an archive notification file of the streamed WAL segment
including XLOG_SWITCH record if the notification file has not been
created yet.

+				if (readSource == XLOG_FROM_STREAM &&
+					record->xl_rmid == RM_XLOG_ID &&
+					(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)

readSource is the source at the time startup reads it and it could be
different from the source at the time the record was written. We
cannot know where the record came from there, so does the readSource
condition work as expected? If we had some trouble streaming just
before, readSource at the time is likely to be XLOG_FROM_PG_WAL.

+						if (XLogArchivingAlways())
+							XLogArchiveNotify(xlogfilename, true);
+						else
+							XLogArchiveForceDone(xlogfilename);

The path is used both for crash and archive recovery. If we pass there
while crash recovery on a primary with archive_mode=on, the file could
be marked .done before actually archived. On the other hand when
archive_mode=always, the file could be re-marked .ready even after it
has been already archived. Why isn't it XLogArchiveCheckDone?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#10Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#9)
1 attachment(s)
Re: Possible missing segments in archiving on standby

On 2021/09/03 14:56, Kyotaro Horiguchi wrote:

+				if (readSource == XLOG_FROM_STREAM &&
+					record->xl_rmid == RM_XLOG_ID &&
+					(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)

readSource is the source at the time startup reads it and it could be
different from the source at the time the record was written. We
cannot know where the record came from there, so does the readSource
condition work as expected? If we had some trouble streaming just
before, readSource at the time is likely to be XLOG_FROM_PG_WAL.

Yes.

+						if (XLogArchivingAlways())
+							XLogArchiveNotify(xlogfilename, true);
+						else
+							XLogArchiveForceDone(xlogfilename);

The path is used both for crash and archive recovery. If we pass there
while crash recovery on a primary with archive_mode=on, the file could
be marked .done before actually archived. On the other hand when
archive_mode=always, the file could be re-marked .ready even after it
has been already archived. Why isn't it XLogArchiveCheckDone?

Yeah, you're right. ISTM what we should do is to just call
XLogArchiveCheckDone() for the segment including XLOG_SWITCH record,
i.e., to create .ready file if the segment has no archive notification file yet
and archive_mode is set to 'always'. Even if we don't do this when we reach
XLOG_SWITCH record, subsequent restartpoints eventually will call
XLogArchiveCheckDone() for such segments.

One issue of this approach is that the WAL segment including XLOG_SWITCH
record may be archived before its previous segments are. That is,
the notification file of current segment is created when it's replayed
because it includes XLOG_SWIATCH, but the notification files of
its previous segments will be created by subsequent restartpoints
because they don't have XLOG_SWITCH. Probably we should avoid this?

If yes, one approach for this issue is to call XLogArchiveCheckDone() for
not only the segment including XLOG_SWITCH but also all the segments
older than that. Thought?

Anyway, I extracted the changes in walreceiver from the patch,
because it's self-contained and can be applied separately.
Patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

walreceiver_notify_archive_soon_v3.patchtext/plain; charset=UTF-8; name=walreceiver_notify_archive_soon_v3.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a2bc37fd7..eb9d12adc1 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -125,6 +125,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvClose(XLogRecPtr recptr);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -883,42 +884,11 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		/* Close the current segment if it's completed */
+		XLogWalRcvClose(recptr);
+
+		if (recvFile < 0)
 		{
-			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
-			 */
-			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									xlogfname)));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-					XLogArchiveForceDone(xlogfname);
-				else
-					XLogArchiveNotify(xlogfname);
-			}
-			recvFile = -1;
-
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			recvFile = XLogFileInit(recvSegNo);
@@ -967,6 +937,14 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+
+	/*
+	 * Close the current segment if it's fully written up in the last cycle of
+	 * the loop, to create its archive notification file soon. Otherwise WAL
+	 * archiving of the segment will be delayed until any data in the next
+	 * segment is received and written.
+	 */
+	XLogWalRcvClose(recptr);
 }
 
 /*
@@ -1020,6 +998,53 @@ XLogWalRcvFlush(bool dying)
 	}
 }
 
+/*
+ * Close the current segment if it's completed.
+ *
+ * Flush the segment to disk before closing it. Otherwise we have to
+ * reopen and fsync it later.
+ *
+ * Create an archive notification file since the segment is known completed.
+ */
+static void
+XLogWalRcvClose(XLogRecPtr recptr)
+{
+	if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+	{
+		char		xlogfname[MAXFNAMELEN];
+
+		/*
+		 * fsync() and close current file before we switch to next one. We
+		 * would otherwise have to reopen this file to fsync it later
+		 */
+		XLogWalRcvFlush(false);
+
+		XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+		/*
+		 * XLOG segment files will be re-read by recovery in startup process
+		 * soon, so we don't advise the OS to release cache pages associated
+		 * with the file like XLogFileClose() does.
+		 */
+		if (close(recvFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not close log segment %s: %m",
+							xlogfname)));
+
+		/*
+		 * Create .done file forcibly to prevent the streamed segment from
+		 * being archived later.
+		 */
+		if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+			XLogArchiveForceDone(xlogfname);
+		else
+			XLogArchiveNotify(xlogfname);
+
+		recvFile = -1;
+	}
+}
+
 /*
  * Send reply message to primary, indicating our current WAL locations, oldest
  * xmin and the current time.
#11Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#10)
Re: Possible missing segments in archiving on standby

At Tue, 7 Sep 2021 17:03:06 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

+ if (XLogArchivingAlways())
+ XLogArchiveNotify(xlogfilename, true);
+ else
+ XLogArchiveForceDone(xlogfilename);
The path is used both for crash and archive recovery. If we pass there
while crash recovery on a primary with archive_mode=on, the file could
be marked .done before actually archived. On the other hand when
archive_mode=always, the file could be re-marked .ready even after it
has been already archived. Why isn't it XLogArchiveCheckDone?

Yeah, you're right. ISTM what we should do is to just call
XLogArchiveCheckDone() for the segment including XLOG_SWITCH record,
i.e., to create .ready file if the segment has no archive notification
file yet
and archive_mode is set to 'always'. Even if we don't do this when we
reach
XLOG_SWITCH record, subsequent restartpoints eventually will call
XLogArchiveCheckDone() for such segments.

One issue of this approach is that the WAL segment including
XLOG_SWITCH
record may be archived before its previous segments are. That is,
the notification file of current segment is created when it's replayed
because it includes XLOG_SWIATCH, but the notification files of
its previous segments will be created by subsequent restartpoints
because they don't have XLOG_SWITCH. Probably we should avoid this?

Anyway there's no guarantee on the archive ordering. As discussed in
the nearby thread [1]/messages/by-id/20210504042755.ehuaoz5blcnjw5yk@alap3.anarazel.de, newer segment is often archived earlier. I
agree that that happens mainly on busy servers, though. The archiver
is designed to handle such "gaps" and/or out-of-order segment
notifications. We could impose a strict ordering on archiving but I
think we would take total performance than such strictness.

In short, no.

If yes, one approach for this issue is to call XLogArchiveCheckDone()
for
not only the segment including XLOG_SWITCH but also all the segments
older than that. Thought?

At least currently, recovery fetches segments by a single process and
every file is marked immediately after being filled-up, so all files
other than the latest one in pg_wal including history files should
have been marked for sure unless file system gets into a trouble. So I
think we don't need to do that even if we want the strictness.

Addition to that that takes too long time when many segments reside in
pg_wal so I think we never want to run such a scan at every segment
end that recovery passes. If I remember correctly, the reason we
don't fix archive status at start up but at checkpoint is we avoided
extra startup time.

Anyway, I extracted the changes in walreceiver from the patch,
because it's self-contained and can be applied separately.
Patch attached.

I'm not sure I like that XLogWalRcvClose hides the segment-switch
condition. If we do that check in the function, I'd like to name the
function XLogWalRcvCloseIfSwitched or something indicates the
condition. I'd like to invert the condition to reduce indentation,
too.

Why don't we call it just after writing data, as my first proposal
did? There's no difference in functionality between doing that and the
patch. If we do so, recvFile>=0 is always true and that condition can
be removed but that would be optional. Anyway, by doing that, no
longer need to call the function twice or we can even live without the
new function.

[1]: /messages/by-id/20210504042755.ehuaoz5blcnjw5yk@alap3.anarazel.de

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#11)
1 attachment(s)
Re: Possible missing segments in archiving on standby

On 2021/09/08 10:45, Kyotaro Horiguchi wrote:

Anyway there's no guarantee on the archive ordering. As discussed in
the nearby thread [1], newer segment is often archived earlier. I
agree that that happens mainly on busy servers, though. The archiver
is designed to handle such "gaps" and/or out-of-order segment
notifications. We could impose a strict ordering on archiving but I
think we would take total performance than such strictness.

Yes, there are other cases causing newer WAL file to be archived eariler.
The issue can happen if XLogArchiveNotify() fails to create .ready file,
for example. Fixing only the case that we're discussing here is not enough.
If *general* fix is discussed at the thread you told, it's better to
do nothing here for the issue and to just make the startup process call
XLogArchiveCheckDone() if it finds the WAL file including XLOG_SWITCH record.

At least currently, recovery fetches segments by a single process and
every file is marked immediately after being filled-up, so all files
other than the latest one in pg_wal including history files should
have been marked for sure unless file system gets into a trouble.

You can reproduce that situation easily by starting the server with
archive_mode=off, generating WAL files, sometimes running pg_switch_wal(),
causing the server to crash, and then restarting the server with
archive_mode=on. At the beginning of recovery, all the WAL files in pg_wal
don't have their archive notification files at all. Then, with the patch,
only WAL files including XLOG_SWITCH are notified for WAL archiving
during recovery. The other WAL files will be notified at the subsequent
checkpoint.

I'm not sure I like that XLogWalRcvClose hides the segment-switch
condition. If we do that check in the function, I'd like to name the
function XLogWalRcvCloseIfSwitched or something indicates the
condition. I'd like to invert the condition to reduce indentation,
too.

We can move the condition-check out of the function like the attached patch.

Why don't we call it just after writing data, as my first proposal
did? There's no difference in functionality between doing that and the
patch. If we do so, recvFile>=0 is always true and that condition can
be removed but that would be optional. Anyway, by doing that, no
longer need to call the function twice or we can even live without the
new function.

I think that it's better and more robust to confirm that the currently-opened
WAL file is valid target one to write WAL *before* actually writing any data
into it.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

walreceiver_notify_archive_soon_v4.patchtext/plain; charset=UTF-8; name=walreceiver_notify_archive_soon_v4.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a2bc37fd7..8d7f52352d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -125,6 +125,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvClose(XLogRecPtr recptr);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -883,42 +884,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		/* Close the current segment if it's completed */
+		if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+			XLogWalRcvClose(recptr);
+
+		if (recvFile < 0)
 		{
-			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
-			 */
-			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									xlogfname)));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-					XLogArchiveForceDone(xlogfname);
-				else
-					XLogArchiveNotify(xlogfname);
-			}
-			recvFile = -1;
-
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			recvFile = XLogFileInit(recvSegNo);
@@ -967,6 +938,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+
+	/*
+	 * Close the current segment if it's fully written up in the last cycle of
+	 * the loop, to create its archive notification file soon. Otherwise WAL
+	 * archiving of the segment will be delayed until any data in the next
+	 * segment is received and written.
+	 */
+	if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		XLogWalRcvClose(recptr);
 }
 
 /*
@@ -1020,6 +1000,52 @@ XLogWalRcvFlush(bool dying)
 	}
 }
 
+/*
+ * Close the current segment.
+ *
+ * Flush the segment to disk before closing it. Otherwise we have to
+ * reopen and fsync it later.
+ *
+ * Create an archive notification file since the segment is known completed.
+ */
+static void
+XLogWalRcvClose(XLogRecPtr recptr)
+{
+	char		xlogfname[MAXFNAMELEN];
+
+	Assert(recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size));
+
+	/*
+	 * fsync() and close current file before we switch to next one. We
+	 * would otherwise have to reopen this file to fsync it later
+	 */
+	XLogWalRcvFlush(false);
+
+	XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+	/*
+	 * XLOG segment files will be re-read by recovery in startup process
+	 * soon, so we don't advise the OS to release cache pages associated
+	 * with the file like XLogFileClose() does.
+	 */
+	if (close(recvFile) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not close log segment %s: %m",
+						xlogfname)));
+
+	/*
+	 * Create .done file forcibly to prevent the streamed segment from
+	 * being archived later.
+	 */
+	if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+		XLogArchiveForceDone(xlogfname);
+	else
+		XLogArchiveNotify(xlogfname);
+
+	recvFile = -1;
+}
+
 /*
  * Send reply message to primary, indicating our current WAL locations, oldest
  * xmin and the current time.
#13Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#12)
Re: Possible missing segments in archiving on standby

At Wed, 8 Sep 2021 16:01:22 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/09/08 10:45, Kyotaro Horiguchi wrote:

Anyway there's no guarantee on the archive ordering. As discussed in
the nearby thread [1], newer segment is often archived earlier. I
agree that that happens mainly on busy servers, though. The archiver
is designed to handle such "gaps" and/or out-of-order segment
notifications. We could impose a strict ordering on archiving but I
think we would take total performance than such strictness.

Yes, there are other cases causing newer WAL file to be archived
eariler.
The issue can happen if XLogArchiveNotify() fails to create .ready
file,
for example. Fixing only the case that we're discussing here is not
enough.
If *general* fix is discussed at the thread you told, it's better to
do nothing here for the issue and to just make the startup process
call
XLogArchiveCheckDone() if it finds the WAL file including XLOG_SWITCH
record.

No. The discussion taken there is not about permanently missing .ready
files, but about .ready files created out-of-order. So I don't think
the outcome from the thread does *fix* this issue.

At least currently, recovery fetches segments by a single process and
every file is marked immediately after being filled-up, so all files
other than the latest one in pg_wal including history files should
have been marked for sure unless file system gets into a trouble.

You can reproduce that situation easily by starting the server with
archive_mode=off, generating WAL files, sometimes running
pg_switch_wal(),
causing the server to crash, and then restarting the server with
archive_mode=on. At the beginning of recovery, all the WAL files in
pg_wal
don't have their archive notification files at all. Then, with the
patch,
only WAL files including XLOG_SWITCH are notified for WAL archiving
during recovery. The other WAL files will be notified at the
subsequent
checkpoint.

I don't think we want such extent of perfectness at all for the case
where some archive-related parameters are changed after a
crash. Anyway we need to take a backup after that and at least all
segments required for the backup will be properly archived. The
segments up to the XLOG_SWITCH segment are harmless garbage, or a bit
of food for disk.

I'm not sure I like that XLogWalRcvClose hides the segment-switch
condition. If we do that check in the function, I'd like to name the
function XLogWalRcvCloseIfSwitched or something indicates the
condition. I'd like to invert the condition to reduce indentation,
too.

We can move the condition-check out of the function like the attached
patch.

Thanks!

Why don't we call it just after writing data, as my first proposal
did? There's no difference in functionality between doing that and the
patch. If we do so, recvFile>=0 is always true and that condition can
be removed but that would be optional. Anyway, by doing that, no
longer need to call the function twice or we can even live without the
new function.

I think that it's better and more robust to confirm that the
currently-opened
WAL file is valid target one to write WAL *before* actually writing
any data
into it.

Sounds convincing. Ok, I agree to that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#14Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#13)
6 attachment(s)
Re: Possible missing segments in archiving on standby

On 2021/09/08 16:40, Kyotaro Horiguchi wrote:

No. The discussion taken there is not about permanently missing .ready
files, but about .ready files created out-of-order. So I don't think
the outcome from the thread does *fix* this issue.

Hmm...

I don't think we want such extent of perfectness at all for the case
where some archive-related parameters are changed after a
crash. Anyway we need to take a backup after that and at least all
segments required for the backup will be properly archived. The
segments up to the XLOG_SWITCH segment are harmless garbage, or a bit
of food for disk.

So probably we reached the consensus that something like the attached patch
(XLogArchiveCheckDone_walfile_xlog_switch.patch) is enough for the corner
case where walreceiver fails to create .ready file of WAL file including
XLOG_SWITCH?

Sounds convincing. Ok, I agree to that.

So barring any objection, I will commit the patch
and back-patch it to all supported version.

walreceiver_notify_archive_soon_v5.patch
walreceiver_notify_archive_soon_v5_pg14-13.patch
walreceiver_notify_archive_soon_v5_pg12-11.patch
walreceiver_notify_archive_soon_v5_pg10.patch
walreceiver_notify_archive_soon_v5_pg96.patch

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

XLogArchiveCheckDone_walfile_xlog_switch.patchtext/plain; charset=UTF-8; name=XLogArchiveCheckDone_walfile_xlog_switch.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..6046e24f0f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7392,6 +7392,27 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * In standby mode, create an archive notification file of a
+				 * WAL segment if it includes an XLOG_SWITCH record and its
+				 * notification file has not been created yet. This is
+				 * necessary to handle the corner case that walreceiver may
+				 * fail to create such notification file if it exits after it
+				 * receives XLOG_SWITCH record but while it's receiving the
+				 * remaining bytes in the segment. Without this handling, WAL
+				 * archiving of the segment will be delayed until subsequent
+				 * checkpoint creates its notification file when removing it
+				 * even though it can be archived soon.
+				 */
+				if (StandbyMode && record->xl_rmid == RM_XLOG_ID &&
+					(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
+				{
+					char		xlogfilename[MAXFNAMELEN];
+
+					XLogFileName(xlogfilename, curFileTLI, readSegNo, wal_segment_size);
+					XLogArchiveCheckDone(xlogfilename);
+				}
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().
walreceiver_notify_archive_soon_v5_pg10.patchtext/plain; charset=UTF-8; name=walreceiver_notify_archive_soon_v5_pg10.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 75ec985953..2818bf5e25 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -118,6 +118,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvClose(XLogRecPtr recptr);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -920,46 +921,16 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo))
-		{
-			bool		use_existent;
-
-			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
-			 */
-			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
+		/* Close the current segment if it's completed */
+		if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo))
+			XLogWalRcvClose(recptr);
 
-				XLogWalRcvFlush(false);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									XLogFileNameP(recvFileTLI, recvSegNo))));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-					XLogArchiveForceDone(xlogfname);
-				else
-					XLogArchiveNotify(xlogfname);
-			}
-			recvFile = -1;
+		if (recvFile < 0)
+		{
+			bool		use_existent = true;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo);
-			use_existent = true;
 			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
 			recvFileTLI = ThisTimeLineID;
 			recvOff = 0;
@@ -1011,6 +982,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		LogstreamResult.Write = recptr;
 	}
+
+	/*
+	 * Close the current segment if it's fully written up in the last cycle of
+	 * the loop, to create its archive notification file soon. Otherwise WAL
+	 * archiving of the segment will be delayed until any data in the next
+	 * segment is received and written.
+	 */
+	if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo))
+		XLogWalRcvClose(recptr);
 }
 
 /*
@@ -1065,6 +1045,52 @@ XLogWalRcvFlush(bool dying)
 	}
 }
 
+/*
+ * Close the current segment.
+ *
+ * Flush the segment to disk before closing it. Otherwise we have to
+ * reopen and fsync it later.
+ *
+ * Create an archive notification file since the segment is known completed.
+ */
+static void
+XLogWalRcvClose(XLogRecPtr recptr)
+{
+	char		xlogfname[MAXFNAMELEN];
+
+	Assert(recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo));
+
+	/*
+	 * fsync() and close current file before we switch to next one. We would
+	 * otherwise have to reopen this file to fsync it later
+	 */
+	XLogWalRcvFlush(false);
+
+	XLogFileName(xlogfname, recvFileTLI, recvSegNo);
+
+	/*
+	 * XLOG segment files will be re-read by recovery in startup process soon,
+	 * so we don't advise the OS to release cache pages associated with the
+	 * file like XLogFileClose() does.
+	 */
+	if (close(recvFile) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not close log segment %s: %m",
+						xlogfname)));
+
+	/*
+	 * Create .done file forcibly to prevent the streamed segment from being
+	 * archived later.
+	 */
+	if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+		XLogArchiveForceDone(xlogfname);
+	else
+		XLogArchiveNotify(xlogfname);
+
+	recvFile = -1;
+}
+
 /*
  * Send reply message to primary, indicating our current WAL locations, oldest
  * xmin and the current time.
walreceiver_notify_archive_soon_v5_pg12-11.patchtext/plain; charset=UTF-8; name=walreceiver_notify_archive_soon_v5_pg12-11.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c6cb90da81..441af76c0f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -119,6 +119,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvClose(XLogRecPtr recptr);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -906,46 +907,16 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
-		{
-			bool		use_existent;
-
-			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
-			 */
-			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
+		/* Close the current segment if it's completed */
+		if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+			XLogWalRcvClose(recptr);
 
-				XLogWalRcvFlush(false);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									XLogFileNameP(recvFileTLI, recvSegNo))));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-					XLogArchiveForceDone(xlogfname);
-				else
-					XLogArchiveNotify(xlogfname);
-			}
-			recvFile = -1;
+		if (recvFile < 0)
+		{
+			bool		use_existent = true;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
-			use_existent = true;
 			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
 			recvFileTLI = ThisTimeLineID;
 			recvOff = 0;
@@ -997,6 +968,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		LogstreamResult.Write = recptr;
 	}
+
+	/*
+	 * Close the current segment if it's fully written up in the last cycle of
+	 * the loop, to create its archive notification file soon. Otherwise WAL
+	 * archiving of the segment will be delayed until any data in the next
+	 * segment is received and written.
+	 */
+	if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		XLogWalRcvClose(recptr);
 }
 
 /*
@@ -1051,6 +1031,52 @@ XLogWalRcvFlush(bool dying)
 	}
 }
 
+/*
+ * Close the current segment.
+ *
+ * Flush the segment to disk before closing it. Otherwise we have to
+ * reopen and fsync it later.
+ *
+ * Create an archive notification file since the segment is known completed.
+ */
+static void
+XLogWalRcvClose(XLogRecPtr recptr)
+{
+	char		xlogfname[MAXFNAMELEN];
+
+	Assert(recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size));
+
+	/*
+	 * fsync() and close current file before we switch to next one. We would
+	 * otherwise have to reopen this file to fsync it later
+	 */
+	XLogWalRcvFlush(false);
+
+	XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+	/*
+	 * XLOG segment files will be re-read by recovery in startup process soon,
+	 * so we don't advise the OS to release cache pages associated with the
+	 * file like XLogFileClose() does.
+	 */
+	if (close(recvFile) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not close log segment %s: %m",
+						xlogfname)));
+
+	/*
+	 * Create .done file forcibly to prevent the streamed segment from being
+	 * archived later.
+	 */
+	if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+		XLogArchiveForceDone(xlogfname);
+	else
+		XLogArchiveNotify(xlogfname);
+
+	recvFile = -1;
+}
+
 /*
  * Send reply message to primary, indicating our current WAL locations, oldest
  * xmin and the current time.
walreceiver_notify_archive_soon_v5_pg14-13.patchtext/plain; charset=UTF-8; name=walreceiver_notify_archive_soon_v5_pg14-13.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index faeea9f0cc..4831a259c4 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -125,6 +125,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvClose(XLogRecPtr recptr);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -883,47 +884,16 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
-		{
-			bool		use_existent;
-
-			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
-			 */
-			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									xlogfname)));
+		/* Close the current segment if it's completed */
+		if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+			XLogWalRcvClose(recptr);
 
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-					XLogArchiveForceDone(xlogfname);
-				else
-					XLogArchiveNotify(xlogfname);
-			}
-			recvFile = -1;
+		if (recvFile < 0)
+		{
+			bool		use_existent = true;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
-			use_existent = true;
 			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
 			recvFileTLI = ThisTimeLineID;
 		}
@@ -970,6 +940,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+
+	/*
+	 * Close the current segment if it's fully written up in the last cycle of
+	 * the loop, to create its archive notification file soon. Otherwise WAL
+	 * archiving of the segment will be delayed until any data in the next
+	 * segment is received and written.
+	 */
+	if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		XLogWalRcvClose(recptr);
 }
 
 /*
@@ -1023,6 +1002,52 @@ XLogWalRcvFlush(bool dying)
 	}
 }
 
+/*
+ * Close the current segment.
+ *
+ * Flush the segment to disk before closing it. Otherwise we have to
+ * reopen and fsync it later.
+ *
+ * Create an archive notification file since the segment is known completed.
+ */
+static void
+XLogWalRcvClose(XLogRecPtr recptr)
+{
+	char		xlogfname[MAXFNAMELEN];
+
+	Assert(recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size));
+
+	/*
+	 * fsync() and close current file before we switch to next one. We would
+	 * otherwise have to reopen this file to fsync it later
+	 */
+	XLogWalRcvFlush(false);
+
+	XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+	/*
+	 * XLOG segment files will be re-read by recovery in startup process soon,
+	 * so we don't advise the OS to release cache pages associated with the
+	 * file like XLogFileClose() does.
+	 */
+	if (close(recvFile) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not close log segment %s: %m",
+						xlogfname)));
+
+	/*
+	 * Create .done file forcibly to prevent the streamed segment from being
+	 * archived later.
+	 */
+	if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+		XLogArchiveForceDone(xlogfname);
+	else
+		XLogArchiveNotify(xlogfname);
+
+	recvFile = -1;
+}
+
 /*
  * Send reply message to primary, indicating our current WAL locations, oldest
  * xmin and the current time.
walreceiver_notify_archive_soon_v5_pg96.patchtext/plain; charset=UTF-8; name=walreceiver_notify_archive_soon_v5_pg96.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 5c3a94cc3d..85b3f6def3 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -144,6 +144,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvClose(XLogRecPtr recptr);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -940,46 +941,16 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo))
-		{
-			bool		use_existent;
-
-			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
-			 */
-			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
+		/* Close the current segment if it's completed */
+		if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo))
+			XLogWalRcvClose(recptr);
 
-				XLogWalRcvFlush(false);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									XLogFileNameP(recvFileTLI, recvSegNo))));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-					XLogArchiveForceDone(xlogfname);
-				else
-					XLogArchiveNotify(xlogfname);
-			}
-			recvFile = -1;
+		if (recvFile < 0)
+		{
+			bool		use_existent = true;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo);
-			use_existent = true;
 			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
 			recvFileTLI = ThisTimeLineID;
 			recvOff = 0;
@@ -1031,6 +1002,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		LogstreamResult.Write = recptr;
 	}
+
+	/*
+	 * Close the current segment if it's fully written up in the last cycle of
+	 * the loop, to create its archive notification file soon. Otherwise WAL
+	 * archiving of the segment will be delayed until any data in the next
+	 * segment is received and written.
+	 */
+	if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo))
+		XLogWalRcvClose(recptr);
 }
 
 /*
@@ -1085,6 +1065,52 @@ XLogWalRcvFlush(bool dying)
 	}
 }
 
+/*
+ * Close the current segment.
+ *
+ * Flush the segment to disk before closing it. Otherwise we have to
+ * reopen and fsync it later.
+ *
+ * Create an archive notification file since the segment is known completed.
+ */
+static void
+XLogWalRcvClose(XLogRecPtr recptr)
+{
+	char		xlogfname[MAXFNAMELEN];
+
+	Assert(recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo));
+
+	/*
+	 * fsync() and close current file before we switch to next one. We would
+	 * otherwise have to reopen this file to fsync it later
+	 */
+	XLogWalRcvFlush(false);
+
+	XLogFileName(xlogfname, recvFileTLI, recvSegNo);
+
+	/*
+	 * XLOG segment files will be re-read by recovery in startup process soon,
+	 * so we don't advise the OS to release cache pages associated with the
+	 * file like XLogFileClose() does.
+	 */
+	if (close(recvFile) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not close log segment %s: %m",
+						xlogfname)));
+
+	/*
+	 * Create .done file forcibly to prevent the streamed segment from being
+	 * archived later.
+	 */
+	if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+		XLogArchiveForceDone(xlogfname);
+	else
+		XLogArchiveNotify(xlogfname);
+
+	recvFile = -1;
+}
+
 /*
  * Send reply message to primary, indicating our current XLOG positions, oldest
  * xmin and the current time.
walreceiver_notify_archive_soon_v5.patchtext/plain; charset=UTF-8; name=walreceiver_notify_archive_soon_v5.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9a2bc37fd7..b90e5ca98e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -125,6 +125,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvClose(XLogRecPtr recptr);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -883,42 +884,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		/* Close the current segment if it's completed */
+		if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+			XLogWalRcvClose(recptr);
+
+		if (recvFile < 0)
 		{
-			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
-			 */
-			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									xlogfname)));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-					XLogArchiveForceDone(xlogfname);
-				else
-					XLogArchiveNotify(xlogfname);
-			}
-			recvFile = -1;
-
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			recvFile = XLogFileInit(recvSegNo);
@@ -967,6 +938,15 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 	/* Update shared-memory status */
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+
+	/*
+	 * Close the current segment if it's fully written up in the last cycle of
+	 * the loop, to create its archive notification file soon. Otherwise WAL
+	 * archiving of the segment will be delayed until any data in the next
+	 * segment is received and written.
+	 */
+	if (recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		XLogWalRcvClose(recptr);
 }
 
 /*
@@ -1020,6 +1000,52 @@ XLogWalRcvFlush(bool dying)
 	}
 }
 
+/*
+ * Close the current segment.
+ *
+ * Flush the segment to disk before closing it. Otherwise we have to
+ * reopen and fsync it later.
+ *
+ * Create an archive notification file since the segment is known completed.
+ */
+static void
+XLogWalRcvClose(XLogRecPtr recptr)
+{
+	char		xlogfname[MAXFNAMELEN];
+
+	Assert(recvFile >= 0 && !XLByteInSeg(recptr, recvSegNo, wal_segment_size));
+
+	/*
+	 * fsync() and close current file before we switch to next one. We would
+	 * otherwise have to reopen this file to fsync it later
+	 */
+	XLogWalRcvFlush(false);
+
+	XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+	/*
+	 * XLOG segment files will be re-read by recovery in startup process soon,
+	 * so we don't advise the OS to release cache pages associated with the
+	 * file like XLogFileClose() does.
+	 */
+	if (close(recvFile) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not close log segment %s: %m",
+						xlogfname)));
+
+	/*
+	 * Create .done file forcibly to prevent the streamed segment from being
+	 * archived later.
+	 */
+	if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+		XLogArchiveForceDone(xlogfname);
+	else
+		XLogArchiveNotify(xlogfname);
+
+	recvFile = -1;
+}
+
 /*
  * Send reply message to primary, indicating our current WAL locations, oldest
  * xmin and the current time.
#15Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Fujii Masao (#14)
1 attachment(s)
Re: Possible missing segments in archiving on standby

On 2021/09/08 22:41, Fujii Masao wrote:

So probably we reached the consensus that something like the attached patch
(XLogArchiveCheckDone_walfile_xlog_switch.patch) is enough for the corner
case where walreceiver fails to create .ready file of WAL file including
XLOG_SWITCH?

I attached the patch again, just in the case.

Sounds convincing.  Ok, I agree to that.

So barring any objection, I will commit the patch
and back-patch it to all supported version.

Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachments:

XLogArchiveCheckDone_walfile_xlog_switch.patchtext/plain; charset=UTF-8; name=XLogArchiveCheckDone_walfile_xlog_switch.patch; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..6046e24f0f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7392,6 +7392,27 @@ StartupXLOG(void)
 				/* Handle interrupt signals of startup process */
 				HandleStartupProcInterrupts();
 
+				/*
+				 * In standby mode, create an archive notification file of a
+				 * WAL segment if it includes an XLOG_SWITCH record and its
+				 * notification file has not been created yet. This is
+				 * necessary to handle the corner case that walreceiver may
+				 * fail to create such notification file if it exits after it
+				 * receives XLOG_SWITCH record but while it's receiving the
+				 * remaining bytes in the segment. Without this handling, WAL
+				 * archiving of the segment will be delayed until subsequent
+				 * checkpoint creates its notification file when removing it
+				 * even though it can be archived soon.
+				 */
+				if (StandbyMode && record->xl_rmid == RM_XLOG_ID &&
+					(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
+				{
+					char		xlogfilename[MAXFNAMELEN];
+
+					XLogFileName(xlogfilename, curFileTLI, readSegNo, wal_segment_size);
+					XLogArchiveCheckDone(xlogfilename);
+				}
+
 				/*
 				 * Pause WAL replay, if requested by a hot-standby session via
 				 * SetRecoveryPause().