Walreceiver fsyncs excessively

Started by Heikki Linnakangasalmost 15 years ago3 messages
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
1 attachment(s)

While testing Fujii-san's patch to flush any already-written WAL on
error in walreceiver, I added a debugging elog to XLogWalRcvFlush() to
print out how far it has written and flushed.

I saw an unexpected pattern while the standby catches up with the master:

LOG: streaming replication successfully connected to primary
LOG: flushing flush=0/0 write=0/1E020000
LOG: flushing flush=0/1E020000 write=0/1E040000
LOG: flushing flush=0/1E040000 write=0/1E060000
LOG: flushing flush=0/1E060000 write=0/1E080000
LOG: flushing flush=0/1E080000 write=0/1E0A0000
LOG: flushing flush=0/1E0A0000 write=0/1E0C0000
LOG: flushing flush=0/1E0C0000 write=0/1E0E0000
LOG: flushing flush=0/1E0E0000 write=0/1E100000
LOG: flushing flush=0/1E100000 write=0/1E120000
LOG: flushing flush=0/1E120000 write=0/1E140000

The master sends the WAL at full-speed, but walreceiver always fsyncs it
in 128 kB chunks. That's excessive, it should be able to read and write
to disk the whole WAL segment before flushing.

There's a little flaw in the walreceiver logic that tries to read all
the available WAL before flushing and sleeping. The way libpqrcv_receive
is written, when it's called with timeout==0 it will not call
PQconsumeInput. So what happens is that when walreceiver main loop calls
libpqrcv_receive() in a loop to fetch all WAL that's available without
blocking, it actually only reads the WAL that's in the libpq receive
buffer - it will not read the WAL that's in the TCP read buffer.

Attached patch fixes libpqrcv_receive() so that it calls
PQconsumeInput() before concluding that there's no data available. The
excessive fsyncing can lead to very bad performance, so this needs to be
appled to 9.0 too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

fix-excessive-walreceiver-fsyncing.patchtext/x-diff; name=fix-excessive-walreceiver-fsyncing.patchDownload
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 5aac85d..9e8504b 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -351,28 +351,33 @@ libpqrcv_receive(int timeout, unsigned char *type, char **buffer, int *len)
 		PQfreemem(recvBuf);
 	recvBuf = NULL;
 
-	/*
-	 * If the caller requested to block, wait for data to arrive. But if this
-	 * is the first call after connecting, don't wait, because there might
-	 * already be some data in libpq buffer that we haven't returned to
-	 * caller.
-	 */
-	if (timeout > 0 && !justconnected)
+	/* Try to receive a CopyData message */
+	rawlen = PQgetCopyData(streamConn, &recvBuf, 1);
+	if (rawlen == 0)
 	{
-		if (!libpq_select(timeout))
-			return false;
+		/*
+		 * No data available yet. If the caller requested to block, wait for
+		 * more data to arrive. But if this is the first call after connecting,
+		 * don't wait, because there might already be some data in libpq buffer
+		 * that we haven't returned to caller.
+		 */
+		if (timeout > 0 && !justconnected)
+		{
+			if (!libpq_select(timeout))
+				return false;
+		}
+		justconnected = false;
 
 		if (PQconsumeInput(streamConn) == 0)
 			ereport(ERROR,
 					(errmsg("could not receive data from WAL stream: %s",
 							PQerrorMessage(streamConn))));
-	}
-	justconnected = false;
 
-	/* Receive CopyData message */
-	rawlen = PQgetCopyData(streamConn, &recvBuf, 1);
-	if (rawlen == 0)			/* no data available yet, then return */
-		return false;
+		/* Now that we've consumed some input, try again */
+		rawlen = PQgetCopyData(streamConn, &recvBuf, 1);
+		if (rawlen == 0)
+			return false;
+	}
 	if (rawlen == -1)			/* end-of-streaming or error */
 	{
 		PGresult   *res;
#2Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Walreceiver fsyncs excessively

On Thu, Jan 13, 2011 at 9:01 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

While testing Fujii-san's patch to flush any already-written WAL on error in
walreceiver, I added a debugging elog to XLogWalRcvFlush() to print out how
far it has written and flushed.

I saw an unexpected pattern while the standby catches up with the master:

LOG:  streaming replication successfully connected to primary
LOG:   flushing flush=0/0 write=0/1E020000
LOG:   flushing flush=0/1E020000 write=0/1E040000
LOG:   flushing flush=0/1E040000 write=0/1E060000
LOG:   flushing flush=0/1E060000 write=0/1E080000
LOG:   flushing flush=0/1E080000 write=0/1E0A0000
LOG:   flushing flush=0/1E0A0000 write=0/1E0C0000
LOG:   flushing flush=0/1E0C0000 write=0/1E0E0000
LOG:   flushing flush=0/1E0E0000 write=0/1E100000
LOG:   flushing flush=0/1E100000 write=0/1E120000
LOG:   flushing flush=0/1E120000 write=0/1E140000

The master sends the WAL at full-speed, but walreceiver always fsyncs it in
128 kB chunks. That's excessive, it should be able to read and write to disk
the whole WAL segment before flushing.

There's a little flaw in the walreceiver logic that tries to read all the
available WAL before flushing and sleeping. The way libpqrcv_receive is
written, when it's called with timeout==0 it will not call PQconsumeInput.
So what happens is that when walreceiver main loop calls libpqrcv_receive()
in a loop to fetch all WAL that's available without blocking, it actually
only reads the WAL that's in the libpq receive buffer - it will not read the
WAL that's in the TCP read buffer.

Attached patch fixes libpqrcv_receive() so that it calls PQconsumeInput()
before concluding that there's no data available. The excessive fsyncing can
lead to very bad performance, so this needs to be appled to 9.0 too.

Seems good.

Can we remove the "justconnected" flag, thanks to the patch?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#2)
Re: Walreceiver fsyncs excessively

On 13.01.2011 14:34, Fujii Masao wrote:

On Thu, Jan 13, 2011 at 9:01 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Attached patch fixes libpqrcv_receive() so that it calls PQconsumeInput()
before concluding that there's no data available. The excessive fsyncing can
lead to very bad performance, so this needs to be appled to 9.0 too.

Seems good.

Can we remove the "justconnected" flag, thanks to the patch?

Yes, good point. Committed with "justconnected" removed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com