WAL recycle retading based on active sync rep.

Started by Kyotaro HORIGUCHIabout 9 years ago5 messages
#1Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
2 attachment(s)

Hello.

We had too-early WAL recycling during a test we had on a sync
replication set. This is not a bug and a bit extreme case but is
contrary to expectation on synchronous replication.

FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000088 has already been removed

This is because sync replication doesn't wait non-commit WALs to
be replicated. This situation is artificially caused with the
first patch attached and the following steps.

- Configure a master with max_wal_size=80MB and
min_wal_size=48MB, and synchronous_standby_names='*' then run.

- Configure a replica using pg_basebackup and run it.
Make a file /tmp/slow to delay replication.

- On the master do
=# create table t (a int);
=# insert into t (select * from generate_series(0, 2000000));

I could guess the following two approaches for this.

A. Retard wal recycling back to where sync replication reached.

B. Block wal insertion until sync replication reaches to the
first surviving segments.

The second attached patch implements the first measure. It makes
CreateCheckPoint consider satisfied sync replication on WAL
recycling. If WAL segments to be recycled is required by the
currently satisfied sync-replication, it keeps the required
segments and emit the following message.

WARNING: sync replication too retarded. 2 extra WAL segments are preserved (last segno to preserve is moved from 185 to 183)
HINT: If you see this message too frequently, consider increasing wal_keep_segments or max_wal_size.

This is somewhat simliar to what repl-slot does but this doesn't
anything when synchronous replication is not satisfied. Perhaps
max_temporary_preserve_segments or similar GUC is required to
limit amount of extra segments.

- Is this situation required to be saved? This is caused by a
large transaction, spans over two max_wal_size segments, or
replication stall lasts for a chackepoint period.

- Is the measure acceptable? For the worst case, a master
crashes from WAL space exhaustion. (But such large transaction
won't/shouldn't exist?)

Or other comments?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Slows-replication-processing.patchtext/x-patch; charset=us-asciiDownload
From 7234376b2d6fa6b86cc9e4ed95a52af7bd6225e6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 18 Nov 2016 12:44:25 +0900
Subject: [PATCH 1/2] Slows replication processing.

To cause sync-rep desync artificially, sleep 100ms for each
replication message if /tmp/slow exists.
---
 src/backend/replication/walreceiver.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2bb3dce..57f3ad2 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -439,6 +439,11 @@ WalReceiverMain(void)
 							 */
 							last_recv_timestamp = GetCurrentTimestamp();
 							ping_sent = false;
+							{
+								struct stat b;
+								if (stat("/tmp/slow", &b) >= 0)
+									usleep(100000);
+							}
 							XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1);
 						}
 						else if (len == 0)
-- 
2.9.2

0002-Preserve-WAL-segments-requred-by-synchronous-standby.patchtext/x-patch; charset=us-asciiDownload
From 3368c16e7a8f30216e7d9579f5d2ca3b923259d5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 18 Nov 2016 13:14:55 +0900
Subject: [PATCH 2/2] Preserve WAL segments requred by synchronous standbys.

Since synchronous standby doesn't sync non-commit records, a large
transaction may unexpectedly break a sync replication. This patch
makes CreateCheckPoint to preserve all WAL segments required by the
currently established synchronous replication.
---
 src/backend/access/transam/xlog.c | 26 ++++++++++++++++++++++++++
 src/backend/replication/syncrep.c | 23 ++++++++++-------------
 src/include/replication/syncrep.h |  4 ++++
 3 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cec027..195272e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -49,6 +49,7 @@
 #include "replication/slot.h"
 #include "replication/origin.h"
 #include "replication/snapbuild.h"
+#include "replication/syncrep.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/barrier.h"
@@ -8628,12 +8629,37 @@ CreateCheckPoint(int flags)
 	if (PriorRedoPtr != InvalidXLogRecPtr)
 	{
 		XLogSegNo	_logSegNo;
+		bool		in_sync, am_sync;
+		XLogRecPtr	repwriteptr, repflushptr, repapplyptr;
 
 		/* Update the average distance between checkpoints. */
 		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
 
 		XLByteToSeg(PriorRedoPtr, _logSegNo);
 		KeepLogSeg(recptr, &_logSegNo);
+
+		/*
+		 * If I am under satisfied synchronous replication, refrain from
+		 * removing segments apparently required by them. Refferring to write
+		 * pointer is enough.
+		 */
+		in_sync = SyncRepGetOldestSyncRecPtr(&repwriteptr, &repflushptr,
+											 &repapplyptr, &am_sync, true);
+		if (in_sync && repwriteptr != InvalidXLogRecPtr)
+		{
+			XLogSegNo synckeep;
+
+			XLByteToSeg(repwriteptr, synckeep);
+			if (synckeep < _logSegNo)
+			{
+				ereport(WARNING,
+						(errmsg("sync replication too retarded. %lu extra WAL segments are preserved (last segno to preesrve is moved from %lx to %lx)",
+								_logSegNo - synckeep, _logSegNo, synckeep),
+						 errhint("If you see this message too frequently, consider increasing wal_keep_segments or max_wal_size.")));
+				_logSegNo = synckeep;
+			}
+		}
+
 		_logSegNo--;
 		RemoveOldXlogFiles(_logSegNo, PriorRedoPtr, recptr);
 	}
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ac29f56..343217bc 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -86,10 +86,6 @@ static void SyncRepQueueInsert(int mode);
 static void SyncRepCancelWait(void);
 static int	SyncRepWakeQueue(bool all, int mode);
 
-static bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
-						   XLogRecPtr *flushPtr,
-						   XLogRecPtr *applyPtr,
-						   bool *am_sync);
 static int	SyncRepGetStandbyPriority(void);
 
 #ifdef USE_ASSERT_CHECKING
@@ -417,7 +413,7 @@ SyncRepReleaseWaiters(void)
 	 * positions among all sync standbys.
 	 */
 	got_oldest = SyncRepGetOldestSyncRecPtr(&writePtr, &flushPtr,
-											&applyPtr, &am_sync);
+											&applyPtr, &am_sync, false);
 
 	/*
 	 * If we are managing a sync standby, though we weren't prior to this,
@@ -473,16 +469,17 @@ SyncRepReleaseWaiters(void)
 /*
  * Calculate the oldest Write, Flush and Apply positions among sync standbys.
  *
- * Return false if the number of sync standbys is less than
- * synchronous_standby_names specifies. Otherwise return true and
- * store the oldest positions into *writePtr, *flushPtr and *applyPtr.
+ * Return true if the oldest WAL locations are set. They are set only when the
+ * synchronous_standby_names is defined and satisfied, and I am connected from
+ * a synchronous standby. The locations are set even when connected from a
+ * synchronous standby if force is true.
  *
- * On return, *am_sync is set to true if this walsender is connecting to
- * sync standby. Otherwise it's set to false.
+ * *am_sync is set to true on return if this walsender is connecting to a
+ * synchronous standby.
  */
-static bool
+bool
 SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
-						   XLogRecPtr *applyPtr, bool *am_sync)
+						   XLogRecPtr *applyPtr, bool *am_sync, bool force)
 {
 	List	   *sync_standbys;
 	ListCell   *cell;
@@ -499,7 +496,7 @@ SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	 * Quick exit if we are not managing a sync standby or there are not
 	 * enough synchronous standbys.
 	 */
-	if (!(*am_sync) ||
+	if ((!force && !(*am_sync)) ||
 		SyncRepConfig == NULL ||
 		list_length(sync_standbys) < SyncRepConfig->num_sync)
 	{
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index e4e0e27..0c42523 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -71,6 +71,10 @@ extern List *SyncRepGetSyncStandbys(bool *am_sync);
 
 /* called by checkpointer */
 extern void SyncRepUpdateSyncStandbysDefined(void);
+extern bool SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
+									   XLogRecPtr *flushPtr,
+									   XLogRecPtr *applyPtr,
+									   bool *am_sync, bool force);
 
 /* GUC infrastructure */
 extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
-- 
2.9.2

#2Craig Ringer
craig.ringer@2ndquadrant.com
In reply to: Kyotaro HORIGUCHI (#1)
Re: WAL recycle retading based on active sync rep.

On 18 Nov. 2016 13:14, "Kyotaro HORIGUCHI" <horiguchi.kyotaro@lab.ntt.co.jp>
wrote:

Hello.

We had too-early WAL recycling during a test we had on a sync
replication set. This is not a bug and a bit extreme case but is
contrary to expectation on synchronous replication.

Isn't this prevented by using a physical replication slot?

You hint that you looked at slots but they didn't meet your needs in some
way. I'm not sure I understood the last part.

#3Andres Freund
andres@anarazel.de
In reply to: Kyotaro HORIGUCHI (#1)
Re: WAL recycle retading based on active sync rep.

Hi,

On 2016-11-18 14:12:42 +0900, Kyotaro HORIGUCHI wrote:

We had too-early WAL recycling during a test we had on a sync
replication set. This is not a bug and a bit extreme case but is
contrary to expectation on synchronous replication.

I don't think you can expect anything else.

This is because sync replication doesn't wait non-commit WALs to
be replicated. This situation is artificially caused with the
first patch attached and the following steps.

You could get that situation even if we waited for syncrep. The
SyncRepWaitForLSN happens after delayChkpt is unset.

Additionally a syncrep connection could break for a a short while, and
you'd loose all guarantees anyway.

- Is this situation required to be saved? This is caused by a
large transaction, spans over two max_wal_size segments, or
replication stall lasts for a chackepoint period.

I very strongly think not.

- Is the measure acceptable? For the worst case, a master
crashes from WAL space exhaustion. (But such large transaction
won't/shouldn't exist?)

No, imo not.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#1)
Re: WAL recycle retading based on active sync rep.

Thanks for the comment.

At Fri, 18 Nov 2016 17:06:55 +0800, Craig Ringer <craig.ringer@2ndquadrant.com> wrote in <CAMsr+YGkmJ2aweanT4JF9_i_xS_bGTZkdKW-_=2A88yEGansPA@mail.gmail.com>

We had too-early WAL recycling during a test we had on a sync
replication set. This is not a bug and a bit extreme case but is
contrary to expectation on synchronous replication.

Isn't this prevented by using a physical replication slot?

You hint that you looked at slots but they didn't meet your needs in some
way. I'm not sure I understood the last part.

Yes, repslot does the similar. The point was whether "Do we
expect that removal of necessary WAL doesn't occur on an active
sync replication?", with a strong doubt.

At Fri, 18 Nov 2016 10:16:22 -0800, Andres Freund <andres@anarazel.de> wrote in <20161118181622.hklschaizwaxocl7@alap3.anarazel.de>

On 2016-11-18 14:12:42 +0900, Kyotaro HORIGUCHI wrote:

We had too-early WAL recycling during a test we had on a sync
replication set. This is not a bug and a bit extreme case but is
contrary to expectation on synchronous replication.

I don't think you can expect anything else.

I think this is the answer for it.

regards,

--
堀口恭太郎

日本電信電話株式会社 NTTオープンソースソフトウェアセンタ
Phone: 03-5860-5115 / Fax: 03-5463-5490

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Andres Freund (#3)
Re: WAL recycle retading based on active sync rep.

Hello,

At Fri, 18 Nov 2016 10:16:22 -0800, Andres Freund <andres@anarazel.de> wrote in <20161118181622.hklschaizwaxocl7@alap3.anarazel.de>

Hi,

On 2016-11-18 14:12:42 +0900, Kyotaro HORIGUCHI wrote:

We had too-early WAL recycling during a test we had on a sync
replication set. This is not a bug and a bit extreme case but is
contrary to expectation on synchronous replication.

I don't think you can expect anything else.

My sentense was inaccurate. "is contrary to *naive* expectation
on synchronous replication." But I agree to you.

This is because sync replication doesn't wait non-commit WALs to
be replicated. This situation is artificially caused with the
first patch attached and the following steps.

You could get that situation even if we waited for syncrep. The
SyncRepWaitForLSN happens after delayChkpt is unset.

Additionally a syncrep connection could break for a a short while, and
you'd loose all guarantees anyway.

I know. Replication slots are for such cases.

- Is this situation required to be saved? This is caused by a
large transaction, spans over two max_wal_size segments, or
replication stall lasts for a chackepoint period.

I very strongly think not.

- Is the measure acceptable? For the worst case, a master
crashes from WAL space exhaustion. (But such large transaction
won't/shouldn't exist?)

No, imo not.

Thanks for clarifying that.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers