Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

Started by Michael Paquieralmost 13 years ago25 messages
#1Michael Paquier
michael.paquier@gmail.com

Hi all,

There is a strange bug with the latest master head (commit 7fcbf6a).
When the WAL stream with a master is cut on a slave, slave returns a FATAL
(well normal...), but then enters in recovery process and automatically
promotes.
Here are more details about the logs on slave (I simply killed the master
manually):
FATAL: could not receive data from WAL stream:
cp: cannot stat
‘/home/michael/bin/pgsql/archive/master/000000010000000000000004’: No such
file or directory
LOG: record with zero length at 0/401E1B8
LOG: redo done at 0/401E178
LOG: last completed transaction was at log time 2013-01-17
20:27:53.180971+09
cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000002.history’:
No such file or directory
LOG: selected new timeline ID: 2
cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000001.history’:
No such file or directory
LOG: archive recovery complete
DEBUG: resetting unlogged relations: cleanup 0 init 1
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
DEBUG: archived transaction log file "000000010000000000000004"
DEBUG: archived transaction log file "00000002.history"
LOG: statement: create table bn (a int);
DEBUG: autovacuum: processing database "postgres"

Slave does not try anymore to reconnect to master with messages of the type:
FATAL: could not connect to the primary server

I also noticed that there is some delay until modifications on master are
visible on slave.
For example run a simple CREATE TABLE and the new table is not

[some bisecting later...]

I think that bug has been introduced by commit 7fcbf6a.
Before splitting xlog reading as a separate facility things worked
correctly.
There are also no delay problems before this commit.

Does someone else noticed that?
--
Michael Paquier
http://michael.otacoo.com

#2Andres Freund
andres@2ndquadrant.com
In reply to: Michael Paquier (#1)
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:

Hi all,

There is a strange bug with the latest master head (commit 7fcbf6a).
When the WAL stream with a master is cut on a slave, slave returns a FATAL
(well normal...), but then enters in recovery process and automatically
promotes.
Here are more details about the logs on slave (I simply killed the master
manually):
FATAL: could not receive data from WAL stream:
cp: cannot stat
‘/home/michael/bin/pgsql/archive/master/000000010000000000000004’: No such
file or directory
LOG: record with zero length at 0/401E1B8
LOG: redo done at 0/401E178
LOG: last completed transaction was at log time 2013-01-17
20:27:53.180971+09
cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000002.history’:
No such file or directory
LOG: selected new timeline ID: 2
cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000001.history’:
No such file or directory
LOG: archive recovery complete
DEBUG: resetting unlogged relations: cleanup 0 init 1
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
DEBUG: archived transaction log file "000000010000000000000004"
DEBUG: archived transaction log file "00000002.history"
LOG: statement: create table bn (a int);
DEBUG: autovacuum: processing database "postgres"

Slave does not try anymore to reconnect to master with messages of the type:
FATAL: could not connect to the primary server

I also noticed that there is some delay until modifications on master are
visible on slave.
For example run a simple CREATE TABLE and the new table is not

[some bisecting later...]

I think that bug has been introduced by commit 7fcbf6a.
Before splitting xlog reading as a separate facility things worked
correctly.
There are also no delay problems before this commit.

Does someone else noticed that?

No, at least I haven't, but it got committed only a rather short time
ago ;)
Looking, I have an inkling where the rpoblem could be.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@2ndquadrant.com
In reply to: Michael Paquier (#1)
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:

Hi all,

There is a strange bug with the latest master head (commit 7fcbf6a).
When the WAL stream with a master is cut on a slave, slave returns a FATAL
(well normal...), but then enters in recovery process and automatically
promotes.
Here are more details about the logs on slave (I simply killed the master
manually):
FATAL: could not receive data from WAL stream:
cp: cannot stat
‘/home/michael/bin/pgsql/archive/master/000000010000000000000004’: No such
file or directory
LOG: record with zero length at 0/401E1B8
LOG: redo done at 0/401E178
LOG: last completed transaction was at log time 2013-01-17
20:27:53.180971+09
cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000002.history’:
No such file or directory
LOG: selected new timeline ID: 2
cp: cannot stat ‘/home/michael/bin/pgsql/archive/master/00000001.history’:
No such file or directory
LOG: archive recovery complete
DEBUG: resetting unlogged relations: cleanup 0 init 1
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
DEBUG: archived transaction log file "000000010000000000000004"
DEBUG: archived transaction log file "00000002.history"
LOG: statement: create table bn (a int);
DEBUG: autovacuum: processing database "postgres"

Slave does not try anymore to reconnect to master with messages of the type:
FATAL: could not connect to the primary server

I also noticed that there is some delay until modifications on master are
visible on slave.
For example run a simple CREATE TABLE and the new table is not

[some bisecting later...]

I think that bug has been introduced by commit 7fcbf6a.
Before splitting xlog reading as a separate facility things worked
correctly.
There are also no delay problems before this commit.

Ok, my inkling proved to be correct, its two related issues:

a ) The error handling in XLogReadRecord is inconsistent, it doesn't
always reset the necessary things.

b) ReadRecord: We cannot not break out of the retry loop in readRecord
just so, just removing the break seems correct.

c) ReadRecord: (minor): We should log an error even if errormsg is not
set, otherwise we wont jump out far enough.

I think at least a) and b) is the result of merges between development
of different people, sorry for that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#3)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 17.01.2013 15:05, Andres Freund wrote:

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:

I think that bug has been introduced by commit 7fcbf6a.
Before splitting xlog reading as a separate facility things worked
correctly.
There are also no delay problems before this commit.

Ok, my inkling proved to be correct, its two related issues:

a ) The error handling in XLogReadRecord is inconsistent, it doesn't
always reset the necessary things.

b) ReadRecord: We cannot not break out of the retry loop in readRecord
just so, just removing the break seems correct.

c) ReadRecord: (minor): We should log an error even if errormsg is not
set, otherwise we wont jump out far enough.

I think at least a) and b) is the result of merges between development
of different people, sorry for that.

Got a patch?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#4)
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote:

On 17.01.2013 15:05, Andres Freund wrote:

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:

I think that bug has been introduced by commit 7fcbf6a.
Before splitting xlog reading as a separate facility things worked
correctly.
There are also no delay problems before this commit.

Ok, my inkling proved to be correct, its two related issues:

a ) The error handling in XLogReadRecord is inconsistent, it doesn't
always reset the necessary things.

b) ReadRecord: We cannot not break out of the retry loop in readRecord
just so, just removing the break seems correct.

c) ReadRecord: (minor): We should log an error even if errormsg is not
set, otherwise we wont jump out far enough.

I think at least a) and b) is the result of merges between development
of different people, sorry for that.

Got a patch?

Yes, I am just testing some scenarios with it, will send it after that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#5)
1 attachment(s)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 2013-01-17 16:23:44 +0100, Andres Freund wrote:

On 2013-01-17 17:18:14 +0200, Heikki Linnakangas wrote:

On 17.01.2013 15:05, Andres Freund wrote:

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:

I think that bug has been introduced by commit 7fcbf6a.
Before splitting xlog reading as a separate facility things worked
correctly.
There are also no delay problems before this commit.

Ok, my inkling proved to be correct, its two related issues:

a ) The error handling in XLogReadRecord is inconsistent, it doesn't
always reset the necessary things.

b) ReadRecord: We cannot not break out of the retry loop in readRecord
just so, just removing the break seems correct.

c) ReadRecord: (minor): We should log an error even if errormsg is not
set, otherwise we wont jump out far enough.

I think at least a) and b) is the result of merges between development
of different people, sorry for that.

Got a patch?

Yes, I am just testing some scenarios with it, will send it after that.

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch. I also noticed that the TLI check
didn't mark the last source as failed.

Not a real issue and its independent from this patch but I found that
when promoting from streaming rep the code first fails over to archive
recovery and only then to recovering from pg_xlog. Is that intended?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

fix-error-handling-xlogreader.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 70cfabc..e33bd7a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3209,12 +3209,6 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			/* not all failures fill errormsg; report those that do */
-			if (errormsg && errormsg[0] != '\0')
-				ereport(emode_for_corrupt_record(emode,
-												 RecPtr ? RecPtr : EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */));
-
 			lastSourceFailed = true;
 
 			if (readFile >= 0)
@@ -3222,7 +3216,23 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				close(readFile);
 				readFile = -1;
 			}
-			break;
+
+			/*
+			 * We only end up here without a message when XLogPageRead() failed
+			 * - in that case we already logged something.
+			 * In StandbyMode that only happens if we have been triggered, so
+			 * we shouldn't loop anymore in that case.
+			 */
+			if (errormsg == NULL)
+				break;
+
+
+			ereport(emode_for_corrupt_record(emode,
+											 RecPtr ? RecPtr : EndRecPtr),
+					(errmsg_internal("%s", errormsg) /* already translated */));
+
+			/* don't make a timeline check */
+			continue;
 		}
 
 		/*
@@ -3234,6 +3244,8 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLogSegNo segno;
 			int32 offset;
 
+			lastSourceFailed = true;
+
 			XLByteToSeg(xlogreader->latestPagePtr, segno);
 			offset = xlogreader->latestPagePtr % XLogSegSize;
 			XLogFileName(fname, xlogreader->readPageTLI, segno);
@@ -3243,7 +3255,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 							xlogreader->latestPageTLI,
 							fname,
 							offset)));
-			return false;
+			continue;
 		}
 	} while (StandbyMode && record == NULL);
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ff871a3..75164f6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -222,11 +222,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogRecord);
 
 	if (readOff < 0)
-	{
-		if (state->errormsg_buf[0] != '\0')
-			*errormsg = state->errormsg_buf;
-		return NULL;
-	}
+		goto err;
 
 	/*
 	 * ReadPageInternal always returns at least the page header, so we can
@@ -246,8 +242,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	{
 		report_invalid_record(state, "invalid record offset at %X/%X",
 							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
-		*errormsg = state->errormsg_buf;
-		return NULL;
+		goto err;
 	}
 
 	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
@@ -255,8 +250,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	{
 		report_invalid_record(state, "contrecord is requested by %X/%X",
 							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
-		*errormsg = state->errormsg_buf;
-		return NULL;
+		goto err;
 	}
 
 	/* ReadPageInternal has verified the page header */
@@ -270,11 +264,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 							   targetPagePtr,
 						  Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
 	if (readOff < 0)
-	{
-		if (state->errormsg_buf[0] != '\0')
-			*errormsg = state->errormsg_buf;
-		return NULL;
-	}
+		goto err;
 
 	/*
 	 * Read the record length.
@@ -300,11 +290,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
 								   randAccess))
-		{
-			if (state->errormsg_buf[0] != '\0')
-				*errormsg = state->errormsg_buf;
-			return NULL;
-		}
+			goto err;
 		gotheader = true;
 	}
 	else
@@ -314,8 +300,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		{
 			report_invalid_record(state, "invalid record length at %X/%X",
 								  (uint32) (RecPtr >> 32), (uint32) RecPtr);
-			*errormsg = state->errormsg_buf;
-			return NULL;
+			goto err;
 		}
 		gotheader = false;
 	}
@@ -330,8 +315,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		report_invalid_record(state, "record length %u at %X/%X too long",
 							  total_len,
 							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
-		*errormsg = state->errormsg_buf;
-		return NULL;
+		goto err;
 	}
 
 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
#7Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#6)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 17.01.2013 17:42, Andres Freund wrote:

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch. I also noticed that the TLI check
didn't mark the last source as failed.

This looks fragile:

/*
* We only end up here without a message when XLogPageRead() failed
* - in that case we already logged something.
* In StandbyMode that only happens if we have been triggered, so
* we shouldn't loop anymore in that case.
*/
if (errormsg == NULL)
break;

I don't like relying on the presence of an error message to control
logic like that. Should we throw in an explicit CheckForStandbyTrigger()
check in the condition of that loop?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#7)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:

On 17.01.2013 17:42, Andres Freund wrote:

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch. I also noticed that the TLI check
didn't mark the last source as failed.

This looks fragile:

/*
* We only end up here without a message when XLogPageRead() failed
* - in that case we already logged something.
* In StandbyMode that only happens if we have been triggered, so
* we shouldn't loop anymore in that case.
*/
if (errormsg == NULL)
break;

I don't like relying on the presence of an error message to control logic
like that. Should we throw in an explicit CheckForStandbyTrigger() check in
the condition of that loop?

I agree, I wasn't too happy about that either. But in some sense its
only propagating state from XLogReadPage which already has dealt with
the error and decided its ok.
Its the solution closest to what happened in the old implementation,
thats why I thought it would be halfway to acceptable.

Adding the CheckForStandbyTrigger() in the condition would mean
promotion would happen before all the available records are processed
and it would increase the amount of stat()s tremendously.
So I don't really like that either.

Don't really have a good idea :(

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#8)
1 attachment(s)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 17.01.2013 18:42, Andres Freund wrote:

On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:

On 17.01.2013 17:42, Andres Freund wrote:

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch. I also noticed that the TLI check
didn't mark the last source as failed.

This looks fragile:

/*
* We only end up here without a message when XLogPageRead() failed
* - in that case we already logged something.
* In StandbyMode that only happens if we have been triggered, so
* we shouldn't loop anymore in that case.
*/
if (errormsg == NULL)
break;

I don't like relying on the presence of an error message to control logic
like that. Should we throw in an explicit CheckForStandbyTrigger() check in
the condition of that loop?

I agree, I wasn't too happy about that either. But in some sense its
only propagating state from XLogReadPage which already has dealt with
the error and decided its ok.
Its the solution closest to what happened in the old implementation,
thats why I thought it would be halfway to acceptable.

Adding the CheckForStandbyTrigger() in the condition would mean
promotion would happen before all the available records are processed
and it would increase the amount of stat()s tremendously.
So I don't really like that either.

I was thinking of the attached. As long as we check for
CheckForStandbyTrigger() after the "record == NULL" check, we won't
perform extra stat() calls on successful reads, only when we're polling
after reaching the end of valid WAL. That seems acceptable. If we want
to avoid even that, we could move the static 'triggered' variable from
CheckForStandbyTrigger() out of that function, and check that in the loop.

- Heikki

Attachments:

fix-error-handling-xlogreader-2.patchtext/x-diff; name=fix-error-handling-xlogreader-2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 70cfabc..ac2b26b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3180,7 +3180,8 @@ RestoreBackupBlock(XLogRecPtr lsn, XLogRecord *record, int block_index,
  * try to read a record just after the last one previously read.
  *
  * If no valid record is available, returns NULL, or fails if emode is PANIC.
- * (emode must be either PANIC, LOG)
+ * (emode must be either PANIC, LOG). In standby mode, retries until a valid
+ * record is available.
  *
  * The record is copied into readRecordBuf, so that on successful return,
  * the returned record pointer always points there.
@@ -3209,12 +3210,6 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			/* not all failures fill errormsg; report those that do */
-			if (errormsg && errormsg[0] != '\0')
-				ereport(emode_for_corrupt_record(emode,
-												 RecPtr ? RecPtr : EndRecPtr),
-						(errmsg_internal("%s", errormsg) /* already translated */));
-
 			lastSourceFailed = true;
 
 			if (readFile >= 0)
@@ -3222,7 +3217,20 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				close(readFile);
 				readFile = -1;
 			}
-			break;
+
+			/*
+			 * We only end up here without a message when XLogPageRead() failed
+			 * - in that case we already logged something.
+			 * In StandbyMode that only happens if we have been triggered, so
+			 * we shouldn't loop anymore in that case.
+			 */
+			if (errormsg)
+				ereport(emode_for_corrupt_record(emode,
+												 RecPtr ? RecPtr : EndRecPtr),
+						(errmsg_internal("%s", errormsg) /* already translated */));
+
+			/* Give up, or retry if we're in standby mode. */
+			continue;
 		}
 
 		/*
@@ -3234,6 +3242,8 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			XLogSegNo segno;
 			int32 offset;
 
+			lastSourceFailed = true;
+
 			XLByteToSeg(xlogreader->latestPagePtr, segno);
 			offset = xlogreader->latestPagePtr % XLogSegSize;
 			XLogFileName(fname, xlogreader->readPageTLI, segno);
@@ -3243,9 +3253,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 							xlogreader->latestPageTLI,
 							fname,
 							offset)));
-			return false;
+			record = NULL;
+			continue;
 		}
-	} while (StandbyMode && record == NULL);
+	} while (StandbyMode && record == NULL && !CheckForStandbyTrigger());
 
 	return record;
 }
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ff871a3..75164f6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -222,11 +222,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogRecord);
 
 	if (readOff < 0)
-	{
-		if (state->errormsg_buf[0] != '\0')
-			*errormsg = state->errormsg_buf;
-		return NULL;
-	}
+		goto err;
 
 	/*
 	 * ReadPageInternal always returns at least the page header, so we can
@@ -246,8 +242,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	{
 		report_invalid_record(state, "invalid record offset at %X/%X",
 							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
-		*errormsg = state->errormsg_buf;
-		return NULL;
+		goto err;
 	}
 
 	if ((((XLogPageHeader) state->readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD) &&
@@ -255,8 +250,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	{
 		report_invalid_record(state, "contrecord is requested by %X/%X",
 							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
-		*errormsg = state->errormsg_buf;
-		return NULL;
+		goto err;
 	}
 
 	/* ReadPageInternal has verified the page header */
@@ -270,11 +264,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 							   targetPagePtr,
 						  Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
 	if (readOff < 0)
-	{
-		if (state->errormsg_buf[0] != '\0')
-			*errormsg = state->errormsg_buf;
-		return NULL;
-	}
+		goto err;
 
 	/*
 	 * Read the record length.
@@ -300,11 +290,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 	{
 		if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
 								   randAccess))
-		{
-			if (state->errormsg_buf[0] != '\0')
-				*errormsg = state->errormsg_buf;
-			return NULL;
-		}
+			goto err;
 		gotheader = true;
 	}
 	else
@@ -314,8 +300,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		{
 			report_invalid_record(state, "invalid record length at %X/%X",
 								  (uint32) (RecPtr >> 32), (uint32) RecPtr);
-			*errormsg = state->errormsg_buf;
-			return NULL;
+			goto err;
 		}
 		gotheader = false;
 	}
@@ -330,8 +315,7 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		report_invalid_record(state, "record length %u at %X/%X too long",
 							  total_len,
 							  (uint32) (RecPtr >> 32), (uint32) RecPtr);
-		*errormsg = state->errormsg_buf;
-		return NULL;
+		goto err;
 	}
 
 	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
#10Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#9)
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote:

On 17.01.2013 18:42, Andres Freund wrote:

On 2013-01-17 18:33:42 +0200, Heikki Linnakangas wrote:

On 17.01.2013 17:42, Andres Freund wrote:

Ok, the attached patch seems to fix a) and b). c) above is bogus, as
explained in a comment in the patch. I also noticed that the TLI check
didn't mark the last source as failed.

This looks fragile:

/*
* We only end up here without a message when XLogPageRead() failed
* - in that case we already logged something.
* In StandbyMode that only happens if we have been triggered, so
* we shouldn't loop anymore in that case.
*/
if (errormsg == NULL)
break;

I don't like relying on the presence of an error message to control logic
like that. Should we throw in an explicit CheckForStandbyTrigger() check in
the condition of that loop?

I agree, I wasn't too happy about that either. But in some sense its
only propagating state from XLogReadPage which already has dealt with
the error and decided its ok.
Its the solution closest to what happened in the old implementation,
thats why I thought it would be halfway to acceptable.

Adding the CheckForStandbyTrigger() in the condition would mean
promotion would happen before all the available records are processed
and it would increase the amount of stat()s tremendously.
So I don't really like that either.

I was thinking of the attached. As long as we check for
CheckForStandbyTrigger() after the "record == NULL" check, we won't perform
extra stat() calls on successful reads, only when we're polling after
reaching the end of valid WAL. That seems acceptable.

Looks good to me.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#10)
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 17.01.2013 18:55, Andres Freund wrote:

On 2013-01-17 18:50:35 +0200, Heikki Linnakangas wrote:

I was thinking of the attached. As long as we check for
CheckForStandbyTrigger() after the "record == NULL" check, we won't perform
extra stat() calls on successful reads, only when we're polling after
reaching the end of valid WAL. That seems acceptable.

Looks good to me.

Ok, committed.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@2ndquadrant.com
In reply to: Michael Paquier (#1)
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:

Slave does not try anymore to reconnect to master with messages of the type:
FATAL: could not connect to the primary server

I also noticed that there is some delay until modifications on master are
visible on slave.

I think that bug has been introduced by commit 7fcbf6a.
Before splitting xlog reading as a separate facility things worked
correctly.
There are also no delay problems before this commit.

Heikki committed a fix for at least the promotion issue, I didn't notice
any problem with an increased delay, could you check again if you still
see it?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Fujii Masao
masao.fujii@gmail.com
In reply to: Andres Freund (#12)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:

Slave does not try anymore to reconnect to master with messages of the type:
FATAL: could not connect to the primary server

I also noticed that there is some delay until modifications on master are
visible on slave.

I think that bug has been introduced by commit 7fcbf6a.
Before splitting xlog reading as a separate facility things worked
correctly.
There are also no delay problems before this commit.

Heikki committed a fix for at least the promotion issue, I didn't notice
any problem with an increased delay, could you check again if you still
see it?

I encountered the problem that the timeline switch is not performed expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
....

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@2ndquadrant.com
In reply to: Fujii Masao (#13)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:

On Fri, Jan 18, 2013 at 2:35 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-01-17 13:47:41 +0900, Michael Paquier wrote:

Slave does not try anymore to reconnect to master with messages of the type:
FATAL: could not connect to the primary server

I also noticed that there is some delay until modifications on master are
visible on slave.

I think that bug has been introduced by commit 7fcbf6a.
Before splitting xlog reading as a separate facility things worked
correctly.
There are also no delay problems before this commit.

Heikki committed a fix for at least the promotion issue, I didn't notice
any problem with an increased delay, could you check again if you still
see it?

I encountered the problem that the timeline switch is not performed expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
....

That's after the commit or before? Because in passing I think I
noticed/fixed a bug that could cause exactly that problem...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#14)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 17.01.2013 20:08, Andres Freund wrote:

On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:

I encountered the problem that the timeline switch is not performed expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
....

That's after the commit or before? Because in passing I think I
noticed/fixed a bug that could cause exactly that problem...

I think I broke that with the "teach pg_receivexlog to cross timelines"
patch. Will take a look...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#15)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 17.01.2013 21:57, Heikki Linnakangas wrote:

On 17.01.2013 20:08, Andres Freund wrote:

On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:

I encountered the problem that the timeline switch is not performed
expectedly.
I set up one master, one standby and one cascade standby. All the
servers
share the archive directory. restore_command is specified in the
recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the
following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
....

That's after the commit or before? Because in passing I think I
noticed/fixed a bug that could cause exactly that problem...

I think I broke that with the "teach pg_receivexlog to cross timelines"
patch. Will take a look...

Ok, there was a couple of issues. First, as I suspected above, I added a
new result set after copy has ended in START_STREAMING command, but
forgot to teach walreceiver about it. Fixed that.

Secondly, there's an interesting interaction between the new xlogreader
code and streaming replication and timeline switches:

The first call to the page_read callback in xlogreader is always made
with size SizeOfXLogRecord, counting from the beginning of the page.
That's bogus; there can be no WAL record in the very beginning of page,
because of the page header, so I think that was supposed to be
SizeXLogShortPHD. But that won't fix the issue.

The problem is that XLogPageRead callback uses the page address and
requested length to decide what timeline to stream from. For example,
imagine that there's a timeline switch at offset 1000 in a page, and we
have already read all the WAL up to that point. When xlogreader first
asks to fetch the first 32 bytes from the page (the bogus
SizeOfXLogRecord), XLogPageRead deduces that that byte range is still on
the old timeline, and starts streaming from the old timeline. Next,
xlogreader needs the rest of the page, up to 1000 + SizeOfPageHeader, to
read the first record it's actually interested in, XLogPageRead will
return an error because that range is not on the timeline that's
currently streamed. And we loop back to retry, and run into the same
problem again.

This interaction is a bit too subtle for my taste, but the
straightforward fix is to just modify xlogreader so that the first
read_page call requests all the bytes up to record we're actually
interested in. That seems like a smart thing to do anyway.

I committed a patch for that second issue too, but please take a look in
case you come up with a better idea.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#16)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 2013-01-17 23:49:22 +0200, Heikki Linnakangas wrote:

On 17.01.2013 21:57, Heikki Linnakangas wrote:

On 17.01.2013 20:08, Andres Freund wrote:

On 2013-01-18 03:05:47 +0900, Fujii Masao wrote:

I encountered the problem that the timeline switch is not performed
expectedly.
I set up one master, one standby and one cascade standby. All the
servers
share the archive directory. restore_command is specified in the
recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the
following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
....

That's after the commit or before? Because in passing I think I
noticed/fixed a bug that could cause exactly that problem...

I think I broke that with the "teach pg_receivexlog to cross timelines"
patch. Will take a look...

Ok, there was a couple of issues. First, as I suspected above, I added a new
result set after copy has ended in START_STREAMING command, but forgot to
teach walreceiver about it. Fixed that.

Secondly, there's an interesting interaction between the new xlogreader code
and streaming replication and timeline switches:

The first call to the page_read callback in xlogreader is always made with
size SizeOfXLogRecord, counting from the beginning of the page. That's
bogus; there can be no WAL record in the very beginning of page, because of
the page header, so I think that was supposed to be SizeXLogShortPHD. But
that won't fix the issue.

Yea, thats clearly a typo.

The problem is that XLogPageRead callback uses the page address and
requested length to decide what timeline to stream from. For example,
imagine that there's a timeline switch at offset 1000 in a page, and we have
already read all the WAL up to that point. When xlogreader first asks to
fetch the first 32 bytes from the page (the bogus SizeOfXLogRecord),
XLogPageRead deduces that that byte range is still on the old timeline, and
starts streaming from the old timeline. Next, xlogreader needs the rest of
the page, up to 1000 + SizeOfPageHeader, to read the first record it's
actually interested in, XLogPageRead will return an error because that range
is not on the timeline that's currently streamed. And we loop back to retry,
and run into the same problem again.

This interaction is a bit too subtle for my taste, but the straightforward
fix is to just modify xlogreader so that the first read_page call requests
all the bytes up to record we're actually interested in. That seems like a
smart thing to do anyway.

Yuck. Reading from targetRecOff onwards seems like a good idea, yes. But
ISTM that the code isn't really safe now: Assume targetRecOff is 0
(which is valid) then we read up to SizeOfXLogRecord but assume that we
can read both the xlog record as well as the page header.
Then you also need to factor in that the page header can be differently
big.

And yes, its too subtle :(

Do you want to fix that or shall I?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#13)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I encountered the problem that the timeline switch is not performed
expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the
recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the
following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1

I am seeing similar issues with master at 88228e6.
This is easily reproducible by setting up 2 slaves under a master, then
kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then
you will notice that the timeline jump is not done.

I don't know if Masao tried to put in sync the slave that reconnects to the
promoted slave, but in this case slave2 stucks in "potential" state". That
is due to timeline that has not changed on slave2 but better to let you
know...

The replication delays are still here.
--
Michael Paquier
http://michael.otacoo.com

#19Andres Freund
andres@2ndquadrant.com
In reply to: Michael Paquier (#18)
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:

On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I encountered the problem that the timeline switch is not performed
expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the
recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the
following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1

I am seeing similar issues with master at 88228e6.
This is easily reproducible by setting up 2 slaves under a master, then
kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then
you will notice that the timeline jump is not done.

Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got
split off?).

The replication delays are still here.

That one is caused by this nice bug, courtesy of yours truly:
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 90ba32e..1174493 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8874,7 +8874,7 @@ retry:
    /* See if we need to retrieve more data */
    if (readFile < 0 ||
        (readSource == XLOG_FROM_STREAM &&
-        receivedUpto <= targetPagePtr + reqLen))
+        receivedUpto < targetPagePtr + reqLen))
    {
        if (StandbyMode)
        {

I didn't notice because I had a testscript inserting stuff continuously
and it cause at most lagging by one record...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Michael Paquier
michael.paquier@gmail.com
In reply to: Andres Freund (#19)
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On Fri, Jan 18, 2013 at 8:48 AM, Andres Freund <andres@2ndquadrant.com>wrote:

Can you reproduce that one with 7fcbf6a^ (i.e before xlogreader got
split off?).

Yes, it is reproducible before the xlog reader split.
Just an additional report, the master jumps correctly to the new timeline.

The replication delays are still here.

That one is caused by this nice bug, courtesy of yours truly:
diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 90ba32e..1174493 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8874,7 +8874,7 @@ retry:
/* See if we need to retrieve more data */
if (readFile < 0 ||
(readSource == XLOG_FROM_STREAM &&
-        receivedUpto <= targetPagePtr + reqLen))
+        receivedUpto < targetPagePtr + reqLen))
{
if (StandbyMode)
{

I didn't notice because I had a testscript inserting stuff continuously
and it cause at most lagging by one record...

This fix is indeed working.
--
Michael Paquier
http://michael.otacoo.com

#21Andres Freund
andres@2ndquadrant.com
In reply to: Michael Paquier (#18)
Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:

On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I encountered the problem that the timeline switch is not performed
expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the
recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the
following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1

I am seeing similar issues with master at 88228e6.
This is easily reproducible by setting up 2 slaves under a master, then
kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then
you will notice that the timeline jump is not done.

I don't know if Masao tried to put in sync the slave that reconnects to the
promoted slave, but in this case slave2 stucks in "potential" state". That
is due to timeline that has not changed on slave2 but better to let you
know...

Ok, I know whats causing this now. Rather ugly.

Whenever accessing a page in a segment we haven't accessed before we
read the first page to do an extra bit of validation as the first page
in a segment contains more information.

Suppose timeline 1 ends at 0/6087088, xlog.c notices that WAL ends
there, wants to read the new timeline, requests record
0/06087088. xlogreader wants to do its validation and goes back to the
first page in the segment which triggers xlog.c to rerequest timeline1
to be transferred..

Heikki, any ideas?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#21)
1 attachment(s)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On 18.01.2013 02:35, Andres Freund wrote:

On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:

On Fri, Jan 18, 2013 at 3:05 AM, Fujii Masao<masao.fujii@gmail.com> wrote:

I encountered the problem that the timeline switch is not performed
expectedly.
I set up one master, one standby and one cascade standby. All the servers
share the archive directory. restore_command is specified in the
recovery.conf
in those two standbys.

I shut down the master, and then promoted the standby. In this case, the
cascade standby should switch to new timeline and replication should be
successfully restarted. But the timeline was never changed, and the
following
log messages were kept outputting.

sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1
sby2 LOG: restarted WAL streaming at 0/3000000 on timeline 1
sby2 LOG: replication terminated by primary server
sby2 DETAIL: End of WAL reached on timeline 1

I am seeing similar issues with master at 88228e6.
This is easily reproducible by setting up 2 slaves under a master, then
kill the master. Promote slave 1 and reconnect slave 2 to slave 1, then
you will notice that the timeline jump is not done.

I don't know if Masao tried to put in sync the slave that reconnects to the
promoted slave, but in this case slave2 stucks in "potential" state". That
is due to timeline that has not changed on slave2 but better to let you
know...

Ok, I know whats causing this now. Rather ugly.

Whenever accessing a page in a segment we haven't accessed before we
read the first page to do an extra bit of validation as the first page
in a segment contains more information.

Suppose timeline 1 ends at 0/6087088, xlog.c notices that WAL ends
there, wants to read the new timeline, requests record
0/06087088. xlogreader wants to do its validation and goes back to the
first page in the segment which triggers xlog.c to rerequest timeline1
to be transferred..

Hmm, so it's the same issue I thought I fixed yesterday. My patch only
fixed it for the case that the timeline switch is in the first page of
the segment. When it's not, you still get two calls for a WAL record,
first one for the first page in the segment, to verify that, and then
the page that actually contains the record. The first call leads
XLogPageRead to think it needs to read from the old timeline.

We didn't have this problem before the xlogreader refactoring because
XLogPageRead() was always called with the RecPtr of the record, even
when we actually read the segment header from the file first. We'll have
to somehow get that same information, the RecPtr of the record we're
actually interested in, to XLogPageRead(). We could add a new argument
to the callback for that, or we could keep xlogreader.c as it is and
pass it through from ReadRecord to XLogPageRead() in the private struct.

An explicit argument to the callback is probably best. That's
straightforward, and it might be useful for the callback to know the
actual WAL position that xlogreader.c is interested in anyway. See attached.

- Heikki

Attachments:

choose-correct-timeline-in-xlogpageread.patchtext/x-diff; name=choose-correct-timeline-in-xlogpageread.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 90ba32e..3ac3b76 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -626,9 +626,10 @@ static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			 int source, bool notexistOk);
 static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
 static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
-				 int reqLen, char *readBuf, TimeLineID *readTLI);
+			 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
+			 TimeLineID *readTLI);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-							bool fetching_ckpt);
+							bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -8832,7 +8833,7 @@ CancelBackup(void)
  */
 static int
 XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
-			 char *readBuf, TimeLineID *readTLI)
+			 XLogRecPtr targetRecPtr, char *readBuf, TimeLineID *readTLI)
 {
 	XLogPageReadPrivate *private =
 		(XLogPageReadPrivate *) xlogreader->private_data;
@@ -8880,7 +8881,8 @@ retry:
 		{
 			if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 											 private->randAccess,
-											 private->fetching_ckpt))
+											 private->fetching_ckpt,
+											 targetRecPtr))
 				goto triggered;
 		}
 		/* In archive or crash recovery. */
@@ -8980,11 +8982,19 @@ triggered:
 }
 
 /*
- * In standby mode, wait for the requested record to become available, either
+ * In standby mode, wait for WAL at position 'RecPtr' to become available, either
  * via restore_command succeeding to restore the segment, or via walreceiver
  * having streamed the record (or via someone copying the segment directly to
  * pg_xlog, but that is not documented or recommended).
  *
+ * If 'fetching_ckpt' is true, we're fetching a checkpoint record, and should
+ * prepare to read WAL starting from RedoStartLSN after this.
+ *
+ * 'RecPtr' might not point to the beginning of the record we're interested
+ * in, it might also point to the page or segment header. In that case,
+ * 'tliRecPtr' is the position of the WAL record we're interested in. It is
+ * used to decide which timeline to stream the requested WAL from.
+ *
  * When the requested record becomes available, the function opens the file
  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
@@ -8992,7 +9002,7 @@ triggered:
  */
 static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
-							bool fetching_ckpt)
+							bool fetching_ckpt, XLogRecPtr tliRecPtr)
 {
 	static pg_time_t last_fail_time = 0;
 	pg_time_t now;
@@ -9076,7 +9086,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						else
 						{
 							ptr = RecPtr;
-							tli = tliOfPointInHistory(ptr, expectedTLEs);
+							tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
 
 							if (curFileTLI > 0 && tli < curFileTLI)
 								elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came from timeline %u",
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 9499f84..a358a3d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -216,6 +216,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 		randAccess = true;		/* allow readPageTLI to go backwards too */
 	}
 
+	state->currRecPtr = RecPtr;
+
 	targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
 	targetRecOff = RecPtr % XLOG_BLCKSZ;
 
@@ -503,6 +505,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
 		readLen = state->read_page(state, targetSegmentPtr, XLOG_BLCKSZ,
+								   state->currRecPtr,
 								   state->readBuf, &state->readPageTLI);
 		if (readLen < 0)
 			goto err;
@@ -521,6 +524,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * so that we can validate it.
 	 */
 	readLen = state->read_page(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
+							   state->currRecPtr,
 							   state->readBuf, &state->readPageTLI);
 	if (readLen < 0)
 		goto err;
@@ -539,6 +543,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	if (readLen < XLogPageHeaderSize(hdr))
 	{
 		readLen = state->read_page(state, pageptr, XLogPageHeaderSize(hdr),
+								   state->currRecPtr,
 								   state->readBuf, &state->readPageTLI);
 		if (readLen < 0)
 			goto err;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 36907d6..3829ce2 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -27,6 +27,7 @@ typedef struct XLogReaderState XLogReaderState;
 typedef int (*XLogPageReadCB) (XLogReaderState *xlogreader,
 										   XLogRecPtr targetPagePtr,
 										   int reqLen,
+										   XLogRecPtr targetRecPtr,
 										   char *readBuf,
 										   TimeLineID *pageTLI);
 
@@ -46,11 +47,17 @@ struct XLogReaderState
 	 * -1 on failure.  The callback shall sleep, if necessary, to wait for the
 	 * requested bytes to become available.  The callback will not be invoked
 	 * again for the same page unless more than the returned number of bytes
-	 * are necessary.
+	 * are needed.
 	 *
-	 * *pageTLI should be set to the TLI of the file the page was read from.
-	 * It is currently used only for error reporting purposes, to reconstruct
-	 * the name of the WAL file where an error occurred.
+	 * targetRecPtr is the position of the WAL record we're reading.  Usually
+	 * it is equal to targetPagePtr + reqLen, but sometimes xlogreader needs
+	 * to read and verify the page or segment header, before it reads the
+	 * actual WAL record it's interested in.  In that case, targetRecPtr can
+	 * be used to determine which timeline to read the page from.
+	 *
+	 * The callback shall set *pageTLI to the TLI of the file the page was
+	 * read from.  It is currently used only for error reporting purposes, to
+	 * reconstruct the name of the WAL file where an error occurred.
 	 */
 	XLogPageReadCB read_page;
 
@@ -90,6 +97,9 @@ struct XLogReaderState
 	XLogRecPtr	latestPagePtr;
 	TimeLineID	latestPageTLI;
 
+	/* beginning of the WAL record being read. */
+	XLogRecPtr	currRecPtr;
+
 	/* Buffer for current ReadRecord result (expandable) */
 	char	   *readRecordBuf;
 	uint32		readRecordBufSize;
#23Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#19)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

Andres Freund escribió:

On 2013-01-18 08:24:31 +0900, Michael Paquier wrote:

The replication delays are still here.

That one is caused by this nice bug, courtesy of yours truly:
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 90ba32e..1174493 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8874,7 +8874,7 @@ retry:
/* See if we need to retrieve more data */
if (readFile < 0 ||
(readSource == XLOG_FROM_STREAM &&
-        receivedUpto <= targetPagePtr + reqLen))
+        receivedUpto < targetPagePtr + reqLen))
{
if (StandbyMode)
{

Pushed.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#22)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On Fri, Jan 18, 2013 at 6:20 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

Hmm, so it's the same issue I thought I fixed yesterday. My patch only
fixed it for the case that the timeline switch is in the first page of the
segment. When it's not, you still get two calls for a WAL record, first one
for the first page in the segment, to verify that, and then the page that
actually contains the record. The first call leads XLogPageRead to think it
needs to read from the old timeline.

We didn't have this problem before the xlogreader refactoring because
XLogPageRead() was always called with the RecPtr of the record, even when
we actually read the segment header from the file first. We'll have to
somehow get that same information, the RecPtr of the record we're actually
interested in, to XLogPageRead(). We could add a new argument to the
callback for that, or we could keep xlogreader.c as it is and pass it
through from ReadRecord to XLogPageRead() in the private struct.

An explicit argument to the callback is probably best. That's
straightforward, and it might be useful for the callback to know the actual
WAL position that xlogreader.c is interested in anyway. See attached.

Just to let you know that I am still getting the error even after commit
2ff6555.
With the same scenario:
1) Start a master with 2 slaves
2) Kill/Stop slave
3) Promote slave 1, it switches to timeline 2
Log on slave 1
LOG: selected new timeline ID: 2
4) Reconnect slave 2 to save 1, slave 2 remains stuck in timeline 1 even if
recovery_target_timeline is set to latest
Log on slave 1 at this moment:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: TIMELINE_HISTORY 2
DEBUG: received replication command: START_REPLICATION 0/5000000 TIMELINE 1
Slave 1 receives command to start replication with timeline 1, while it is
sync with timeline 2.
Log on slave 2 at this moment:
LOG: restarted WAL streaming at 0/5000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/5014200
DEBUG: walreceiver ended streaming and awaits new instructions

The timeline history file is the same for both nodes:
$ cat 00000002.history
1 0/5014200 no recovery target specified

I might be wrong, but shouldn't there be also an entry for timeline 2 in
this file?

Am I missing something?
--
Michael Paquier
http://michael.otacoo.com

#25Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#24)
Re: Re: Slave enters in recovery and promotes when WAL stream with master is cut + delay master/slave

On Tue, Jan 22, 2013 at 9:06 AM, Michael Paquier
<michael.paquier@gmail.com>wrote:

On Fri, Jan 18, 2013 at 6:20 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

Hmm, so it's the same issue I thought I fixed yesterday. My patch only
fixed it for the case that the timeline switch is in the first page of the
segment. When it's not, you still get two calls for a WAL record, first one
for the first page in the segment, to verify that, and then the page that
actually contains the record. The first call leads XLogPageRead to think it
needs to read from the old timeline.

We didn't have this problem before the xlogreader refactoring because
XLogPageRead() was always called with the RecPtr of the record, even when
we actually read the segment header from the file first. We'll have to
somehow get that same information, the RecPtr of the record we're actually
interested in, to XLogPageRead(). We could add a new argument to the
callback for that, or we could keep xlogreader.c as it is and pass it
through from ReadRecord to XLogPageRead() in the private struct.

An explicit argument to the callback is probably best. That's
straightforward, and it might be useful for the callback to know the actual
WAL position that xlogreader.c is interested in anyway. See attached.

Just to let you know that I am still getting the error even after commit
2ff6555.
With the same scenario:
1) Start a master with 2 slaves
2) Kill/Stop slave
3) Promote slave 1, it switches to timeline 2
Log on slave 1

LOG: selected new timeline ID: 2
4) Reconnect slave 2 to save 1, slave 2 remains stuck in timeline 1 even
if recovery_target_timeline is set to latest
Log on slave 1 at this moment:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: TIMELINE_HISTORY 2
DEBUG: received replication command: START_REPLICATION 0/5000000 TIMELINE
1
Slave 1 receives command to start replication with timeline 1, while it is
sync with timeline 2.
Log on slave 2 at this moment:
LOG: restarted WAL streaming at 0/5000000 on timeline 1

LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/5014200
DEBUG: walreceiver ended streaming and awaits new instructions

The timeline history file is the same for both nodes:
$ cat 00000002.history
1 0/5014200 no recovery target specified

I might be wrong, but shouldn't there be also an entry for timeline 2 in
this file?

Am I missing something?

Sorry, there are no problems...
I simply forgot to set up recovery_target_timeline to 'latest' in
recovery.conf...
--
Michael Paquier
http://michael.otacoo.com