XMin Hot Standby Feedback patch

Started by Daniel Farinaalmost 15 years ago13 messages
#1Daniel Farina
daniel@heroku.com
1 attachment(s)

This is another bit of the syncrep patch split out.

I will revisit the replication timeout one Real Soon, I promise -- but
I have a couple things to do today that may delay that until the
evening.

https://github.com/fdr/postgres/commit/ad3ce9ac62f0e128d7d1fd20d47184f867056af1

Context diff supplied here.

Note that this information is not exposed via catalog in the original
syncrep patch, and is not here. Do we want that kind of reporting?

--
fdr

Attachments:

Hot-standby-feedback-patch.patchtext/x-patch; charset=US-ASCII; name=Hot-standby-feedback-patch.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2029,2034 **** SET ENABLE_SEQSCAN TO OFF;
--- 2029,2038 ----
          This parameter can only be set in the <filename>postgresql.conf</>
          file or on the server command line.
         </para>
+        <para>
+         You should also consider setting <varname>hot_standby_feedback</>
+         as an alternative to using this parameter.
+        </para>
        </listitem>
       </varlistentry>
       </variablelist>
***************
*** 2121,2126 **** SET ENABLE_SEQSCAN TO OFF;
--- 2125,2145 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby">
+       <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)</term>
+       <indexterm>
+        <primary><varname>hot_standby_feedback</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies whether or not a hot standby will send feedback to the primary
+         about queries currently executing on the standby. This parameter can
+         be used to eliminate query cancels caused by cleanup records, though
+         it can cause database bloat on the primary for some workloads.
+         The default value is <literal>off</literal>.
+         This parameter can only be set at server start. It only has effect
+         if <varname>hot_standby</> is enabled.
+ 
       <varlistentry id="guc-replication-timeout-server" xreflabel="replication_timeout_server">
        <term><varname>replication_timeout_server</varname> (<type>integer</type>)</term>
        <indexterm>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1393,1403 **** if (!triggered)
      These conflicts are <emphasis>hard conflicts</> in the sense that queries
      might need to be cancelled and, in some cases, sessions disconnected to resolve them.
      The user is provided with several ways to handle these
!     conflicts. Conflict cases include:
  
        <itemizedlist>
         <listitem>
          <para>
           Access Exclusive locks taken on the primary server, including both
           explicit <command>LOCK</> commands and various <acronym>DDL</>
           actions, conflict with table accesses in standby queries.
--- 1393,1410 ----
      These conflicts are <emphasis>hard conflicts</> in the sense that queries
      might need to be cancelled and, in some cases, sessions disconnected to resolve them.
      The user is provided with several ways to handle these
!     conflicts. Conflict cases in order of likely frequency are:
  
        <itemizedlist>
         <listitem>
          <para>
+          Application of a vacuum cleanup record from WAL conflicts with
+          standby transactions whose snapshots can still <quote>see</> any of
+          the rows to be removed.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
           Access Exclusive locks taken on the primary server, including both
           explicit <command>LOCK</> commands and various <acronym>DDL</>
           actions, conflict with table accesses in standby queries.
***************
*** 1417,1432 **** if (!triggered)
         </listitem>
         <listitem>
          <para>
!          Application of a vacuum cleanup record from WAL conflicts with
!          standby transactions whose snapshots can still <quote>see</> any of
!          the rows to be removed.
!         </para>
!        </listitem>
!        <listitem>
!         <para>
!          Application of a vacuum cleanup record from WAL conflicts with
!          queries accessing the target page on the standby, whether or not
!          the data to be removed is visible.
          </para>
         </listitem>
        </itemizedlist>
--- 1424,1433 ----
         </listitem>
         <listitem>
          <para>
! 	 Buffer pin deadlock caused by application of a vacuum cleanup
!          record from WAL conflicts with queries accessing the target
!          page on the standby, whether or not the data to be removed is
!          visible.
          </para>
         </listitem>
        </itemizedlist>
***************
*** 1539,1555 **** if (!triggered)
  
     <para>
      Remedial possibilities exist if the number of standby-query cancellations
!     is found to be unacceptable.  The first option is to connect to the
!     primary server and keep a query active for as long as needed to
!     run queries on the standby. This prevents <command>VACUUM</> from removing
!     recently-dead rows and so cleanup conflicts do not occur.
!     This could be done using <xref linkend="dblink"> and
!     <function>pg_sleep()</>, or via other mechanisms. If you do this, you
      should note that this will delay cleanup of dead rows on the primary,
      which may result in undesirable table bloat. However, the cleanup
      situation will be no worse than if the standby queries were running
!     directly on the primary server, and you are still getting the benefit of
!     off-loading execution onto the standby.
      <varname>max_standby_archive_delay</> must be kept large in this case,
      because delayed WAL files might already contain entries that conflict with
      the desired standby queries.
--- 1540,1555 ----
  
     <para>
      Remedial possibilities exist if the number of standby-query cancellations
!     is found to be unacceptable.  Typically the best option is to enable
!     <varname>hot_standby_feedback</>. This prevents <command>VACUUM</> from
!     removing recently-dead rows and so cleanup conflicts do not occur.
!     If you do this, you
      should note that this will delay cleanup of dead rows on the primary,
      which may result in undesirable table bloat. However, the cleanup
      situation will be no worse than if the standby queries were running
!     directly on the primary server. You are still getting the benefit
!     of off-loading execution onto the standby and the query may complete
!     faster than it would have done on the primary server.
      <varname>max_standby_archive_delay</> must be kept large in this case,
      because delayed WAL files might already contain entries that conflict with
      the desired standby queries.
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 76,81 **** bool		fullPageWrites = true;
--- 76,82 ----
  bool		log_checkpoints = false;
  int			sync_method = DEFAULT_SYNC_METHOD;
  int			wal_level = WAL_LEVEL_MINIMAL;
+ bool		hot_standby_feedback;
  
  #ifdef WAL_DEBUG
  bool		XLOG_DEBUG = false;
***************
*** 6159,6164 **** StartupXLOG(void)
--- 6160,6172 ----
  			if (XLByteLT(ControlFile->minRecoveryPoint, checkPoint.redo))
  				ControlFile->minRecoveryPoint = checkPoint.redo;
  		}
+ 		else
+ 		{
+ 			/*
+ 			 * No need to calculate feedback if we're not in Hot Standby.
+ 			 */
+ 			hot_standby_feedback = false;
+ 		}
  
  		/*
  		 * set backupStartupPoint if we're starting archive recovery from a
*** a/src/backend/replication/walreceiver.c
--- b/src/backend/replication/walreceiver.c
***************
*** 38,43 ****
--- 38,44 ----
  #include <signal.h>
  #include <unistd.h>
  
+ #include "access/transam.h"
  #include "access/xlog_internal.h"
  #include "libpq/pqsignal.h"
  #include "miscadmin.h"
***************
*** 45,50 ****
--- 46,52 ----
  #include "replication/walreceiver.h"
  #include "storage/ipc.h"
  #include "storage/pmsignal.h"
+ #include "storage/procarray.h"
  #include "utils/builtins.h"
  #include "utils/guc.h"
  #include "utils/memutils.h"
***************
*** 609,614 **** XLogWalRcvSendReply(void)
--- 611,620 ----
  	reply_message.flush = LogstreamResult.Flush;
  	reply_message.apply = GetXLogReplayRecPtr();
  	reply_message.sendTime = now;
+ 	if (hot_standby_feedback)
+ 		reply_message.xmin = GetOldestXmin(true, false);
+ 	else
+ 		reply_message.xmin = InvalidTransactionId;
  
  	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
  				 reply_message.write.xlogid, reply_message.write.xrecoff,
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 551,556 **** ProcessStandbyReplyMessage(void)
--- 551,559 ----
  		walsnd->write = reply.write;
  		walsnd->flush = reply.flush;
  		walsnd->apply = reply.apply;
+ 		if (TransactionIdIsValid(reply.xmin) &&
+ 			TransactionIdPrecedes(MyProc->xmin, reply.xmin))
+ 			MyProc->xmin = reply.xmin;
  		SpinLockRelease(&walsnd->mutex);
  	}
  }
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 326,329 **** extern void do_pg_abort_backup(void);
--- 326,332 ----
  #define BACKUP_LABEL_FILE		"backup_label"
  #define BACKUP_LABEL_OLD		"backup_label.old"
  
+ /* Indicates that the hot standby should send feedback */
+ extern bool hot_standby_feedback;
+ 
  #endif   /* XLOG_H */
*** a/src/include/replication/walprotocol.h
--- b/src/include/replication/walprotocol.h
***************
*** 56,61 **** typedef struct
--- 56,68 ----
  	XLogRecPtr	flush;
  	XLogRecPtr	apply;
  
+ 	/*
+ 	 * The current xmin from the standby, for Hot Standby feedback.
+ 	 * This may be invalid if the standby-side does not support feedback,
+ 	 * or Hot Standby is not yet available.
+ 	 */
+ 	TransactionId	xmin;
+ 
  	/* Sender's system clock at the time of transmission */
  	TimestampTz sendTime;
  } StandbyReplyMessage;
#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Daniel Farina (#1)
1 attachment(s)
Re: XMin Hot Standby Feedback patch

On Sat, 2011-02-12 at 14:11 -0800, Daniel Farina wrote:

This is another bit of the syncrep patch split out.

I will revisit the replication timeout one Real Soon, I promise -- but
I have a couple things to do today that may delay that until the
evening.

https://github.com/fdr/postgres/commit/ad3ce9ac62f0e128d7d1fd20d47184f867056af1

Context diff supplied here.

Greg just tipped me off to this thread a few hours ago. I saw your other
work on timeouts which looks good.

I've reworked this feature myself, and its roughly the same thing you
have posted, so I will just add on to this thread. The major change from
my earlier patch is that the logic around setting xmin on the master is
considerably tighter, and correctly uses locking.

Patch attached, no docs yet, but the patch is clear.

I'm looking to commit this in next 24 hours barring objections and/or
test failures.

Note that this information is not exposed via catalog in the original
syncrep patch, and is not here. Do we want that kind of reporting?

Probably, but its a small change and will conflict with other work for
now.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Attachments:

hs_feedback.v11.patchtext/x-patch; charset=UTF-8; name=hs_feedback.v11.patchDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 3277da8..c4ebf97 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -45,6 +45,7 @@
 #include "replication/walreceiver.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
+#include "storage/procarray.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -56,6 +57,7 @@ bool		am_walreceiver;
 
 /* GUC variable */
 int			wal_receiver_status_interval;
+bool		hot_standby_feedback;
 
 /* libpqreceiver hooks to these when loaded */
 walrcv_connect_type walrcv_connect = NULL;
@@ -610,10 +612,14 @@ XLogWalRcvSendReply(void)
 	reply_message.apply = GetXLogReplayRecPtr();
 	reply_message.sendTime = now;
 
-	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
+	if (hot_standby_feedback)
+		reply_message.xmin = GetOldestXmin(true, false);
+
+	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X xmin %u",
 				 reply_message.write.xlogid, reply_message.write.xrecoff,
 				 reply_message.flush.xlogid, reply_message.flush.xrecoff,
-				 reply_message.apply.xlogid, reply_message.apply.xrecoff);
+				 reply_message.apply.xlogid, reply_message.apply.xrecoff,
+				 reply_message.xmin);
 
 	/* Prepend with the message type and send it. */
 	buf[0] = 'r';
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3ad95b4..36eb3a9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -53,6 +53,7 @@
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
@@ -498,6 +499,7 @@ ProcessStandbyReplyMessage(void)
 	static StringInfoData input_message;
 	StandbyReplyMessage	reply;
 	char msgtype;
+	TransactionId newxmin = InvalidTransactionId;
 
 	initStringInfo(&input_message);
 
@@ -524,10 +526,11 @@ ProcessStandbyReplyMessage(void)
 
 	pq_copymsgbytes(&input_message, (char *) &reply, sizeof(StandbyReplyMessage));
 
-	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X ",
+	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X xmin %u",
 		 reply.write.xlogid, reply.write.xrecoff,
 		 reply.flush.xlogid, reply.flush.xrecoff,
-		 reply.apply.xlogid, reply.apply.xrecoff);
+		 reply.apply.xlogid, reply.apply.xrecoff,
+		 reply.xmin);
 
 	/*
 	 * Update shared state for this WalSender process
@@ -543,6 +546,74 @@ ProcessStandbyReplyMessage(void)
 		walsnd->apply = reply.apply;
 		SpinLockRelease(&walsnd->mutex);
 	}
+
+	/*
+	 * Update the WalSender's proc xmin to allow it to be visible
+	 * to snapshots. This will hold back the removal of dead rows
+	 * and thereby prevent the generation of cleanup conflicts
+	 * on the standby server.
+	 */
+	if (TransactionIdIsValid(reply.xmin))
+	{
+		TransactionId safeLimit;
+#define HOT_STANDBY_FEEDBACK_HORIZON		 1000000000
+
+		if (!TransactionIdIsValid(MyProc->xmin))
+		{
+			/*
+			 * Initialise MyProc->xmin from standby feedback.
+			 * Don't allow use oldestXmin if it is too far back.
+			 */
+			safeLimit = ReadNewTransactionId() - HOT_STANDBY_FEEDBACK_HORIZON;
+			if (!TransactionIdIsNormal(safeLimit))
+				safeLimit = FirstNormalTransactionId;
+
+			if (TransactionIdPrecedes(safeLimit, reply.xmin))
+			{
+				TransactionId oldestXmin = GetOldestXmin(true, true);
+
+				if (TransactionIdPrecedes(oldestXmin, reply.xmin))
+					newxmin = reply.xmin;
+				else
+					newxmin = oldestXmin;
+			}
+			else
+				ereport(WARNING,
+						(errmsg("standby xmin is far in the past"),
+						 errhint("standby must catch up before hot standby feedback is effective.")));
+		}
+		else
+		{
+			/*
+			 * Feedback from standby should move us forwards, but not too far.
+			 * Avoid grabbing XidGenLock here in case of waits, so use
+			 * a heuristic instead.
+			 */
+			Assert(TransactionIdIsValid(MyProc->xmin));
+			safeLimit = MyProc->xmin + HOT_STANDBY_FEEDBACK_HORIZON;
+			if (!TransactionIdIsNormal(safeLimit))
+				safeLimit = FirstNormalTransactionId;
+
+			if (TransactionIdPrecedes(safeLimit, reply.xmin))
+			{
+				ereport(WARNING,
+						(errmsg("standby xmin is far in the past"),
+						 errhint("standby must catch up before hot standby feedback is effective.")));
+			}
+			else
+			{
+				if (TransactionIdPrecedes(MyProc->xmin, reply.xmin))
+					newxmin = reply.xmin;
+			}
+		}
+	}
+
+	/*
+	 * Grab the ProcArrayLock to set xmin
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	MyProc->xmin = newxmin;
+	LWLockRelease(ProcArrayLock);
 }
 
 /* Main loop of walsender process */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 470183d..08cf941 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1279,6 +1279,15 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"hot_standby_feedback", PGC_SIGHUP, WAL_STANDBY_SERVERS,
+			gettext_noop("Allows feedback from a hot standby primary that will avoid query conflicts."),
+			NULL
+		},
+		&hot_standby_feedback,
+		false, NULL, NULL
+	},
+
+	{
 		{"allow_system_table_mods", PGC_POSTMASTER, DEVELOPER_OPTIONS,
 			gettext_noop("Allows modifications of the structure of system tables."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5d31365..3ef6813 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -196,6 +196,7 @@
 
 #hot_standby = off			# "on" allows queries during recovery
 					# (change requires restart)
+#hot_standby_feedback = off	# info from standby to prevent query conflicts
 #max_standby_archive_delay = 30s	# max delay before canceling queries
 					# when reading WAL from archive;
 					# -1 allows indefinite delay
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index 32c4962..8e4e7d0 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -56,6 +56,13 @@ typedef struct
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
 
+	/*
+	 * The current xmin from the standby, for Hot Standby feedback.
+	 * This may be invalid if the standby-side does not support feedback,
+	 * or Hot Standby is not yet available.
+	 */
+	TransactionId	xmin;
+
 	/* Sender's system clock at the time of transmission */
 	TimestampTz sendTime;
 } StandbyReplyMessage;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index aa5bfb7..9137b86 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -18,6 +18,7 @@
 
 extern bool am_walreceiver;
 extern int wal_receiver_status_interval;
+extern bool hot_standby_feedback;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#2)
Re: XMin Hot Standby Feedback patch

On 15.02.2011 18:42, Simon Riggs wrote:

On Sat, 2011-02-12 at 14:11 -0800, Daniel Farina wrote:

This is another bit of the syncrep patch split out.

I will revisit the replication timeout one Real Soon, I promise -- but
I have a couple things to do today that may delay that until the
evening.

https://github.com/fdr/postgres/commit/ad3ce9ac62f0e128d7d1fd20d47184f867056af1

Context diff supplied here.

Greg just tipped me off to this thread a few hours ago. I saw your other
work on timeouts which looks good.

I've reworked this feature myself, and its roughly the same thing you
have posted, so I will just add on to this thread. The major change from
my earlier patch is that the logic around setting xmin on the master is
considerably tighter, and correctly uses locking.

It would be wise to also transmit the epoch in addition to xmin, to
avoid confusion if the standby is > 2 billion transactions behind.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: XMin Hot Standby Feedback patch

On Tue, Feb 15, 2011 at 11:49 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 15.02.2011 18:42, Simon Riggs wrote:

On Sat, 2011-02-12 at 14:11 -0800, Daniel Farina wrote:

This is another bit of the syncrep patch split out.

I will revisit the replication timeout one Real Soon, I promise -- but
I have a couple things to do today that may delay that until the
evening.

https://github.com/fdr/postgres/commit/ad3ce9ac62f0e128d7d1fd20d47184f867056af1

Context diff supplied here.

Greg just tipped me off to this thread a few hours ago. I saw your other
work on timeouts which looks good.

I've reworked this feature myself, and its roughly the same thing you
have posted, so I will just add on to this thread. The major change from
my earlier patch is that the logic around setting xmin on the master is
considerably tighter, and correctly uses locking.

It would be wise to also transmit the epoch in addition to xmin, to avoid
confusion if the standby is > 2 billion transactions behind.

That case is probably hopelessly broken anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#4)
Re: XMin Hot Standby Feedback patch

On 15.02.2011 18:52, Robert Haas wrote:

On Tue, Feb 15, 2011 at 11:49 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

It would be wise to also transmit the epoch in addition to xmin, to avoid
confusion if the standby is> 2 billion transactions behind.

That case is probably hopelessly broken anyway.

I don't expect the feedback to do anything too useful in case of xid
wraparound, but at least the master would recognize the situation and
know to ignore the bogus xmin from that standby.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#3)
Re: XMin Hot Standby Feedback patch

On Tue, 2011-02-15 at 18:49 +0200, Heikki Linnakangas wrote:

It would be wise to also transmit the epoch in addition to xmin, to
avoid confusion if the standby is > 2 billion transactions behind.

Yes, good idea, thanks.

That has to be the record for the fastest patch review. ;-)

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#7Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#2)
Re: XMin Hot Standby Feedback patch

On Tue, Feb 15, 2011 at 11:42 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Patch attached, no docs yet, but the patch is clear.

I'm looking to commit this in next 24 hours barring objections and/or
test failures.

Looks pretty good to me, though I haven't tested it. I like some of
the safety valves you put in there, but I don't understand this part:

+                       /*
+                        * Feedback from standby should move us
forwards, but not too far.
+                        * Avoid grabbing XidGenLock here in case of
waits, so use
+                        * a heuristic instead.
+                        */

What's XIDGenLock got to do with it?

On another disk, I think that those warning messages are a bad idea.
That could fill up someone's disk really quickly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#7)
Re: XMin Hot Standby Feedback patch

On Tue, Feb 15, 2011 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On another disk, I think that those warning messages are a bad idea.
That could fill up someone's disk really quickly.

On another disk? What the heck am I talking about?

On another point? On another note? Anyway, you get the idea... hopefully.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#8)
Re: XMin Hot Standby Feedback patch

On Tue, 2011-02-15 at 12:20 -0500, Robert Haas wrote:

On Tue, Feb 15, 2011 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On another disk, I think that those warning messages are a bad idea.
That could fill up someone's disk really quickly.

On another disk? What the heck am I talking about?

On another point? On another note? Anyway, you get the idea... hopefully.

Yeh. I quite liked it as a metaphorical turn of phrase.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#7)
1 attachment(s)
Re: XMin Hot Standby Feedback patch

On Tue, 2011-02-15 at 12:15 -0500, Robert Haas wrote:

Looks pretty good to me, though I haven't tested it. I like some of
the safety valves you put in there, but I don't understand this part

Reworked logic covering all feedback, plus tests, plus docs.

Last comments before commit please.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

Attachments:

hs_feedback.v12.patchtext/x-patch; charset=UTF-8; name=hs_feedback.v12.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 63c6283..30c33fb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2029,6 +2029,10 @@ SET ENABLE_SEQSCAN TO OFF;
         This parameter can only be set in the <filename>postgresql.conf</>
         file or on the server command line.
        </para>
+       <para>
+        You should also consider setting <varname>hot_standby_feedback</>
+        as an alternative to using this parameter.
+       </para>
       </listitem>
      </varlistentry>
      </variablelist>
@@ -2121,6 +2125,22 @@ SET ENABLE_SEQSCAN TO OFF;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby">
+      <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)</term>
+      <indexterm>
+       <primary><varname>hot_standby_feedback</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Specifies whether or not a hot standby will send feedback to the primary
+        about queries currently executing on the standby. This parameter can
+        be used to eliminate query cancels caused by cleanup records, though
+        it can cause database bloat on the primary for some workloads.
+        The default value is <literal>off</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
    </sect1>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index a892969..6941e67 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1486,23 +1486,6 @@ if (!triggered)
    </para>
 
    <para>
-    The most common reason for conflict between standby queries and WAL replay
-    is <quote>early cleanup</>.  Normally, <productname>PostgreSQL</> allows
-    cleanup of old row versions when there are no transactions that need to
-    see them to ensure correct visibility of data according to MVCC rules.
-    However, this rule can only be applied for transactions executing on the
-    master.  So it is possible that cleanup on the master will remove row
-    versions that are still visible to a transaction on the standby.
-   </para>
-
-   <para>
-    Experienced users should note that both row version cleanup and row version
-    freezing will potentially conflict with standby queries. Running a manual
-    <command>VACUUM FREEZE</> is likely to cause conflicts even on tables with
-    no updated or deleted rows.
-   </para>
-
-   <para>
     Once the delay specified by <varname>max_standby_archive_delay</> or
     <varname>max_standby_streaming_delay</> has been exceeded, conflicting
     queries will be cancelled.  This usually results just in a cancellation
@@ -1529,6 +1512,23 @@ if (!triggered)
    </para>
 
    <para>
+    The most common reason for conflict between standby queries and WAL replay
+    is <quote>early cleanup</>.  Normally, <productname>PostgreSQL</> allows
+    cleanup of old row versions when there are no transactions that need to
+    see them to ensure correct visibility of data according to MVCC rules.
+    However, this rule can only be applied for transactions executing on the
+    master.  So it is possible that cleanup on the master will remove row
+    versions that are still visible to a transaction on the standby.
+   </para>
+
+   <para>
+    Experienced users should note that both row version cleanup and row version
+    freezing will potentially conflict with standby queries. Running a manual
+    <command>VACUUM FREEZE</> is likely to cause conflicts even on tables with
+    no updated or deleted rows.
+   </para>
+
+   <para>
     Users should be clear that tables that are regularly and heavily updated
     on the primary server will quickly cause cancellation of longer running
     queries on the standby. In such cases the setting of a finite value for
@@ -1539,12 +1539,10 @@ if (!triggered)
 
    <para>
     Remedial possibilities exist if the number of standby-query cancellations
-    is found to be unacceptable.  The first option is to connect to the
-    primary server and keep a query active for as long as needed to
-    run queries on the standby. This prevents <command>VACUUM</> from removing
-    recently-dead rows and so cleanup conflicts do not occur.
-    This could be done using <xref linkend="dblink"> and
-    <function>pg_sleep()</>, or via other mechanisms. If you do this, you
+    is found to be unacceptable.  The first option is to set the parameter
+    <varname>hot_standby_feedback</>, which prevents <command>VACUUM</> from
+    removing recently-dead rows and so cleanup conflicts do not occur.
+    If you do this, you
     should note that this will delay cleanup of dead rows on the primary,
     which may result in undesirable table bloat. However, the cleanup
     situation will be no worse than if the standby queries were running
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 3277da8..e23c4e5 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -45,6 +45,7 @@
 #include "replication/walreceiver.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
+#include "storage/procarray.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -56,6 +57,7 @@ bool		am_walreceiver;
 
 /* GUC variable */
 int			wal_receiver_status_interval;
+bool		hot_standby_feedback;
 
 /* libpqreceiver hooks to these when loaded */
 walrcv_connect_type walrcv_connect = NULL;
@@ -610,10 +612,28 @@ XLogWalRcvSendReply(void)
 	reply_message.apply = GetXLogReplayRecPtr();
 	reply_message.sendTime = now;
 
-	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
+	/*
+	 * Get the OldestXmin and its associated epoch
+	 */
+	if (hot_standby_feedback)
+	{
+		TransactionId	nextXid;
+		uint32			nextEpoch;
+
+		reply_message.xmin = GetOldestXmin(true, false);
+
+		GetNextXidAndEpoch(&nextXid, &nextEpoch);
+		if (nextXid < reply_message.xmin)
+			nextEpoch--;
+		reply_message.epoch = nextEpoch;
+	}
+
+	elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X xmin %u epoch %u",
 				 reply_message.write.xlogid, reply_message.write.xrecoff,
 				 reply_message.flush.xlogid, reply_message.flush.xrecoff,
-				 reply_message.apply.xlogid, reply_message.apply.xrecoff);
+				 reply_message.apply.xlogid, reply_message.apply.xrecoff,
+				 reply_message.xmin,
+				 reply_message.epoch);
 
 	/* Prepend with the message type and send it. */
 	buf[0] = 'r';
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3ad95b4..bf58482 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -53,6 +53,7 @@
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
+#include "storage/procarray.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
@@ -498,6 +499,7 @@ ProcessStandbyReplyMessage(void)
 	static StringInfoData input_message;
 	StandbyReplyMessage	reply;
 	char msgtype;
+	TransactionId newxmin = InvalidTransactionId;
 
 	initStringInfo(&input_message);
 
@@ -524,10 +526,12 @@ ProcessStandbyReplyMessage(void)
 
 	pq_copymsgbytes(&input_message, (char *) &reply, sizeof(StandbyReplyMessage));
 
-	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X ",
+	elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X xmin %u epoch %u",
 		 reply.write.xlogid, reply.write.xrecoff,
 		 reply.flush.xlogid, reply.flush.xrecoff,
-		 reply.apply.xlogid, reply.apply.xrecoff);
+		 reply.apply.xlogid, reply.apply.xrecoff,
+		 reply.xmin,
+		 reply.epoch);
 
 	/*
 	 * Update shared state for this WalSender process
@@ -543,6 +547,69 @@ ProcessStandbyReplyMessage(void)
 		walsnd->apply = reply.apply;
 		SpinLockRelease(&walsnd->mutex);
 	}
+
+	/*
+	 * Update the WalSender's proc xmin to allow it to be visible
+	 * to snapshots. This will hold back the removal of dead rows
+	 * and thereby prevent the generation of cleanup conflicts
+	 * on the standby server.
+	 */
+	if (TransactionIdIsValid(reply.xmin))
+	{
+		TransactionId	nextXid;
+		uint32			nextEpoch;
+		bool			epochOK;
+
+		GetNextXidAndEpoch(&nextXid, &nextEpoch);
+
+		/*
+		 * Epoch of oldestXmin should be same as standby or
+		 * if the counter has wrapped, then one less than reply.
+		 */
+		if (reply.xmin <= nextXid)
+		{
+			if (reply.epoch == nextEpoch)
+				epochOK = true;
+		}
+		else
+		{
+			if (nextEpoch > 0 && reply.epoch == nextEpoch - 1)
+				epochOK = true;
+		}
+
+		/*
+		 * Feedback from standby must not go backwards, nor should it go
+		 * forwards further than our most recent xid.
+		 */
+		if (epochOK && TransactionIdPrecedesOrEquals(reply.xmin, nextXid))
+		{
+			if (!TransactionIdIsValid(MyProc->xmin))
+			{
+				TransactionId oldestXmin = GetOldestXmin(true, true);
+
+				if (TransactionIdPrecedes(oldestXmin, reply.xmin))
+					newxmin = reply.xmin;
+				else
+					newxmin = oldestXmin;
+			}
+			else
+			{
+				Assert(TransactionIdIsValid(MyProc->xmin));
+
+				if (TransactionIdPrecedes(MyProc->xmin, reply.xmin))
+					newxmin = reply.xmin;
+				else
+					newxmin = MyProc->xmin; /* stay the same */
+			}
+		}
+	}
+
+	/*
+	 * Grab the ProcArrayLock to set xmin, or invalidate for bad reply
+	 */
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	MyProc->xmin = newxmin;
+	LWLockRelease(ProcArrayLock);
 }
 
 /* Main loop of walsender process */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 470183d..08cf941 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1279,6 +1279,15 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
+		{"hot_standby_feedback", PGC_SIGHUP, WAL_STANDBY_SERVERS,
+			gettext_noop("Allows feedback from a hot standby primary that will avoid query conflicts."),
+			NULL
+		},
+		&hot_standby_feedback,
+		false, NULL, NULL
+	},
+
+	{
 		{"allow_system_table_mods", PGC_POSTMASTER, DEVELOPER_OPTIONS,
 			gettext_noop("Allows modifications of the structure of system tables."),
 			NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5d31365..3ef6813 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -196,6 +196,7 @@
 
 #hot_standby = off			# "on" allows queries during recovery
 					# (change requires restart)
+#hot_standby_feedback = off	# info from standby to prevent query conflicts
 #max_standby_archive_delay = 30s	# max delay before canceling queries
 					# when reading WAL from archive;
 					# -1 allows indefinite delay
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index 32c4962..b026143 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -56,6 +56,15 @@ typedef struct
 	XLogRecPtr	flush;
 	XLogRecPtr	apply;
 
+	/*
+	 * The current xmin and epochfrom the standby, for Hot Standby feedback.
+	 * This may be invalid if the standby-side does not support feedback,
+	 * or Hot Standby is not yet available.
+	 */
+	TransactionId	xmin;
+	uint32			epoch;
+
+
 	/* Sender's system clock at the time of transmission */
 	TimestampTz sendTime;
 } StandbyReplyMessage;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index aa5bfb7..9137b86 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -18,6 +18,7 @@
 
 extern bool am_walreceiver;
 extern int wal_receiver_status_interval;
+extern bool hot_standby_feedback;
 
 /*
  * MAXCONNINFO: maximum size of a connection string.
#11Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#10)
Re: XMin Hot Standby Feedback patch

On Tue, Feb 15, 2011 at 4:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2011-02-15 at 12:15 -0500, Robert Haas wrote:

Looks pretty good to me, though I haven't tested it.  I like some of
the safety valves you put in there, but I don't understand this part

Reworked logic covering all feedback, plus tests, plus docs.

Last comments before commit please.

What happens if someone has hot_standby_feedback on and then turns it
off? I think in XLogWalRcvSendReply() you need

if (hot_standby_feedback)
{
stuff
}
else
{
reply_message.xmin = InvaidXID;
reply_message.epoch = 0; /* or something */
}

Also this part looks kludgy to me:

+		GetNextXidAndEpoch(&nextXid, &nextEpoch);
+		if (nextXid < reply_message.xmin)
+			nextEpoch--;

How about introducing a GetOldestXminAndEpoch function instead?

Would it make sense to avoid grabbing the ProcArrayLock except when we
truly need to update MyProc->xmin? ProcessStandbyReplyMessage gets
called a lot...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#10)
Re: XMin Hot Standby Feedback patch

On Wed, Feb 16, 2011 at 6:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2011-02-15 at 12:15 -0500, Robert Haas wrote:

Looks pretty good to me, though I haven't tested it.  I like some of
the safety valves you put in there, but I don't understand this part

Reworked logic covering all feedback, plus tests, plus docs.

Last comments before commit please.

When I started the standby with hot_standby = off and hot_standby_feedback = on,
I got the following assertion error.

-----------------
sby LOG: streaming replication successfully connected to primary
TRAP: FailedAssertion("!(((result) >= ((TransactionId) 3)))", File:
"procarray.c", Line: 1027)
act LOG: unexpected EOF on standby connection
sby LOG: WAL receiver process (PID 17572) was terminated by signal 6: Aborted
sby LOG: terminating any other active server processes
-----------------

vacuum_defer_cleanup_age on the *standby* should not affect the
feedback xid.

VACUUM TABLE on the *primary* doesn't use the feedback xid at all.
Is this intentional?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#12)
Re: XMin Hot Standby Feedback patch

On Wed, 2011-02-16 at 11:25 +0900, Fujii Masao wrote:

On Wed, Feb 16, 2011 at 6:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2011-02-15 at 12:15 -0500, Robert Haas wrote:

Looks pretty good to me, though I haven't tested it. I like some of
the safety valves you put in there, but I don't understand this part

Reworked logic covering all feedback, plus tests, plus docs.

Last comments before commit please.

When I started the standby with hot_standby = off and hot_standby_feedback = on,
I got the following assertion error.

-----------------
sby LOG: streaming replication successfully connected to primary
TRAP: FailedAssertion("!(((result) >= ((TransactionId) 3)))", File:
"procarray.c", Line: 1027)
act LOG: unexpected EOF on standby connection
sby LOG: WAL receiver process (PID 17572) was terminated by signal 6: Aborted
sby LOG: terminating any other active server processes
-----------------

Thanks

vacuum_defer_cleanup_age on the *standby* should not affect the
feedback xid.

Not sure, will think some more.

VACUUM TABLE on the *primary* doesn't use the feedback xid at all.
Is this intentional?

Yes, I was in the middle of fixing that.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services