Sync Rep for 2011CF1
Here's the latest patch for sync rep.
From here, I will be developing the patch further on public git
repository towards commit. My expectation is that commit is at least 2
weeks away, though there are no major unresolved problems. I expect
essential follow on patches to continue for a further 2-4 weeks after
that first commit.
I will add my own reviewer's notes tomorrow.
In terms of testing, the patch hasn't been tested further than my own
laptop as yet, so it seems likely there's a few trivial howlers in
there. That is simply because of my recent flu.
I've requested Heikki as main reviewer and he's accepted. Other comments
are also welcome about the user interface and the reply protocol are
also welcome. Please don't bother performance testing yet. I'll let you
know when that is appropriate.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
syncrep.v9.patchtext/x-patch; charset=UTF-8; name=syncrep.v9.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8e2a2c5..807fdb4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1992,8 +1992,122 @@ SET ENABLE_SEQSCAN TO OFF;
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
+ <para>
+ You should also consider setting <varname>hot_standby_feedback</>
+ as an alternative to using this parameter.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </sect2>
+
+ <sect2 id="runtime-config-sync-rep">
+ <title>Synchronous Replication</title>
+
+ <para>
+ These settings control the behavior of the built-in
+ <firstterm>synchronous replication</> feature.
+ These parameters would be set on the primary server that is
+ to send replication data to one or more standby servers.
+ </para>
+
+ <variablelist>
+ <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
+ <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>synchronous_replication</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether transaction commit will wait for WAL records
+ to be replicated before the command returns a <quote>success</>
+ indication to the client. The default setting is <literal>off</>.
+ When <literal>on</>, there will be a delay while the client waits
+ for confirmation of successful replication. That delay will
+ increase depending upon the physical distance and network activity
+ between primary and standby. The commit wait will last until the
+ first reply from any standby. Multiple standby servers allow
+ increased availability and possibly increase performance as well.
+ </para>
+ <para>
+ The parameter must be set on both primary and standby.
+ </para>
+ <para>
+ On the primary, this parameter can be changed at any time; the
+ behavior for any one transaction is determined by the setting in
+ effect when it commits. It is therefore possible, and useful, to have
+ some transactions replicate synchronously and others asynchronously.
+ For example, to make a single multistatement transaction commit
+ asynchronously when the default is synchronous replication, issue
+ <command>SET LOCAL synchronous_replication TO OFF</> within the
+ transaction.
+ </para>
+ <para>
+ On the standby, the parameter value is taken only at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-allow-standalone-primary" xreflabel="allow_standalone_primary">
+ <term><varname>allow_standalone_primary</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>allow_standalone_primary</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If <varname>allow_standalone_primary</> is set, then the server
+ can operate normally whether or not replication is active. If
+ a client requests <varname>synchronous_replication</> and it is
+ not available, they will use asynchornous replication instead.
+ </para>
+ <para>
+ If <varname>allow_standalone_primary</> is not set, then the server
+ will prevent normal client connections until a standby connects that
+ has <varname>synchronous_replication_feedback</> enabled. Once
+ clients connect, if they request <varname>synchronous_replication</>
+ and it is no longer available they will wait for
+ <varname>replication_timeout_client</>.
+ </para>
</listitem>
</varlistentry>
+
+ <varlistentry id="guc-replication-timeout-client" xreflabel="replication_timeout_client">
+ <term><varname>replication_timeout_client</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>replication_timeout_client</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If the client has <varname>synchronous_replication</varname> set,
+ and a synchronous standby is currently available
+ then the commit will wait for up to <varname>replication_timeout_client</>
+ seconds before it returns a <quote>success</>. The commit will wait
+ forever for a confirmation when <varname>replication_timeout_client</>
+ is set to -1.
+ </para>
+ <para>
+ If the client has <varname>synchronous_replication</varname> set,
+ and yet no synchronous standby is available when we commit, then the
+ setting of <varname>allow_standalone_primary</> determines whether
+ or not we wait.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-replication-timeout-server" xreflabel="replication_timeout_server">
+ <term><varname>replication_timeout_server</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>replication_timeout_server</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If the primary server does not receive a reply from a standby server
+ within <varname>replication_timeout_server</> seconds then the
+ primary will terminate the replication connection.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
@@ -2084,6 +2198,42 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby">
+ <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>hot_standby_feedback</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether or not a hot standby will send feedback to the primary
+ about queries currently executing on the standby. This parameter can
+ be used to eliminate query cancels caused by cleanup records, though
+ it can cause database bloat on the primary for some workloads.
+ The default value is <literal>off</literal>.
+ This parameter can only be set at server start. It only has effect
+ if <varname>hot_standby</> is enabled.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-synchronous-replication-feedback" xreflabel="synchronous_replication_feedback">
+ <term><varname>synchronous_replication_feedback</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>synchronous_replication_feedback</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether the standby will provide reply messages to
+ allow synchronous replication on the primary.
+ Reasons for doing this might be that the standby is physically
+ co-located with the primary and so would be a bad choice as a
+ future primary server, or the standby might be a test server.
+ The default value is <literal>on</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
</sect1>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index d884122..372ac27 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -737,13 +737,12 @@ archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
</para>
<para>
- Streaming replication is asynchronous, so there is still a small delay
+ There is a small replication delay
between committing a transaction in the primary and for the changes to
become visible in the standby. The delay is however much smaller than with
file-based log shipping, typically under one second assuming the standby
is powerful enough to keep up with the load. With streaming replication,
- <varname>archive_timeout</> is not required to reduce the data loss
- window.
+ <varname>archive_timeout</> is not required.
</para>
<para>
@@ -878,6 +877,236 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
</sect3>
</sect2>
+ <sect2 id="synchronous-replication">
+ <title>Synchronous Replication</title>
+
+ <indexterm zone="high-availability">
+ <primary>Synchronous Replication</primary>
+ </indexterm>
+
+ <para>
+ <productname>PostgreSQL</> streaming replication is asynchronous by
+ default. If the primary server
+ crashes then some transactions that were committed may not have been
+ replicated to the standby server, causing data loss. The amount
+ of data loss is proportional to the replication delay at the time of
+ failover. That could be zero, or more, we do not know for certain
+ either way, when using asynchronous replication.
+ </para>
+
+ <para>
+ Synchronous replication offers the ability to confirm that all changes
+ made by a transaction have been transferred to at least one remote
+ standby server. This extends the standard level of durability
+ offered by a transaction commit. This level of protection is referred
+ to as 2-safe replication in computer science theory.
+ </para>
+
+ <para>
+ Synchronous replication works in the following way. When requested,
+ the commit of a write transaction will wait until confirmation is
+ received that the commit has been written to the transaction log on disk
+ of both the primary and standby server. The only possibility that data
+ can be lost is if both the primary and the standby suffer crashes at the
+ same time. This can provide a much higher level of durability if the
+ sysadmin is cautious about the placement and management of the two servers.
+ Waiting for confirmation increases the user's confidence that the changes
+ will not be lost in the event of server crashes but it also necessarily
+ increases the response time for the requesting transaction. The minimum
+ wait time is the roundtrip time between primary to standby.
+ </para>
+
+ <para>
+ Read only transactions and transaction rollbacks need not wait for
+ replies from standby servers. Subtransaction commits do not wait for
+ responses from standby servers, only final top-level commits. Long
+ running actions such as data loading or index building do not wait
+ until the very final commit message.
+ </para>
+
+ <sect3 id="synchronous-replication-config">
+ <title>Basic Configuration</title>
+
+ <para>
+ Synchronous replication will be active if appropriate options are
+ enabled on both the primary and at least one standby server. If
+ options are not correctly set on both servers, the primary will use
+ use asynchronous replication by default.
+ </para>
+
+ <para>
+ On the primary server we need to set
+
+<programlisting>
+synchronous_replication = on
+</programlisting>
+
+ and on the standby server we need to set
+
+<programlisting>
+synchronous_replication_feedback = on
+</programlisting>
+
+ On the primary, <varname>synchronous_replication</> can be set
+ for particular users or databases, or dynamically by applications
+ programs. On the standby, <varname>synchronous_replication_feedback</>
+ can only be set at server start.
+ </para>
+
+ <para>
+ If more than one standby server
+ specifies <varname>synchronous_replication_feedback</>, then whichever
+ standby replies first will release waiting commits.
+ Turning this setting off for a standby allows the administrator to
+ exclude certain standby servers from releasing waiting transactions.
+ This is useful if not all standby servers are designated as potential
+ future primary servers, such as if a standby were co-located
+ with the primary, so that a disaster would cause both servers to be lost.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-performance">
+ <title>Planning for Performance</title>
+
+ <para>
+ Synchronous replication usually requires carefully planned and placed
+ standby servers to ensure applications perform acceptably. Waiting
+ doesn't utilise system resources, but transaction locks continue to be
+ held until the transfer is confirmed. As a result, incautious use of
+ synchronous replication will reduce performance for database
+ applications because of increased response times and higher contention.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</> allows the application developer
+ to specify the durability level required via replication. This can be
+ specified for the system overall, though it can also be specified for
+ specific users or connections, or even individual transactions.
+ </para>
+
+ <para>
+ For example, an application workload might consist of:
+ 10% of changes are important customer details, while
+ 90% of changes are less important data that the business can more
+ easily survive if it is lost, such as chat messages between users.
+ </para>
+
+ <para>
+ With synchronous replication options specified at the application level
+ (on the primary) we can offer sync rep for the most important changes,
+ without slowing down the bulk of the total workload. Application level
+ options are an important and practical tool for allowing the benefits of
+ synchronous replication for high performance applications.
+ </para>
+
+ <para>
+ You should consider that the network bandwidth must be higher than
+ the rate of generation of WAL data.
+ 10% of changes are important customer details, while
+ 90% of changes are less important data that the business can more
+ easily survive if it is lost, such as chat messages between users.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-ha">
+ <title>Planning for High Availability</title>
+
+ <para>
+ The easiest and safest method of gaining High Availability using
+ synchronous replication is to configure at least two standby servers.
+ To understand why, we need to examine what can happen when you lose all
+ standby servers.
+ </para>
+
+ <para>
+ Commits made when synchronous_replication is set will wait until at
+ least one standby responds. The response may never occur if the last,
+ or only, standby should crash or the network drops. What should we do in
+ that situation?
+ </para>
+
+ <para>
+ Sitting and waiting will typically cause operational problems
+ because it is an effective outage of the primary server should all
+ sessions end up waiting. In contrast, allowing the primary server to
+ continue processing write transactions in the absence of a standby
+ puts those latest data changes at risk. So in this situation there
+ is a direct choice between database availability and the potential
+ durability of the data it contains. How we handle this situation
+ is controlled by <varname>allow_standalone_primary</>. The default
+ setting is <literal>on</>, allowing processing to continue, though
+ there is no recommended setting. Choosing the best setting for
+ <varname>allow_standalone_primary</> is a difficult decision and best
+ left to those with combined business responsibility for both data and
+ applications. The difficulty of this choice is the reason why we
+ recommend that you reduce the possibility of this situation occurring
+ by using multiple standby servers.
+ </para>
+
+ <para>
+ A user will stop waiting once the <varname>replication_timeout_client</>
+ has been reached for their specific session. Users are not waiting for
+ a specific standby to reply, they are waiting for a reply from any
+ standby, so the unavailability of any one standby is not significant
+ to a user. It is possible for user sessions to hit timeout even though
+ standbys are communicating normally. In that case, the setting of
+ <varname>replication_timeout</> is probably too low.
+ </para>
+
+ <para>
+ The standby sends regular status messages to the primary. If no status
+ messages have been received for <varname>replication_timeout_server</>
+ the primary server will assume the connection is dead and terminate it.
+ </para>
+
+ <para>
+ When the primary is started with <varname>allow_standalone_primary</>
+ enabled, the primary will not allow connections until a standby connects
+ that also has <varname>synchronous_replication</> enabled. This is a
+ convenience to ensure that we don't allow connections before write
+ transactions will return successfully.
+ </para>
+
+ <para>
+ When a standby first attaches to the primary, it may not be properly
+ synchronized. The standby is only able to become a synchronous standby
+ once it has become synchronized, or "caught up" with the the primary.
+ The catch-up duration may be long immediately after the standby has
+ been created. If the standby is shutdown, then the catch-up period
+ will increase according to the length of time the standby has been
+ down. You are advised to make sure <varname>allow_standalone_primary</>
+ is not set during the initial catch-up period.
+ </para>
+
+ <para>
+ If primary crashes while commits are waiting for acknowledgement, those
+ transactions will be marked fully committed if the primary database
+ recovers, no matter how <varname>allow_standalone_primary</> is set.
+ There is no way to be certain that all standbys have received all
+ outstanding WAL data at time of the crash of the primary. Some
+ transactions may not show as committed on the standby, even though
+ they show as committed on the primary. The guarantee we offer is that
+ the application will not receive explicit acknowledgement of the
+ successful commit of a transaction until the WAL data is known to be
+ safely received by the standby. Hence this mechanism is technically
+ "semi synchronous" rather than "fully synchronous" replication. Note
+ that replication still not be fully synchronous even if we wait for
+ all standby servers, though this would reduce availability, as
+ described previously.
+ </para>
+
+ <para>
+ If you need to re-create a standby server while transactions are
+ waiting, make sure that the commands to run pg_start_backup() and
+ pg_stop_backup() are run in a session with
+ synchronous_replication = off, otherwise those requests will wait
+ forever for the standby to appear.
+ </para>
+
+ </sect3>
+ </sect2>
</sect1>
<sect1 id="warm-standby-failover">
@@ -1392,11 +1621,18 @@ if (!triggered)
These conflicts are <emphasis>hard conflicts</> in the sense that queries
might need to be cancelled and, in some cases, sessions disconnected to resolve them.
The user is provided with several ways to handle these
- conflicts. Conflict cases include:
+ conflicts. Conflict cases in order of likely frequency are:
<itemizedlist>
<listitem>
<para>
+ Application of a vacuum cleanup record from WAL conflicts with
+ standby transactions whose snapshots can still <quote>see</> any of
+ the rows to be removed.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
Access Exclusive locks taken on the primary server, including both
explicit <command>LOCK</> commands and various <acronym>DDL</>
actions, conflict with table accesses in standby queries.
@@ -1416,14 +1652,8 @@ if (!triggered)
</listitem>
<listitem>
<para>
- Application of a vacuum cleanup record from WAL conflicts with
- standby transactions whose snapshots can still <quote>see</> any of
- the rows to be removed.
- </para>
- </listitem>
- <listitem>
- <para>
- Application of a vacuum cleanup record from WAL conflicts with
+ Buffer pin deadlock caused by
+ application of a vacuum cleanup record from WAL conflicts with
queries accessing the target page on the standby, whether or not
the data to be removed is visible.
</para>
@@ -1538,17 +1768,16 @@ if (!triggered)
<para>
Remedial possibilities exist if the number of standby-query cancellations
- is found to be unacceptable. The first option is to connect to the
- primary server and keep a query active for as long as needed to
- run queries on the standby. This prevents <command>VACUUM</> from removing
- recently-dead rows and so cleanup conflicts do not occur.
- This could be done using <filename>contrib/dblink</> and
- <function>pg_sleep()</>, or via other mechanisms. If you do this, you
+ is found to be unacceptable. Typically the best option is to enable
+ <varname>hot_standby_feedback</>. This prevents <command>VACUUM</> from
+ removing recently-dead rows and so cleanup conflicts do not occur.
+ If you do this, you
should note that this will delay cleanup of dead rows on the primary,
which may result in undesirable table bloat. However, the cleanup
situation will be no worse than if the standby queries were running
- directly on the primary server, and you are still getting the benefit of
- off-loading execution onto the standby.
+ directly on the primary server. You are still getting the benefit
+ of off-loading execution onto the standby and the query may complete
+ faster than it would have done on the primary server.
<varname>max_standby_archive_delay</> must be kept large in this case,
because delayed WAL files might already contain entries that conflict with
the desired standby queries.
@@ -1562,7 +1791,8 @@ if (!triggered)
a high <varname>max_standby_streaming_delay</>. However it is
difficult to guarantee any specific execution-time window with this
approach, since <varname>vacuum_defer_cleanup_age</> is measured in
- transactions executed on the primary server.
+ transactions executed on the primary server. As of version 9.1, this
+ second option is much less likely to valuable.
</para>
<para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 4fee9c3..e4607ac 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -56,6 +56,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "replication/walsender.h"
+#include "replication/syncrep.h"
#include "storage/fd.h"
#include "storage/procarray.h"
#include "storage/sinvaladt.h"
@@ -2027,6 +2028,14 @@ RecordTransactionCommitPrepared(TransactionId xid,
MyProc->inCommit = false;
END_CRIT_SECTION();
+
+ /*
+ * Wait for synchronous replication, if required.
+ *
+ * Note that at this stage we have marked clog, but still show as
+ * running in the procarray and continue to hold locks.
+ */
+ SyncRepWaitForLSN(recptr);
}
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1e31e07..18e9ce1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -37,6 +37,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "replication/walsender.h"
+#include "replication/syncrep.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
@@ -53,6 +54,7 @@
#include "utils/snapmgr.h"
#include "pg_trace.h"
+extern void WalRcvWakeup(void); /* we are only caller, so include directly */
/*
* User-tweakable parameters
@@ -1051,7 +1053,7 @@ RecordTransactionCommit(void)
* if all to-be-deleted tables are temporary though, since they are lost
* anyway if we crash.)
*/
- if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0)
+ if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 || SyncRepRequested())
{
/*
* Synchronous commit case:
@@ -1121,6 +1123,14 @@ RecordTransactionCommit(void)
/* Compute latestXid while we have the child XIDs handy */
latestXid = TransactionIdLatest(xid, nchildren, children);
+ /*
+ * Wait for synchronous replication, if required.
+ *
+ * Note that at this stage we have marked clog, but still show as
+ * running in the procarray and continue to hold locks.
+ */
+ SyncRepWaitForLSN(XactLastRecEnd);
+
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd.xrecoff = 0;
@@ -4512,6 +4522,14 @@ xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, XLogRecPtr lsn)
*/
if (XactCompletionForceSyncCommit(xlrec))
XLogFlush(lsn);
+
+ /*
+ * If this standby is offering sync_rep_service then signal WALReceiver,
+ * in case it needs to send a reply just for this commit on an
+ * otherwise quiet server.
+ */
+ if (sync_rep_service)
+ WalRcvWakeup();
}
/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5b6a230..d5a2a72 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -41,6 +41,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -159,6 +160,11 @@ static XLogRecPtr LastRec;
* known, need to check the shared state".
*/
static bool LocalRecoveryInProgress = true;
+/*
+ * Local copy of SharedHotStandbyActive variable. False actually means "not
+ * known, need to check the shared state".
+ */
+static bool LocalHotStandbyActive = false;
/*
* Local state for XLogInsertAllowed():
@@ -395,6 +401,12 @@ typedef struct XLogCtlData
bool SharedRecoveryInProgress;
/*
+ * SharedHotStandbyActive indicates if we're still in crash or archive
+ * recovery. Protected by info_lck.
+ */
+ bool SharedHotStandbyActive;
+
+ /*
* recoveryWakeupLatch is used to wake up the startup process to
* continue WAL replay, if it is waiting for WAL to arrive or failover
* trigger file to appear.
@@ -4847,6 +4859,7 @@ XLOGShmemInit(void)
*/
XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
XLogCtl->SharedRecoveryInProgress = true;
+ XLogCtl->SharedHotStandbyActive = false;
XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
SpinLockInit(&XLogCtl->info_lck);
InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
@@ -5187,6 +5200,12 @@ readRecoveryCommandFile(void)
(errmsg("recovery command file \"%s\" specified neither primary_conninfo nor restore_command",
RECOVERY_COMMAND_FILE),
errhint("The database server will regularly poll the pg_xlog subdirectory to check for files placed there.")));
+
+ if (PrimaryConnInfo == NULL && sync_rep_service)
+ ereport(WARNING,
+ (errmsg("recovery command file \"%s\" specified synchronous_replication_service yet streaming was not requested",
+ RECOVERY_COMMAND_FILE),
+ errhint("Specify primary_conninfo to allow synchronous replication.")));
}
else
{
@@ -6028,6 +6047,13 @@ StartupXLOG(void)
StandbyRecoverPreparedTransactions(false);
}
}
+ else
+ {
+ /*
+ * No need to calculate feedback if we're not in Hot Standby.
+ */
+ hot_standby_feedback = false;
+ }
/* Initialize resource managers */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
@@ -6522,8 +6548,6 @@ StartupXLOG(void)
static void
CheckRecoveryConsistency(void)
{
- static bool backendsAllowed = false;
-
/*
* Have we passed our safe starting point?
*/
@@ -6543,11 +6567,19 @@ CheckRecoveryConsistency(void)
* enabling connections.
*/
if (standbyState == STANDBY_SNAPSHOT_READY &&
- !backendsAllowed &&
+ !LocalHotStandbyActive &&
reachedMinRecoveryPoint &&
IsUnderPostmaster)
{
- backendsAllowed = true;
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ xlogctl->SharedHotStandbyActive = true;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ LocalHotStandbyActive = true;
+
SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
}
}
@@ -6595,6 +6627,38 @@ RecoveryInProgress(void)
}
/*
+ * Is HotStandby active yet? This is only important in special backends
+ * since normal backends won't ever be able to connect until this returns
+ * true.
+ *
+ * Unlike testing standbyState, this works in any process that's connected to
+ * shared memory.
+ */
+bool
+HotStandbyActive(void)
+{
+ /*
+ * We check shared state each time only until Hot Standby is active. We
+ * can't de-activate Hot Standby, so there's no need to keep checking after
+ * the shared variable has once been seen true.
+ */
+ if (LocalHotStandbyActive)
+ return true;
+ else
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ /* spinlock is essential on machines with weak memory ordering! */
+ SpinLockAcquire(&xlogctl->info_lck);
+ LocalHotStandbyActive = xlogctl->SharedHotStandbyActive;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ return LocalHotStandbyActive;
+ }
+}
+
+/*
* Is this process allowed to insert new WAL records?
*
* Ordinarily this is essentially equivalent to !RecoveryInProgress().
@@ -8870,6 +8934,25 @@ pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
}
/*
+ * Get latest redo apply position.
+ *
+ * Exported to allow WALReceiver to read the pointer directly.
+ */
+XLogRecPtr
+GetXLogReplayRecPtr(void)
+{
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+ XLogRecPtr recptr;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ recptr = xlogctl->recoveryLastRecPtr;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ return recptr;
+}
+
+/*
* Report the last WAL replay location (same format as pg_start_backup etc)
*
* This is useful for determining how much of WAL is visible to read-only
@@ -8878,14 +8961,10 @@ pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
Datum
pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
{
- /* use volatile pointer to prevent code rearrangement */
- volatile XLogCtlData *xlogctl = XLogCtl;
XLogRecPtr recptr;
char location[MAXFNAMELEN];
- SpinLockAcquire(&xlogctl->info_lck);
- recptr = xlogctl->recoveryLastRecPtr;
- SpinLockRelease(&xlogctl->info_lck);
+ recptr = GetXLogReplayRecPtr();
if (recptr.xlogid == 0 && recptr.xrecoff == 0)
PG_RETURN_NULL();
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 718e996..506e908 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -502,7 +502,11 @@ CREATE VIEW pg_stat_replication AS
S.client_port,
S.backend_start,
W.state,
- W.sent_location
+ W.sync,
+ W.sent_location,
+ W.write_location,
+ W.flush_location,
+ W.apply_location
FROM pg_stat_get_activity(NULL) AS S, pg_authid U,
pg_stat_get_wal_senders() AS W
WHERE S.usesysid = U.oid AND
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 179048f..6bdc43b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -275,6 +275,7 @@ typedef enum
PM_STARTUP, /* waiting for startup subprocess */
PM_RECOVERY, /* in archive recovery mode */
PM_HOT_STANDBY, /* in hot standby mode */
+ PM_WAIT_FOR_REPLICATION, /* waiting for sync replication to become active */
PM_RUN, /* normal "database is alive" state */
PM_WAIT_BACKUP, /* waiting for online backup mode to end */
PM_WAIT_READONLY, /* waiting for read only backends to exit */
@@ -735,6 +736,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\" or \"hot_standby\"")));
+ if (!allow_standalone_primary && max_wal_senders == 0)
+ ereport(ERROR,
+ (errmsg("WAL streaming (max_wal_senders > 0) is required if allow_standalone_primary = off")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1845,6 +1849,12 @@ retry1:
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
errmsg("the database system is in recovery mode")));
break;
+ case CAC_REPLICATION_ONLY:
+ if (!am_walsender)
+ ereport(FATAL,
+ (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+ errmsg("the database system is waiting for replication to start")));
+ break;
case CAC_TOOMANY:
ereport(FATAL,
(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
@@ -1942,7 +1952,9 @@ canAcceptConnections(void)
*/
if (pmState != PM_RUN)
{
- if (pmState == PM_WAIT_BACKUP)
+ if (pmState == PM_WAIT_FOR_REPLICATION)
+ result = CAC_REPLICATION_ONLY; /* allow replication only */
+ else if (pmState == PM_WAIT_BACKUP)
result = CAC_WAITBACKUP; /* allow superusers only */
else if (Shutdown > NoShutdown)
return CAC_SHUTDOWN; /* shutdown is pending */
@@ -2396,8 +2408,13 @@ reaper(SIGNAL_ARGS)
* Startup succeeded, commence normal operations
*/
FatalError = false;
- ReachedNormalRunning = true;
- pmState = PM_RUN;
+ if (allow_standalone_primary)
+ {
+ ReachedNormalRunning = true;
+ pmState = PM_RUN;
+ }
+ else
+ pmState = PM_WAIT_FOR_REPLICATION;
/*
* Crank up the background writer, if we didn't do that already
@@ -3221,8 +3238,8 @@ BackendStartup(Port *port)
/* Pass down canAcceptConnections state */
port->canAcceptConnections = canAcceptConnections();
bn->dead_end = (port->canAcceptConnections != CAC_OK &&
- port->canAcceptConnections != CAC_WAITBACKUP);
-
+ port->canAcceptConnections != CAC_WAITBACKUP &&
+ port->canAcceptConnections != CAC_REPLICATION_ONLY);
/*
* Unless it's a dead_end child, assign it a child slot number
*/
@@ -4272,6 +4289,16 @@ sigusr1_handler(SIGNAL_ARGS)
WalReceiverPID = StartWalReceiver();
}
+ if (CheckPostmasterSignal(PMSIGNAL_SYNC_REPLICATION_ACTIVE) &&
+ pmState == PM_WAIT_FOR_REPLICATION)
+ {
+ /* Allow connections now that a synchronous replication standby
+ * has successfully connected and is active.
+ */
+ ReachedNormalRunning = true;
+ pmState = PM_RUN;
+ }
+
PG_SETMASK(&UnBlockSig);
errno = save_errno;
@@ -4510,6 +4537,7 @@ static void
StartAutovacuumWorker(void)
{
Backend *bn;
+ CAC_state cac = CAC_OK;
/*
* If not in condition to run a process, don't try, but handle it like a
@@ -4518,7 +4546,8 @@ StartAutovacuumWorker(void)
* we have to check to avoid race-condition problems during DB state
* changes.
*/
- if (canAcceptConnections() == CAC_OK)
+ cac = canAcceptConnections();
+ if (cac == CAC_OK || cac == CAC_REPLICATION_ONLY)
{
bn = (Backend *) malloc(sizeof(Backend));
if (bn)
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 42c6eaf..3fe490e 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
- repl_gram.o
+ repl_gram.o syncrep.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 9c2e0d8..7387224 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -1,5 +1,27 @@
src/backend/replication/README
+Overview
+--------
+
+The WALSender sends WAL data and receives replies. The WALReceiver
+receives WAL data and sends replies.
+
+If there is no more WAL data to send then WALSender goes quiet,
+apart from checking for replies. If there is no more WAL data
+to receive then WALReceiver keeps sending replies until all the data
+received has been applied, then it too goes quiet. When all is quiet
+WALReceiver sends regular replies so that WALSender knows the link
+is still working - we don't want to wait until a transaction
+arrives before we try to determine the health of the connection.
+
+WALReceiver sends one reply per message received. If nothing is
+received it sends one reply every time apply pointer advances,
+with a minimum of one reply each cycletime.
+
+For synchronous replication, all decisions about whether to wait
+and how long to wait are taken on the primary. The standby has no
+state information about what is happening on the primary.
+
Walreceiver - libpqwalreceiver API
----------------------------------
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
new file mode 100644
index 0000000..12a3825
--- /dev/null
+++ b/src/backend/replication/syncrep.c
@@ -0,0 +1,641 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.c
+ *
+ * Synchronous replication is new as of PostgreSQL 9.1.
+ *
+ * If requested, transaction commits wait until their commit LSN is
+ * acknowledged by the standby, or the wait hits timeout.
+ *
+ * This module contains the code for waiting and release of backends.
+ * All code in this module executes on the primary. The core streaming
+ * replication transport remains within WALreceiver/WALsender modules.
+ *
+ * The essence of this design is that it isolates all logic about
+ * waiting/releasing onto the primary. The primary is aware of which
+ * standby servers offer a synchronisation service. The standby is
+ * completely unaware of the durability requirements of transactions
+ * on the primary, reducing the complexity of the code and streamlining
+ * both standby operations and network bandwidth because there is no
+ * requirement to ship per-transaction state information.
+ *
+ * The bookeeping approach we take is that a commit is either synchronous
+ * or not synchronous (async). If it is async, we just fastpath out of
+ * here. If it is sync, then it follows exactly one rigid definition of
+ * synchronous replication as laid out by the various parameters. If we
+ * change the definition of replication, we'll need to scan through all
+ * waiting backends to see if we should now release them.
+ *
+ * The best performing way to manage the waiting backends is to have a
+ * single ordered queue of waiting backends, so that we can avoid
+ * searching the through all waiters each time we receive a reply.
+ *
+ * Starting sync replication is a two stage process. First, the standby
+ * must have caught up with the primary; that may take some time. Next,
+ * we must receive a reply from the standby before we change state so
+ * that sync rep is fully active and commits can wait on us.
+ *
+ * XXX Changing state to a sync rep service while we are running allows
+ * us to enable sync replication via SIGHUP on the standby at a later
+ * time, without restart, if we need to do that. Though you can't turn
+ * it off without disconnecting.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xact.h"
+#include "access/xlog_internal.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/ipc.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_tables.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+
+
+/* User-settable parameters for sync rep */
+bool sync_rep_mode = false; /* Only set in user backends */
+int sync_rep_timeout_client = 120; /* Only set in user backends */
+int sync_rep_timeout_server = 30; /* Only set in user backends */
+bool sync_rep_service = false; /* Never set in user backends */
+bool hot_standby_feedback = true;
+
+/*
+ * Queuing code is written to allow later extension to multiple
+ * queues. Currently, we use just one queue (==FSYNC).
+ *
+ * XXX We later expect to have RECV, FSYNC and APPLY modes.
+ */
+#define SYNC_REP_NOT_ON_QUEUE -1
+#define SYNC_REP_FSYNC 0
+#define IsOnSyncRepQueue() (current_queue > SYNC_REP_NOT_ON_QUEUE)
+/*
+ * Queue identifier of the queue on which user backend currently waits.
+ */
+static int current_queue = SYNC_REP_NOT_ON_QUEUE;
+
+static void SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN, int qid);
+static void SyncRepRemoveFromQueue(void);
+static void SyncRepAddToQueue(int qid);
+static bool SyncRepServiceAvailable(void);
+static long SyncRepGetWaitTimeout(void);
+
+static void SyncRepWakeFromQueue(int wait_queue, XLogRecPtr lsn);
+
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for normal user backends
+ * ===========================================================
+ */
+
+/*
+ * Wait for synchronous replication, if requested by user.
+ */
+extern void
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+ /*
+ * Fast exit if user has requested async replication, or
+ * streaming replication is inactive in this server.
+ */
+ if (max_wal_senders == 0 || !sync_rep_mode)
+ return;
+
+ Assert(sync_rep_mode);
+
+ if (allow_standalone_primary)
+ {
+ bool avail_sync_mode;
+
+ /*
+ * Check that the service level we want is available.
+ * If not, downgrade the service level to async.
+ */
+ avail_sync_mode = SyncRepServiceAvailable();
+
+ /*
+ * Perform the wait here, then drop through and exit.
+ */
+ if (avail_sync_mode)
+ SyncRepWaitOnQueue(XactCommitLSN, 0);
+ }
+ else
+ {
+ /*
+ * Wait only on the service level requested,
+ * whether or not it is currently available.
+ * Sounds weird, but this mode exists to protect
+ * against changes that will only occur on primary.
+ */
+ SyncRepWaitOnQueue(XactCommitLSN, 0);
+ }
+}
+
+/*
+ * Wait for specified LSN to be confirmed at the requested level
+ * of durability. Each proc has its own wait latch, so we perform
+ * a normal latch check/wait loop here.
+ */
+static void
+SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN, int qid)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[0]);
+ TimestampTz now = GetCurrentTransactionStopTimestamp();
+ long timeout = SyncRepGetWaitTimeout(); /* seconds */
+ char *new_status = NULL;
+ const char *old_status;
+ int len;
+
+ /*
+ * No need to wait for autovacuums. If the standby does go away and
+ * we wait for it to return we may as well do some usefulwork locally.
+ * This is critical since we may need to perform emergency vacuuming
+ * and cannot wait for standby to return.
+ */
+ if (IsAutoVacuumWorkerProcess())
+ return;
+
+ ereport(DEBUG2,
+ (errmsg("synchronous replication waiting for %X/%X starting at %s",
+ XactCommitLSN.xlogid,
+ XactCommitLSN.xrecoff,
+ timestamptz_to_str(GetCurrentTransactionStopTimestamp()))));
+
+ for (;;)
+ {
+ ResetLatch(&MyProc->waitLatch);
+
+ /*
+ * First time through, add ourselves to the appropriate queue.
+ */
+ if (!IsOnSyncRepQueue())
+ {
+ SpinLockAcquire(&queue->qlock);
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ {
+ /* No need to wait */
+ SpinLockRelease(&queue->qlock);
+ return;
+ }
+
+ /*
+ * Set our waitLSN so WALSender will know when to wake us.
+ * We set this before we add ourselves to queue, so that
+ * any proc on the queue can be examined freely without
+ * taking a lock on each process in the queue.
+ */
+ MyProc->waitLSN = XactCommitLSN;
+ SyncRepAddToQueue(qid);
+ SpinLockRelease(&queue->qlock);
+ current_queue = qid; /* Remember which queue we're on */
+
+ /*
+ * Alter ps display to show waiting for sync rep.
+ */
+ old_status = get_ps_display(&len);
+ new_status = (char *) palloc(len + 21 + 1);
+ memcpy(new_status, old_status, len);
+ strcpy(new_status + len, " waiting for sync rep");
+ set_ps_display(new_status, false);
+ new_status[len] = '\0'; /* truncate off " waiting" */
+ }
+ else
+ {
+ bool release = false;
+ bool timeout = false;
+
+ SpinLockAcquire(&queue->qlock);
+
+ /*
+ * Check the LSN on our queue and if its moved far enough then
+ * remove us from the queue. First time through this is
+ * unlikely to be far enough, yet is possible. Next time we are
+ * woken we should be more lucky.
+ */
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ release = true;
+ else if (timeout > 0 &&
+ TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
+ now,
+ timeout))
+ {
+ release = true;
+ timeout = true;
+ }
+
+ if (release)
+ {
+ SyncRepRemoveFromQueue();
+ SpinLockRelease(&queue->qlock);
+
+ if (new_status)
+ {
+ /* Reset ps display */
+ set_ps_display(new_status, false);
+ pfree(new_status);
+ }
+
+ /*
+ * Our response to the timeout is to simply post a NOTICE and
+ * then return to the user. The commit has happened, we just
+ * haven't been able to verify it has been replicated to the
+ * level requested.
+ *
+ * XXX We could check here to see if our LSN has been sent to
+ * another standby that offers a lower level of service. That
+ * could be true if we had, for example, requested 'apply'
+ * with two standbys, one at 'apply' and one at 'recv' and the
+ * apply standby has just gone down. Something for the weekend.
+ */
+ if (timeout)
+ ereport(NOTICE,
+ (errmsg("synchronous replication timeout at %s",
+ timestamptz_to_str(now))));
+ else
+ ereport(DEBUG2,
+ (errmsg("synchronous replication wait complete at %s",
+ timestamptz_to_str(now))));
+
+ /* XXX Do we need to unset the latch? */
+ return;
+ }
+
+ SpinLockRelease(&queue->qlock);
+ }
+
+ WaitLatch(&MyProc->waitLatch, timeout);
+ now = GetCurrentTimestamp();
+ }
+}
+
+/*
+ * Remove myself from sync rep wait queue.
+ *
+ * Assume on queue at start; will not be on queue at end.
+ * Queue is already locked at start and remains locked on exit.
+ *
+ * XXX Implements design pattern "Reinvent Wheel", think about changing
+ */
+void
+SyncRepRemoveFromQueue(void)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[current_queue]);
+ PGPROC *proc = queue->head;
+ int numprocs = 0;
+
+ Assert(IsOnSyncRepQueue());
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG3, "removing myself from queue %d", current_queue);
+#endif
+
+ for (; proc != NULL; proc = proc->lwWaitLink)
+ {
+ if (proc == MyProc)
+ {
+ elog(LOG, "proc %d lsn %X/%X is MyProc",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+ }
+ else
+ {
+ elog(LOG, "proc %d lsn %X/%X",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+ }
+ numprocs++;
+ }
+
+ proc = queue->head;
+
+ if (proc == MyProc)
+ {
+ if (MyProc->lwWaitLink == NULL)
+ {
+ /*
+ * We were the only waiter on the queue. Reset head and tail.
+ */
+ Assert(queue->tail == MyProc);
+ queue->head = NULL;
+ queue->tail = NULL;
+ }
+ else
+ /*
+ * Move head to next proc on the queue.
+ */
+ queue->head = MyProc->lwWaitLink;
+ }
+ else
+ {
+ while (proc->lwWaitLink != NULL)
+ {
+ /* Are we the next proc in our traversal of the queue? */
+ if (proc->lwWaitLink == MyProc)
+ {
+ /*
+ * Remove ourselves from middle of queue.
+ * No need to touch head or tail.
+ */
+ proc->lwWaitLink = MyProc->lwWaitLink;
+ }
+
+ if (proc->lwWaitLink == NULL)
+ elog(WARNING, "could not locate ourselves on wait queue");
+ proc = proc->lwWaitLink;
+ }
+
+ if (proc->lwWaitLink == NULL) /* At tail */
+ {
+ Assert(proc == MyProc);
+ /* Remove ourselves from tail of queue */
+ Assert(queue->tail == MyProc);
+ queue->tail = proc;
+ proc->lwWaitLink = NULL;
+ }
+ }
+ MyProc->lwWaitLink = NULL;
+ current_queue = SYNC_REP_NOT_ON_QUEUE;
+}
+
+/*
+ * Add myself to sync rep wait queue.
+ *
+ * Assume not on queue at start; will be on queue at end.
+ * Queue is already locked at start and remains locked on exit.
+ */
+static void
+SyncRepAddToQueue(int qid)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[qid]);
+ PGPROC *tail = queue->tail;
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG3, "adding myself to queue %d", qid);
+#endif
+
+ /*
+ * Add myself to tail of wait queue.
+ */
+ if (tail == NULL)
+ {
+ queue->head = MyProc;
+ queue->tail = MyProc;
+ }
+ else
+ {
+ /*
+ * XXX extra code needed here to maintain sorted invariant.
+ * Our approach should be same as racing car - slow in, fast out.
+ */
+ Assert(tail->lwWaitLink == NULL);
+ tail->lwWaitLink = MyProc;
+ }
+ queue->tail = MyProc;
+
+ /*
+ * This used to be an Assert, but it keeps failing... why?
+ */
+ MyProc->lwWaitLink = NULL; /* to be sure */
+}
+
+/*
+ * Dynamically decide the sync rep wait mode. It may seem a trifle
+ * wasteful to do this for every transaction but we need to do this
+ * so we can cope sensibly with standby disconnections. It's OK to
+ * spend a few cycles here anyway, since while we're doing this the
+ * WALSender will be sending the data we want to wait for, so this
+ * is dead time and the user has requested to wait anyway.
+ */
+static bool
+SyncRepServiceAvailable(void)
+{
+ bool result = false;
+
+ SpinLockAcquire(&WalSndCtl->ctlmutex);
+ result = WalSndCtl->sync_rep_service_available;
+ SpinLockRelease(&WalSndCtl->ctlmutex);
+
+ return result;
+}
+
+/*
+ * Allows more complex decision making about what the wait time should be.
+ */
+static long
+SyncRepGetWaitTimeout(void)
+{
+ if (sync_rep_timeout_client <= 0)
+ return -1L;
+
+ return 1000000L * sync_rep_timeout_client;
+}
+
+void
+SyncRepCleanupAtProcExit(int code, Datum arg)
+{
+/*
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[qid]);
+
+ if (IsOnSyncRepQueue())
+ {
+ SpinLockAcquire(&queue->qlock);
+ SyncRepRemoveFromQueue();
+ SpinLockRelease(&queue->qlock);
+ }
+*/
+
+ if (MyProc != NULL && MyProc->ownLatch)
+ {
+ DisownLatch(&MyProc->waitLatch);
+ MyProc->ownLatch = false;
+ }
+}
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for wal sender processes
+ * ===========================================================
+ */
+
+/*
+ * Update the LSNs on each queue based upon our latest state. This
+ * implements a simple policy of first-valid-standby-releases-waiter.
+ *
+ * Other policies are possible, which would change what we do here and what
+ * perhaps also which information we store as well.
+ */
+void
+SyncRepReleaseWaiters(bool timeout)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ int mode;
+
+ /*
+ * If we are now streaming, and haven't yet enabled the sync rep service
+ * do so now. We don't enable sync rep service during a base backup since
+ * during that action we aren't sending WAL at all, so there cannot be
+ * any meaningful replies. We don't enable sync rep service while we
+ * are still in catchup mode either, since clients might experience an
+ * extended wait (perhaps hours) if they waited at that point.
+ *
+ * Note that we do release waiters, even if they aren't enabled yet.
+ * That sounds strange, but we may have dropped the connection and
+ * reconnected, so there may still be clients waiting for a response
+ * from when we were connected previously.
+ *
+ * If we already have a sync rep server connected, don't enable
+ * this server as well.
+ *
+ * XXX expect to be able to support multiple sync standbys in future.
+ */
+ if (!MyWalSnd->sync_rep_service &&
+ MyWalSnd->state == WALSNDSTATE_STREAMING &&
+ !SyncRepServiceAvailable())
+ {
+ ereport(LOG,
+ (errmsg("enabling synchronous replication service for standby")));
+
+ /*
+ * Update state for this WAL sender.
+ */
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalSnd *walsnd = MyWalSnd;
+
+ SpinLockAcquire(&walsnd->mutex);
+ walsnd->sync_rep_service = true;
+ SpinLockRelease(&walsnd->mutex);
+ }
+
+ /*
+ * We have at least one standby, so we're open for business.
+ */
+ {
+ SpinLockAcquire(&WalSndCtl->ctlmutex);
+ WalSndCtl->sync_rep_service_available = true;
+ SpinLockRelease(&WalSndCtl->ctlmutex);
+ }
+
+ /*
+ * Let postmaster know we can allow connections, if the user
+ * requested waiting until sync rep was active before starting.
+ * We send this unconditionally to avoid more complexity in
+ * postmaster code.
+ */
+ if (IsUnderPostmaster)
+ SendPostmasterSignal(PMSIGNAL_SYNC_REPLICATION_ACTIVE);
+ }
+
+ /*
+ * No point trying to release waiters while doing a base backup
+ */
+ if (MyWalSnd->state == WALSNDSTATE_BACKUP)
+ return;
+
+#ifdef SYNCREP_DEBUG
+ elog(LOG, "releasing waiters up to flush = %X/%X",
+ MyWalSnd->flush.xlogid, MyWalSnd->flush.xrecoff);
+#endif
+
+
+ /*
+ * Only maintain LSNs of queues for which we advertise a service.
+ * This is important to ensure that we only wakeup users when a
+ * preferred standby has reached the required LSN.
+ *
+ * Since sycnhronous_replication_mode is currently a boolean, we either
+ * offer all modes, or none.
+ */
+ for (mode = 0; mode < NUM_SYNC_REP_WAIT_MODES; mode++)
+ {
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[mode]);
+
+ /*
+ * Lock the queue. Not really necessary with just one sync standby
+ * but it makes clear what needs to happen.
+ */
+ SpinLockAcquire(&queue->qlock);
+ if (XLByteLT(queue->lsn, MyWalSnd->flush))
+ {
+ /*
+ * Set the lsn first so that when we wake backends they will
+ * release up to this location.
+ */
+ queue->lsn = MyWalSnd->flush;
+ SyncRepWakeFromQueue(mode, MyWalSnd->flush);
+ }
+ SpinLockRelease(&queue->qlock);
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG2, "q%d queue = %X/%X flush = %X/%X", mode,
+ queue->lsn.xlogid, queue->lsn.xrecoff,
+ MyWalSnd->flush.xlogid, MyWalSnd->flush.xrecoff);
+#endif
+ }
+}
+
+/*
+ * Walk queue from head setting the latches of any procs that need
+ * to be woken. We don't modify the queue, we leave that for individual
+ * procs to release themselves.
+ *
+ * Must hold spinlock on queue.
+ */
+static void
+SyncRepWakeFromQueue(int wait_queue, XLogRecPtr lsn)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[wait_queue]);
+ PGPROC *proc = queue->head;
+ int numprocs = 0;
+ int totalprocs = 0;
+
+ if (proc == NULL)
+ return;
+
+ for (; proc != NULL; proc = proc->lwWaitLink)
+ {
+ elog(LOG, "proc %d lsn %X/%X",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+
+ if (XLByteLE(proc->waitLSN, lsn))
+ {
+ numprocs++;
+ SetLatch(&proc->waitLatch);
+ }
+ totalprocs++;
+ }
+ elog(DEBUG2, "released %d procs out of %d waiting procs", numprocs, totalprocs);
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG2, "released %d procs up to %X/%X", numprocs, lsn.xlogid, lsn.xrecoff);
+#endif
+}
+
+void
+SyncRepTimeoutExceeded(void)
+{
+ SyncRepReleaseWaiters(true);
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d257caf..38f9b8e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -38,6 +38,7 @@
#include <signal.h>
#include <unistd.h>
+#include "access/transam.h"
#include "access/xlog_internal.h"
#include "libpq/pqsignal.h"
#include "miscadmin.h"
@@ -45,6 +46,7 @@
#include "replication/walreceiver.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
+#include "storage/procarray.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/memutils.h"
@@ -84,9 +86,11 @@ static volatile sig_atomic_t got_SIGTERM = false;
*/
static struct
{
- XLogRecPtr Write; /* last byte + 1 written out in the standby */
- XLogRecPtr Flush; /* last byte + 1 flushed in the standby */
-} LogstreamResult;
+ XLogRecPtr Write; /* last byte + 1 written out in the standby */
+ XLogRecPtr Flush; /* last byte + 1 flushed in the standby */
+} LogstreamResult;
+
+static char *reply_message;
/*
* About SIGTERM handling:
@@ -114,6 +118,7 @@ static void WalRcvDie(int code, Datum arg);
static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
static void XLogWalRcvFlush(void);
+static void XLogWalRcvSendReply(void);
/* Signal handlers */
static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -204,6 +209,8 @@ WalReceiverMain(void)
/* Advertise our PID so that the startup process can kill us */
walrcv->pid = MyProcPid;
walrcv->walRcvState = WALRCV_RUNNING;
+ elog(DEBUG2, "WALreceiver starting");
+ OwnLatch(&WalRcv->latch); /* Run before signals enabled, since they can wakeup latch */
/* Fetch information required to start streaming */
strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
@@ -265,12 +272,19 @@ WalReceiverMain(void)
walrcv_connect(conninfo, startpoint);
DisableWalRcvImmediateExit();
+ /*
+ * Allocate buffer that will be used for each output message. We do this
+ * just once to reduce palloc overhead.
+ */
+ reply_message = palloc(sizeof(StandbyReplyMessage));
+
/* Loop until end-of-streaming or error */
for (;;)
{
unsigned char type;
char *buf;
int len;
+ bool received_all = false;
/*
* Emergency bailout if postmaster has died. This is to avoid the
@@ -296,21 +310,44 @@ WalReceiverMain(void)
ProcessConfigFile(PGC_SIGHUP);
}
- /* Wait a while for data to arrive */
- if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
+ ResetLatch(&WalRcv->latch);
+
+ if (walrcv_receive(0, &type, &buf, &len))
{
- /* Accept the received data, and process it */
+ received_all = false;
XLogWalRcvProcessMsg(type, buf, len);
+ }
+ else
+ received_all = true;
- /* Receive any more data we can without sleeping */
- while (walrcv_receive(0, &type, &buf, &len))
- XLogWalRcvProcessMsg(type, buf, len);
+ XLogWalRcvSendReply();
+ if (received_all && !got_SIGHUP && !got_SIGTERM)
+ {
/*
- * If we've written some records, flush them to disk and let the
- * startup process know about them.
+ * Flush, then reply.
+ *
+ * XXX We really need the WALWriter active as well
*/
XLogWalRcvFlush();
+ XLogWalRcvSendReply();
+
+ /*
+ * Sleep for up to 500 ms, the fixed keepalive delay.
+ *
+ * We will be woken if new data is received from primary
+ * or if a commit is applied. This is sub-optimal in the
+ * case where a group of commits arrive, then it all goes
+ * quiet, but its not worth the extra code to handle both
+ * that and the simple case of a single commit.
+ *
+ * Note that we do not need to wake up when the Startup
+ * process has applied the last outstanding record. That
+ * is interesting iff that is a commit record.
+ */
+ pg_usleep(1000000L); /* slow down loop for debugging */
+// WaitLatchOrSocket(&WalRcv->latch, MyProcPort->sock,
+// 500000L);
}
}
}
@@ -331,6 +368,8 @@ WalRcvDie(int code, Datum arg)
walrcv->pid = 0;
SpinLockRelease(&walrcv->mutex);
+ DisownLatch(&WalRcv->latch);
+
/* Terminate the connection gracefully. */
if (walrcv_disconnect != NULL)
walrcv_disconnect();
@@ -341,6 +380,7 @@ static void
WalRcvSigHupHandler(SIGNAL_ARGS)
{
got_SIGHUP = true;
+ WalRcvWakeup();
}
/* SIGTERM: set flag for main loop, or shutdown immediately if safe */
@@ -348,6 +388,7 @@ static void
WalRcvShutdownHandler(SIGNAL_ARGS)
{
got_SIGTERM = true;
+ WalRcvWakeup();
/* Don't joggle the elbow of proc_exit */
if (!proc_exit_inprogress && WalRcvImmediateInterruptOK)
@@ -545,3 +586,58 @@ XLogWalRcvFlush(void)
}
}
}
+
+/*
+ * Send reply message to primary. Returns false if message send failed.
+ *
+ * Our reply consists solely of the current state of the standby. Standby
+ * doesn't make any attempt to remember requests made by transactions on
+ * the primary.
+ */
+static void
+XLogWalRcvSendReply(void)
+{
+ StandbyReplyMessage reply;
+
+ if (!sync_rep_service && !hot_standby_feedback)
+ return;
+
+ /*
+ * Set sub-protocol message type for a StandbyReplyMessage.
+ */
+ if (sync_rep_service)
+ {
+ reply.write = LogstreamResult.Write;
+ reply.flush = LogstreamResult.Flush;
+ reply.apply = GetXLogReplayRecPtr();
+ }
+
+ if (hot_standby_feedback && HotStandbyActive())
+ reply.xmin = GetOldestXmin(true, false);
+ else
+ reply.xmin = InvalidTransactionId;
+
+ reply.sendTime = GetCurrentTimestamp();
+
+ memcpy(reply_message, &reply, sizeof(StandbyReplyMessage));
+
+ elog(DEBUG2, "sending write = %X/%X "
+ "flush = %X/%X "
+ "apply = %X/%X "
+ "xmin = %d ",
+ reply.write.xlogid, reply.write.xrecoff,
+ reply.flush.xlogid, reply.flush.xrecoff,
+ reply.apply.xlogid, reply.apply.xrecoff,
+ reply.xmin);
+
+ walrcv_send(reply_message, sizeof(StandbyReplyMessage));
+}
+
+/* Wake up the WalRcv
+ * Prototype goes in xact.c since that is only external caller
+ */
+void
+WalRcvWakeup(void)
+{
+ SetLatch(&WalRcv->latch);
+};
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 04c9004..da97528 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -64,6 +64,7 @@ WalRcvShmemInit(void)
MemSet(WalRcv, 0, WalRcvShmemSize());
WalRcv->walRcvState = WALRCV_STOPPED;
SpinLockInit(&WalRcv->mutex);
+ InitSharedLatch(&WalRcv->latch);
}
}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d078501..e863a1b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -39,6 +39,7 @@
#include "funcapi.h"
#include "access/xlog_internal.h"
+#include "access/transam.h"
#include "catalog/pg_type.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
@@ -63,7 +64,7 @@
WalSndCtlData *WalSndCtl = NULL;
/* My slot in the shared memory array */
-static WalSnd *MyWalSnd = NULL;
+WalSnd *MyWalSnd = NULL;
/* Global state */
bool am_walsender = false; /* Am I a walsender process ? */
@@ -71,6 +72,7 @@ bool am_walsender = false; /* Am I a walsender process ? */
/* User-settable parameters for walsender */
int max_wal_senders = 0; /* the maximum number of concurrent walsenders */
int WalSndDelay = 200; /* max sleep time between some actions */
+bool allow_standalone_primary = true; /* action if no sync standby active */
/*
* These variables are used similarly to openLogFile/Id/Seg/Off,
@@ -87,6 +89,9 @@ static uint32 sendOff = 0;
*/
static XLogRecPtr sentPtr = {0, 0};
+static StringInfoData input_message;
+static TimestampTz last_reply_timestamp;
+
/* Flags set by signal handlers for later service in main loop */
static volatile sig_atomic_t got_SIGHUP = false;
volatile sig_atomic_t walsender_shutdown_requested = false;
@@ -107,10 +112,10 @@ static void WalSndHandshake(void);
static void WalSndKill(int code, Datum arg);
static void XLogRead(char *buf, XLogRecPtr recptr, Size nbytes);
static bool XLogSend(char *msgbuf, bool *caughtup);
-static void CheckClosedConnection(void);
static void IdentifySystem(void);
static void StartReplication(StartReplicationCmd * cmd);
-
+static void ProcessStandbyReplyMessage(void);
+static void ProcessRepliesIfAny(void);
/* Main entry point for walsender process */
int
@@ -148,6 +153,8 @@ WalSenderMain(void)
/* Unblock signals (they were blocked when the postmaster forked us) */
PG_SETMASK(&UnBlockSig);
+ elog(DEBUG2, "WALsender starting");
+
/* Tell the standby that walsender is ready for receiving commands */
ReadyForQuery(DestRemote);
@@ -164,6 +171,8 @@ WalSenderMain(void)
SpinLockRelease(&walsnd->mutex);
}
+ elog(DEBUG2, "WALsender handshake complete");
+
/* Main loop of walsender */
return WalSndLoop();
}
@@ -174,7 +183,6 @@ WalSenderMain(void)
static void
WalSndHandshake(void)
{
- StringInfoData input_message;
bool replication_started = false;
initStringInfo(&input_message);
@@ -248,6 +256,11 @@ WalSndHandshake(void)
errmsg("invalid standby handshake message type %d", firstchar)));
}
}
+
+ /*
+ * Initialize our timeout checking mechanism.
+ */
+ last_reply_timestamp = GetCurrentTimestamp();
}
/*
@@ -386,12 +399,14 @@ HandleReplicationCommand(const char *cmd_string)
/* break out of the loop */
replication_started = true;
+ WalSndSetState(WALSNDSTATE_CATCHUP);
break;
case T_BaseBackupCmd:
{
BaseBackupCmd *cmd = (BaseBackupCmd *) cmd_node;
+ WalSndSetState(WALSNDSTATE_BACKUP);
SendBaseBackup(cmd->label, cmd->progress);
/* Send CommandComplete and ReadyForQuery messages */
@@ -418,7 +433,7 @@ HandleReplicationCommand(const char *cmd_string)
* Check if the remote end has closed the connection.
*/
static void
-CheckClosedConnection(void)
+ProcessRepliesIfAny(void)
{
unsigned char firstchar;
int r;
@@ -442,6 +457,13 @@ CheckClosedConnection(void)
switch (firstchar)
{
/*
+ * 'd' means a standby reply wrapped in a COPY BOTH packet.
+ */
+ case 'd':
+ ProcessStandbyReplyMessage();
+ break;
+
+ /*
* 'X' means that the standby is closing down the socket.
*/
case 'X':
@@ -455,6 +477,64 @@ CheckClosedConnection(void)
}
}
+/*
+ * Receive StandbyReplyMessage. False if message send failed.
+ */
+static void
+ProcessStandbyReplyMessage(void)
+{
+ StandbyReplyMessage reply;
+
+ /*
+ * Read the message contents.
+ */
+ if (pq_getmessage(&input_message, 0))
+ {
+ ereport(COMMERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected EOF on standby connection")));
+ proc_exit(0);
+ }
+
+ pq_copymsgbytes(&input_message, (char *) &reply, sizeof(StandbyReplyMessage));
+
+ elog(DEBUG2, "write = %X/%X "
+ "flush = %X/%X "
+ "apply = %X/%X "
+ "xmin = %d ",
+ reply.write.xlogid, reply.write.xrecoff,
+ reply.flush.xlogid, reply.flush.xrecoff,
+ reply.apply.xlogid, reply.apply.xrecoff,
+ reply.xmin);
+
+ /*
+ * Update shared state for this WalSender process
+ * based on reply data from standby.
+ */
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalSnd *walsnd = MyWalSnd;
+
+ SpinLockAcquire(&walsnd->mutex);
+ if (XLByteLT(walsnd->write, reply.write))
+ walsnd->write = reply.write;
+ if (XLByteLT(walsnd->flush, reply.flush))
+ walsnd->flush = reply.flush;
+ if (XLByteLT(walsnd->apply, reply.apply))
+ walsnd->apply = reply.apply;
+ SpinLockRelease(&walsnd->mutex);
+
+ if (TransactionIdIsValid(reply.xmin) &&
+ TransactionIdPrecedes(MyProc->xmin, reply.xmin))
+ MyProc->xmin = reply.xmin;
+ }
+
+ /*
+ * Release any backends waiting to commit.
+ */
+ SyncRepReleaseWaiters(false);
+}
+
/* Main loop of walsender process */
static int
WalSndLoop(void)
@@ -494,6 +574,7 @@ WalSndLoop(void)
{
if (!XLogSend(output_message, &caughtup))
break;
+ ProcessRepliesIfAny();
if (caughtup)
walsender_shutdown_requested = true;
}
@@ -501,7 +582,11 @@ WalSndLoop(void)
/* Normal exit from the walsender is here */
if (walsender_shutdown_requested)
{
- /* Inform the standby that XLOG streaming was done */
+ ProcessRepliesIfAny();
+
+ /* Inform the standby that XLOG streaming was done
+ * by sending CommandComplete message.
+ */
pq_puttextmessage('C', "COPY 0");
pq_flush();
@@ -509,12 +594,31 @@ WalSndLoop(void)
}
/*
- * If we had sent all accumulated WAL in last round, nap for the
- * configured time before retrying.
+ * If we had sent all accumulated WAL in last round, then we don't
+ * have much to do. We still expect a steady stream of replies from
+ * standby. It is important to note that we don't keep track of
+ * whether or not there are backends waiting here, since that
+ * is potentially very complex state information.
+ *
+ * Also note that there is no delay between sending data and
+ * checking for the replies. We expect replies to take some time
+ * and we are more concerned with overall throughput than absolute
+ * response time to any single request.
*/
if (caughtup)
{
/*
+ * If we were still catching up, change state to streaming.
+ * While in the initial catchup phase, clients waiting for
+ * a response from the standby would wait for a very long
+ * time, so we need to have a one-way state transition to avoid
+ * problems. No need to grab a lock for the check; we are the
+ * only one to ever change the state.
+ */
+ if (MyWalSnd->state < WALSNDSTATE_STREAMING)
+ WalSndSetState(WALSNDSTATE_STREAMING);
+
+ /*
* Even if we wrote all the WAL that was available when we started
* sending, more might have arrived while we were sending this
* batch. We had the latch set while sending, so we have not
@@ -527,6 +631,13 @@ WalSndLoop(void)
break;
if (caughtup && !got_SIGHUP && !walsender_ready_to_stop && !walsender_shutdown_requested)
{
+ long timeout;
+
+ if (sync_rep_timeout_server == -1)
+ timeout = -1L;
+ else
+ timeout = 1000000L * sync_rep_timeout_server;
+
/*
* XXX: We don't really need the periodic wakeups anymore,
* WaitLatchOrSocket should reliably wake up as soon as
@@ -534,12 +645,15 @@ WalSndLoop(void)
*/
/* Sleep */
- WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
- WalSndDelay * 1000L);
+ if (WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
+ timeout) == 0)
+ {
+ ereport(LOG,
+ (errmsg("streaming replication timeout after %d s",
+ sync_rep_timeout_server)));
+ break;
+ }
}
-
- /* Check if the connection was closed */
- CheckClosedConnection();
}
else
{
@@ -548,12 +662,11 @@ WalSndLoop(void)
break;
}
- /* Update our state to indicate if we're behind or not */
- WalSndSetState(caughtup ? WALSNDSTATE_STREAMING : WALSNDSTATE_CATCHUP);
+ ProcessRepliesIfAny();
}
/*
- * Get here on send failure. Clean up and exit.
+ * Get here on send failure or timeout. Clean up and exit.
*
* Reset whereToSendOutput to prevent ereport from attempting to send any
* more messages to the standby.
@@ -779,9 +892,9 @@ XLogSend(char *msgbuf, bool *caughtup)
* Attempt to send all data that's already been written out and fsync'd to
* disk. We cannot go further than what's been written out given the
* current implementation of XLogRead(). And in any case it's unsafe to
- * send WAL that is not securely down to disk on the master: if the master
+ * send WAL that is not securely down to disk on the primary: if the primary
* subsequently crashes and restarts, slaves must not have applied any WAL
- * that gets lost on the master.
+ * that gets lost on the primary.
*/
SendRqstPtr = GetFlushRecPtr();
@@ -859,6 +972,9 @@ XLogSend(char *msgbuf, bool *caughtup)
msghdr.walEnd = SendRqstPtr;
msghdr.sendTime = GetCurrentTimestamp();
+ elog(DEBUG2, "sent = %X/%X ",
+ startptr.xlogid, startptr.xrecoff);
+
memcpy(msgbuf + 1, &msghdr, sizeof(WalDataMessageHeader));
pq_putmessage('d', msgbuf, 1 + sizeof(WalDataMessageHeader) + nbytes);
@@ -1016,6 +1132,16 @@ WalSndShmemInit(void)
SpinLockInit(&walsnd->mutex);
InitSharedLatch(&walsnd->latch);
}
+
+ /*
+ * Initialise the spinlocks on each sync rep queue
+ */
+ for (i = 0; i < NUM_SYNC_REP_WAIT_MODES; i++)
+ {
+ SyncRepQueue *queue = &WalSndCtl->sync_rep_queue[i];
+
+ SpinLockInit(&queue->qlock);
+ }
}
}
@@ -1075,7 +1201,7 @@ WalSndGetStateString(WalSndState state)
Datum
pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
-#define PG_STAT_GET_WAL_SENDERS_COLS 3
+#define PG_STAT_GET_WAL_SENDERS_COLS 7
ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc tupdesc;
Tuplestorestate *tupstore;
@@ -1112,9 +1238,13 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
/* use volatile pointer to prevent code rearrangement */
volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
- char sent_location[MAXFNAMELEN];
+ char location[MAXFNAMELEN];
XLogRecPtr sentPtr;
+ XLogRecPtr write;
+ XLogRecPtr flush;
+ XLogRecPtr apply;
WalSndState state;
+ bool sync_rep_service;
Datum values[PG_STAT_GET_WAL_SENDERS_COLS];
bool nulls[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -1124,15 +1254,38 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
SpinLockAcquire(&walsnd->mutex);
sentPtr = walsnd->sentPtr;
state = walsnd->state;
+ write = walsnd->write;
+ flush = walsnd->flush;
+ apply = walsnd->apply;
+ sync_rep_service = walsnd->sync_rep_service;
SpinLockRelease(&walsnd->mutex);
- snprintf(sent_location, sizeof(sent_location), "%X/%X",
- sentPtr.xlogid, sentPtr.xrecoff);
-
memset(nulls, 0, sizeof(nulls));
values[0] = Int32GetDatum(walsnd->pid);
values[1] = CStringGetTextDatum(WalSndGetStateString(state));
- values[2] = CStringGetTextDatum(sent_location);
+ values[2] = BoolGetDatum(sync_rep_service);
+
+ snprintf(location, sizeof(location), "%X/%X",
+ sentPtr.xlogid, sentPtr.xrecoff);
+ values[3] = CStringGetTextDatum(location);
+
+ if (write.xlogid == 0 && write.xrecoff == 0)
+ nulls[4] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ write.xlogid, write.xrecoff);
+ values[4] = CStringGetTextDatum(location);
+
+ if (flush.xlogid == 0 && flush.xrecoff == 0)
+ nulls[5] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ flush.xlogid, flush.xrecoff);
+ values[5] = CStringGetTextDatum(location);
+
+ if (apply.xlogid == 0 && apply.xrecoff == 0)
+ nulls[6] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ apply.xlogid, apply.xrecoff);
+ values[6] = CStringGetTextDatum(location);
tuplestore_putvalues(tupstore, tupdesc, values, nulls);
}
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index be577bc..7aa7671 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -39,6 +39,7 @@
#include "access/xact.h"
#include "miscadmin.h"
#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -196,6 +197,7 @@ InitProcGlobal(void)
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
ProcGlobal->freeProcs = &procs[i];
+ InitSharedLatch(&procs[i].waitLatch);
}
/*
@@ -214,6 +216,7 @@ InitProcGlobal(void)
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
ProcGlobal->autovacFreeProcs = &procs[i];
+ InitSharedLatch(&procs[i].waitLatch);
}
/*
@@ -224,6 +227,7 @@ InitProcGlobal(void)
{
AuxiliaryProcs[i].pid = 0; /* marks auxiliary proc as not in use */
PGSemaphoreCreate(&(AuxiliaryProcs[i].sem));
+ InitSharedLatch(&procs[i].waitLatch);
}
/* Create ProcStructLock spinlock, too */
@@ -326,6 +330,13 @@ InitProcess(void)
SHMQueueInit(&(MyProc->myProcLocks[i]));
MyProc->recoveryConflictPending = false;
+ /* Initialise the waitLSN for sync rep */
+ MyProc->waitLSN.xlogid = 0;
+ MyProc->waitLSN.xrecoff = 0;
+
+ OwnLatch((Latch *) &MyProc->waitLatch);
+ MyProc->ownLatch = true;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -365,6 +376,7 @@ InitProcessPhase2(void)
/*
* Arrange to clean that up at backend exit.
*/
+ on_shmem_exit(SyncRepCleanupAtProcExit, 0);
on_shmem_exit(RemoveProcFromArray, 0);
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e4dea31..2fd9916 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -55,6 +55,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
#include "postmaster/walwriter.h"
+#include "replication/syncrep.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
#include "storage/standby.h"
@@ -619,6 +620,15 @@ const char *const config_type_names[] =
static struct config_bool ConfigureNamesBool[] =
{
{
+ {"allow_standalone_primary", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Refuse connections on startup and force users to wait forever if synchronous replication has failed."),
+ NULL
+ },
+ &allow_standalone_primary,
+ true, NULL, NULL
+ },
+
+ {
{"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of sequential-scan plans."),
NULL
@@ -1261,6 +1271,33 @@ static struct config_bool ConfigureNamesBool[] =
},
{
+ {"synchronous_replication", PGC_USERSET, WAL_SETTINGS,
+ gettext_noop("Requests synchronous replication."),
+ NULL
+ },
+ &sync_rep_mode,
+ false, NULL, NULL
+ },
+
+ {
+ {"synchronous_replication_feedback", PGC_POSTMASTER, WAL_STANDBY_SERVERS,
+ gettext_noop("Allows feedback from a standby to primary for synchronous replication."),
+ NULL
+ },
+ &sync_rep_service,
+ true, NULL, NULL
+ },
+
+ {
+ {"hot_standby_feedback", PGC_POSTMASTER, WAL_STANDBY_SERVERS,
+ gettext_noop("Allows feedback from a hot standby to primary to avoid query conflicts."),
+ NULL
+ },
+ &hot_standby_feedback,
+ false, NULL, NULL
+ },
+
+ {
{"allow_system_table_mods", PGC_POSTMASTER, DEVELOPER_OPTIONS,
gettext_noop("Allows modifications of the structure of system tables."),
NULL,
@@ -1456,6 +1493,26 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"replication_timeout_client", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Clients waiting for confirmation will timeout after this duration."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &sync_rep_timeout_client,
+ 120, -1, INT_MAX, NULL, NULL
+ },
+
+ {
+ {"replication_timeout_server", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Replication connection will timeout after this duration."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &sync_rep_timeout_server,
+ 30, -1, INT_MAX, NULL, NULL
+ },
+
+ {
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f436b83..d0f51c7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -184,7 +184,15 @@
#archive_timeout = 0 # force a logfile segment switch after this
# number of seconds; 0 disables
-# - Streaming Replication -
+# - Replication - User Settings
+
+#synchronous_replication = off # commit waits for reply from standby
+#replication_timeout_client = 120 # -1 means wait forever
+
+# - Streaming Replication - Server Settings
+
+#allow_standalone_primary = on # sync rep parameter
+#replication_timeout_client = 30 # -1 means wait forever
#max_wal_senders = 0 # max number of walsender processes
# (change requires restart)
@@ -196,6 +204,8 @@
#hot_standby = off # "on" allows queries during recovery
# (change requires restart)
+#hot_standby_feedback = off # info from standby to prevent query conflicts
+#synchronous_replication_feedback = off # allows sync replication
#max_standby_archive_delay = 30s # max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 74d3427..4735ec9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -288,8 +288,10 @@ extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg);
extern bool RecoveryInProgress(void);
+extern bool HotStandbyActive(void);
extern bool XLogInsertAllowed(void);
extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
+extern XLogRecPtr GetXLogReplayRecPtr(void);
extern void UpdateControlFile(void);
extern uint64 GetSystemIdentifier(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f8b5d4d..b83ed0c 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3075,7 +3075,7 @@ DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 f f
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,23}" "{i,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
DESCR("statistics: information about currently active backends");
-DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25}" "{o,o,o}" "{procpid,state,sent_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,16,25,25,25,25}" "{o,o,o,o,o,o,o}" "{procpid,state,sync,sent_location,write_location,flush_location,apply_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
DESCR("statistics: information about currently active replication");
DATA(insert OID = 2026 ( pg_backend_pid PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
DESCR("statistics: current backend PID");
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 4cdb15f..9a00b2c 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -73,7 +73,7 @@ typedef struct
typedef enum CAC_state
{
CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
- CAC_WAITBACKUP
+ CAC_WAITBACKUP, CAC_REPLICATION_ONLY
} CAC_state;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
new file mode 100644
index 0000000..a071b9a
--- /dev/null
+++ b/src/include/replication/syncrep.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.h
+ * Exports from replication/syncrep.c.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _SYNCREP_H
+#define _SYNCREP_H
+
+#include "access/xlog.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+
+#define SyncRepRequested() (sync_rep_mode)
+#define StandbyOffersSyncRepService() (sync_rep_service)
+
+/*
+ * There is no reply from standby to primary for async mode, so the reply
+ * message needs one less slot than the maximum number of modes
+ */
+#define NUM_SYNC_REP_WAIT_MODES 1
+
+extern XLogRecPtr ReplyLSN[NUM_SYNC_REP_WAIT_MODES];
+
+/*
+ * Each synchronous rep wait mode has one SyncRepWaitQueue in shared memory.
+ * These queues live in the WAL sender shmem area.
+ */
+typedef struct SyncRepQueue
+{
+ /*
+ * Current location of the head of the queue. Nobody should be waiting
+ * on the queue for an lsn equal to or earlier than this value. Procs
+ * on the queue will always be later than this value, though we don't
+ * record those values here.
+ */
+ XLogRecPtr lsn;
+
+ PGPROC *head;
+ PGPROC *tail;
+
+ slock_t qlock; /* locks shared variables shown above */
+} SyncRepQueue;
+
+/* user-settable parameters for synchronous replication */
+extern bool sync_rep_mode;
+extern int sync_rep_timeout_client;
+extern int sync_rep_timeout_server;
+extern bool sync_rep_service;
+
+extern bool hot_standby_feedback;
+
+/* called by user backend */
+extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+
+/* called by wal sender */
+extern void SyncRepReleaseWaiters(bool timeout);
+extern void SyncRepTimeoutExceeded(void);
+
+/* callback at exit */
+extern void SyncRepCleanupAtProcExit(int code, Datum arg);
+
+#endif /* _SYNCREP_H */
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index 1993851..8a7101a 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -40,6 +40,47 @@ typedef struct
} WalDataMessageHeader;
/*
+ * Reply message from standby (message type 'r'). This is wrapped within
+ * a CopyData message at the FE/BE protocol level.
+ *
+ * Note that the data length is not specified here.
+ */
+typedef struct
+{
+ /*
+ * The xlog location that has been written to WAL file by standby-side.
+ * This may be invalid if the standby-side does not choose to offer
+ * a synchronous replication reply service, or is unable to offer
+ * a valid reply for data that has only been written, not fsynced.
+ */
+ XLogRecPtr write;
+
+ /*
+ * The xlog location that has been fsynced onto disk by standby-side.
+ * This may be invalid if the standby-side does not choose to offer
+ * a synchronous replication reply service, or is unable to.
+ */
+ XLogRecPtr flush;
+
+ /*
+ * The xlog location that has been applied by standby Startup process.
+ * This may be invalid if the standby-side does not support apply,
+ * or does not choose to apply records, as yet.
+ */
+ XLogRecPtr apply;
+
+ /*
+ * The current xmin from the standby, for Hot Standby feedback.
+ * This may be invalid if the standby-side does not support feedback,
+ * or Hot Standby is not yet available.
+ */
+ TransactionId xmin;
+
+ /* Sender's system clock at the time of transmission */
+ TimestampTz sendTime;
+} StandbyReplyMessage;
+
+/*
* Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
*
* We don't have a good idea of what a good value would be; there's some
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 24ad438..a6afec4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -13,6 +13,8 @@
#define _WALRECEIVER_H
#include "access/xlogdefs.h"
+#include "replication/syncrep.h"
+#include "storage/latch.h"
#include "storage/spin.h"
#include "pgtime.h"
@@ -71,6 +73,11 @@ typedef struct
*/
char conninfo[MAXCONNINFO];
+ /*
+ * Latch used by aux procs to wake up walreceiver when it has work to do.
+ */
+ Latch latch;
+
slock_t mutex; /* locks shared variables shown above */
} WalRcvData;
@@ -92,6 +99,7 @@ extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
/* prototypes for functions in walreceiver.c */
extern void WalReceiverMain(void);
+extern void WalRcvWakeup(void);
/* prototypes for functions in walreceiverfuncs.c */
extern Size WalRcvShmemSize(void);
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index bd9e193..5594127 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -15,6 +15,7 @@
#include "access/xlog.h"
#include "nodes/nodes.h"
#include "storage/latch.h"
+#include "replication/syncrep.h"
#include "storage/spin.h"
@@ -35,18 +36,63 @@ typedef struct WalSnd
WalSndState state; /* this walsender's state */
XLogRecPtr sentPtr; /* WAL has been sent up to this point */
- slock_t mutex; /* locks shared variables shown above */
+ /*
+ * The xlog location that has been written to WAL file by standby-side.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr write;
+
+ /*
+ * The xlog location that has been fsynced onto disk by standby-side.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr flush;
+
+ /*
+ * The xlog location that has been applied by standby Startup process.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr apply;
+
+ /*
+ * The current xmin from the standby, for Hot Standby feedback.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ TransactionId xmin;
/*
* Latch used by backends to wake up this walsender when it has work
* to do.
*/
Latch latch;
+
+ /*
+ * Highest level of sync rep available from this standby.
+ */
+ bool sync_rep_service;
+
+ slock_t mutex; /* locks shared variables shown above */
+
} WalSnd;
+extern WalSnd *MyWalSnd;
+
/* There is one WalSndCtl struct for the whole database cluster */
typedef struct
{
+ /*
+ * Sync rep wait queues with one queue per request type.
+ * We use one queue per request type so that we can maintain the
+ * invariant that the individual queues are sorted on LSN.
+ * This may also help performance when multiple wal senders
+ * offer different sync rep service levels.
+ */
+ SyncRepQueue sync_rep_queue[NUM_SYNC_REP_WAIT_MODES];
+
+ bool sync_rep_service_available;
+
+ slock_t ctlmutex; /* locks shared variables shown above */
+
WalSnd walsnds[1]; /* VARIABLE LENGTH ARRAY */
} WalSndCtlData;
@@ -60,6 +106,7 @@ extern volatile sig_atomic_t walsender_ready_to_stop;
/* user-settable parameters */
extern int WalSndDelay;
extern int max_wal_senders;
+extern bool allow_standalone_primary;
extern int WalSenderMain(void);
extern void WalSndSignals(void);
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 2deff72..84b91b3 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -29,6 +29,7 @@ typedef enum
PMSIGNAL_START_AUTOVAC_LAUNCHER, /* start an autovacuum launcher */
PMSIGNAL_START_AUTOVAC_WORKER, /* start an autovacuum worker */
PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
+ PMSIGNAL_SYNC_REPLICATION_ACTIVE, /* walsender has completed handshake */
NUM_PMSIGNALS /* Must be last value of enum! */
} PMSignalReason;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 78dbade..27b57c8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -14,6 +14,8 @@
#ifndef _PROC_H_
#define _PROC_H_
+#include "access/xlog.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "storage/pg_sema.h"
#include "utils/timestamp.h"
@@ -115,6 +117,11 @@ struct PGPROC
LOCKMASK heldLocks; /* bitmask for lock types already held on this
* lock object by this backend */
+ /* Info to allow us to wait for synchronous replication, if needed. */
+ Latch waitLatch;
+ XLogRecPtr waitLSN; /* waiting for this LSN or higher */
+ bool ownLatch; /* do we own the above latch? */
+
/*
* All PROCLOCK objects for locks held or awaited by this backend are
* linked into one of these lists, according to the partition number of
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 72e5630..b070340 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1296,7 +1296,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
pg_stat_bgwriter | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc;
pg_stat_database | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts FROM pg_database d;
pg_stat_database_conflicts | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
- pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sent_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
+ pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sync, w.sent_location, w.write_location, w.flush_location, w.apply_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sync, sent_location, write_location, flush_location, apply_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
pg_stat_sys_indexes | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
pg_stat_sys_tables | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_user_functions | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));
On Sat, Jan 15, 2011 at 22:40, Simon Riggs <simon@2ndquadrant.com> wrote:
Here's the latest patch for sync rep.
From here, I will be developing the patch further on public git
repository towards commit. My expectation is that commit is at least 2
That's great. Just one tiny detail - which repository and which branch? ;)
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
(grr, I wrote this on Monday already, but just found it in my drafts
folder, unsent)
On 15.01.2011 23:40, Simon Riggs wrote:
Here's the latest patch for sync rep.
From here, I will be developing the patch further on public git
repository towards commit. My expectation is that commit is at least 2
weeks away, though there are no major unresolved problems. I expect
essential follow on patches to continue for a further 2-4 weeks after
that first commit.
Thanks! Some quick observations after first read-through:
* The docs for synchronous_replication still claim that it means two
different things in master and standby. Looking at the code, I believe
that's not true anymore.
* it seems like overkill to not let clients to even connect when
allow_standalone_primary=off and no synchronous standbys are available.
What if you just want to run a read-only query?
* Please separate the hot standby feedback loop into a separate patch on
top of the synch rep patch. I know it's not a lot of code, but it's
still easier to handle features separately.
* The UI differs from what was agreed on here:
http://archives.postgresql.org/message-id/4D1DCF5A.7070808@enterprisedb.com.
* Instead of the short-circuit for autovacuum in SyncRepWaitOnQueue(),
it's probably better to set synchronous_commit=off locally when the
autovacuum process starts.
* the "queue id" thing is dead code at the moment, as there is only one
queue. I gather this is a leftover from having different queues for
"apply", "sync", "write" modes, but I think it would be better to just
remove it for now.
PS, I'm surprised how small this patch is. Thinking about it some more,
I don't know why I expected this to be a big patch.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, 2011-01-21 at 14:45 +0200, Heikki Linnakangas wrote:
(grr, I wrote this on Monday already, but just found it in my drafts
folder, unsent)
No worries, thanks for commenting.
Thanks! Some quick observations after first read-through:
* The docs for synchronous_replication still claim that it means two
different things in master and standby. Looking at the code, I believe
that's not true anymore.
Probably. The docs changed so many times I had gone "code-blind".
* it seems like overkill to not let clients to even connect when
allow_standalone_primary=off and no synchronous standbys are available.
What if you just want to run a read-only query?
That's what Aidan requested, I agreed and so its there. You're using
sync rep because of writes, so you have a read-write app. If you allow
connections then half of the app will work, half will not. Half-working
isn't very useful, as Aidan eloquently explained. If your app is all
read-only you wouldn't be using sync rep anyway. That's the argument,
but I've not got especially strong feelings it has to be this way.
Perhaps discuss that on a separate thread? See what everyone thinks?
* Please separate the hot standby feedback loop into a separate patch on
top of the synch rep patch. I know it's not a lot of code, but it's
still easier to handle features separately.
I tried to do that initially, but there is interaction between those
features. The way I have it is that the replies from the standby act as
keepalives to the master. So the hot standby feedback is just an extra
parameter and an extra field. Removing that doesn't really make the
patch any easier to understand.
* The UI differs from what was agreed on here:
http://archives.postgresql.org/message-id/4D1DCF5A.7070808@enterprisedb.com.
You mean synchronous_standbys is not there yet? Yes, I know. It can be
added after we commit this, its only a small bit of code and no
dependencies. I figured we had bigger things to agree first.
* Instead of the short-circuit for autovacuum in SyncRepWaitOnQueue(),
it's probably better to set synchronous_commit=off locally when the
autovacuum process starts.
Even better plan, thanks.
* the "queue id" thing is dead code at the moment, as there is only one
queue. I gather this is a leftover from having different queues for
"apply", "sync", "write" modes, but I think it would be better to just
remove it for now.
It's a trivial patch to add options to either fsync or apply, so I was
expecting to add that back in this release also.
PS, I'm surprised how small this patch is. Thinking about it some more,
I don't know why I expected this to be a big patch.
Yes, it's the decisions which seem fairly big this time.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Jan 21, 2011 at 7:45 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
* it seems like overkill to not let clients to even connect when
allow_standalone_primary=off and no synchronous standbys are available. What
if you just want to run a read-only query?
For what it's worth, +1.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jan 21, 2011 at 14:24, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, 2011-01-21 at 14:45 +0200, Heikki Linnakangas wrote:
* it seems like overkill to not let clients to even connect when
allow_standalone_primary=off and no synchronous standbys are available.
What if you just want to run a read-only query?That's what Aidan requested, I agreed and so its there. You're using
sync rep because of writes, so you have a read-write app. If you allow
connections then half of the app will work, half will not. Half-working
isn't very useful, as Aidan eloquently explained. If your app is all
read-only you wouldn't be using sync rep anyway. That's the argument,
but I've not got especially strong feelings it has to be this way.Perhaps discuss that on a separate thread? See what everyone thinks?
I'll respond here once, and we'll see if more people want to comment
then we can move it :-)
Doesn't this make a pretty strange assumption - namely that you have a
single application? We support multiple databases, and multiple users,
and multiple pretty much anything - in most cases, people deploy
multiple apps. (They may well be part of the same "solution" or
whatever you want to call it, but parts may well be readonly - like a
reporting app, or even just a monitoring client)
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On 21.01.2011 15:24, Simon Riggs wrote:
On Fri, 2011-01-21 at 14:45 +0200, Heikki Linnakangas wrote:
* it seems like overkill to not let clients to even connect when
allow_standalone_primary=off and no synchronous standbys are available.
What if you just want to run a read-only query?That's what Aidan requested, I agreed and so its there. You're using
sync rep because of writes, so you have a read-write app. If you allow
connections then half of the app will work, half will not. Half-working
isn't very useful, as Aidan eloquently explained. If your app is all
read-only you wouldn't be using sync rep anyway. That's the argument,
but I've not got especially strong feelings it has to be this way.
It's also possible that most of your transactions in fact do "set
synchronous_replication=off", and only a few actually do synchronous
replication. It would be pretty bad to not allow connections in that
case. And what if you want to connect to the server to diagnose the
issue? Oh, you can't... Besides, we're not kicking out existing
connections, are we? Seems inconsistent to let the old connections live.
IMHO the only reasonable option is to allow connections as usual, and
only fail (or block forever) at COMMIT.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, Jan 21, 2011 at 10:33 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
It's also possible that most of your transactions in fact do "set
synchronous_replication=off", and only a few actually do synchronous
replication. It would be pretty bad to not allow connections in that case.
And what if you want to connect to the server to diagnose the issue? Oh, you
can't... Besides, we're not kicking out existing connections, are we? Seems
inconsistent to let the old connections live.IMHO the only reasonable option is to allow connections as usual, and only
fail (or block forever) at COMMIT.
Another point is that the synchronous standby could come back at any
time. There's no reason not to let the client do all the work they
want up until the commit - maybe the standby will pop back up before
the COMMIT actually issued. Or even if it doesn't, as soon as it pops
back up, all those COMMITs get released.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, 2011-01-21 at 17:33 +0200, Heikki Linnakangas wrote:
On 21.01.2011 15:24, Simon Riggs wrote:
On Fri, 2011-01-21 at 14:45 +0200, Heikki Linnakangas wrote:
* it seems like overkill to not let clients to even connect when
allow_standalone_primary=off and no synchronous standbys are available.
What if you just want to run a read-only query?That's what Aidan requested, I agreed and so its there. You're using
sync rep because of writes, so you have a read-write app. If you allow
connections then half of the app will work, half will not. Half-working
isn't very useful, as Aidan eloquently explained. If your app is all
read-only you wouldn't be using sync rep anyway. That's the argument,
but I've not got especially strong feelings it has to be this way.It's also possible that most of your transactions in fact do "set
synchronous_replication=off", and only a few actually do synchronous
replication. It would be pretty bad to not allow connections in that
case. And what if you want to connect to the server to diagnose the
issue? Oh, you can't... Besides, we're not kicking out existing
connections, are we? Seems inconsistent to let the old connections live.IMHO the only reasonable option is to allow connections as usual, and
only fail (or block forever) at COMMIT.
We all think our own proposed options are the only reasonable thing, but
that helps us not at all in moving forwards. I've put much time into
delivering options many other people want, so there is a range of
function. I think we should hear from Aidan first before we decide to
remove that aspect.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On Fri, 2011-01-21 at 14:34 +0100, Magnus Hagander wrote:
On Fri, Jan 21, 2011 at 14:24, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, 2011-01-21 at 14:45 +0200, Heikki Linnakangas wrote:
* it seems like overkill to not let clients to even connect when
allow_standalone_primary=off and no synchronous standbys are available.
What if you just want to run a read-only query?That's what Aidan requested, I agreed and so its there. You're using
sync rep because of writes, so you have a read-write app. If you allow
connections then half of the app will work, half will not. Half-working
isn't very useful, as Aidan eloquently explained. If your app is all
read-only you wouldn't be using sync rep anyway. That's the argument,
but I've not got especially strong feelings it has to be this way.Perhaps discuss that on a separate thread? See what everyone thinks?
I'll respond here once, and we'll see if more people want to comment
then we can move it :-)Doesn't this make a pretty strange assumption - namely that you have a
single application? We support multiple databases, and multiple users,
and multiple pretty much anything - in most cases, people deploy
multiple apps. (They may well be part of the same "solution" or
whatever you want to call it, but parts may well be readonly - like a
reporting app, or even just a monitoring client)
There are various problems whatever we do. If we don't like one way, we
must balance that by judging what happens if we do things the other way.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Jan 21, 2011 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
We all think our own proposed options are the only reasonable thing, but
that helps us not at all in moving forwards. I've put much time into
delivering options many other people want, so there is a range of
function. I think we should hear from Aidan first before we decide to
remove that aspect.
Since invited, I'll describe what I *want* do to do. I understand I
may not get it ;-)
When no sync slave is connected, yes, I want to stop things hard. I
don't mind read-only queries working, but what I want to avoid (if
possible) is having the master do lots of inserts/updates/deletes for
clients, fsyncing them all to disk (so on some strange event causing
recovery they'll be considered commit) and just delay the commit
return until it has a valid sync slave connected and caught up again.
And *I*'ld prefer if client transactions get errors right away rather
than begin to hang if a sync slave is not connected.
Even with single server, there's the window where stuff could be
"committed" but the client not notified yet. And that leads to
transactions which need to be verified. And with sync rep, that
window get's a little larger. But I'ld prefer not to make it a hanger
door, *especially* when it gets flung open at the point where the shit
has hit the fan and we're in the midst of switching over to manual
processing...
So, in my case, I'ld like it if PG couldn't do anything to generate
any user-initiated WAL unless there is a sync slave connected. Yes, I
understand that leads to hard-fail, and yes, I understand I'm in the
minority, maybe almost singular in that desire.
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
On Fri, Jan 21, 2011 at 12:23 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
On Fri, Jan 21, 2011 at 11:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
We all think our own proposed options are the only reasonable thing, but
that helps us not at all in moving forwards. I've put much time into
delivering options many other people want, so there is a range of
function. I think we should hear from Aidan first before we decide to
remove that aspect.Since invited, I'll describe what I *want* do to do. I understand I
may not get it ;-)When no sync slave is connected, yes, I want to stop things hard. I
don't mind read-only queries working, but what I want to avoid (if
possible) is having the master do lots of inserts/updates/deletes for
clients, fsyncing them all to disk (so on some strange event causing
recovery they'll be considered commit) and just delay the commit
return until it has a valid sync slave connected and caught up again.
And *I*'ld prefer if client transactions get errors right away rather
than begin to hang if a sync slave is not connected.Even with single server, there's the window where stuff could be
"committed" but the client not notified yet. And that leads to
transactions which need to be verified. And with sync rep, that
window get's a little larger. But I'ld prefer not to make it a hanger
door, *especially* when it gets flung open at the point where the shit
has hit the fan and we're in the midst of switching over to manual
processing...So, in my case, I'ld like it if PG couldn't do anything to generate
any user-initiated WAL unless there is a sync slave connected. Yes, I
understand that leads to hard-fail, and yes, I understand I'm in the
minority, maybe almost singular in that desire.
What you're proposing is to fail things earlier than absolutely
necessary (when they try to XLOG, rather than at commit) but still
later than what I think Simon is proposing (not even letting them log
in).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Fri, Jan 21, 2011 at 12:23 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
When no sync slave is connected, yes, I want to stop things hard.
What you're proposing is to fail things earlier than absolutely
necessary (when they try to XLOG, rather than at commit) but still
later than what I think Simon is proposing (not even letting them log
in).
I can't see a reason to disallow login, because read-only transactions
can still run in such a situation --- and, indeed, might be fairly
essential if you need to inspect the database state on the way to fixing
the replication problem. (Of course, we've already had the discussion
about it being a terrible idea to configure replication from inside the
database, but that doesn't mean there might not be views or status you
would wish to look at.)
regards, tom lane
On Fri, Jan 21, 2011 at 1:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Fri, Jan 21, 2011 at 12:23 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
When no sync slave is connected, yes, I want to stop things hard.
What you're proposing is to fail things earlier than absolutely
necessary (when they try to XLOG, rather than at commit) but still
later than what I think Simon is proposing (not even letting them log
in).I can't see a reason to disallow login, because read-only transactions
can still run in such a situation --- and, indeed, might be fairly
essential if you need to inspect the database state on the way to fixing
the replication problem. (Of course, we've already had the discussion
about it being a terrible idea to configure replication from inside the
database, but that doesn't mean there might not be views or status you
would wish to look at.)
And just disallowing new logins is probably not even enough, because
it allows current logged in clients "forward progress", leading
towards an eventual hang (with now committed data on the master).
Again, I'm trying to stop "forward progress" as soon as possible when
a sync slave isn't replicating. And I'ld like clients to fail with
errors sooner (hopefully they get to the commit point) rather than
accumulate the WAL synced to the master and just wait at the commit.
So I think that's a more complete picture of my quick "not do anything
with no synchronous slave replicating" that I think was what led to
the no-login approach.
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
On Fri, Jan 21, 2011 at 1:09 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
On Fri, Jan 21, 2011 at 1:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Fri, Jan 21, 2011 at 12:23 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
When no sync slave is connected, yes, I want to stop things hard.
What you're proposing is to fail things earlier than absolutely
necessary (when they try to XLOG, rather than at commit) but still
later than what I think Simon is proposing (not even letting them log
in).I can't see a reason to disallow login, because read-only transactions
can still run in such a situation --- and, indeed, might be fairly
essential if you need to inspect the database state on the way to fixing
the replication problem. (Of course, we've already had the discussion
about it being a terrible idea to configure replication from inside the
database, but that doesn't mean there might not be views or status you
would wish to look at.)And just disallowing new logins is probably not even enough, because
it allows current logged in clients "forward progress", leading
towards an eventual hang (with now committed data on the master).Again, I'm trying to stop "forward progress" as soon as possible when
a sync slave isn't replicating. And I'ld like clients to fail with
errors sooner (hopefully they get to the commit point) rather than
accumulate the WAL synced to the master and just wait at the commit.So I think that's a more complete picture of my quick "not do anything
with no synchronous slave replicating" that I think was what led to
the no-login approach.
Well, stopping all WAL activity with an error sounds *more* reasonable
than refusing all logins, but I'm not personally sold on it. For
example, a brief network disruption on the connection between master
and standby would cause the master to grind to a halt... and then
almost immediately resume operations. More generally, if you have
short-running transactions, there's not much difference between
wait-at-commit and wait-at-WAL, and if you have long-running
transactions, then wait-at-WAL might be gumming up the works more than
necessary.
One idea might be to wait both before and after commit. If
allow_standalone_primary is off, and a commit is attempted, we check
whether there's a slave connected, and if not, wait for one to
connect. Then, we write and sync the commit WAL record. Next, we
wait for the WAL to be ack'd. Of course, the standby might disappear
between the first check and the second, but it would greatly reduce
the possibility of the master being ahead of the standby after a
crash, which might be useful for some people.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jan 21, 2011 at 1:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Again, I'm trying to stop "forward progress" as soon as possible when
a sync slave isn't replicating. And I'ld like clients to fail with
errors sooner (hopefully they get to the commit point) rather than
accumulate the WAL synced to the master and just wait at the commit.
Well, stopping all WAL activity with an error sounds *more* reasonable
than refusing all logins, but I'm not personally sold on it. For
example, a brief network disruption on the connection between master
and standby would cause the master to grind to a halt... and then
almost immediately resume operations.
Yup. And I'm OK with that. In my case, it would be much better to
have a few quick failures, which can complete automatically a few
seconds later then to have a big buildup of transactions to re-verify
by hand upon starting manual processing.
But again, I'll stress that I'm talking about whe the master has no
sync slave connected. a "brief netowrk disruption" between the
master/slave isn't likely going to disconnect the slave. TCP is
pretty good at handling those. If the master thinks it has a sync
slave connected, I'm fine with it continuing to queue WAL for it even
if it's lagging noticeably.
More generally, if you have
short-running transactions, there's not much difference between
wait-at-commit and wait-at-WAL, and if you have long-running
transactions, then wait-at-WAL might be gumming up the works more than
necessary.
Again, when there is not sync slave *connected*, I don't want to wait
*at all*. I want to fail ASAP. If there is a sync slave, and it's
just slow, I don't really care where it waits.
From my experience, if the slave is not connected (i.e TCP connection
has been disconnected), then we're in something like:
1) Proper slave shutdown: pilot error here stopping it if the master requires it
2) Master start, slave not connected yet: I'm fine with getting
errors here... We *hope* a slave will be here soon, but...
3) network has seperated master/slave: TCP means it's been like this
for a long time already...
4) Slave hardware/os low-level hang/crash: TCP means it's been like
this for a while already before master's os tears down the connection
5) Slave has crashed (or rebooted) and slave OS has closed/rejected
our TCP connection
In all of these, I'ld love for my master not to be generating WAL and
letting clients think they are making progress. And I'm hoping that
for #3 & 4 above, PG will have keepalive type traffic that will
prevent me from queing WAL for normal TCP connection time values.
One idea might be to wait both before and after commit. If
allow_standalone_primary is off, and a commit is attempted, we check
whether there's a slave connected, and if not, wait for one to
connect. Then, we write and sync the commit WAL record. Next, we
wait for the WAL to be ack'd. Of course, the standby might disappear
between the first check and the second, but it would greatly reduce
the possibility of the master being ahead of the standby after a
crash, which might be useful for some people.
Ya, but that becomes much more expensive. Instead of it just being a
"write WAL, fsync WAL, send WAL, wait for slave", it becomes "write
WAL, fsync WAL, send WAL, wait for slave fsync, write WAL, fsync WAL,
send WAL, wait for slave fsync". And it's expense is all the time,
rather than just when the "no slave no go" situations arise.
And it doesn't reduce the transactions I need to verify by hand
either, because that waiting/error still only happens at the COMMIT
statement from the client.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
On Fri, Jan 21, 2011 at 1:59 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
Yup. And I'm OK with that. In my case, it would be much better to
have a few quick failures, which can complete automatically a few
seconds later then to have a big buildup of transactions to re-verify
by hand upon starting manual processing.
Why would you need to do that?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, 2011-01-21 at 13:32 -0500, Robert Haas wrote:
One idea might be to wait both before and after commit. If
allow_standalone_primary is off, and a commit is attempted, we check
whether there's a slave connected, and if not, wait for one to
connect. Then, we write and sync the commit WAL record. Next, we
wait for the WAL to be ack'd. Of course, the standby might disappear
between the first check and the second, but it would greatly reduce
the possibility of the master being ahead of the standby after a
crash, which might be useful for some people.
I like this idea.
I think it would be too invasive to make a check before we insert each
WAL record, as Aidan suggests. Even if we did that, you aren't protected
when a standby goes down because you'll still have written half a
transaction and still be waiting.
So I propose that
if (!allow_standalone_primary)
ConfirmSyncRepAvailable();
before PreCommit_Notify(). That puts transaction into a wait state that
lasts until a sync rep standby is available. Note that it is before the
actual commit, so if we decide we need to we can cancel those
transactions and have them properly abort.
I won't add that code yet, in case better ideas emerge.
There is no support for preventing connections at startup, so I will
remove that completely, now.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On Sat, Jan 22, 2011 at 8:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, 2011-01-21 at 13:32 -0500, Robert Haas wrote:
One idea might be to wait both before and after commit. If
allow_standalone_primary is off, and a commit is attempted, we check
whether there's a slave connected, and if not, wait for one to
connect. Then, we write and sync the commit WAL record. Next, we
wait for the WAL to be ack'd. Of course, the standby might disappear
between the first check and the second, but it would greatly reduce
the possibility of the master being ahead of the standby after a
crash, which might be useful for some people.I like this idea.
I think it would be too invasive to make a check before we insert each
WAL record, as Aidan suggests. Even if we did that, you aren't protected
when a standby goes down because you'll still have written half a
transaction and still be waiting.So I propose that
if (!allow_standalone_primary)
ConfirmSyncRepAvailable();before PreCommit_Notify(). That puts transaction into a wait state that
lasts until a sync rep standby is available. Note that it is before the
actual commit, so if we decide we need to we can cancel those
transactions and have them properly abort.I won't add that code yet, in case better ideas emerge.
There is no support for preventing connections at startup, so I will
remove that completely, now.
Time's running short - do you have an updated patch?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Jan 30, 2011 at 11:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Time's running short - do you have an updated patch?
This patch hasn't been updated in more than three weeks. I assume
this should now be marked Returned with Feedback, and we'll revisit
synchronous replication for 9.2?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas wrote:
On Sun, Jan 30, 2011 at 11:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Time's running short - do you have an updated patch?
This patch hasn't been updated in more than three weeks. I assume
this should now be marked Returned with Feedback, and we'll revisit
synchronous replication for 9.2?
Seems it is time for someone else to take the patch and complete it?
Who can do that?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Mon, 2011-02-07 at 09:55 -0500, Robert Haas wrote:
On Sun, Jan 30, 2011 at 11:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Time's running short - do you have an updated patch?
This patch hasn't been updated in more than three weeks. I assume
this should now be marked Returned with Feedback, and we'll revisit
synchronous replication for 9.2?
Hi Robert,
I have time to complete that in next two weeks, but you are right I
haven't had it in last few weeks.
Cheers
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On Mon, Feb 7, 2011 at 11:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, 2011-02-07 at 09:55 -0500, Robert Haas wrote:
On Sun, Jan 30, 2011 at 11:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Time's running short - do you have an updated patch?
This patch hasn't been updated in more than three weeks. I assume
this should now be marked Returned with Feedback, and we'll revisit
synchronous replication for 9.2?I have time to complete that in next two weeks, but you are right I
haven't had it in last few weeks.
Well, the current CommitFest ends in one week, and we need to leave
time for someone (Heikki, most likely) to review, so there's really
only a couple of days left.
Bruce's suggestion of having someone else pick it up seems like it
might work. The obvious candidates are probably Heikki Linnakangas,
Tom Lane, Fujii Masao, and (if you squint a little) me. I am not
clear that any of those people have the necessary time available
immediately, however.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Feb 7, 2011 at 12:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 7, 2011 at 11:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, 2011-02-07 at 09:55 -0500, Robert Haas wrote:
On Sun, Jan 30, 2011 at 11:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Time's running short - do you have an updated patch?
This patch hasn't been updated in more than three weeks. I assume
this should now be marked Returned with Feedback, and we'll revisit
synchronous replication for 9.2?I have time to complete that in next two weeks, but you are right I
haven't had it in last few weeks.Well, the current CommitFest ends in one week, and we need to leave
time for someone (Heikki, most likely) to review, so there's really
only a couple of days left.Bruce's suggestion of having someone else pick it up seems like it
might work. The obvious candidates are probably Heikki Linnakangas,
Tom Lane, Fujii Masao, and (if you squint a little) me. I am not
clear that any of those people have the necessary time available
immediately, however.
I just spoke to my manager at EnterpriseDB and he cleared my schedule
for the next two days to work on this. So I'll go hack on this now.
I haven't read the patch yet so I don't know for sure how quite I'll
be able to get up to speed on it, so if someone who is more familiar
with this code wants to grab the baton away from me, feel free.
Otherwise, I'll see what I can do with it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
... Well, the current CommitFest ends in one week, ...
Really? I thought the idea for the last CF of a development cycle was
that it kept going till we'd dealt with everything. Arbitrarily
rejecting stuff we haven't dealt with doesn't seem fair.
regards, tom lane
On Mon, Feb 7, 2011 at 12:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... Well, the current CommitFest ends in one week, ...
Really? I thought the idea for the last CF of a development cycle was
that it kept going till we'd dealt with everything. Arbitrarily
rejecting stuff we haven't dealt with doesn't seem fair.
Uh, we did that with 8.4 and it was a disaster. The CommitFest lasted
*five months*. We've been doing schedule-based CommitFests ever since
and it's worked much better. I agree it's unfair to reject things
without looking at them, and I'd like to avoid that if at all
possible, but punting things because they need more work than can be
done in the time available is another thing entirely. I do NOT want
to still be working on the items for this CommitFest in June - that's
about when I'd like to be releasing beta3.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, 2011-02-07 at 12:39 -0500, Robert Haas wrote:
I just spoke to my manager at EnterpriseDB and he cleared my schedule
for the next two days to work on this. So I'll go hack on this now.
I haven't read the patch yet so I don't know for sure how quite I'll
be able to get up to speed on it, so if someone who is more familiar
with this code wants to grab the baton away from me, feel free.
Otherwise, I'll see what I can do with it.
Presumably you have a reason for declaring war? I'm sorry for that, I
really am.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On Mon, Feb 7, 2011 at 12:59 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, 2011-02-07 at 12:39 -0500, Robert Haas wrote:
I just spoke to my manager at EnterpriseDB and he cleared my schedule
for the next two days to work on this. So I'll go hack on this now.
I haven't read the patch yet so I don't know for sure how quite I'll
be able to get up to speed on it, so if someone who is more familiar
with this code wants to grab the baton away from me, feel free.
Otherwise, I'll see what I can do with it.Presumably you have a reason for declaring war? I'm sorry for that, I
really am.
What the hell are you talking about?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Jan 15, 2011 at 4:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Here's the latest patch for sync rep.
Here is a rebased version of this patch which applies to head of the
master branch. I haven't tested it yet beyond making sure that it
compiles and passes the regression tests -- but this fixes the bitrot.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
syncrep-v9.1.patchapplication/octet-stream; name=syncrep-v9.1.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d2a6445..f5a8e63 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2006,8 +2006,122 @@ SET ENABLE_SEQSCAN TO OFF;
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
+ <para>
+ You should also consider setting <varname>hot_standby_feedback</>
+ as an alternative to using this parameter.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </sect2>
+
+ <sect2 id="runtime-config-sync-rep">
+ <title>Synchronous Replication</title>
+
+ <para>
+ These settings control the behavior of the built-in
+ <firstterm>synchronous replication</> feature.
+ These parameters would be set on the primary server that is
+ to send replication data to one or more standby servers.
+ </para>
+
+ <variablelist>
+ <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
+ <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>synchronous_replication</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether transaction commit will wait for WAL records
+ to be replicated before the command returns a <quote>success</>
+ indication to the client. The default setting is <literal>off</>.
+ When <literal>on</>, there will be a delay while the client waits
+ for confirmation of successful replication. That delay will
+ increase depending upon the physical distance and network activity
+ between primary and standby. The commit wait will last until the
+ first reply from any standby. Multiple standby servers allow
+ increased availability and possibly increase performance as well.
+ </para>
+ <para>
+ The parameter must be set on both primary and standby.
+ </para>
+ <para>
+ On the primary, this parameter can be changed at any time; the
+ behavior for any one transaction is determined by the setting in
+ effect when it commits. It is therefore possible, and useful, to have
+ some transactions replicate synchronously and others asynchronously.
+ For example, to make a single multistatement transaction commit
+ asynchronously when the default is synchronous replication, issue
+ <command>SET LOCAL synchronous_replication TO OFF</> within the
+ transaction.
+ </para>
+ <para>
+ On the standby, the parameter value is taken only at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-allow-standalone-primary" xreflabel="allow_standalone_primary">
+ <term><varname>allow_standalone_primary</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>allow_standalone_primary</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If <varname>allow_standalone_primary</> is set, then the server
+ can operate normally whether or not replication is active. If
+ a client requests <varname>synchronous_replication</> and it is
+ not available, they will use asynchornous replication instead.
+ </para>
+ <para>
+ If <varname>allow_standalone_primary</> is not set, then the server
+ will prevent normal client connections until a standby connects that
+ has <varname>synchronous_replication_feedback</> enabled. Once
+ clients connect, if they request <varname>synchronous_replication</>
+ and it is no longer available they will wait for
+ <varname>replication_timeout_client</>.
+ </para>
</listitem>
</varlistentry>
+
+ <varlistentry id="guc-replication-timeout-client" xreflabel="replication_timeout_client">
+ <term><varname>replication_timeout_client</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>replication_timeout_client</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If the client has <varname>synchronous_replication</varname> set,
+ and a synchronous standby is currently available
+ then the commit will wait for up to <varname>replication_timeout_client</>
+ seconds before it returns a <quote>success</>. The commit will wait
+ forever for a confirmation when <varname>replication_timeout_client</>
+ is set to -1.
+ </para>
+ <para>
+ If the client has <varname>synchronous_replication</varname> set,
+ and yet no synchronous standby is available when we commit, then the
+ setting of <varname>allow_standalone_primary</> determines whether
+ or not we wait.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-replication-timeout-server" xreflabel="replication_timeout_server">
+ <term><varname>replication_timeout_server</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>replication_timeout_server</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If the primary server does not receive a reply from a standby server
+ within <varname>replication_timeout_server</> seconds then the
+ primary will terminate the replication connection.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
@@ -2098,6 +2212,42 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby">
+ <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>hot_standby_feedback</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether or not a hot standby will send feedback to the primary
+ about queries currently executing on the standby. This parameter can
+ be used to eliminate query cancels caused by cleanup records, though
+ it can cause database bloat on the primary for some workloads.
+ The default value is <literal>off</literal>.
+ This parameter can only be set at server start. It only has effect
+ if <varname>hot_standby</> is enabled.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-synchronous-replication-feedback" xreflabel="synchronous_replication_feedback">
+ <term><varname>synchronous_replication_feedback</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>synchronous_replication_feedback</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether the standby will provide reply messages to
+ allow synchronous replication on the primary.
+ Reasons for doing this might be that the standby is physically
+ co-located with the primary and so would be a bad choice as a
+ future primary server, or the standby might be a test server.
+ The default value is <literal>on</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
</sect1>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 94d5ae8..5ee77ad 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -738,13 +738,12 @@ archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
</para>
<para>
- Streaming replication is asynchronous, so there is still a small delay
+ There is a small replication delay
between committing a transaction in the primary and for the changes to
become visible in the standby. The delay is however much smaller than with
file-based log shipping, typically under one second assuming the standby
is powerful enough to keep up with the load. With streaming replication,
- <varname>archive_timeout</> is not required to reduce the data loss
- window.
+ <varname>archive_timeout</> is not required.
</para>
<para>
@@ -879,6 +878,236 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
</sect3>
</sect2>
+ <sect2 id="synchronous-replication">
+ <title>Synchronous Replication</title>
+
+ <indexterm zone="high-availability">
+ <primary>Synchronous Replication</primary>
+ </indexterm>
+
+ <para>
+ <productname>PostgreSQL</> streaming replication is asynchronous by
+ default. If the primary server
+ crashes then some transactions that were committed may not have been
+ replicated to the standby server, causing data loss. The amount
+ of data loss is proportional to the replication delay at the time of
+ failover. That could be zero, or more, we do not know for certain
+ either way, when using asynchronous replication.
+ </para>
+
+ <para>
+ Synchronous replication offers the ability to confirm that all changes
+ made by a transaction have been transferred to at least one remote
+ standby server. This extends the standard level of durability
+ offered by a transaction commit. This level of protection is referred
+ to as 2-safe replication in computer science theory.
+ </para>
+
+ <para>
+ Synchronous replication works in the following way. When requested,
+ the commit of a write transaction will wait until confirmation is
+ received that the commit has been written to the transaction log on disk
+ of both the primary and standby server. The only possibility that data
+ can be lost is if both the primary and the standby suffer crashes at the
+ same time. This can provide a much higher level of durability if the
+ sysadmin is cautious about the placement and management of the two servers.
+ Waiting for confirmation increases the user's confidence that the changes
+ will not be lost in the event of server crashes but it also necessarily
+ increases the response time for the requesting transaction. The minimum
+ wait time is the roundtrip time between primary to standby.
+ </para>
+
+ <para>
+ Read only transactions and transaction rollbacks need not wait for
+ replies from standby servers. Subtransaction commits do not wait for
+ responses from standby servers, only final top-level commits. Long
+ running actions such as data loading or index building do not wait
+ until the very final commit message.
+ </para>
+
+ <sect3 id="synchronous-replication-config">
+ <title>Basic Configuration</title>
+
+ <para>
+ Synchronous replication will be active if appropriate options are
+ enabled on both the primary and at least one standby server. If
+ options are not correctly set on both servers, the primary will use
+ use asynchronous replication by default.
+ </para>
+
+ <para>
+ On the primary server we need to set
+
+<programlisting>
+synchronous_replication = on
+</programlisting>
+
+ and on the standby server we need to set
+
+<programlisting>
+synchronous_replication_feedback = on
+</programlisting>
+
+ On the primary, <varname>synchronous_replication</> can be set
+ for particular users or databases, or dynamically by applications
+ programs. On the standby, <varname>synchronous_replication_feedback</>
+ can only be set at server start.
+ </para>
+
+ <para>
+ If more than one standby server
+ specifies <varname>synchronous_replication_feedback</>, then whichever
+ standby replies first will release waiting commits.
+ Turning this setting off for a standby allows the administrator to
+ exclude certain standby servers from releasing waiting transactions.
+ This is useful if not all standby servers are designated as potential
+ future primary servers, such as if a standby were co-located
+ with the primary, so that a disaster would cause both servers to be lost.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-performance">
+ <title>Planning for Performance</title>
+
+ <para>
+ Synchronous replication usually requires carefully planned and placed
+ standby servers to ensure applications perform acceptably. Waiting
+ doesn't utilise system resources, but transaction locks continue to be
+ held until the transfer is confirmed. As a result, incautious use of
+ synchronous replication will reduce performance for database
+ applications because of increased response times and higher contention.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</> allows the application developer
+ to specify the durability level required via replication. This can be
+ specified for the system overall, though it can also be specified for
+ specific users or connections, or even individual transactions.
+ </para>
+
+ <para>
+ For example, an application workload might consist of:
+ 10% of changes are important customer details, while
+ 90% of changes are less important data that the business can more
+ easily survive if it is lost, such as chat messages between users.
+ </para>
+
+ <para>
+ With synchronous replication options specified at the application level
+ (on the primary) we can offer sync rep for the most important changes,
+ without slowing down the bulk of the total workload. Application level
+ options are an important and practical tool for allowing the benefits of
+ synchronous replication for high performance applications.
+ </para>
+
+ <para>
+ You should consider that the network bandwidth must be higher than
+ the rate of generation of WAL data.
+ 10% of changes are important customer details, while
+ 90% of changes are less important data that the business can more
+ easily survive if it is lost, such as chat messages between users.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-ha">
+ <title>Planning for High Availability</title>
+
+ <para>
+ The easiest and safest method of gaining High Availability using
+ synchronous replication is to configure at least two standby servers.
+ To understand why, we need to examine what can happen when you lose all
+ standby servers.
+ </para>
+
+ <para>
+ Commits made when synchronous_replication is set will wait until at
+ least one standby responds. The response may never occur if the last,
+ or only, standby should crash or the network drops. What should we do in
+ that situation?
+ </para>
+
+ <para>
+ Sitting and waiting will typically cause operational problems
+ because it is an effective outage of the primary server should all
+ sessions end up waiting. In contrast, allowing the primary server to
+ continue processing write transactions in the absence of a standby
+ puts those latest data changes at risk. So in this situation there
+ is a direct choice between database availability and the potential
+ durability of the data it contains. How we handle this situation
+ is controlled by <varname>allow_standalone_primary</>. The default
+ setting is <literal>on</>, allowing processing to continue, though
+ there is no recommended setting. Choosing the best setting for
+ <varname>allow_standalone_primary</> is a difficult decision and best
+ left to those with combined business responsibility for both data and
+ applications. The difficulty of this choice is the reason why we
+ recommend that you reduce the possibility of this situation occurring
+ by using multiple standby servers.
+ </para>
+
+ <para>
+ A user will stop waiting once the <varname>replication_timeout_client</>
+ has been reached for their specific session. Users are not waiting for
+ a specific standby to reply, they are waiting for a reply from any
+ standby, so the unavailability of any one standby is not significant
+ to a user. It is possible for user sessions to hit timeout even though
+ standbys are communicating normally. In that case, the setting of
+ <varname>replication_timeout</> is probably too low.
+ </para>
+
+ <para>
+ The standby sends regular status messages to the primary. If no status
+ messages have been received for <varname>replication_timeout_server</>
+ the primary server will assume the connection is dead and terminate it.
+ </para>
+
+ <para>
+ When the primary is started with <varname>allow_standalone_primary</>
+ enabled, the primary will not allow connections until a standby connects
+ that also has <varname>synchronous_replication</> enabled. This is a
+ convenience to ensure that we don't allow connections before write
+ transactions will return successfully.
+ </para>
+
+ <para>
+ When a standby first attaches to the primary, it may not be properly
+ synchronized. The standby is only able to become a synchronous standby
+ once it has become synchronized, or "caught up" with the the primary.
+ The catch-up duration may be long immediately after the standby has
+ been created. If the standby is shutdown, then the catch-up period
+ will increase according to the length of time the standby has been
+ down. You are advised to make sure <varname>allow_standalone_primary</>
+ is not set during the initial catch-up period.
+ </para>
+
+ <para>
+ If primary crashes while commits are waiting for acknowledgement, those
+ transactions will be marked fully committed if the primary database
+ recovers, no matter how <varname>allow_standalone_primary</> is set.
+ There is no way to be certain that all standbys have received all
+ outstanding WAL data at time of the crash of the primary. Some
+ transactions may not show as committed on the standby, even though
+ they show as committed on the primary. The guarantee we offer is that
+ the application will not receive explicit acknowledgement of the
+ successful commit of a transaction until the WAL data is known to be
+ safely received by the standby. Hence this mechanism is technically
+ "semi synchronous" rather than "fully synchronous" replication. Note
+ that replication still not be fully synchronous even if we wait for
+ all standby servers, though this would reduce availability, as
+ described previously.
+ </para>
+
+ <para>
+ If you need to re-create a standby server while transactions are
+ waiting, make sure that the commands to run pg_start_backup() and
+ pg_stop_backup() are run in a session with
+ synchronous_replication = off, otherwise those requests will wait
+ forever for the standby to appear.
+ </para>
+
+ </sect3>
+ </sect2>
</sect1>
<sect1 id="warm-standby-failover">
@@ -1393,11 +1622,18 @@ if (!triggered)
These conflicts are <emphasis>hard conflicts</> in the sense that queries
might need to be cancelled and, in some cases, sessions disconnected to resolve them.
The user is provided with several ways to handle these
- conflicts. Conflict cases include:
+ conflicts. Conflict cases in order of likely frequency are:
<itemizedlist>
<listitem>
<para>
+ Application of a vacuum cleanup record from WAL conflicts with
+ standby transactions whose snapshots can still <quote>see</> any of
+ the rows to be removed.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
Access Exclusive locks taken on the primary server, including both
explicit <command>LOCK</> commands and various <acronym>DDL</>
actions, conflict with table accesses in standby queries.
@@ -1417,14 +1653,8 @@ if (!triggered)
</listitem>
<listitem>
<para>
- Application of a vacuum cleanup record from WAL conflicts with
- standby transactions whose snapshots can still <quote>see</> any of
- the rows to be removed.
- </para>
- </listitem>
- <listitem>
- <para>
- Application of a vacuum cleanup record from WAL conflicts with
+ Buffer pin deadlock caused by
+ application of a vacuum cleanup record from WAL conflicts with
queries accessing the target page on the standby, whether or not
the data to be removed is visible.
</para>
@@ -1539,17 +1769,16 @@ if (!triggered)
<para>
Remedial possibilities exist if the number of standby-query cancellations
- is found to be unacceptable. The first option is to connect to the
- primary server and keep a query active for as long as needed to
- run queries on the standby. This prevents <command>VACUUM</> from removing
- recently-dead rows and so cleanup conflicts do not occur.
- This could be done using <xref linkend="dblink"> and
- <function>pg_sleep()</>, or via other mechanisms. If you do this, you
+ is found to be unacceptable. Typically the best option is to enable
+ <varname>hot_standby_feedback</>. This prevents <command>VACUUM</> from
+ removing recently-dead rows and so cleanup conflicts do not occur.
+ If you do this, you
should note that this will delay cleanup of dead rows on the primary,
which may result in undesirable table bloat. However, the cleanup
situation will be no worse than if the standby queries were running
- directly on the primary server, and you are still getting the benefit of
- off-loading execution onto the standby.
+ directly on the primary server. You are still getting the benefit
+ of off-loading execution onto the standby and the query may complete
+ faster than it would have done on the primary server.
<varname>max_standby_archive_delay</> must be kept large in this case,
because delayed WAL files might already contain entries that conflict with
the desired standby queries.
@@ -1563,7 +1792,8 @@ if (!triggered)
a high <varname>max_standby_streaming_delay</>. However it is
difficult to guarantee any specific execution-time window with this
approach, since <varname>vacuum_defer_cleanup_age</> is measured in
- transactions executed on the primary server.
+ transactions executed on the primary server. As of version 9.1, this
+ second option is much less likely to valuable.
</para>
<para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 4fee9c3..e4607ac 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -56,6 +56,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "replication/walsender.h"
+#include "replication/syncrep.h"
#include "storage/fd.h"
#include "storage/procarray.h"
#include "storage/sinvaladt.h"
@@ -2027,6 +2028,14 @@ RecordTransactionCommitPrepared(TransactionId xid,
MyProc->inCommit = false;
END_CRIT_SECTION();
+
+ /*
+ * Wait for synchronous replication, if required.
+ *
+ * Note that at this stage we have marked clog, but still show as
+ * running in the procarray and continue to hold locks.
+ */
+ SyncRepWaitForLSN(recptr);
}
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1e31e07..18e9ce1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -37,6 +37,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "replication/walsender.h"
+#include "replication/syncrep.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
@@ -53,6 +54,7 @@
#include "utils/snapmgr.h"
#include "pg_trace.h"
+extern void WalRcvWakeup(void); /* we are only caller, so include directly */
/*
* User-tweakable parameters
@@ -1051,7 +1053,7 @@ RecordTransactionCommit(void)
* if all to-be-deleted tables are temporary though, since they are lost
* anyway if we crash.)
*/
- if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0)
+ if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 || SyncRepRequested())
{
/*
* Synchronous commit case:
@@ -1121,6 +1123,14 @@ RecordTransactionCommit(void)
/* Compute latestXid while we have the child XIDs handy */
latestXid = TransactionIdLatest(xid, nchildren, children);
+ /*
+ * Wait for synchronous replication, if required.
+ *
+ * Note that at this stage we have marked clog, but still show as
+ * running in the procarray and continue to hold locks.
+ */
+ SyncRepWaitForLSN(XactLastRecEnd);
+
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd.xrecoff = 0;
@@ -4512,6 +4522,14 @@ xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, XLogRecPtr lsn)
*/
if (XactCompletionForceSyncCommit(xlrec))
XLogFlush(lsn);
+
+ /*
+ * If this standby is offering sync_rep_service then signal WALReceiver,
+ * in case it needs to send a reply just for this commit on an
+ * otherwise quiet server.
+ */
+ if (sync_rep_service)
+ WalRcvWakeup();
}
/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 25c7e06..4b29199 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -41,6 +41,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -157,6 +158,11 @@ static XLogRecPtr LastRec;
* known, need to check the shared state".
*/
static bool LocalRecoveryInProgress = true;
+/*
+ * Local copy of SharedHotStandbyActive variable. False actually means "not
+ * known, need to check the shared state".
+ */
+static bool LocalHotStandbyActive = false;
/*
* Local state for XLogInsertAllowed():
@@ -402,6 +408,12 @@ typedef struct XLogCtlData
bool SharedRecoveryInProgress;
/*
+ * SharedHotStandbyActive indicates if we're still in crash or archive
+ * recovery. Protected by info_lck.
+ */
+ bool SharedHotStandbyActive;
+
+ /*
* recoveryWakeupLatch is used to wake up the startup process to
* continue WAL replay, if it is waiting for WAL to arrive or failover
* trigger file to appear.
@@ -4893,6 +4905,7 @@ XLOGShmemInit(void)
*/
XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
XLogCtl->SharedRecoveryInProgress = true;
+ XLogCtl->SharedHotStandbyActive = false;
XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
SpinLockInit(&XLogCtl->info_lck);
InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
@@ -5233,6 +5246,12 @@ readRecoveryCommandFile(void)
(errmsg("recovery command file \"%s\" specified neither primary_conninfo nor restore_command",
RECOVERY_COMMAND_FILE),
errhint("The database server will regularly poll the pg_xlog subdirectory to check for files placed there.")));
+
+ if (PrimaryConnInfo == NULL && sync_rep_service)
+ ereport(WARNING,
+ (errmsg("recovery command file \"%s\" specified synchronous_replication_service yet streaming was not requested",
+ RECOVERY_COMMAND_FILE),
+ errhint("Specify primary_conninfo to allow synchronous replication.")));
}
else
{
@@ -6074,6 +6093,13 @@ StartupXLOG(void)
StandbyRecoverPreparedTransactions(false);
}
}
+ else
+ {
+ /*
+ * No need to calculate feedback if we're not in Hot Standby.
+ */
+ hot_standby_feedback = false;
+ }
/* Initialize resource managers */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
@@ -6568,8 +6594,6 @@ StartupXLOG(void)
static void
CheckRecoveryConsistency(void)
{
- static bool backendsAllowed = false;
-
/*
* Have we passed our safe starting point?
*/
@@ -6589,11 +6613,19 @@ CheckRecoveryConsistency(void)
* enabling connections.
*/
if (standbyState == STANDBY_SNAPSHOT_READY &&
- !backendsAllowed &&
+ !LocalHotStandbyActive &&
reachedMinRecoveryPoint &&
IsUnderPostmaster)
{
- backendsAllowed = true;
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ xlogctl->SharedHotStandbyActive = true;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ LocalHotStandbyActive = true;
+
SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
}
}
@@ -6641,6 +6673,38 @@ RecoveryInProgress(void)
}
/*
+ * Is HotStandby active yet? This is only important in special backends
+ * since normal backends won't ever be able to connect until this returns
+ * true.
+ *
+ * Unlike testing standbyState, this works in any process that's connected to
+ * shared memory.
+ */
+bool
+HotStandbyActive(void)
+{
+ /*
+ * We check shared state each time only until Hot Standby is active. We
+ * can't de-activate Hot Standby, so there's no need to keep checking after
+ * the shared variable has once been seen true.
+ */
+ if (LocalHotStandbyActive)
+ return true;
+ else
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ /* spinlock is essential on machines with weak memory ordering! */
+ SpinLockAcquire(&xlogctl->info_lck);
+ LocalHotStandbyActive = xlogctl->SharedHotStandbyActive;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ return LocalHotStandbyActive;
+ }
+}
+
+/*
* Is this process allowed to insert new WAL records?
*
* Ordinarily this is essentially equivalent to !RecoveryInProgress().
@@ -9029,6 +9093,25 @@ pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
}
/*
+ * Get latest redo apply position.
+ *
+ * Exported to allow WALReceiver to read the pointer directly.
+ */
+XLogRecPtr
+GetXLogReplayRecPtr(void)
+{
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+ XLogRecPtr recptr;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ recptr = xlogctl->recoveryLastRecPtr;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ return recptr;
+}
+
+/*
* Report the last WAL replay location (same format as pg_start_backup etc)
*
* This is useful for determining how much of WAL is visible to read-only
@@ -9037,14 +9120,10 @@ pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
Datum
pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
{
- /* use volatile pointer to prevent code rearrangement */
- volatile XLogCtlData *xlogctl = XLogCtl;
XLogRecPtr recptr;
char location[MAXFNAMELEN];
- SpinLockAcquire(&xlogctl->info_lck);
- recptr = xlogctl->recoveryLastRecPtr;
- SpinLockRelease(&xlogctl->info_lck);
+ recptr = GetXLogReplayRecPtr();
if (recptr.xlogid == 0 && recptr.xrecoff == 0)
PG_RETURN_NULL();
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 718e996..506e908 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -502,7 +502,11 @@ CREATE VIEW pg_stat_replication AS
S.client_port,
S.backend_start,
W.state,
- W.sent_location
+ W.sync,
+ W.sent_location,
+ W.write_location,
+ W.flush_location,
+ W.apply_location
FROM pg_stat_get_activity(NULL) AS S, pg_authid U,
pg_stat_get_wal_senders() AS W
WHERE S.usesysid = U.oid AND
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8f77d1b..1577875 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -275,6 +275,7 @@ typedef enum
PM_STARTUP, /* waiting for startup subprocess */
PM_RECOVERY, /* in archive recovery mode */
PM_HOT_STANDBY, /* in hot standby mode */
+ PM_WAIT_FOR_REPLICATION, /* waiting for sync replication to become active */
PM_RUN, /* normal "database is alive" state */
PM_WAIT_BACKUP, /* waiting for online backup mode to end */
PM_WAIT_READONLY, /* waiting for read only backends to exit */
@@ -735,6 +736,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\" or \"hot_standby\"")));
+ if (!allow_standalone_primary && max_wal_senders == 0)
+ ereport(ERROR,
+ (errmsg("WAL streaming (max_wal_senders > 0) is required if allow_standalone_primary = off")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1845,6 +1849,12 @@ retry1:
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
errmsg("the database system is in recovery mode")));
break;
+ case CAC_REPLICATION_ONLY:
+ if (!am_walsender)
+ ereport(FATAL,
+ (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+ errmsg("the database system is waiting for replication to start")));
+ break;
case CAC_TOOMANY:
ereport(FATAL,
(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
@@ -1942,7 +1952,9 @@ canAcceptConnections(void)
*/
if (pmState != PM_RUN)
{
- if (pmState == PM_WAIT_BACKUP)
+ if (pmState == PM_WAIT_FOR_REPLICATION)
+ result = CAC_REPLICATION_ONLY; /* allow replication only */
+ else if (pmState == PM_WAIT_BACKUP)
result = CAC_WAITBACKUP; /* allow superusers only */
else if (Shutdown > NoShutdown)
return CAC_SHUTDOWN; /* shutdown is pending */
@@ -2396,8 +2408,13 @@ reaper(SIGNAL_ARGS)
* Startup succeeded, commence normal operations
*/
FatalError = false;
- ReachedNormalRunning = true;
- pmState = PM_RUN;
+ if (allow_standalone_primary)
+ {
+ ReachedNormalRunning = true;
+ pmState = PM_RUN;
+ }
+ else
+ pmState = PM_WAIT_FOR_REPLICATION;
/*
* Crank up the background writer, if we didn't do that already
@@ -3233,8 +3250,8 @@ BackendStartup(Port *port)
/* Pass down canAcceptConnections state */
port->canAcceptConnections = canAcceptConnections();
bn->dead_end = (port->canAcceptConnections != CAC_OK &&
- port->canAcceptConnections != CAC_WAITBACKUP);
-
+ port->canAcceptConnections != CAC_WAITBACKUP &&
+ port->canAcceptConnections != CAC_REPLICATION_ONLY);
/*
* Unless it's a dead_end child, assign it a child slot number
*/
@@ -4284,6 +4301,16 @@ sigusr1_handler(SIGNAL_ARGS)
WalReceiverPID = StartWalReceiver();
}
+ if (CheckPostmasterSignal(PMSIGNAL_SYNC_REPLICATION_ACTIVE) &&
+ pmState == PM_WAIT_FOR_REPLICATION)
+ {
+ /* Allow connections now that a synchronous replication standby
+ * has successfully connected and is active.
+ */
+ ReachedNormalRunning = true;
+ pmState = PM_RUN;
+ }
+
PG_SETMASK(&UnBlockSig);
errno = save_errno;
@@ -4534,6 +4561,7 @@ static void
StartAutovacuumWorker(void)
{
Backend *bn;
+ CAC_state cac = CAC_OK;
/*
* If not in condition to run a process, don't try, but handle it like a
@@ -4542,7 +4570,8 @@ StartAutovacuumWorker(void)
* we have to check to avoid race-condition problems during DB state
* changes.
*/
- if (canAcceptConnections() == CAC_OK)
+ cac = canAcceptConnections();
+ if (cac == CAC_OK || cac == CAC_REPLICATION_ONLY)
{
bn = (Backend *) malloc(sizeof(Backend));
if (bn)
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 42c6eaf..3fe490e 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
- repl_gram.o
+ repl_gram.o syncrep.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 9c2e0d8..7387224 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -1,5 +1,27 @@
src/backend/replication/README
+Overview
+--------
+
+The WALSender sends WAL data and receives replies. The WALReceiver
+receives WAL data and sends replies.
+
+If there is no more WAL data to send then WALSender goes quiet,
+apart from checking for replies. If there is no more WAL data
+to receive then WALReceiver keeps sending replies until all the data
+received has been applied, then it too goes quiet. When all is quiet
+WALReceiver sends regular replies so that WALSender knows the link
+is still working - we don't want to wait until a transaction
+arrives before we try to determine the health of the connection.
+
+WALReceiver sends one reply per message received. If nothing is
+received it sends one reply every time apply pointer advances,
+with a minimum of one reply each cycletime.
+
+For synchronous replication, all decisions about whether to wait
+and how long to wait are taken on the primary. The standby has no
+state information about what is happening on the primary.
+
Walreceiver - libpqwalreceiver API
----------------------------------
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
new file mode 100644
index 0000000..12a3825
--- /dev/null
+++ b/src/backend/replication/syncrep.c
@@ -0,0 +1,641 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.c
+ *
+ * Synchronous replication is new as of PostgreSQL 9.1.
+ *
+ * If requested, transaction commits wait until their commit LSN is
+ * acknowledged by the standby, or the wait hits timeout.
+ *
+ * This module contains the code for waiting and release of backends.
+ * All code in this module executes on the primary. The core streaming
+ * replication transport remains within WALreceiver/WALsender modules.
+ *
+ * The essence of this design is that it isolates all logic about
+ * waiting/releasing onto the primary. The primary is aware of which
+ * standby servers offer a synchronisation service. The standby is
+ * completely unaware of the durability requirements of transactions
+ * on the primary, reducing the complexity of the code and streamlining
+ * both standby operations and network bandwidth because there is no
+ * requirement to ship per-transaction state information.
+ *
+ * The bookeeping approach we take is that a commit is either synchronous
+ * or not synchronous (async). If it is async, we just fastpath out of
+ * here. If it is sync, then it follows exactly one rigid definition of
+ * synchronous replication as laid out by the various parameters. If we
+ * change the definition of replication, we'll need to scan through all
+ * waiting backends to see if we should now release them.
+ *
+ * The best performing way to manage the waiting backends is to have a
+ * single ordered queue of waiting backends, so that we can avoid
+ * searching the through all waiters each time we receive a reply.
+ *
+ * Starting sync replication is a two stage process. First, the standby
+ * must have caught up with the primary; that may take some time. Next,
+ * we must receive a reply from the standby before we change state so
+ * that sync rep is fully active and commits can wait on us.
+ *
+ * XXX Changing state to a sync rep service while we are running allows
+ * us to enable sync replication via SIGHUP on the standby at a later
+ * time, without restart, if we need to do that. Though you can't turn
+ * it off without disconnecting.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xact.h"
+#include "access/xlog_internal.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/ipc.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_tables.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+
+
+/* User-settable parameters for sync rep */
+bool sync_rep_mode = false; /* Only set in user backends */
+int sync_rep_timeout_client = 120; /* Only set in user backends */
+int sync_rep_timeout_server = 30; /* Only set in user backends */
+bool sync_rep_service = false; /* Never set in user backends */
+bool hot_standby_feedback = true;
+
+/*
+ * Queuing code is written to allow later extension to multiple
+ * queues. Currently, we use just one queue (==FSYNC).
+ *
+ * XXX We later expect to have RECV, FSYNC and APPLY modes.
+ */
+#define SYNC_REP_NOT_ON_QUEUE -1
+#define SYNC_REP_FSYNC 0
+#define IsOnSyncRepQueue() (current_queue > SYNC_REP_NOT_ON_QUEUE)
+/*
+ * Queue identifier of the queue on which user backend currently waits.
+ */
+static int current_queue = SYNC_REP_NOT_ON_QUEUE;
+
+static void SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN, int qid);
+static void SyncRepRemoveFromQueue(void);
+static void SyncRepAddToQueue(int qid);
+static bool SyncRepServiceAvailable(void);
+static long SyncRepGetWaitTimeout(void);
+
+static void SyncRepWakeFromQueue(int wait_queue, XLogRecPtr lsn);
+
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for normal user backends
+ * ===========================================================
+ */
+
+/*
+ * Wait for synchronous replication, if requested by user.
+ */
+extern void
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+ /*
+ * Fast exit if user has requested async replication, or
+ * streaming replication is inactive in this server.
+ */
+ if (max_wal_senders == 0 || !sync_rep_mode)
+ return;
+
+ Assert(sync_rep_mode);
+
+ if (allow_standalone_primary)
+ {
+ bool avail_sync_mode;
+
+ /*
+ * Check that the service level we want is available.
+ * If not, downgrade the service level to async.
+ */
+ avail_sync_mode = SyncRepServiceAvailable();
+
+ /*
+ * Perform the wait here, then drop through and exit.
+ */
+ if (avail_sync_mode)
+ SyncRepWaitOnQueue(XactCommitLSN, 0);
+ }
+ else
+ {
+ /*
+ * Wait only on the service level requested,
+ * whether or not it is currently available.
+ * Sounds weird, but this mode exists to protect
+ * against changes that will only occur on primary.
+ */
+ SyncRepWaitOnQueue(XactCommitLSN, 0);
+ }
+}
+
+/*
+ * Wait for specified LSN to be confirmed at the requested level
+ * of durability. Each proc has its own wait latch, so we perform
+ * a normal latch check/wait loop here.
+ */
+static void
+SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN, int qid)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[0]);
+ TimestampTz now = GetCurrentTransactionStopTimestamp();
+ long timeout = SyncRepGetWaitTimeout(); /* seconds */
+ char *new_status = NULL;
+ const char *old_status;
+ int len;
+
+ /*
+ * No need to wait for autovacuums. If the standby does go away and
+ * we wait for it to return we may as well do some usefulwork locally.
+ * This is critical since we may need to perform emergency vacuuming
+ * and cannot wait for standby to return.
+ */
+ if (IsAutoVacuumWorkerProcess())
+ return;
+
+ ereport(DEBUG2,
+ (errmsg("synchronous replication waiting for %X/%X starting at %s",
+ XactCommitLSN.xlogid,
+ XactCommitLSN.xrecoff,
+ timestamptz_to_str(GetCurrentTransactionStopTimestamp()))));
+
+ for (;;)
+ {
+ ResetLatch(&MyProc->waitLatch);
+
+ /*
+ * First time through, add ourselves to the appropriate queue.
+ */
+ if (!IsOnSyncRepQueue())
+ {
+ SpinLockAcquire(&queue->qlock);
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ {
+ /* No need to wait */
+ SpinLockRelease(&queue->qlock);
+ return;
+ }
+
+ /*
+ * Set our waitLSN so WALSender will know when to wake us.
+ * We set this before we add ourselves to queue, so that
+ * any proc on the queue can be examined freely without
+ * taking a lock on each process in the queue.
+ */
+ MyProc->waitLSN = XactCommitLSN;
+ SyncRepAddToQueue(qid);
+ SpinLockRelease(&queue->qlock);
+ current_queue = qid; /* Remember which queue we're on */
+
+ /*
+ * Alter ps display to show waiting for sync rep.
+ */
+ old_status = get_ps_display(&len);
+ new_status = (char *) palloc(len + 21 + 1);
+ memcpy(new_status, old_status, len);
+ strcpy(new_status + len, " waiting for sync rep");
+ set_ps_display(new_status, false);
+ new_status[len] = '\0'; /* truncate off " waiting" */
+ }
+ else
+ {
+ bool release = false;
+ bool timeout = false;
+
+ SpinLockAcquire(&queue->qlock);
+
+ /*
+ * Check the LSN on our queue and if its moved far enough then
+ * remove us from the queue. First time through this is
+ * unlikely to be far enough, yet is possible. Next time we are
+ * woken we should be more lucky.
+ */
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ release = true;
+ else if (timeout > 0 &&
+ TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
+ now,
+ timeout))
+ {
+ release = true;
+ timeout = true;
+ }
+
+ if (release)
+ {
+ SyncRepRemoveFromQueue();
+ SpinLockRelease(&queue->qlock);
+
+ if (new_status)
+ {
+ /* Reset ps display */
+ set_ps_display(new_status, false);
+ pfree(new_status);
+ }
+
+ /*
+ * Our response to the timeout is to simply post a NOTICE and
+ * then return to the user. The commit has happened, we just
+ * haven't been able to verify it has been replicated to the
+ * level requested.
+ *
+ * XXX We could check here to see if our LSN has been sent to
+ * another standby that offers a lower level of service. That
+ * could be true if we had, for example, requested 'apply'
+ * with two standbys, one at 'apply' and one at 'recv' and the
+ * apply standby has just gone down. Something for the weekend.
+ */
+ if (timeout)
+ ereport(NOTICE,
+ (errmsg("synchronous replication timeout at %s",
+ timestamptz_to_str(now))));
+ else
+ ereport(DEBUG2,
+ (errmsg("synchronous replication wait complete at %s",
+ timestamptz_to_str(now))));
+
+ /* XXX Do we need to unset the latch? */
+ return;
+ }
+
+ SpinLockRelease(&queue->qlock);
+ }
+
+ WaitLatch(&MyProc->waitLatch, timeout);
+ now = GetCurrentTimestamp();
+ }
+}
+
+/*
+ * Remove myself from sync rep wait queue.
+ *
+ * Assume on queue at start; will not be on queue at end.
+ * Queue is already locked at start and remains locked on exit.
+ *
+ * XXX Implements design pattern "Reinvent Wheel", think about changing
+ */
+void
+SyncRepRemoveFromQueue(void)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[current_queue]);
+ PGPROC *proc = queue->head;
+ int numprocs = 0;
+
+ Assert(IsOnSyncRepQueue());
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG3, "removing myself from queue %d", current_queue);
+#endif
+
+ for (; proc != NULL; proc = proc->lwWaitLink)
+ {
+ if (proc == MyProc)
+ {
+ elog(LOG, "proc %d lsn %X/%X is MyProc",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+ }
+ else
+ {
+ elog(LOG, "proc %d lsn %X/%X",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+ }
+ numprocs++;
+ }
+
+ proc = queue->head;
+
+ if (proc == MyProc)
+ {
+ if (MyProc->lwWaitLink == NULL)
+ {
+ /*
+ * We were the only waiter on the queue. Reset head and tail.
+ */
+ Assert(queue->tail == MyProc);
+ queue->head = NULL;
+ queue->tail = NULL;
+ }
+ else
+ /*
+ * Move head to next proc on the queue.
+ */
+ queue->head = MyProc->lwWaitLink;
+ }
+ else
+ {
+ while (proc->lwWaitLink != NULL)
+ {
+ /* Are we the next proc in our traversal of the queue? */
+ if (proc->lwWaitLink == MyProc)
+ {
+ /*
+ * Remove ourselves from middle of queue.
+ * No need to touch head or tail.
+ */
+ proc->lwWaitLink = MyProc->lwWaitLink;
+ }
+
+ if (proc->lwWaitLink == NULL)
+ elog(WARNING, "could not locate ourselves on wait queue");
+ proc = proc->lwWaitLink;
+ }
+
+ if (proc->lwWaitLink == NULL) /* At tail */
+ {
+ Assert(proc == MyProc);
+ /* Remove ourselves from tail of queue */
+ Assert(queue->tail == MyProc);
+ queue->tail = proc;
+ proc->lwWaitLink = NULL;
+ }
+ }
+ MyProc->lwWaitLink = NULL;
+ current_queue = SYNC_REP_NOT_ON_QUEUE;
+}
+
+/*
+ * Add myself to sync rep wait queue.
+ *
+ * Assume not on queue at start; will be on queue at end.
+ * Queue is already locked at start and remains locked on exit.
+ */
+static void
+SyncRepAddToQueue(int qid)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[qid]);
+ PGPROC *tail = queue->tail;
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG3, "adding myself to queue %d", qid);
+#endif
+
+ /*
+ * Add myself to tail of wait queue.
+ */
+ if (tail == NULL)
+ {
+ queue->head = MyProc;
+ queue->tail = MyProc;
+ }
+ else
+ {
+ /*
+ * XXX extra code needed here to maintain sorted invariant.
+ * Our approach should be same as racing car - slow in, fast out.
+ */
+ Assert(tail->lwWaitLink == NULL);
+ tail->lwWaitLink = MyProc;
+ }
+ queue->tail = MyProc;
+
+ /*
+ * This used to be an Assert, but it keeps failing... why?
+ */
+ MyProc->lwWaitLink = NULL; /* to be sure */
+}
+
+/*
+ * Dynamically decide the sync rep wait mode. It may seem a trifle
+ * wasteful to do this for every transaction but we need to do this
+ * so we can cope sensibly with standby disconnections. It's OK to
+ * spend a few cycles here anyway, since while we're doing this the
+ * WALSender will be sending the data we want to wait for, so this
+ * is dead time and the user has requested to wait anyway.
+ */
+static bool
+SyncRepServiceAvailable(void)
+{
+ bool result = false;
+
+ SpinLockAcquire(&WalSndCtl->ctlmutex);
+ result = WalSndCtl->sync_rep_service_available;
+ SpinLockRelease(&WalSndCtl->ctlmutex);
+
+ return result;
+}
+
+/*
+ * Allows more complex decision making about what the wait time should be.
+ */
+static long
+SyncRepGetWaitTimeout(void)
+{
+ if (sync_rep_timeout_client <= 0)
+ return -1L;
+
+ return 1000000L * sync_rep_timeout_client;
+}
+
+void
+SyncRepCleanupAtProcExit(int code, Datum arg)
+{
+/*
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[qid]);
+
+ if (IsOnSyncRepQueue())
+ {
+ SpinLockAcquire(&queue->qlock);
+ SyncRepRemoveFromQueue();
+ SpinLockRelease(&queue->qlock);
+ }
+*/
+
+ if (MyProc != NULL && MyProc->ownLatch)
+ {
+ DisownLatch(&MyProc->waitLatch);
+ MyProc->ownLatch = false;
+ }
+}
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for wal sender processes
+ * ===========================================================
+ */
+
+/*
+ * Update the LSNs on each queue based upon our latest state. This
+ * implements a simple policy of first-valid-standby-releases-waiter.
+ *
+ * Other policies are possible, which would change what we do here and what
+ * perhaps also which information we store as well.
+ */
+void
+SyncRepReleaseWaiters(bool timeout)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ int mode;
+
+ /*
+ * If we are now streaming, and haven't yet enabled the sync rep service
+ * do so now. We don't enable sync rep service during a base backup since
+ * during that action we aren't sending WAL at all, so there cannot be
+ * any meaningful replies. We don't enable sync rep service while we
+ * are still in catchup mode either, since clients might experience an
+ * extended wait (perhaps hours) if they waited at that point.
+ *
+ * Note that we do release waiters, even if they aren't enabled yet.
+ * That sounds strange, but we may have dropped the connection and
+ * reconnected, so there may still be clients waiting for a response
+ * from when we were connected previously.
+ *
+ * If we already have a sync rep server connected, don't enable
+ * this server as well.
+ *
+ * XXX expect to be able to support multiple sync standbys in future.
+ */
+ if (!MyWalSnd->sync_rep_service &&
+ MyWalSnd->state == WALSNDSTATE_STREAMING &&
+ !SyncRepServiceAvailable())
+ {
+ ereport(LOG,
+ (errmsg("enabling synchronous replication service for standby")));
+
+ /*
+ * Update state for this WAL sender.
+ */
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalSnd *walsnd = MyWalSnd;
+
+ SpinLockAcquire(&walsnd->mutex);
+ walsnd->sync_rep_service = true;
+ SpinLockRelease(&walsnd->mutex);
+ }
+
+ /*
+ * We have at least one standby, so we're open for business.
+ */
+ {
+ SpinLockAcquire(&WalSndCtl->ctlmutex);
+ WalSndCtl->sync_rep_service_available = true;
+ SpinLockRelease(&WalSndCtl->ctlmutex);
+ }
+
+ /*
+ * Let postmaster know we can allow connections, if the user
+ * requested waiting until sync rep was active before starting.
+ * We send this unconditionally to avoid more complexity in
+ * postmaster code.
+ */
+ if (IsUnderPostmaster)
+ SendPostmasterSignal(PMSIGNAL_SYNC_REPLICATION_ACTIVE);
+ }
+
+ /*
+ * No point trying to release waiters while doing a base backup
+ */
+ if (MyWalSnd->state == WALSNDSTATE_BACKUP)
+ return;
+
+#ifdef SYNCREP_DEBUG
+ elog(LOG, "releasing waiters up to flush = %X/%X",
+ MyWalSnd->flush.xlogid, MyWalSnd->flush.xrecoff);
+#endif
+
+
+ /*
+ * Only maintain LSNs of queues for which we advertise a service.
+ * This is important to ensure that we only wakeup users when a
+ * preferred standby has reached the required LSN.
+ *
+ * Since sycnhronous_replication_mode is currently a boolean, we either
+ * offer all modes, or none.
+ */
+ for (mode = 0; mode < NUM_SYNC_REP_WAIT_MODES; mode++)
+ {
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[mode]);
+
+ /*
+ * Lock the queue. Not really necessary with just one sync standby
+ * but it makes clear what needs to happen.
+ */
+ SpinLockAcquire(&queue->qlock);
+ if (XLByteLT(queue->lsn, MyWalSnd->flush))
+ {
+ /*
+ * Set the lsn first so that when we wake backends they will
+ * release up to this location.
+ */
+ queue->lsn = MyWalSnd->flush;
+ SyncRepWakeFromQueue(mode, MyWalSnd->flush);
+ }
+ SpinLockRelease(&queue->qlock);
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG2, "q%d queue = %X/%X flush = %X/%X", mode,
+ queue->lsn.xlogid, queue->lsn.xrecoff,
+ MyWalSnd->flush.xlogid, MyWalSnd->flush.xrecoff);
+#endif
+ }
+}
+
+/*
+ * Walk queue from head setting the latches of any procs that need
+ * to be woken. We don't modify the queue, we leave that for individual
+ * procs to release themselves.
+ *
+ * Must hold spinlock on queue.
+ */
+static void
+SyncRepWakeFromQueue(int wait_queue, XLogRecPtr lsn)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[wait_queue]);
+ PGPROC *proc = queue->head;
+ int numprocs = 0;
+ int totalprocs = 0;
+
+ if (proc == NULL)
+ return;
+
+ for (; proc != NULL; proc = proc->lwWaitLink)
+ {
+ elog(LOG, "proc %d lsn %X/%X",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+
+ if (XLByteLE(proc->waitLSN, lsn))
+ {
+ numprocs++;
+ SetLatch(&proc->waitLatch);
+ }
+ totalprocs++;
+ }
+ elog(DEBUG2, "released %d procs out of %d waiting procs", numprocs, totalprocs);
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG2, "released %d procs up to %X/%X", numprocs, lsn.xlogid, lsn.xrecoff);
+#endif
+}
+
+void
+SyncRepTimeoutExceeded(void)
+{
+ SyncRepReleaseWaiters(true);
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7005307..18b5c45 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -38,6 +38,7 @@
#include <signal.h>
#include <unistd.h>
+#include "access/transam.h"
#include "access/xlog_internal.h"
#include "libpq/pqsignal.h"
#include "miscadmin.h"
@@ -45,6 +46,7 @@
#include "replication/walreceiver.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
+#include "storage/procarray.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/memutils.h"
@@ -84,9 +86,11 @@ static volatile sig_atomic_t got_SIGTERM = false;
*/
static struct
{
- XLogRecPtr Write; /* last byte + 1 written out in the standby */
- XLogRecPtr Flush; /* last byte + 1 flushed in the standby */
-} LogstreamResult;
+ XLogRecPtr Write; /* last byte + 1 written out in the standby */
+ XLogRecPtr Flush; /* last byte + 1 flushed in the standby */
+} LogstreamResult;
+
+static char *reply_message;
/*
* About SIGTERM handling:
@@ -114,6 +118,7 @@ static void WalRcvDie(int code, Datum arg);
static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
static void XLogWalRcvFlush(void);
+static void XLogWalRcvSendReply(void);
/* Signal handlers */
static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -204,6 +209,8 @@ WalReceiverMain(void)
/* Advertise our PID so that the startup process can kill us */
walrcv->pid = MyProcPid;
walrcv->walRcvState = WALRCV_RUNNING;
+ elog(DEBUG2, "WALreceiver starting");
+ OwnLatch(&WalRcv->latch); /* Run before signals enabled, since they can wakeup latch */
/* Fetch information required to start streaming */
strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
@@ -265,12 +272,19 @@ WalReceiverMain(void)
walrcv_connect(conninfo, startpoint);
DisableWalRcvImmediateExit();
+ /*
+ * Allocate buffer that will be used for each output message. We do this
+ * just once to reduce palloc overhead.
+ */
+ reply_message = palloc(sizeof(StandbyReplyMessage));
+
/* Loop until end-of-streaming or error */
for (;;)
{
unsigned char type;
char *buf;
int len;
+ bool received_all = false;
/*
* Emergency bailout if postmaster has died. This is to avoid the
@@ -296,21 +310,44 @@ WalReceiverMain(void)
ProcessConfigFile(PGC_SIGHUP);
}
- /* Wait a while for data to arrive */
- if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
+ ResetLatch(&WalRcv->latch);
+
+ if (walrcv_receive(0, &type, &buf, &len))
{
- /* Accept the received data, and process it */
+ received_all = false;
XLogWalRcvProcessMsg(type, buf, len);
+ }
+ else
+ received_all = true;
- /* Receive any more data we can without sleeping */
- while (walrcv_receive(0, &type, &buf, &len))
- XLogWalRcvProcessMsg(type, buf, len);
+ XLogWalRcvSendReply();
+ if (received_all && !got_SIGHUP && !got_SIGTERM)
+ {
/*
- * If we've written some records, flush them to disk and let the
- * startup process know about them.
+ * Flush, then reply.
+ *
+ * XXX We really need the WALWriter active as well
*/
XLogWalRcvFlush();
+ XLogWalRcvSendReply();
+
+ /*
+ * Sleep for up to 500 ms, the fixed keepalive delay.
+ *
+ * We will be woken if new data is received from primary
+ * or if a commit is applied. This is sub-optimal in the
+ * case where a group of commits arrive, then it all goes
+ * quiet, but its not worth the extra code to handle both
+ * that and the simple case of a single commit.
+ *
+ * Note that we do not need to wake up when the Startup
+ * process has applied the last outstanding record. That
+ * is interesting iff that is a commit record.
+ */
+ pg_usleep(1000000L); /* slow down loop for debugging */
+// WaitLatchOrSocket(&WalRcv->latch, MyProcPort->sock,
+// 500000L);
}
}
}
@@ -334,6 +371,8 @@ WalRcvDie(int code, Datum arg)
walrcv->pid = 0;
SpinLockRelease(&walrcv->mutex);
+ DisownLatch(&WalRcv->latch);
+
/* Terminate the connection gracefully. */
if (walrcv_disconnect != NULL)
walrcv_disconnect();
@@ -344,6 +383,7 @@ static void
WalRcvSigHupHandler(SIGNAL_ARGS)
{
got_SIGHUP = true;
+ WalRcvWakeup();
}
/* SIGTERM: set flag for main loop, or shutdown immediately if safe */
@@ -351,6 +391,7 @@ static void
WalRcvShutdownHandler(SIGNAL_ARGS)
{
got_SIGTERM = true;
+ WalRcvWakeup();
/* Don't joggle the elbow of proc_exit */
if (!proc_exit_inprogress && WalRcvImmediateInterruptOK)
@@ -548,3 +589,58 @@ XLogWalRcvFlush(void)
}
}
}
+
+/*
+ * Send reply message to primary. Returns false if message send failed.
+ *
+ * Our reply consists solely of the current state of the standby. Standby
+ * doesn't make any attempt to remember requests made by transactions on
+ * the primary.
+ */
+static void
+XLogWalRcvSendReply(void)
+{
+ StandbyReplyMessage reply;
+
+ if (!sync_rep_service && !hot_standby_feedback)
+ return;
+
+ /*
+ * Set sub-protocol message type for a StandbyReplyMessage.
+ */
+ if (sync_rep_service)
+ {
+ reply.write = LogstreamResult.Write;
+ reply.flush = LogstreamResult.Flush;
+ reply.apply = GetXLogReplayRecPtr();
+ }
+
+ if (hot_standby_feedback && HotStandbyActive())
+ reply.xmin = GetOldestXmin(true, false);
+ else
+ reply.xmin = InvalidTransactionId;
+
+ reply.sendTime = GetCurrentTimestamp();
+
+ memcpy(reply_message, &reply, sizeof(StandbyReplyMessage));
+
+ elog(DEBUG2, "sending write = %X/%X "
+ "flush = %X/%X "
+ "apply = %X/%X "
+ "xmin = %d ",
+ reply.write.xlogid, reply.write.xrecoff,
+ reply.flush.xlogid, reply.flush.xrecoff,
+ reply.apply.xlogid, reply.apply.xrecoff,
+ reply.xmin);
+
+ walrcv_send(reply_message, sizeof(StandbyReplyMessage));
+}
+
+/* Wake up the WalRcv
+ * Prototype goes in xact.c since that is only external caller
+ */
+void
+WalRcvWakeup(void)
+{
+ SetLatch(&WalRcv->latch);
+};
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 04c9004..da97528 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -64,6 +64,7 @@ WalRcvShmemInit(void)
MemSet(WalRcv, 0, WalRcvShmemSize());
WalRcv->walRcvState = WALRCV_STOPPED;
SpinLockInit(&WalRcv->mutex);
+ InitSharedLatch(&WalRcv->latch);
}
}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 78963c1..d9ff9ed 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -39,6 +39,7 @@
#include "funcapi.h"
#include "access/xlog_internal.h"
+#include "access/transam.h"
#include "catalog/pg_type.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
@@ -63,7 +64,7 @@
WalSndCtlData *WalSndCtl = NULL;
/* My slot in the shared memory array */
-static WalSnd *MyWalSnd = NULL;
+WalSnd *MyWalSnd = NULL;
/* Global state */
bool am_walsender = false; /* Am I a walsender process ? */
@@ -71,6 +72,7 @@ bool am_walsender = false; /* Am I a walsender process ? */
/* User-settable parameters for walsender */
int max_wal_senders = 0; /* the maximum number of concurrent walsenders */
int WalSndDelay = 200; /* max sleep time between some actions */
+bool allow_standalone_primary = true; /* action if no sync standby active */
/*
* These variables are used similarly to openLogFile/Id/Seg/Off,
@@ -87,6 +89,9 @@ static uint32 sendOff = 0;
*/
static XLogRecPtr sentPtr = {0, 0};
+static StringInfoData input_message;
+static TimestampTz last_reply_timestamp;
+
/* Flags set by signal handlers for later service in main loop */
static volatile sig_atomic_t got_SIGHUP = false;
volatile sig_atomic_t walsender_shutdown_requested = false;
@@ -106,10 +111,10 @@ static void InitWalSnd(void);
static void WalSndHandshake(void);
static void WalSndKill(int code, Datum arg);
static bool XLogSend(char *msgbuf, bool *caughtup);
-static void CheckClosedConnection(void);
static void IdentifySystem(void);
static void StartReplication(StartReplicationCmd * cmd);
-
+static void ProcessStandbyReplyMessage(void);
+static void ProcessRepliesIfAny(void);
/* Main entry point for walsender process */
int
@@ -147,6 +152,8 @@ WalSenderMain(void)
/* Unblock signals (they were blocked when the postmaster forked us) */
PG_SETMASK(&UnBlockSig);
+ elog(DEBUG2, "WALsender starting");
+
/* Tell the standby that walsender is ready for receiving commands */
ReadyForQuery(DestRemote);
@@ -163,6 +170,8 @@ WalSenderMain(void)
SpinLockRelease(&walsnd->mutex);
}
+ elog(DEBUG2, "WALsender handshake complete");
+
/* Main loop of walsender */
return WalSndLoop();
}
@@ -173,7 +182,6 @@ WalSenderMain(void)
static void
WalSndHandshake(void)
{
- StringInfoData input_message;
bool replication_started = false;
initStringInfo(&input_message);
@@ -247,6 +255,11 @@ WalSndHandshake(void)
errmsg("invalid standby handshake message type %d", firstchar)));
}
}
+
+ /*
+ * Initialize our timeout checking mechanism.
+ */
+ last_reply_timestamp = GetCurrentTimestamp();
}
/*
@@ -414,9 +427,11 @@ HandleReplicationCommand(const char *cmd_string)
/* break out of the loop */
replication_started = true;
+ WalSndSetState(WALSNDSTATE_CATCHUP);
break;
case T_BaseBackupCmd:
+ WalSndSetState(WALSNDSTATE_BACKUP);
SendBaseBackup((BaseBackupCmd *) cmd_node);
/* Send CommandComplete and ReadyForQuery messages */
@@ -442,7 +457,7 @@ HandleReplicationCommand(const char *cmd_string)
* Check if the remote end has closed the connection.
*/
static void
-CheckClosedConnection(void)
+ProcessRepliesIfAny(void)
{
unsigned char firstchar;
int r;
@@ -466,6 +481,13 @@ CheckClosedConnection(void)
switch (firstchar)
{
/*
+ * 'd' means a standby reply wrapped in a COPY BOTH packet.
+ */
+ case 'd':
+ ProcessStandbyReplyMessage();
+ break;
+
+ /*
* 'X' means that the standby is closing down the socket.
*/
case 'X':
@@ -479,6 +501,64 @@ CheckClosedConnection(void)
}
}
+/*
+ * Receive StandbyReplyMessage. False if message send failed.
+ */
+static void
+ProcessStandbyReplyMessage(void)
+{
+ StandbyReplyMessage reply;
+
+ /*
+ * Read the message contents.
+ */
+ if (pq_getmessage(&input_message, 0))
+ {
+ ereport(COMMERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected EOF on standby connection")));
+ proc_exit(0);
+ }
+
+ pq_copymsgbytes(&input_message, (char *) &reply, sizeof(StandbyReplyMessage));
+
+ elog(DEBUG2, "write = %X/%X "
+ "flush = %X/%X "
+ "apply = %X/%X "
+ "xmin = %d ",
+ reply.write.xlogid, reply.write.xrecoff,
+ reply.flush.xlogid, reply.flush.xrecoff,
+ reply.apply.xlogid, reply.apply.xrecoff,
+ reply.xmin);
+
+ /*
+ * Update shared state for this WalSender process
+ * based on reply data from standby.
+ */
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalSnd *walsnd = MyWalSnd;
+
+ SpinLockAcquire(&walsnd->mutex);
+ if (XLByteLT(walsnd->write, reply.write))
+ walsnd->write = reply.write;
+ if (XLByteLT(walsnd->flush, reply.flush))
+ walsnd->flush = reply.flush;
+ if (XLByteLT(walsnd->apply, reply.apply))
+ walsnd->apply = reply.apply;
+ SpinLockRelease(&walsnd->mutex);
+
+ if (TransactionIdIsValid(reply.xmin) &&
+ TransactionIdPrecedes(MyProc->xmin, reply.xmin))
+ MyProc->xmin = reply.xmin;
+ }
+
+ /*
+ * Release any backends waiting to commit.
+ */
+ SyncRepReleaseWaiters(false);
+}
+
/* Main loop of walsender process */
static int
WalSndLoop(void)
@@ -518,6 +598,7 @@ WalSndLoop(void)
{
if (!XLogSend(output_message, &caughtup))
break;
+ ProcessRepliesIfAny();
if (caughtup)
walsender_shutdown_requested = true;
}
@@ -525,7 +606,11 @@ WalSndLoop(void)
/* Normal exit from the walsender is here */
if (walsender_shutdown_requested)
{
- /* Inform the standby that XLOG streaming was done */
+ ProcessRepliesIfAny();
+
+ /* Inform the standby that XLOG streaming was done
+ * by sending CommandComplete message.
+ */
pq_puttextmessage('C', "COPY 0");
pq_flush();
@@ -533,12 +618,31 @@ WalSndLoop(void)
}
/*
- * If we had sent all accumulated WAL in last round, nap for the
- * configured time before retrying.
+ * If we had sent all accumulated WAL in last round, then we don't
+ * have much to do. We still expect a steady stream of replies from
+ * standby. It is important to note that we don't keep track of
+ * whether or not there are backends waiting here, since that
+ * is potentially very complex state information.
+ *
+ * Also note that there is no delay between sending data and
+ * checking for the replies. We expect replies to take some time
+ * and we are more concerned with overall throughput than absolute
+ * response time to any single request.
*/
if (caughtup)
{
/*
+ * If we were still catching up, change state to streaming.
+ * While in the initial catchup phase, clients waiting for
+ * a response from the standby would wait for a very long
+ * time, so we need to have a one-way state transition to avoid
+ * problems. No need to grab a lock for the check; we are the
+ * only one to ever change the state.
+ */
+ if (MyWalSnd->state < WALSNDSTATE_STREAMING)
+ WalSndSetState(WALSNDSTATE_STREAMING);
+
+ /*
* Even if we wrote all the WAL that was available when we started
* sending, more might have arrived while we were sending this
* batch. We had the latch set while sending, so we have not
@@ -551,6 +655,13 @@ WalSndLoop(void)
break;
if (caughtup && !got_SIGHUP && !walsender_ready_to_stop && !walsender_shutdown_requested)
{
+ long timeout;
+
+ if (sync_rep_timeout_server == -1)
+ timeout = -1L;
+ else
+ timeout = 1000000L * sync_rep_timeout_server;
+
/*
* XXX: We don't really need the periodic wakeups anymore,
* WaitLatchOrSocket should reliably wake up as soon as
@@ -558,12 +669,15 @@ WalSndLoop(void)
*/
/* Sleep */
- WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
- WalSndDelay * 1000L);
+ if (WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
+ timeout) == 0)
+ {
+ ereport(LOG,
+ (errmsg("streaming replication timeout after %d s",
+ sync_rep_timeout_server)));
+ break;
+ }
}
-
- /* Check if the connection was closed */
- CheckClosedConnection();
}
else
{
@@ -572,12 +686,11 @@ WalSndLoop(void)
break;
}
- /* Update our state to indicate if we're behind or not */
- WalSndSetState(caughtup ? WALSNDSTATE_STREAMING : WALSNDSTATE_CATCHUP);
+ ProcessRepliesIfAny();
}
/*
- * Get here on send failure. Clean up and exit.
+ * Get here on send failure or timeout. Clean up and exit.
*
* Reset whereToSendOutput to prevent ereport from attempting to send any
* more messages to the standby.
@@ -808,9 +921,9 @@ XLogSend(char *msgbuf, bool *caughtup)
* Attempt to send all data that's already been written out and fsync'd to
* disk. We cannot go further than what's been written out given the
* current implementation of XLogRead(). And in any case it's unsafe to
- * send WAL that is not securely down to disk on the master: if the master
+ * send WAL that is not securely down to disk on the primary: if the primary
* subsequently crashes and restarts, slaves must not have applied any WAL
- * that gets lost on the master.
+ * that gets lost on the primary.
*/
SendRqstPtr = GetFlushRecPtr();
@@ -888,6 +1001,9 @@ XLogSend(char *msgbuf, bool *caughtup)
msghdr.walEnd = SendRqstPtr;
msghdr.sendTime = GetCurrentTimestamp();
+ elog(DEBUG2, "sent = %X/%X ",
+ startptr.xlogid, startptr.xrecoff);
+
memcpy(msgbuf + 1, &msghdr, sizeof(WalDataMessageHeader));
pq_putmessage('d', msgbuf, 1 + sizeof(WalDataMessageHeader) + nbytes);
@@ -1045,6 +1161,16 @@ WalSndShmemInit(void)
SpinLockInit(&walsnd->mutex);
InitSharedLatch(&walsnd->latch);
}
+
+ /*
+ * Initialise the spinlocks on each sync rep queue
+ */
+ for (i = 0; i < NUM_SYNC_REP_WAIT_MODES; i++)
+ {
+ SyncRepQueue *queue = &WalSndCtl->sync_rep_queue[i];
+
+ SpinLockInit(&queue->qlock);
+ }
}
}
@@ -1104,7 +1230,7 @@ WalSndGetStateString(WalSndState state)
Datum
pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
-#define PG_STAT_GET_WAL_SENDERS_COLS 3
+#define PG_STAT_GET_WAL_SENDERS_COLS 7
ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc tupdesc;
Tuplestorestate *tupstore;
@@ -1141,9 +1267,13 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
/* use volatile pointer to prevent code rearrangement */
volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
- char sent_location[MAXFNAMELEN];
+ char location[MAXFNAMELEN];
XLogRecPtr sentPtr;
+ XLogRecPtr write;
+ XLogRecPtr flush;
+ XLogRecPtr apply;
WalSndState state;
+ bool sync_rep_service;
Datum values[PG_STAT_GET_WAL_SENDERS_COLS];
bool nulls[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -1153,13 +1283,15 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
SpinLockAcquire(&walsnd->mutex);
sentPtr = walsnd->sentPtr;
state = walsnd->state;
+ write = walsnd->write;
+ flush = walsnd->flush;
+ apply = walsnd->apply;
+ sync_rep_service = walsnd->sync_rep_service;
SpinLockRelease(&walsnd->mutex);
- snprintf(sent_location, sizeof(sent_location), "%X/%X",
- sentPtr.xlogid, sentPtr.xrecoff);
-
memset(nulls, 0, sizeof(nulls));
values[0] = Int32GetDatum(walsnd->pid);
+
if (!superuser())
{
/*
@@ -1168,11 +1300,37 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
*/
nulls[1] = true;
nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
}
else
{
values[1] = CStringGetTextDatum(WalSndGetStateString(state));
- values[2] = CStringGetTextDatum(sent_location);
+ values[2] = BoolGetDatum(sync_rep_service);
+
+ snprintf(location, sizeof(location), "%X/%X",
+ sentPtr.xlogid, sentPtr.xrecoff);
+ values[3] = CStringGetTextDatum(location);
+
+ if (write.xlogid == 0 && write.xrecoff == 0)
+ nulls[4] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ write.xlogid, write.xrecoff);
+ values[4] = CStringGetTextDatum(location);
+
+ if (flush.xlogid == 0 && flush.xrecoff == 0)
+ nulls[5] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ flush.xlogid, flush.xrecoff);
+ values[5] = CStringGetTextDatum(location);
+
+ if (apply.xlogid == 0 && apply.xrecoff == 0)
+ nulls[6] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ apply.xlogid, apply.xrecoff);
+ values[6] = CStringGetTextDatum(location);
}
tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index be577bc..7aa7671 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -39,6 +39,7 @@
#include "access/xact.h"
#include "miscadmin.h"
#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -196,6 +197,7 @@ InitProcGlobal(void)
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
ProcGlobal->freeProcs = &procs[i];
+ InitSharedLatch(&procs[i].waitLatch);
}
/*
@@ -214,6 +216,7 @@ InitProcGlobal(void)
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
ProcGlobal->autovacFreeProcs = &procs[i];
+ InitSharedLatch(&procs[i].waitLatch);
}
/*
@@ -224,6 +227,7 @@ InitProcGlobal(void)
{
AuxiliaryProcs[i].pid = 0; /* marks auxiliary proc as not in use */
PGSemaphoreCreate(&(AuxiliaryProcs[i].sem));
+ InitSharedLatch(&procs[i].waitLatch);
}
/* Create ProcStructLock spinlock, too */
@@ -326,6 +330,13 @@ InitProcess(void)
SHMQueueInit(&(MyProc->myProcLocks[i]));
MyProc->recoveryConflictPending = false;
+ /* Initialise the waitLSN for sync rep */
+ MyProc->waitLSN.xlogid = 0;
+ MyProc->waitLSN.xrecoff = 0;
+
+ OwnLatch((Latch *) &MyProc->waitLatch);
+ MyProc->ownLatch = true;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -365,6 +376,7 @@ InitProcessPhase2(void)
/*
* Arrange to clean that up at backend exit.
*/
+ on_shmem_exit(SyncRepCleanupAtProcExit, 0);
on_shmem_exit(RemoveProcFromArray, 0);
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2c95ef8..7cbcde4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -55,6 +55,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
#include "postmaster/walwriter.h"
+#include "replication/syncrep.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
#include "storage/standby.h"
@@ -618,6 +619,15 @@ const char *const config_type_names[] =
static struct config_bool ConfigureNamesBool[] =
{
{
+ {"allow_standalone_primary", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Refuse connections on startup and force users to wait forever if synchronous replication has failed."),
+ NULL
+ },
+ &allow_standalone_primary,
+ true, NULL, NULL
+ },
+
+ {
{"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of sequential-scan plans."),
NULL
@@ -1260,6 +1270,33 @@ static struct config_bool ConfigureNamesBool[] =
},
{
+ {"synchronous_replication", PGC_USERSET, WAL_SETTINGS,
+ gettext_noop("Requests synchronous replication."),
+ NULL
+ },
+ &sync_rep_mode,
+ false, NULL, NULL
+ },
+
+ {
+ {"synchronous_replication_feedback", PGC_POSTMASTER, WAL_STANDBY_SERVERS,
+ gettext_noop("Allows feedback from a standby to primary for synchronous replication."),
+ NULL
+ },
+ &sync_rep_service,
+ true, NULL, NULL
+ },
+
+ {
+ {"hot_standby_feedback", PGC_POSTMASTER, WAL_STANDBY_SERVERS,
+ gettext_noop("Allows feedback from a hot standby to primary to avoid query conflicts."),
+ NULL
+ },
+ &hot_standby_feedback,
+ false, NULL, NULL
+ },
+
+ {
{"allow_system_table_mods", PGC_POSTMASTER, DEVELOPER_OPTIONS,
gettext_noop("Allows modifications of the structure of system tables."),
NULL,
@@ -1455,6 +1492,26 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"replication_timeout_client", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Clients waiting for confirmation will timeout after this duration."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &sync_rep_timeout_client,
+ 120, -1, INT_MAX, NULL, NULL
+ },
+
+ {
+ {"replication_timeout_server", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Replication connection will timeout after this duration."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &sync_rep_timeout_server,
+ 30, -1, INT_MAX, NULL, NULL
+ },
+
+ {
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6c6f9a9..eac4076 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -184,7 +184,15 @@
#archive_timeout = 0 # force a logfile segment switch after this
# number of seconds; 0 disables
-# - Streaming Replication -
+# - Replication - User Settings
+
+#synchronous_replication = off # commit waits for reply from standby
+#replication_timeout_client = 120 # -1 means wait forever
+
+# - Streaming Replication - Server Settings
+
+#allow_standalone_primary = on # sync rep parameter
+#replication_timeout_client = 30 # -1 means wait forever
#max_wal_senders = 0 # max number of walsender processes
# (change requires restart)
@@ -196,6 +204,8 @@
#hot_standby = off # "on" allows queries during recovery
# (change requires restart)
+#hot_standby_feedback = off # info from standby to prevent query conflicts
+#synchronous_replication_feedback = off # allows sync replication
#max_standby_archive_delay = 30s # max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 122e96b..784b62e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -288,8 +288,10 @@ extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg);
extern bool RecoveryInProgress(void);
+extern bool HotStandbyActive(void);
extern bool XLogInsertAllowed(void);
extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
+extern XLogRecPtr GetXLogReplayRecPtr(void);
extern void UpdateControlFile(void);
extern uint64 GetSystemIdentifier(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f8b5d4d..b83ed0c 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3075,7 +3075,7 @@ DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 f f
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,23}" "{i,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
DESCR("statistics: information about currently active backends");
-DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25}" "{o,o,o}" "{procpid,state,sent_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,16,25,25,25,25}" "{o,o,o,o,o,o,o}" "{procpid,state,sync,sent_location,write_location,flush_location,apply_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
DESCR("statistics: information about currently active replication");
DATA(insert OID = 2026 ( pg_backend_pid PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
DESCR("statistics: current backend PID");
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 4cdb15f..9a00b2c 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -73,7 +73,7 @@ typedef struct
typedef enum CAC_state
{
CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
- CAC_WAITBACKUP
+ CAC_WAITBACKUP, CAC_REPLICATION_ONLY
} CAC_state;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
new file mode 100644
index 0000000..a071b9a
--- /dev/null
+++ b/src/include/replication/syncrep.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.h
+ * Exports from replication/syncrep.c.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _SYNCREP_H
+#define _SYNCREP_H
+
+#include "access/xlog.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+
+#define SyncRepRequested() (sync_rep_mode)
+#define StandbyOffersSyncRepService() (sync_rep_service)
+
+/*
+ * There is no reply from standby to primary for async mode, so the reply
+ * message needs one less slot than the maximum number of modes
+ */
+#define NUM_SYNC_REP_WAIT_MODES 1
+
+extern XLogRecPtr ReplyLSN[NUM_SYNC_REP_WAIT_MODES];
+
+/*
+ * Each synchronous rep wait mode has one SyncRepWaitQueue in shared memory.
+ * These queues live in the WAL sender shmem area.
+ */
+typedef struct SyncRepQueue
+{
+ /*
+ * Current location of the head of the queue. Nobody should be waiting
+ * on the queue for an lsn equal to or earlier than this value. Procs
+ * on the queue will always be later than this value, though we don't
+ * record those values here.
+ */
+ XLogRecPtr lsn;
+
+ PGPROC *head;
+ PGPROC *tail;
+
+ slock_t qlock; /* locks shared variables shown above */
+} SyncRepQueue;
+
+/* user-settable parameters for synchronous replication */
+extern bool sync_rep_mode;
+extern int sync_rep_timeout_client;
+extern int sync_rep_timeout_server;
+extern bool sync_rep_service;
+
+extern bool hot_standby_feedback;
+
+/* called by user backend */
+extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+
+/* called by wal sender */
+extern void SyncRepReleaseWaiters(bool timeout);
+extern void SyncRepTimeoutExceeded(void);
+
+/* callback at exit */
+extern void SyncRepCleanupAtProcExit(int code, Datum arg);
+
+#endif /* _SYNCREP_H */
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index 1993851..8a7101a 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -40,6 +40,47 @@ typedef struct
} WalDataMessageHeader;
/*
+ * Reply message from standby (message type 'r'). This is wrapped within
+ * a CopyData message at the FE/BE protocol level.
+ *
+ * Note that the data length is not specified here.
+ */
+typedef struct
+{
+ /*
+ * The xlog location that has been written to WAL file by standby-side.
+ * This may be invalid if the standby-side does not choose to offer
+ * a synchronous replication reply service, or is unable to offer
+ * a valid reply for data that has only been written, not fsynced.
+ */
+ XLogRecPtr write;
+
+ /*
+ * The xlog location that has been fsynced onto disk by standby-side.
+ * This may be invalid if the standby-side does not choose to offer
+ * a synchronous replication reply service, or is unable to.
+ */
+ XLogRecPtr flush;
+
+ /*
+ * The xlog location that has been applied by standby Startup process.
+ * This may be invalid if the standby-side does not support apply,
+ * or does not choose to apply records, as yet.
+ */
+ XLogRecPtr apply;
+
+ /*
+ * The current xmin from the standby, for Hot Standby feedback.
+ * This may be invalid if the standby-side does not support feedback,
+ * or Hot Standby is not yet available.
+ */
+ TransactionId xmin;
+
+ /* Sender's system clock at the time of transmission */
+ TimestampTz sendTime;
+} StandbyReplyMessage;
+
+/*
* Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
*
* We don't have a good idea of what a good value would be; there's some
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 24ad438..a6afec4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -13,6 +13,8 @@
#define _WALRECEIVER_H
#include "access/xlogdefs.h"
+#include "replication/syncrep.h"
+#include "storage/latch.h"
#include "storage/spin.h"
#include "pgtime.h"
@@ -71,6 +73,11 @@ typedef struct
*/
char conninfo[MAXCONNINFO];
+ /*
+ * Latch used by aux procs to wake up walreceiver when it has work to do.
+ */
+ Latch latch;
+
slock_t mutex; /* locks shared variables shown above */
} WalRcvData;
@@ -92,6 +99,7 @@ extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
/* prototypes for functions in walreceiver.c */
extern void WalReceiverMain(void);
+extern void WalRcvWakeup(void);
/* prototypes for functions in walreceiverfuncs.c */
extern Size WalRcvShmemSize(void);
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 9a196ab..ce85cf2 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -15,6 +15,7 @@
#include "access/xlog.h"
#include "nodes/nodes.h"
#include "storage/latch.h"
+#include "replication/syncrep.h"
#include "storage/spin.h"
@@ -35,18 +36,63 @@ typedef struct WalSnd
WalSndState state; /* this walsender's state */
XLogRecPtr sentPtr; /* WAL has been sent up to this point */
- slock_t mutex; /* locks shared variables shown above */
+ /*
+ * The xlog location that has been written to WAL file by standby-side.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr write;
+
+ /*
+ * The xlog location that has been fsynced onto disk by standby-side.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr flush;
+
+ /*
+ * The xlog location that has been applied by standby Startup process.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr apply;
+
+ /*
+ * The current xmin from the standby, for Hot Standby feedback.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ TransactionId xmin;
/*
* Latch used by backends to wake up this walsender when it has work
* to do.
*/
Latch latch;
+
+ /*
+ * Highest level of sync rep available from this standby.
+ */
+ bool sync_rep_service;
+
+ slock_t mutex; /* locks shared variables shown above */
+
} WalSnd;
+extern WalSnd *MyWalSnd;
+
/* There is one WalSndCtl struct for the whole database cluster */
typedef struct
{
+ /*
+ * Sync rep wait queues with one queue per request type.
+ * We use one queue per request type so that we can maintain the
+ * invariant that the individual queues are sorted on LSN.
+ * This may also help performance when multiple wal senders
+ * offer different sync rep service levels.
+ */
+ SyncRepQueue sync_rep_queue[NUM_SYNC_REP_WAIT_MODES];
+
+ bool sync_rep_service_available;
+
+ slock_t ctlmutex; /* locks shared variables shown above */
+
WalSnd walsnds[1]; /* VARIABLE LENGTH ARRAY */
} WalSndCtlData;
@@ -60,6 +106,7 @@ extern volatile sig_atomic_t walsender_ready_to_stop;
/* user-settable parameters */
extern int WalSndDelay;
extern int max_wal_senders;
+extern bool allow_standalone_primary;
extern int WalSenderMain(void);
extern void WalSndSignals(void);
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 97bdc7b..0d2a78e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -29,6 +29,7 @@ typedef enum
PMSIGNAL_START_AUTOVAC_LAUNCHER, /* start an autovacuum launcher */
PMSIGNAL_START_AUTOVAC_WORKER, /* start an autovacuum worker */
PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
+ PMSIGNAL_SYNC_REPLICATION_ACTIVE, /* walsender has completed handshake */
NUM_PMSIGNALS /* Must be last value of enum! */
} PMSignalReason;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 78dbade..27b57c8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -14,6 +14,8 @@
#ifndef _PROC_H_
#define _PROC_H_
+#include "access/xlog.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "storage/pg_sema.h"
#include "utils/timestamp.h"
@@ -115,6 +117,11 @@ struct PGPROC
LOCKMASK heldLocks; /* bitmask for lock types already held on this
* lock object by this backend */
+ /* Info to allow us to wait for synchronous replication, if needed. */
+ Latch waitLatch;
+ XLogRecPtr waitLSN; /* waiting for this LSN or higher */
+ bool ownLatch; /* do we own the above latch? */
+
/*
* All PROCLOCK objects for locks held or awaited by this backend are
* linked into one of these lists, according to the partition number of
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 72e5630..b070340 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1296,7 +1296,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
pg_stat_bgwriter | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc;
pg_stat_database | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts FROM pg_database d;
pg_stat_database_conflicts | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
- pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sent_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
+ pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sync, w.sent_location, w.write_location, w.flush_location, w.apply_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sync, sent_location, write_location, flush_location, apply_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
pg_stat_sys_indexes | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
pg_stat_sys_tables | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_user_functions | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));
Robert Haas <robertmhaas@gmail.com> writes:
done in the time available is another thing entirely. I do NOT want
to still be working on the items for this CommitFest in June - that's
about when I'd like to be releasing beta3.
Except that's not how we work here. You want to change that with
respect to the release management process and schedule (or lack
thereof). Tradition and current practice say you need to reach
consensus to be able to bypass compromising.
Good luck with that.
Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Mon, 2011-02-07 at 17:59 +0000, Simon Riggs wrote:
On Mon, 2011-02-07 at 12:39 -0500, Robert Haas wrote:
I just spoke to my manager at EnterpriseDB and he cleared my schedule
for the next two days to work on this. So I'll go hack on this now.
I haven't read the patch yet so I don't know for sure how quite I'll
be able to get up to speed on it, so if someone who is more familiar
with this code wants to grab the baton away from me, feel free.
Otherwise, I'll see what I can do with it.Presumably you have a reason for declaring war? I'm sorry for that, I
really am.
Simon,
My impression was that Robert had received a release from current
responsibilities to help you with your patch, not that he was declaring
war or some such thing. I believe we all want SyncRep to be successful.
Sincerely,
JD
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Mon, Feb 7, 2011 at 2:09 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
done in the time available is another thing entirely. I do NOT want
to still be working on the items for this CommitFest in June - that's
about when I'd like to be releasing beta3.Except that's not how we work here. You want to change that with
respect to the release management process and schedule (or lack
thereof). Tradition and current practice say you need to reach
consensus to be able to bypass compromising.Good luck with that.
I'm not trying to bypass compromising, and I don't know what makes you
think otherwise. I am trying to ensure that the CommitFest wraps up
in a timely fashion, which is something we have done consistently for
every CommitFest in the 9.0 and 9.1 cycles to date, including the last
CommitFest of the 9.0 cycle. It is not somehow a deviation from past
community practice to boot patches that can't be finished up in the
time available during the CommitFest. That has been routine practice
for a long time.
I have worked very hard on this CommitFest, both to line up patch
reviewers and to review myself. I want to make sure that every patch
gets a good, thorough review before the CommitFest is over. I think
there is general consensus that this is important and that we will
lose contributors if we don't do it. However, I don't want to prolong
the CommitFest indefinitely in the face of patches that the authors
are not actively working on or can't finish in the time available, or
where there is no consensus that the proposed change is what we want.
I believe that this, too, is a generally accepted principle in our
community, not something I just made up.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
I just spoke to my manager at EnterpriseDB and he cleared my schedule
for the next two days to work on this. So I'll go hack on this now.
I haven't read the patch yet so I don't know for sure how quite I'll
be able to get up to speed on it, so if someone who is more familiar
with this code wants to grab the baton away from me, feel free.
Otherwise, I'll see what I can do with it.Presumably you have a reason for declaring war? I'm sorry for that, I
really am.
How is clearing out his whole schedule to help review & fix the patch
declaring war? You have an odd attitude towards assistance, Simon.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Mon, Feb 7, 2011 at 6:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 7, 2011 at 12:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... Well, the current CommitFest ends in one week, ...
Really? I thought the idea for the last CF of a development cycle was
that it kept going till we'd dealt with everything. Arbitrarily
rejecting stuff we haven't dealt with doesn't seem fair.Uh, we did that with 8.4 and it was a disaster. The CommitFest lasted
*five months*. We've been doing schedule-based CommitFests ever since
and it's worked much better.
Rejecting stuff because we haven't gotten round to dealing with it in
such a short period of time is a damn good way to limit the number of
contributions we get. I don't believe we've agreed at any point that
the last commitfest should be the same time length as the others (when
we originally came up with the commitfest idea, it certainly wasn't
expected), and deciding that without giving people advanced notice is
a really good way to piss them off and encourage them to go work on
other things.
If we're going to put a time limit on this - and I think we should -
we should publish a date ASAP, that gives everyone a fair chance to
finish their work - say, 4 weeks.
Then, if we want to make the last commitfest the same length as the
others next year, we can make that decision and document those plans.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 7 February 2011 18:20, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Jan 15, 2011 at 4:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Here's the latest patch for sync rep.
Here is a rebased version of this patch which applies to head of the
master branch. I haven't tested it yet beyond making sure that it
compiles and passes the regression tests -- but this fixes the bitrot.
"When the primary is started with allow_standalone_primary enabled,
the primary will not allow connections until a standby connects that
also has synchronous_replication enabled. This is a convenience to
ensure that we don't allow connections before write transactions will
return successfully."
Shouldn't this be if allow_standalone_primary is disabled?
Also spotted some indentation, spelling and grammatical issues, which
I've applied to the patch if it's of interest (attached).
--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935
Attachments:
syncrep-v9.1.w.doc-fixes.patchapplication/octet-stream; name=syncrep-v9.1.w.doc-fixes.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d2a6445..96331ca 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2006,8 +2006,122 @@ SET ENABLE_SEQSCAN TO OFF;
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
+ <para>
+ You should also consider setting <varname>hot_standby_feedback</>
+ as an alternative to using this parameter.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </sect2>
+
+ <sect2 id="runtime-config-sync-rep">
+ <title>Synchronous Replication</title>
+
+ <para>
+ These settings control the behavior of the built-in
+ <firstterm>synchronous replication</> feature.
+ These parameters would be set on the primary server that is
+ to send replication data to one or more standby servers.
+ </para>
+
+ <variablelist>
+ <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
+ <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>synchronous_replication</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether transaction commit will wait for WAL records
+ to be replicated before the command returns a <quote>success</>
+ indication to the client. The default setting is <literal>off</>.
+ When <literal>on</>, there will be a delay while the client waits
+ for confirmation of successful replication. That delay will
+ increase depending upon the physical distance and network activity
+ between primary and standby. The commit wait will last until the
+ first reply from any standby. Multiple standby servers allow
+ increased availability and possibly increase performance as well.
+ </para>
+ <para>
+ The parameter must be set on both primary and standby.
+ </para>
+ <para>
+ On the primary, this parameter can be changed at any time; the
+ behavior for any one transaction is determined by the setting in
+ effect when it commits. It is therefore possible, and useful, to have
+ some transactions replicate synchronously and others asynchronously.
+ For example, to make a single multistatement transaction commit
+ asynchronously when the default is synchronous replication, issue
+ <command>SET LOCAL synchronous_replication TO OFF</> within the
+ transaction.
+ </para>
+ <para>
+ On the standby, the parameter value is taken only at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-allow-standalone-primary" xreflabel="allow_standalone_primary">
+ <term><varname>allow_standalone_primary</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>allow_standalone_primary</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If <varname>allow_standalone_primary</> is set, then the server
+ can operate normally whether or not replication is active. If
+ a client requests <varname>synchronous_replication</> and it is
+ not available, they will use asynchronous replication instead.
+ </para>
+ <para>
+ If <varname>allow_standalone_primary</> is not set, then the server
+ will prevent normal client connections until a standby connects that
+ has <varname>synchronous_replication_feedback</> enabled. Once
+ clients connect, if they request <varname>synchronous_replication</>
+ and it is no longer available they will wait for
+ <varname>replication_timeout_client</>.
+ </para>
</listitem>
</varlistentry>
+
+ <varlistentry id="guc-replication-timeout-client" xreflabel="replication_timeout_client">
+ <term><varname>replication_timeout_client</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>replication_timeout_client</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If the client has <varname>synchronous_replication</varname> set,
+ and a synchronous standby is currently available
+ then the commit will wait for up to <varname>replication_timeout_client</>
+ seconds before it returns a <quote>success</>. The commit will wait
+ forever for a confirmation when <varname>replication_timeout_client</>
+ is set to -1.
+ </para>
+ <para>
+ If the client has <varname>synchronous_replication</varname> set,
+ and yet no synchronous standby is available when we commit, then the
+ setting of <varname>allow_standalone_primary</> determines whether
+ or not we wait.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-replication-timeout-server" xreflabel="replication_timeout_server">
+ <term><varname>replication_timeout_server</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>replication_timeout_server</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If the primary server does not receive a reply from a standby server
+ within <varname>replication_timeout_server</> seconds then the
+ primary will terminate the replication connection.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
@@ -2098,6 +2212,42 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby">
+ <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>hot_standby_feedback</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether or not a hot standby will send feedback to the primary
+ about queries currently executing on the standby. This parameter can
+ be used to eliminate query cancels caused by cleanup records, though
+ it can cause database bloat on the primary for some workloads.
+ The default value is <literal>off</literal>.
+ This parameter can only be set at server start. It only has effect
+ if <varname>hot_standby</> is enabled.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-synchronous-replication-feedback" xreflabel="synchronous_replication_feedback">
+ <term><varname>synchronous_replication_feedback</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>synchronous_replication_feedback</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether the standby will provide reply messages to
+ allow synchronous replication on the primary.
+ Reasons for doing this might be that the standby is physically
+ co-located with the primary and so would be a bad choice as a
+ future primary server, or the standby might be a test server.
+ The default value is <literal>on</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
</sect1>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 94d5ae8..02a8e79 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -738,13 +738,12 @@ archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
</para>
<para>
- Streaming replication is asynchronous, so there is still a small delay
+ There is a small replication delay
between committing a transaction in the primary and for the changes to
become visible in the standby. The delay is however much smaller than with
file-based log shipping, typically under one second assuming the standby
is powerful enough to keep up with the load. With streaming replication,
- <varname>archive_timeout</> is not required to reduce the data loss
- window.
+ <varname>archive_timeout</> is not required.
</para>
<para>
@@ -879,6 +878,234 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
</sect3>
</sect2>
+ <sect2 id="synchronous-replication">
+ <title>Synchronous Replication</title>
+
+ <indexterm zone="high-availability">
+ <primary>Synchronous Replication</primary>
+ </indexterm>
+
+ <para>
+ <productname>PostgreSQL</> streaming replication is asynchronous by
+ default. If the primary server
+ crashes then some transactions that were committed may not have been
+ replicated to the standby server, causing data loss. The amount
+ of data loss is proportional to the replication delay at the time of
+ failover. That could be zero, or more, we do not know for certain
+ either way, when using asynchronous replication.
+ </para>
+
+ <para>
+ Synchronous replication offers the ability to confirm that all changes
+ made by a transaction have been transferred to at least one remote
+ standby server. This extends the standard level of durability
+ offered by a transaction commit. This level of protection is referred
+ to as 2-safe replication in computer science theory.
+ </para>
+
+ <para>
+ Synchronous replication works in the following way. When requested,
+ the commit of a write transaction will wait until confirmation is
+ received that the commit has been written to the transaction log on disk
+ of both the primary and standby server. The only possibility that data
+ can be lost is if both the primary and the standby suffer crashes at the
+ same time. This can provide a much higher level of durability if the
+ sysadmin is cautious about the placement and management of the two servers.
+ Waiting for confirmation increases the user's confidence that the changes
+ will not be lost in the event of server crashes but it also necessarily
+ increases the response time for the requesting transaction. The minimum
+ wait time is the roundtrip time between primary to standby.
+ </para>
+
+ <para>
+ Read only transactions and transaction rollbacks need not wait for
+ replies from standby servers. Subtransaction commits do not wait for
+ responses from standby servers, only final top-level commits. Long
+ running actions such as data loading or index building do not wait
+ until the very final commit message.
+ </para>
+
+ <sect3 id="synchronous-replication-config">
+ <title>Basic Configuration</title>
+
+ <para>
+ Synchronous replication will be active if appropriate options are
+ enabled on both the primary and at least one standby server. If
+ options are not correctly set on both servers, the primary will use
+ use asynchronous replication by default.
+ </para>
+
+ <para>
+ On the primary server we need to set
+
+<programlisting>
+synchronous_replication = on
+</programlisting>
+
+ and on the standby server we need to set
+
+<programlisting>
+synchronous_replication_feedback = on
+</programlisting>
+
+ On the primary, <varname>synchronous_replication</> can be set
+ for particular users or databases, or dynamically by application
+ programs. On the standby, <varname>synchronous_replication_feedback</>
+ can only be set at server start.
+ </para>
+
+ <para>
+ If more than one standby server
+ specifies <varname>synchronous_replication_feedback</>, then whichever
+ standby replies first will release waiting commits.
+ Turning this setting off for a standby allows the administrator to
+ exclude certain standby servers from releasing waiting transactions.
+ This is useful if not all standby servers are designated as potential
+ future primary servers, such as if a standby were co-located
+ with the primary, so that a disaster would cause both servers to be lost.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-performance">
+ <title>Planning for Performance</title>
+
+ <para>
+ Synchronous replication usually requires carefully planned and placed
+ standby servers to ensure applications perform acceptably. Waiting
+ doesn't utilise system resources, but transaction locks continue to be
+ held until the transfer is confirmed. As a result, incautious use of
+ synchronous replication will reduce performance for database
+ applications because of increased response times and higher contention.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</> allows the application developer
+ to specify the durability level required via replication. This can be
+ specified for the system overall, though it can also be specified for
+ specific users or connections, or even individual transactions.
+ </para>
+
+ <para>
+ For example, an application workload might consist of:
+ 10% of changes are important customer details, while
+ 90% of changes are less important data that the business can more
+ easily survive if it is lost, such as chat messages between users.
+ </para>
+
+ <para>
+ With synchronous replication options specified at the application level
+ (on the primary) we can offer sync rep for the most important changes,
+ without slowing down the bulk of the total workload. Application level
+ options are an important and practical tool for allowing the benefits of
+ synchronous replication for high performance applications.
+ </para>
+
+ <para>
+ You should consider that the network bandwidth must be higher than
+ the rate of generation of WAL data.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-ha">
+ <title>Planning for High Availability</title>
+
+ <para>
+ The easiest and safest method of gaining High Availability using
+ synchronous replication is to configure at least two standby servers.
+ To understand why, we need to examine what can happen when you lose all
+ standby servers.
+ </para>
+
+ <para>
+ Commits made when synchronous_replication is set will wait until at
+ least one standby responds. The response may never occur if the last,
+ or only, standby should crash or the network drops. What should we do in
+ that situation?
+ </para>
+
+ <para>
+ Sitting and waiting will typically cause operational problems
+ because it is an effective outage of the primary server should all
+ sessions end up waiting. In contrast, allowing the primary server to
+ continue processing write transactions in the absence of a standby
+ puts those latest data changes at risk. So in this situation there
+ is a direct choice between database availability and the potential
+ durability of the data it contains. How we handle this situation
+ is controlled by <varname>allow_standalone_primary</>. The default
+ setting is <literal>on</>, allowing processing to continue, though
+ there is no recommended setting. Choosing the best setting for
+ <varname>allow_standalone_primary</> is a difficult decision and best
+ left to those with combined business responsibility for both data and
+ applications. The difficulty of this choice is the reason why we
+ recommend that you reduce the possibility of this situation occurring
+ by using multiple standby servers.
+ </para>
+
+ <para>
+ A user will stop waiting once the <varname>replication_timeout_client</>
+ has been reached for their specific session. Users are not waiting for
+ a specific standby to reply, they are waiting for a reply from any
+ standby, so the unavailability of any one standby is not significant
+ to a user. It is possible for user sessions to hit timeout even though
+ standbys are communicating normally. In that case, the setting of
+ <varname>replication_timeout</> is probably too low.
+ </para>
+
+ <para>
+ The standby sends regular status messages to the primary. If no status
+ messages have been received for <varname>replication_timeout_server</>
+ seconds the primary server will assume the connection is dead and
+ terminate it.
+ </para>
+
+ <para>
+ When the primary is started with <varname>allow_standalone_primary</>
+ enabled, the primary will not allow connections until a standby connects
+ that also has <varname>synchronous_replication</> enabled. This is a
+ convenience to ensure that we don't allow connections before write
+ transactions will return successfully.
+ </para>
+
+ <para>
+ When a standby first attaches to the primary, it may not be properly
+ synchronized. The standby is only able to become a synchronous standby
+ once it has become synchronized, or "caught up" with the the primary.
+ The catch-up duration may be long immediately after the standby has
+ been created. If the standby is shutdown, then the catch-up period
+ will increase according to the length of time the standby has been
+ down. You are advised to make sure <varname>allow_standalone_primary</>
+ is not enabled during the initial catch-up period.
+ </para>
+
+ <para>
+ If the primary crashes while commits are waiting for acknowledgement,
+ those transactions will be marked fully committed if the primary
+ database recovers, no matter how <varname>allow_standalone_primary</>
+ is set. There is no way to be certain that all standbys have received
+ all outstanding WAL data at time of the crash of the primary. Some
+ transactions may not show as committed on the standby, even though
+ they show as committed on the primary. The guarantee we offer is that
+ the application will not receive explicit acknowledgement of the
+ successful commit of a transaction until the WAL data is known to be
+ safely received by the standby. Hence this mechanism is technically
+ "semi synchronous" rather than "fully synchronous" replication. Note
+ that replication may still not be fully synchronous even if we wait
+ for all standby servers, though this would reduce availability, as
+ described previously.
+ </para>
+
+ <para>
+ If you need to re-create a standby server while transactions are
+ waiting, make sure that the commands to run pg_start_backup() and
+ pg_stop_backup() are run in a session with
+ synchronous_replication = off, otherwise those requests will wait
+ forever for the standby to appear.
+ </para>
+
+ </sect3>
+ </sect2>
</sect1>
<sect1 id="warm-standby-failover">
@@ -1393,11 +1620,18 @@ if (!triggered)
These conflicts are <emphasis>hard conflicts</> in the sense that queries
might need to be cancelled and, in some cases, sessions disconnected to resolve them.
The user is provided with several ways to handle these
- conflicts. Conflict cases include:
+ conflicts. Conflict cases in order of likely frequency are:
<itemizedlist>
<listitem>
<para>
+ Application of a vacuum cleanup record from WAL conflicts with
+ standby transactions whose snapshots can still <quote>see</> any of
+ the rows to be removed.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
Access Exclusive locks taken on the primary server, including both
explicit <command>LOCK</> commands and various <acronym>DDL</>
actions, conflict with table accesses in standby queries.
@@ -1417,14 +1651,8 @@ if (!triggered)
</listitem>
<listitem>
<para>
- Application of a vacuum cleanup record from WAL conflicts with
- standby transactions whose snapshots can still <quote>see</> any of
- the rows to be removed.
- </para>
- </listitem>
- <listitem>
- <para>
- Application of a vacuum cleanup record from WAL conflicts with
+ Buffer pin deadlock caused by
+ application of a vacuum cleanup record from WAL conflicts with
queries accessing the target page on the standby, whether or not
the data to be removed is visible.
</para>
@@ -1539,17 +1767,16 @@ if (!triggered)
<para>
Remedial possibilities exist if the number of standby-query cancellations
- is found to be unacceptable. The first option is to connect to the
- primary server and keep a query active for as long as needed to
- run queries on the standby. This prevents <command>VACUUM</> from removing
- recently-dead rows and so cleanup conflicts do not occur.
- This could be done using <xref linkend="dblink"> and
- <function>pg_sleep()</>, or via other mechanisms. If you do this, you
+ is found to be unacceptable. Typically the best option is to enable
+ <varname>hot_standby_feedback</>. This prevents <command>VACUUM</> from
+ removing recently-dead rows and so cleanup conflicts do not occur.
+ If you do this, you
should note that this will delay cleanup of dead rows on the primary,
which may result in undesirable table bloat. However, the cleanup
situation will be no worse than if the standby queries were running
- directly on the primary server, and you are still getting the benefit of
- off-loading execution onto the standby.
+ directly on the primary server. You are still getting the benefit
+ of off-loading execution onto the standby and the query may complete
+ faster than it would have done on the primary server.
<varname>max_standby_archive_delay</> must be kept large in this case,
because delayed WAL files might already contain entries that conflict with
the desired standby queries.
@@ -1563,7 +1790,8 @@ if (!triggered)
a high <varname>max_standby_streaming_delay</>. However it is
difficult to guarantee any specific execution-time window with this
approach, since <varname>vacuum_defer_cleanup_age</> is measured in
- transactions executed on the primary server.
+ transactions executed on the primary server. As of version 9.1, this
+ second option is much less likely to be valuable.
</para>
<para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 4fee9c3..e4607ac 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -56,6 +56,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "replication/walsender.h"
+#include "replication/syncrep.h"
#include "storage/fd.h"
#include "storage/procarray.h"
#include "storage/sinvaladt.h"
@@ -2027,6 +2028,14 @@ RecordTransactionCommitPrepared(TransactionId xid,
MyProc->inCommit = false;
END_CRIT_SECTION();
+
+ /*
+ * Wait for synchronous replication, if required.
+ *
+ * Note that at this stage we have marked clog, but still show as
+ * running in the procarray and continue to hold locks.
+ */
+ SyncRepWaitForLSN(recptr);
}
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1e31e07..18e9ce1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -37,6 +37,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "replication/walsender.h"
+#include "replication/syncrep.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
@@ -53,6 +54,7 @@
#include "utils/snapmgr.h"
#include "pg_trace.h"
+extern void WalRcvWakeup(void); /* we are only caller, so include directly */
/*
* User-tweakable parameters
@@ -1051,7 +1053,7 @@ RecordTransactionCommit(void)
* if all to-be-deleted tables are temporary though, since they are lost
* anyway if we crash.)
*/
- if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0)
+ if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 || SyncRepRequested())
{
/*
* Synchronous commit case:
@@ -1121,6 +1123,14 @@ RecordTransactionCommit(void)
/* Compute latestXid while we have the child XIDs handy */
latestXid = TransactionIdLatest(xid, nchildren, children);
+ /*
+ * Wait for synchronous replication, if required.
+ *
+ * Note that at this stage we have marked clog, but still show as
+ * running in the procarray and continue to hold locks.
+ */
+ SyncRepWaitForLSN(XactLastRecEnd);
+
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd.xrecoff = 0;
@@ -4512,6 +4522,14 @@ xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, XLogRecPtr lsn)
*/
if (XactCompletionForceSyncCommit(xlrec))
XLogFlush(lsn);
+
+ /*
+ * If this standby is offering sync_rep_service then signal WALReceiver,
+ * in case it needs to send a reply just for this commit on an
+ * otherwise quiet server.
+ */
+ if (sync_rep_service)
+ WalRcvWakeup();
}
/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 25c7e06..4b29199 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -41,6 +41,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -157,6 +158,11 @@ static XLogRecPtr LastRec;
* known, need to check the shared state".
*/
static bool LocalRecoveryInProgress = true;
+/*
+ * Local copy of SharedHotStandbyActive variable. False actually means "not
+ * known, need to check the shared state".
+ */
+static bool LocalHotStandbyActive = false;
/*
* Local state for XLogInsertAllowed():
@@ -402,6 +408,12 @@ typedef struct XLogCtlData
bool SharedRecoveryInProgress;
/*
+ * SharedHotStandbyActive indicates if we're still in crash or archive
+ * recovery. Protected by info_lck.
+ */
+ bool SharedHotStandbyActive;
+
+ /*
* recoveryWakeupLatch is used to wake up the startup process to
* continue WAL replay, if it is waiting for WAL to arrive or failover
* trigger file to appear.
@@ -4893,6 +4905,7 @@ XLOGShmemInit(void)
*/
XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
XLogCtl->SharedRecoveryInProgress = true;
+ XLogCtl->SharedHotStandbyActive = false;
XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
SpinLockInit(&XLogCtl->info_lck);
InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
@@ -5233,6 +5246,12 @@ readRecoveryCommandFile(void)
(errmsg("recovery command file \"%s\" specified neither primary_conninfo nor restore_command",
RECOVERY_COMMAND_FILE),
errhint("The database server will regularly poll the pg_xlog subdirectory to check for files placed there.")));
+
+ if (PrimaryConnInfo == NULL && sync_rep_service)
+ ereport(WARNING,
+ (errmsg("recovery command file \"%s\" specified synchronous_replication_service yet streaming was not requested",
+ RECOVERY_COMMAND_FILE),
+ errhint("Specify primary_conninfo to allow synchronous replication.")));
}
else
{
@@ -6074,6 +6093,13 @@ StartupXLOG(void)
StandbyRecoverPreparedTransactions(false);
}
}
+ else
+ {
+ /*
+ * No need to calculate feedback if we're not in Hot Standby.
+ */
+ hot_standby_feedback = false;
+ }
/* Initialize resource managers */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
@@ -6568,8 +6594,6 @@ StartupXLOG(void)
static void
CheckRecoveryConsistency(void)
{
- static bool backendsAllowed = false;
-
/*
* Have we passed our safe starting point?
*/
@@ -6589,11 +6613,19 @@ CheckRecoveryConsistency(void)
* enabling connections.
*/
if (standbyState == STANDBY_SNAPSHOT_READY &&
- !backendsAllowed &&
+ !LocalHotStandbyActive &&
reachedMinRecoveryPoint &&
IsUnderPostmaster)
{
- backendsAllowed = true;
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ xlogctl->SharedHotStandbyActive = true;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ LocalHotStandbyActive = true;
+
SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
}
}
@@ -6641,6 +6673,38 @@ RecoveryInProgress(void)
}
/*
+ * Is HotStandby active yet? This is only important in special backends
+ * since normal backends won't ever be able to connect until this returns
+ * true.
+ *
+ * Unlike testing standbyState, this works in any process that's connected to
+ * shared memory.
+ */
+bool
+HotStandbyActive(void)
+{
+ /*
+ * We check shared state each time only until Hot Standby is active. We
+ * can't de-activate Hot Standby, so there's no need to keep checking after
+ * the shared variable has once been seen true.
+ */
+ if (LocalHotStandbyActive)
+ return true;
+ else
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ /* spinlock is essential on machines with weak memory ordering! */
+ SpinLockAcquire(&xlogctl->info_lck);
+ LocalHotStandbyActive = xlogctl->SharedHotStandbyActive;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ return LocalHotStandbyActive;
+ }
+}
+
+/*
* Is this process allowed to insert new WAL records?
*
* Ordinarily this is essentially equivalent to !RecoveryInProgress().
@@ -9029,6 +9093,25 @@ pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
}
/*
+ * Get latest redo apply position.
+ *
+ * Exported to allow WALReceiver to read the pointer directly.
+ */
+XLogRecPtr
+GetXLogReplayRecPtr(void)
+{
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+ XLogRecPtr recptr;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ recptr = xlogctl->recoveryLastRecPtr;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ return recptr;
+}
+
+/*
* Report the last WAL replay location (same format as pg_start_backup etc)
*
* This is useful for determining how much of WAL is visible to read-only
@@ -9037,14 +9120,10 @@ pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
Datum
pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
{
- /* use volatile pointer to prevent code rearrangement */
- volatile XLogCtlData *xlogctl = XLogCtl;
XLogRecPtr recptr;
char location[MAXFNAMELEN];
- SpinLockAcquire(&xlogctl->info_lck);
- recptr = xlogctl->recoveryLastRecPtr;
- SpinLockRelease(&xlogctl->info_lck);
+ recptr = GetXLogReplayRecPtr();
if (recptr.xlogid == 0 && recptr.xrecoff == 0)
PG_RETURN_NULL();
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 718e996..506e908 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -502,7 +502,11 @@ CREATE VIEW pg_stat_replication AS
S.client_port,
S.backend_start,
W.state,
- W.sent_location
+ W.sync,
+ W.sent_location,
+ W.write_location,
+ W.flush_location,
+ W.apply_location
FROM pg_stat_get_activity(NULL) AS S, pg_authid U,
pg_stat_get_wal_senders() AS W
WHERE S.usesysid = U.oid AND
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8f77d1b..1577875 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -275,6 +275,7 @@ typedef enum
PM_STARTUP, /* waiting for startup subprocess */
PM_RECOVERY, /* in archive recovery mode */
PM_HOT_STANDBY, /* in hot standby mode */
+ PM_WAIT_FOR_REPLICATION, /* waiting for sync replication to become active */
PM_RUN, /* normal "database is alive" state */
PM_WAIT_BACKUP, /* waiting for online backup mode to end */
PM_WAIT_READONLY, /* waiting for read only backends to exit */
@@ -735,6 +736,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\" or \"hot_standby\"")));
+ if (!allow_standalone_primary && max_wal_senders == 0)
+ ereport(ERROR,
+ (errmsg("WAL streaming (max_wal_senders > 0) is required if allow_standalone_primary = off")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1845,6 +1849,12 @@ retry1:
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
errmsg("the database system is in recovery mode")));
break;
+ case CAC_REPLICATION_ONLY:
+ if (!am_walsender)
+ ereport(FATAL,
+ (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+ errmsg("the database system is waiting for replication to start")));
+ break;
case CAC_TOOMANY:
ereport(FATAL,
(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
@@ -1942,7 +1952,9 @@ canAcceptConnections(void)
*/
if (pmState != PM_RUN)
{
- if (pmState == PM_WAIT_BACKUP)
+ if (pmState == PM_WAIT_FOR_REPLICATION)
+ result = CAC_REPLICATION_ONLY; /* allow replication only */
+ else if (pmState == PM_WAIT_BACKUP)
result = CAC_WAITBACKUP; /* allow superusers only */
else if (Shutdown > NoShutdown)
return CAC_SHUTDOWN; /* shutdown is pending */
@@ -2396,8 +2408,13 @@ reaper(SIGNAL_ARGS)
* Startup succeeded, commence normal operations
*/
FatalError = false;
- ReachedNormalRunning = true;
- pmState = PM_RUN;
+ if (allow_standalone_primary)
+ {
+ ReachedNormalRunning = true;
+ pmState = PM_RUN;
+ }
+ else
+ pmState = PM_WAIT_FOR_REPLICATION;
/*
* Crank up the background writer, if we didn't do that already
@@ -3233,8 +3250,8 @@ BackendStartup(Port *port)
/* Pass down canAcceptConnections state */
port->canAcceptConnections = canAcceptConnections();
bn->dead_end = (port->canAcceptConnections != CAC_OK &&
- port->canAcceptConnections != CAC_WAITBACKUP);
-
+ port->canAcceptConnections != CAC_WAITBACKUP &&
+ port->canAcceptConnections != CAC_REPLICATION_ONLY);
/*
* Unless it's a dead_end child, assign it a child slot number
*/
@@ -4284,6 +4301,16 @@ sigusr1_handler(SIGNAL_ARGS)
WalReceiverPID = StartWalReceiver();
}
+ if (CheckPostmasterSignal(PMSIGNAL_SYNC_REPLICATION_ACTIVE) &&
+ pmState == PM_WAIT_FOR_REPLICATION)
+ {
+ /* Allow connections now that a synchronous replication standby
+ * has successfully connected and is active.
+ */
+ ReachedNormalRunning = true;
+ pmState = PM_RUN;
+ }
+
PG_SETMASK(&UnBlockSig);
errno = save_errno;
@@ -4534,6 +4561,7 @@ static void
StartAutovacuumWorker(void)
{
Backend *bn;
+ CAC_state cac = CAC_OK;
/*
* If not in condition to run a process, don't try, but handle it like a
@@ -4542,7 +4570,8 @@ StartAutovacuumWorker(void)
* we have to check to avoid race-condition problems during DB state
* changes.
*/
- if (canAcceptConnections() == CAC_OK)
+ cac = canAcceptConnections();
+ if (cac == CAC_OK || cac == CAC_REPLICATION_ONLY)
{
bn = (Backend *) malloc(sizeof(Backend));
if (bn)
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 42c6eaf..3fe490e 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
- repl_gram.o
+ repl_gram.o syncrep.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 9c2e0d8..7387224 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -1,5 +1,27 @@
src/backend/replication/README
+Overview
+--------
+
+The WALSender sends WAL data and receives replies. The WALReceiver
+receives WAL data and sends replies.
+
+If there is no more WAL data to send then WALSender goes quiet,
+apart from checking for replies. If there is no more WAL data
+to receive then WALReceiver keeps sending replies until all the data
+received has been applied, then it too goes quiet. When all is quiet
+WALReceiver sends regular replies so that WALSender knows the link
+is still working - we don't want to wait until a transaction
+arrives before we try to determine the health of the connection.
+
+WALReceiver sends one reply per message received. If nothing is
+received it sends one reply every time apply pointer advances,
+with a minimum of one reply each cycletime.
+
+For synchronous replication, all decisions about whether to wait
+and how long to wait are taken on the primary. The standby has no
+state information about what is happening on the primary.
+
Walreceiver - libpqwalreceiver API
----------------------------------
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
new file mode 100644
index 0000000..12a3825
--- /dev/null
+++ b/src/backend/replication/syncrep.c
@@ -0,0 +1,641 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.c
+ *
+ * Synchronous replication is new as of PostgreSQL 9.1.
+ *
+ * If requested, transaction commits wait until their commit LSN is
+ * acknowledged by the standby, or the wait hits timeout.
+ *
+ * This module contains the code for waiting and release of backends.
+ * All code in this module executes on the primary. The core streaming
+ * replication transport remains within WALreceiver/WALsender modules.
+ *
+ * The essence of this design is that it isolates all logic about
+ * waiting/releasing onto the primary. The primary is aware of which
+ * standby servers offer a synchronisation service. The standby is
+ * completely unaware of the durability requirements of transactions
+ * on the primary, reducing the complexity of the code and streamlining
+ * both standby operations and network bandwidth because there is no
+ * requirement to ship per-transaction state information.
+ *
+ * The bookeeping approach we take is that a commit is either synchronous
+ * or not synchronous (async). If it is async, we just fastpath out of
+ * here. If it is sync, then it follows exactly one rigid definition of
+ * synchronous replication as laid out by the various parameters. If we
+ * change the definition of replication, we'll need to scan through all
+ * waiting backends to see if we should now release them.
+ *
+ * The best performing way to manage the waiting backends is to have a
+ * single ordered queue of waiting backends, so that we can avoid
+ * searching the through all waiters each time we receive a reply.
+ *
+ * Starting sync replication is a two stage process. First, the standby
+ * must have caught up with the primary; that may take some time. Next,
+ * we must receive a reply from the standby before we change state so
+ * that sync rep is fully active and commits can wait on us.
+ *
+ * XXX Changing state to a sync rep service while we are running allows
+ * us to enable sync replication via SIGHUP on the standby at a later
+ * time, without restart, if we need to do that. Though you can't turn
+ * it off without disconnecting.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xact.h"
+#include "access/xlog_internal.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/ipc.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_tables.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+
+
+/* User-settable parameters for sync rep */
+bool sync_rep_mode = false; /* Only set in user backends */
+int sync_rep_timeout_client = 120; /* Only set in user backends */
+int sync_rep_timeout_server = 30; /* Only set in user backends */
+bool sync_rep_service = false; /* Never set in user backends */
+bool hot_standby_feedback = true;
+
+/*
+ * Queuing code is written to allow later extension to multiple
+ * queues. Currently, we use just one queue (==FSYNC).
+ *
+ * XXX We later expect to have RECV, FSYNC and APPLY modes.
+ */
+#define SYNC_REP_NOT_ON_QUEUE -1
+#define SYNC_REP_FSYNC 0
+#define IsOnSyncRepQueue() (current_queue > SYNC_REP_NOT_ON_QUEUE)
+/*
+ * Queue identifier of the queue on which user backend currently waits.
+ */
+static int current_queue = SYNC_REP_NOT_ON_QUEUE;
+
+static void SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN, int qid);
+static void SyncRepRemoveFromQueue(void);
+static void SyncRepAddToQueue(int qid);
+static bool SyncRepServiceAvailable(void);
+static long SyncRepGetWaitTimeout(void);
+
+static void SyncRepWakeFromQueue(int wait_queue, XLogRecPtr lsn);
+
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for normal user backends
+ * ===========================================================
+ */
+
+/*
+ * Wait for synchronous replication, if requested by user.
+ */
+extern void
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+ /*
+ * Fast exit if user has requested async replication, or
+ * streaming replication is inactive in this server.
+ */
+ if (max_wal_senders == 0 || !sync_rep_mode)
+ return;
+
+ Assert(sync_rep_mode);
+
+ if (allow_standalone_primary)
+ {
+ bool avail_sync_mode;
+
+ /*
+ * Check that the service level we want is available.
+ * If not, downgrade the service level to async.
+ */
+ avail_sync_mode = SyncRepServiceAvailable();
+
+ /*
+ * Perform the wait here, then drop through and exit.
+ */
+ if (avail_sync_mode)
+ SyncRepWaitOnQueue(XactCommitLSN, 0);
+ }
+ else
+ {
+ /*
+ * Wait only on the service level requested,
+ * whether or not it is currently available.
+ * Sounds weird, but this mode exists to protect
+ * against changes that will only occur on primary.
+ */
+ SyncRepWaitOnQueue(XactCommitLSN, 0);
+ }
+}
+
+/*
+ * Wait for specified LSN to be confirmed at the requested level
+ * of durability. Each proc has its own wait latch, so we perform
+ * a normal latch check/wait loop here.
+ */
+static void
+SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN, int qid)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[0]);
+ TimestampTz now = GetCurrentTransactionStopTimestamp();
+ long timeout = SyncRepGetWaitTimeout(); /* seconds */
+ char *new_status = NULL;
+ const char *old_status;
+ int len;
+
+ /*
+ * No need to wait for autovacuums. If the standby does go away and
+ * we wait for it to return we may as well do some usefulwork locally.
+ * This is critical since we may need to perform emergency vacuuming
+ * and cannot wait for standby to return.
+ */
+ if (IsAutoVacuumWorkerProcess())
+ return;
+
+ ereport(DEBUG2,
+ (errmsg("synchronous replication waiting for %X/%X starting at %s",
+ XactCommitLSN.xlogid,
+ XactCommitLSN.xrecoff,
+ timestamptz_to_str(GetCurrentTransactionStopTimestamp()))));
+
+ for (;;)
+ {
+ ResetLatch(&MyProc->waitLatch);
+
+ /*
+ * First time through, add ourselves to the appropriate queue.
+ */
+ if (!IsOnSyncRepQueue())
+ {
+ SpinLockAcquire(&queue->qlock);
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ {
+ /* No need to wait */
+ SpinLockRelease(&queue->qlock);
+ return;
+ }
+
+ /*
+ * Set our waitLSN so WALSender will know when to wake us.
+ * We set this before we add ourselves to queue, so that
+ * any proc on the queue can be examined freely without
+ * taking a lock on each process in the queue.
+ */
+ MyProc->waitLSN = XactCommitLSN;
+ SyncRepAddToQueue(qid);
+ SpinLockRelease(&queue->qlock);
+ current_queue = qid; /* Remember which queue we're on */
+
+ /*
+ * Alter ps display to show waiting for sync rep.
+ */
+ old_status = get_ps_display(&len);
+ new_status = (char *) palloc(len + 21 + 1);
+ memcpy(new_status, old_status, len);
+ strcpy(new_status + len, " waiting for sync rep");
+ set_ps_display(new_status, false);
+ new_status[len] = '\0'; /* truncate off " waiting" */
+ }
+ else
+ {
+ bool release = false;
+ bool timeout = false;
+
+ SpinLockAcquire(&queue->qlock);
+
+ /*
+ * Check the LSN on our queue and if its moved far enough then
+ * remove us from the queue. First time through this is
+ * unlikely to be far enough, yet is possible. Next time we are
+ * woken we should be more lucky.
+ */
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ release = true;
+ else if (timeout > 0 &&
+ TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
+ now,
+ timeout))
+ {
+ release = true;
+ timeout = true;
+ }
+
+ if (release)
+ {
+ SyncRepRemoveFromQueue();
+ SpinLockRelease(&queue->qlock);
+
+ if (new_status)
+ {
+ /* Reset ps display */
+ set_ps_display(new_status, false);
+ pfree(new_status);
+ }
+
+ /*
+ * Our response to the timeout is to simply post a NOTICE and
+ * then return to the user. The commit has happened, we just
+ * haven't been able to verify it has been replicated to the
+ * level requested.
+ *
+ * XXX We could check here to see if our LSN has been sent to
+ * another standby that offers a lower level of service. That
+ * could be true if we had, for example, requested 'apply'
+ * with two standbys, one at 'apply' and one at 'recv' and the
+ * apply standby has just gone down. Something for the weekend.
+ */
+ if (timeout)
+ ereport(NOTICE,
+ (errmsg("synchronous replication timeout at %s",
+ timestamptz_to_str(now))));
+ else
+ ereport(DEBUG2,
+ (errmsg("synchronous replication wait complete at %s",
+ timestamptz_to_str(now))));
+
+ /* XXX Do we need to unset the latch? */
+ return;
+ }
+
+ SpinLockRelease(&queue->qlock);
+ }
+
+ WaitLatch(&MyProc->waitLatch, timeout);
+ now = GetCurrentTimestamp();
+ }
+}
+
+/*
+ * Remove myself from sync rep wait queue.
+ *
+ * Assume on queue at start; will not be on queue at end.
+ * Queue is already locked at start and remains locked on exit.
+ *
+ * XXX Implements design pattern "Reinvent Wheel", think about changing
+ */
+void
+SyncRepRemoveFromQueue(void)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[current_queue]);
+ PGPROC *proc = queue->head;
+ int numprocs = 0;
+
+ Assert(IsOnSyncRepQueue());
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG3, "removing myself from queue %d", current_queue);
+#endif
+
+ for (; proc != NULL; proc = proc->lwWaitLink)
+ {
+ if (proc == MyProc)
+ {
+ elog(LOG, "proc %d lsn %X/%X is MyProc",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+ }
+ else
+ {
+ elog(LOG, "proc %d lsn %X/%X",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+ }
+ numprocs++;
+ }
+
+ proc = queue->head;
+
+ if (proc == MyProc)
+ {
+ if (MyProc->lwWaitLink == NULL)
+ {
+ /*
+ * We were the only waiter on the queue. Reset head and tail.
+ */
+ Assert(queue->tail == MyProc);
+ queue->head = NULL;
+ queue->tail = NULL;
+ }
+ else
+ /*
+ * Move head to next proc on the queue.
+ */
+ queue->head = MyProc->lwWaitLink;
+ }
+ else
+ {
+ while (proc->lwWaitLink != NULL)
+ {
+ /* Are we the next proc in our traversal of the queue? */
+ if (proc->lwWaitLink == MyProc)
+ {
+ /*
+ * Remove ourselves from middle of queue.
+ * No need to touch head or tail.
+ */
+ proc->lwWaitLink = MyProc->lwWaitLink;
+ }
+
+ if (proc->lwWaitLink == NULL)
+ elog(WARNING, "could not locate ourselves on wait queue");
+ proc = proc->lwWaitLink;
+ }
+
+ if (proc->lwWaitLink == NULL) /* At tail */
+ {
+ Assert(proc == MyProc);
+ /* Remove ourselves from tail of queue */
+ Assert(queue->tail == MyProc);
+ queue->tail = proc;
+ proc->lwWaitLink = NULL;
+ }
+ }
+ MyProc->lwWaitLink = NULL;
+ current_queue = SYNC_REP_NOT_ON_QUEUE;
+}
+
+/*
+ * Add myself to sync rep wait queue.
+ *
+ * Assume not on queue at start; will be on queue at end.
+ * Queue is already locked at start and remains locked on exit.
+ */
+static void
+SyncRepAddToQueue(int qid)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[qid]);
+ PGPROC *tail = queue->tail;
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG3, "adding myself to queue %d", qid);
+#endif
+
+ /*
+ * Add myself to tail of wait queue.
+ */
+ if (tail == NULL)
+ {
+ queue->head = MyProc;
+ queue->tail = MyProc;
+ }
+ else
+ {
+ /*
+ * XXX extra code needed here to maintain sorted invariant.
+ * Our approach should be same as racing car - slow in, fast out.
+ */
+ Assert(tail->lwWaitLink == NULL);
+ tail->lwWaitLink = MyProc;
+ }
+ queue->tail = MyProc;
+
+ /*
+ * This used to be an Assert, but it keeps failing... why?
+ */
+ MyProc->lwWaitLink = NULL; /* to be sure */
+}
+
+/*
+ * Dynamically decide the sync rep wait mode. It may seem a trifle
+ * wasteful to do this for every transaction but we need to do this
+ * so we can cope sensibly with standby disconnections. It's OK to
+ * spend a few cycles here anyway, since while we're doing this the
+ * WALSender will be sending the data we want to wait for, so this
+ * is dead time and the user has requested to wait anyway.
+ */
+static bool
+SyncRepServiceAvailable(void)
+{
+ bool result = false;
+
+ SpinLockAcquire(&WalSndCtl->ctlmutex);
+ result = WalSndCtl->sync_rep_service_available;
+ SpinLockRelease(&WalSndCtl->ctlmutex);
+
+ return result;
+}
+
+/*
+ * Allows more complex decision making about what the wait time should be.
+ */
+static long
+SyncRepGetWaitTimeout(void)
+{
+ if (sync_rep_timeout_client <= 0)
+ return -1L;
+
+ return 1000000L * sync_rep_timeout_client;
+}
+
+void
+SyncRepCleanupAtProcExit(int code, Datum arg)
+{
+/*
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[qid]);
+
+ if (IsOnSyncRepQueue())
+ {
+ SpinLockAcquire(&queue->qlock);
+ SyncRepRemoveFromQueue();
+ SpinLockRelease(&queue->qlock);
+ }
+*/
+
+ if (MyProc != NULL && MyProc->ownLatch)
+ {
+ DisownLatch(&MyProc->waitLatch);
+ MyProc->ownLatch = false;
+ }
+}
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for wal sender processes
+ * ===========================================================
+ */
+
+/*
+ * Update the LSNs on each queue based upon our latest state. This
+ * implements a simple policy of first-valid-standby-releases-waiter.
+ *
+ * Other policies are possible, which would change what we do here and what
+ * perhaps also which information we store as well.
+ */
+void
+SyncRepReleaseWaiters(bool timeout)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ int mode;
+
+ /*
+ * If we are now streaming, and haven't yet enabled the sync rep service
+ * do so now. We don't enable sync rep service during a base backup since
+ * during that action we aren't sending WAL at all, so there cannot be
+ * any meaningful replies. We don't enable sync rep service while we
+ * are still in catchup mode either, since clients might experience an
+ * extended wait (perhaps hours) if they waited at that point.
+ *
+ * Note that we do release waiters, even if they aren't enabled yet.
+ * That sounds strange, but we may have dropped the connection and
+ * reconnected, so there may still be clients waiting for a response
+ * from when we were connected previously.
+ *
+ * If we already have a sync rep server connected, don't enable
+ * this server as well.
+ *
+ * XXX expect to be able to support multiple sync standbys in future.
+ */
+ if (!MyWalSnd->sync_rep_service &&
+ MyWalSnd->state == WALSNDSTATE_STREAMING &&
+ !SyncRepServiceAvailable())
+ {
+ ereport(LOG,
+ (errmsg("enabling synchronous replication service for standby")));
+
+ /*
+ * Update state for this WAL sender.
+ */
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalSnd *walsnd = MyWalSnd;
+
+ SpinLockAcquire(&walsnd->mutex);
+ walsnd->sync_rep_service = true;
+ SpinLockRelease(&walsnd->mutex);
+ }
+
+ /*
+ * We have at least one standby, so we're open for business.
+ */
+ {
+ SpinLockAcquire(&WalSndCtl->ctlmutex);
+ WalSndCtl->sync_rep_service_available = true;
+ SpinLockRelease(&WalSndCtl->ctlmutex);
+ }
+
+ /*
+ * Let postmaster know we can allow connections, if the user
+ * requested waiting until sync rep was active before starting.
+ * We send this unconditionally to avoid more complexity in
+ * postmaster code.
+ */
+ if (IsUnderPostmaster)
+ SendPostmasterSignal(PMSIGNAL_SYNC_REPLICATION_ACTIVE);
+ }
+
+ /*
+ * No point trying to release waiters while doing a base backup
+ */
+ if (MyWalSnd->state == WALSNDSTATE_BACKUP)
+ return;
+
+#ifdef SYNCREP_DEBUG
+ elog(LOG, "releasing waiters up to flush = %X/%X",
+ MyWalSnd->flush.xlogid, MyWalSnd->flush.xrecoff);
+#endif
+
+
+ /*
+ * Only maintain LSNs of queues for which we advertise a service.
+ * This is important to ensure that we only wakeup users when a
+ * preferred standby has reached the required LSN.
+ *
+ * Since sycnhronous_replication_mode is currently a boolean, we either
+ * offer all modes, or none.
+ */
+ for (mode = 0; mode < NUM_SYNC_REP_WAIT_MODES; mode++)
+ {
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[mode]);
+
+ /*
+ * Lock the queue. Not really necessary with just one sync standby
+ * but it makes clear what needs to happen.
+ */
+ SpinLockAcquire(&queue->qlock);
+ if (XLByteLT(queue->lsn, MyWalSnd->flush))
+ {
+ /*
+ * Set the lsn first so that when we wake backends they will
+ * release up to this location.
+ */
+ queue->lsn = MyWalSnd->flush;
+ SyncRepWakeFromQueue(mode, MyWalSnd->flush);
+ }
+ SpinLockRelease(&queue->qlock);
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG2, "q%d queue = %X/%X flush = %X/%X", mode,
+ queue->lsn.xlogid, queue->lsn.xrecoff,
+ MyWalSnd->flush.xlogid, MyWalSnd->flush.xrecoff);
+#endif
+ }
+}
+
+/*
+ * Walk queue from head setting the latches of any procs that need
+ * to be woken. We don't modify the queue, we leave that for individual
+ * procs to release themselves.
+ *
+ * Must hold spinlock on queue.
+ */
+static void
+SyncRepWakeFromQueue(int wait_queue, XLogRecPtr lsn)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[wait_queue]);
+ PGPROC *proc = queue->head;
+ int numprocs = 0;
+ int totalprocs = 0;
+
+ if (proc == NULL)
+ return;
+
+ for (; proc != NULL; proc = proc->lwWaitLink)
+ {
+ elog(LOG, "proc %d lsn %X/%X",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+
+ if (XLByteLE(proc->waitLSN, lsn))
+ {
+ numprocs++;
+ SetLatch(&proc->waitLatch);
+ }
+ totalprocs++;
+ }
+ elog(DEBUG2, "released %d procs out of %d waiting procs", numprocs, totalprocs);
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG2, "released %d procs up to %X/%X", numprocs, lsn.xlogid, lsn.xrecoff);
+#endif
+}
+
+void
+SyncRepTimeoutExceeded(void)
+{
+ SyncRepReleaseWaiters(true);
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7005307..18b5c45 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -38,6 +38,7 @@
#include <signal.h>
#include <unistd.h>
+#include "access/transam.h"
#include "access/xlog_internal.h"
#include "libpq/pqsignal.h"
#include "miscadmin.h"
@@ -45,6 +46,7 @@
#include "replication/walreceiver.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
+#include "storage/procarray.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/memutils.h"
@@ -84,9 +86,11 @@ static volatile sig_atomic_t got_SIGTERM = false;
*/
static struct
{
- XLogRecPtr Write; /* last byte + 1 written out in the standby */
- XLogRecPtr Flush; /* last byte + 1 flushed in the standby */
-} LogstreamResult;
+ XLogRecPtr Write; /* last byte + 1 written out in the standby */
+ XLogRecPtr Flush; /* last byte + 1 flushed in the standby */
+} LogstreamResult;
+
+static char *reply_message;
/*
* About SIGTERM handling:
@@ -114,6 +118,7 @@ static void WalRcvDie(int code, Datum arg);
static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
static void XLogWalRcvFlush(void);
+static void XLogWalRcvSendReply(void);
/* Signal handlers */
static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -204,6 +209,8 @@ WalReceiverMain(void)
/* Advertise our PID so that the startup process can kill us */
walrcv->pid = MyProcPid;
walrcv->walRcvState = WALRCV_RUNNING;
+ elog(DEBUG2, "WALreceiver starting");
+ OwnLatch(&WalRcv->latch); /* Run before signals enabled, since they can wakeup latch */
/* Fetch information required to start streaming */
strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
@@ -265,12 +272,19 @@ WalReceiverMain(void)
walrcv_connect(conninfo, startpoint);
DisableWalRcvImmediateExit();
+ /*
+ * Allocate buffer that will be used for each output message. We do this
+ * just once to reduce palloc overhead.
+ */
+ reply_message = palloc(sizeof(StandbyReplyMessage));
+
/* Loop until end-of-streaming or error */
for (;;)
{
unsigned char type;
char *buf;
int len;
+ bool received_all = false;
/*
* Emergency bailout if postmaster has died. This is to avoid the
@@ -296,21 +310,44 @@ WalReceiverMain(void)
ProcessConfigFile(PGC_SIGHUP);
}
- /* Wait a while for data to arrive */
- if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
+ ResetLatch(&WalRcv->latch);
+
+ if (walrcv_receive(0, &type, &buf, &len))
{
- /* Accept the received data, and process it */
+ received_all = false;
XLogWalRcvProcessMsg(type, buf, len);
+ }
+ else
+ received_all = true;
- /* Receive any more data we can without sleeping */
- while (walrcv_receive(0, &type, &buf, &len))
- XLogWalRcvProcessMsg(type, buf, len);
+ XLogWalRcvSendReply();
+ if (received_all && !got_SIGHUP && !got_SIGTERM)
+ {
/*
- * If we've written some records, flush them to disk and let the
- * startup process know about them.
+ * Flush, then reply.
+ *
+ * XXX We really need the WALWriter active as well
*/
XLogWalRcvFlush();
+ XLogWalRcvSendReply();
+
+ /*
+ * Sleep for up to 500 ms, the fixed keepalive delay.
+ *
+ * We will be woken if new data is received from primary
+ * or if a commit is applied. This is sub-optimal in the
+ * case where a group of commits arrive, then it all goes
+ * quiet, but its not worth the extra code to handle both
+ * that and the simple case of a single commit.
+ *
+ * Note that we do not need to wake up when the Startup
+ * process has applied the last outstanding record. That
+ * is interesting iff that is a commit record.
+ */
+ pg_usleep(1000000L); /* slow down loop for debugging */
+// WaitLatchOrSocket(&WalRcv->latch, MyProcPort->sock,
+// 500000L);
}
}
}
@@ -334,6 +371,8 @@ WalRcvDie(int code, Datum arg)
walrcv->pid = 0;
SpinLockRelease(&walrcv->mutex);
+ DisownLatch(&WalRcv->latch);
+
/* Terminate the connection gracefully. */
if (walrcv_disconnect != NULL)
walrcv_disconnect();
@@ -344,6 +383,7 @@ static void
WalRcvSigHupHandler(SIGNAL_ARGS)
{
got_SIGHUP = true;
+ WalRcvWakeup();
}
/* SIGTERM: set flag for main loop, or shutdown immediately if safe */
@@ -351,6 +391,7 @@ static void
WalRcvShutdownHandler(SIGNAL_ARGS)
{
got_SIGTERM = true;
+ WalRcvWakeup();
/* Don't joggle the elbow of proc_exit */
if (!proc_exit_inprogress && WalRcvImmediateInterruptOK)
@@ -548,3 +589,58 @@ XLogWalRcvFlush(void)
}
}
}
+
+/*
+ * Send reply message to primary. Returns false if message send failed.
+ *
+ * Our reply consists solely of the current state of the standby. Standby
+ * doesn't make any attempt to remember requests made by transactions on
+ * the primary.
+ */
+static void
+XLogWalRcvSendReply(void)
+{
+ StandbyReplyMessage reply;
+
+ if (!sync_rep_service && !hot_standby_feedback)
+ return;
+
+ /*
+ * Set sub-protocol message type for a StandbyReplyMessage.
+ */
+ if (sync_rep_service)
+ {
+ reply.write = LogstreamResult.Write;
+ reply.flush = LogstreamResult.Flush;
+ reply.apply = GetXLogReplayRecPtr();
+ }
+
+ if (hot_standby_feedback && HotStandbyActive())
+ reply.xmin = GetOldestXmin(true, false);
+ else
+ reply.xmin = InvalidTransactionId;
+
+ reply.sendTime = GetCurrentTimestamp();
+
+ memcpy(reply_message, &reply, sizeof(StandbyReplyMessage));
+
+ elog(DEBUG2, "sending write = %X/%X "
+ "flush = %X/%X "
+ "apply = %X/%X "
+ "xmin = %d ",
+ reply.write.xlogid, reply.write.xrecoff,
+ reply.flush.xlogid, reply.flush.xrecoff,
+ reply.apply.xlogid, reply.apply.xrecoff,
+ reply.xmin);
+
+ walrcv_send(reply_message, sizeof(StandbyReplyMessage));
+}
+
+/* Wake up the WalRcv
+ * Prototype goes in xact.c since that is only external caller
+ */
+void
+WalRcvWakeup(void)
+{
+ SetLatch(&WalRcv->latch);
+};
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 04c9004..da97528 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -64,6 +64,7 @@ WalRcvShmemInit(void)
MemSet(WalRcv, 0, WalRcvShmemSize());
WalRcv->walRcvState = WALRCV_STOPPED;
SpinLockInit(&WalRcv->mutex);
+ InitSharedLatch(&WalRcv->latch);
}
}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 78963c1..d9ff9ed 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -39,6 +39,7 @@
#include "funcapi.h"
#include "access/xlog_internal.h"
+#include "access/transam.h"
#include "catalog/pg_type.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
@@ -63,7 +64,7 @@
WalSndCtlData *WalSndCtl = NULL;
/* My slot in the shared memory array */
-static WalSnd *MyWalSnd = NULL;
+WalSnd *MyWalSnd = NULL;
/* Global state */
bool am_walsender = false; /* Am I a walsender process ? */
@@ -71,6 +72,7 @@ bool am_walsender = false; /* Am I a walsender process ? */
/* User-settable parameters for walsender */
int max_wal_senders = 0; /* the maximum number of concurrent walsenders */
int WalSndDelay = 200; /* max sleep time between some actions */
+bool allow_standalone_primary = true; /* action if no sync standby active */
/*
* These variables are used similarly to openLogFile/Id/Seg/Off,
@@ -87,6 +89,9 @@ static uint32 sendOff = 0;
*/
static XLogRecPtr sentPtr = {0, 0};
+static StringInfoData input_message;
+static TimestampTz last_reply_timestamp;
+
/* Flags set by signal handlers for later service in main loop */
static volatile sig_atomic_t got_SIGHUP = false;
volatile sig_atomic_t walsender_shutdown_requested = false;
@@ -106,10 +111,10 @@ static void InitWalSnd(void);
static void WalSndHandshake(void);
static void WalSndKill(int code, Datum arg);
static bool XLogSend(char *msgbuf, bool *caughtup);
-static void CheckClosedConnection(void);
static void IdentifySystem(void);
static void StartReplication(StartReplicationCmd * cmd);
-
+static void ProcessStandbyReplyMessage(void);
+static void ProcessRepliesIfAny(void);
/* Main entry point for walsender process */
int
@@ -147,6 +152,8 @@ WalSenderMain(void)
/* Unblock signals (they were blocked when the postmaster forked us) */
PG_SETMASK(&UnBlockSig);
+ elog(DEBUG2, "WALsender starting");
+
/* Tell the standby that walsender is ready for receiving commands */
ReadyForQuery(DestRemote);
@@ -163,6 +170,8 @@ WalSenderMain(void)
SpinLockRelease(&walsnd->mutex);
}
+ elog(DEBUG2, "WALsender handshake complete");
+
/* Main loop of walsender */
return WalSndLoop();
}
@@ -173,7 +182,6 @@ WalSenderMain(void)
static void
WalSndHandshake(void)
{
- StringInfoData input_message;
bool replication_started = false;
initStringInfo(&input_message);
@@ -247,6 +255,11 @@ WalSndHandshake(void)
errmsg("invalid standby handshake message type %d", firstchar)));
}
}
+
+ /*
+ * Initialize our timeout checking mechanism.
+ */
+ last_reply_timestamp = GetCurrentTimestamp();
}
/*
@@ -414,9 +427,11 @@ HandleReplicationCommand(const char *cmd_string)
/* break out of the loop */
replication_started = true;
+ WalSndSetState(WALSNDSTATE_CATCHUP);
break;
case T_BaseBackupCmd:
+ WalSndSetState(WALSNDSTATE_BACKUP);
SendBaseBackup((BaseBackupCmd *) cmd_node);
/* Send CommandComplete and ReadyForQuery messages */
@@ -442,7 +457,7 @@ HandleReplicationCommand(const char *cmd_string)
* Check if the remote end has closed the connection.
*/
static void
-CheckClosedConnection(void)
+ProcessRepliesIfAny(void)
{
unsigned char firstchar;
int r;
@@ -466,6 +481,13 @@ CheckClosedConnection(void)
switch (firstchar)
{
/*
+ * 'd' means a standby reply wrapped in a COPY BOTH packet.
+ */
+ case 'd':
+ ProcessStandbyReplyMessage();
+ break;
+
+ /*
* 'X' means that the standby is closing down the socket.
*/
case 'X':
@@ -479,6 +501,64 @@ CheckClosedConnection(void)
}
}
+/*
+ * Receive StandbyReplyMessage. False if message send failed.
+ */
+static void
+ProcessStandbyReplyMessage(void)
+{
+ StandbyReplyMessage reply;
+
+ /*
+ * Read the message contents.
+ */
+ if (pq_getmessage(&input_message, 0))
+ {
+ ereport(COMMERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected EOF on standby connection")));
+ proc_exit(0);
+ }
+
+ pq_copymsgbytes(&input_message, (char *) &reply, sizeof(StandbyReplyMessage));
+
+ elog(DEBUG2, "write = %X/%X "
+ "flush = %X/%X "
+ "apply = %X/%X "
+ "xmin = %d ",
+ reply.write.xlogid, reply.write.xrecoff,
+ reply.flush.xlogid, reply.flush.xrecoff,
+ reply.apply.xlogid, reply.apply.xrecoff,
+ reply.xmin);
+
+ /*
+ * Update shared state for this WalSender process
+ * based on reply data from standby.
+ */
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalSnd *walsnd = MyWalSnd;
+
+ SpinLockAcquire(&walsnd->mutex);
+ if (XLByteLT(walsnd->write, reply.write))
+ walsnd->write = reply.write;
+ if (XLByteLT(walsnd->flush, reply.flush))
+ walsnd->flush = reply.flush;
+ if (XLByteLT(walsnd->apply, reply.apply))
+ walsnd->apply = reply.apply;
+ SpinLockRelease(&walsnd->mutex);
+
+ if (TransactionIdIsValid(reply.xmin) &&
+ TransactionIdPrecedes(MyProc->xmin, reply.xmin))
+ MyProc->xmin = reply.xmin;
+ }
+
+ /*
+ * Release any backends waiting to commit.
+ */
+ SyncRepReleaseWaiters(false);
+}
+
/* Main loop of walsender process */
static int
WalSndLoop(void)
@@ -518,6 +598,7 @@ WalSndLoop(void)
{
if (!XLogSend(output_message, &caughtup))
break;
+ ProcessRepliesIfAny();
if (caughtup)
walsender_shutdown_requested = true;
}
@@ -525,7 +606,11 @@ WalSndLoop(void)
/* Normal exit from the walsender is here */
if (walsender_shutdown_requested)
{
- /* Inform the standby that XLOG streaming was done */
+ ProcessRepliesIfAny();
+
+ /* Inform the standby that XLOG streaming was done
+ * by sending CommandComplete message.
+ */
pq_puttextmessage('C', "COPY 0");
pq_flush();
@@ -533,12 +618,31 @@ WalSndLoop(void)
}
/*
- * If we had sent all accumulated WAL in last round, nap for the
- * configured time before retrying.
+ * If we had sent all accumulated WAL in last round, then we don't
+ * have much to do. We still expect a steady stream of replies from
+ * standby. It is important to note that we don't keep track of
+ * whether or not there are backends waiting here, since that
+ * is potentially very complex state information.
+ *
+ * Also note that there is no delay between sending data and
+ * checking for the replies. We expect replies to take some time
+ * and we are more concerned with overall throughput than absolute
+ * response time to any single request.
*/
if (caughtup)
{
/*
+ * If we were still catching up, change state to streaming.
+ * While in the initial catchup phase, clients waiting for
+ * a response from the standby would wait for a very long
+ * time, so we need to have a one-way state transition to avoid
+ * problems. No need to grab a lock for the check; we are the
+ * only one to ever change the state.
+ */
+ if (MyWalSnd->state < WALSNDSTATE_STREAMING)
+ WalSndSetState(WALSNDSTATE_STREAMING);
+
+ /*
* Even if we wrote all the WAL that was available when we started
* sending, more might have arrived while we were sending this
* batch. We had the latch set while sending, so we have not
@@ -551,6 +655,13 @@ WalSndLoop(void)
break;
if (caughtup && !got_SIGHUP && !walsender_ready_to_stop && !walsender_shutdown_requested)
{
+ long timeout;
+
+ if (sync_rep_timeout_server == -1)
+ timeout = -1L;
+ else
+ timeout = 1000000L * sync_rep_timeout_server;
+
/*
* XXX: We don't really need the periodic wakeups anymore,
* WaitLatchOrSocket should reliably wake up as soon as
@@ -558,12 +669,15 @@ WalSndLoop(void)
*/
/* Sleep */
- WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
- WalSndDelay * 1000L);
+ if (WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
+ timeout) == 0)
+ {
+ ereport(LOG,
+ (errmsg("streaming replication timeout after %d s",
+ sync_rep_timeout_server)));
+ break;
+ }
}
-
- /* Check if the connection was closed */
- CheckClosedConnection();
}
else
{
@@ -572,12 +686,11 @@ WalSndLoop(void)
break;
}
- /* Update our state to indicate if we're behind or not */
- WalSndSetState(caughtup ? WALSNDSTATE_STREAMING : WALSNDSTATE_CATCHUP);
+ ProcessRepliesIfAny();
}
/*
- * Get here on send failure. Clean up and exit.
+ * Get here on send failure or timeout. Clean up and exit.
*
* Reset whereToSendOutput to prevent ereport from attempting to send any
* more messages to the standby.
@@ -808,9 +921,9 @@ XLogSend(char *msgbuf, bool *caughtup)
* Attempt to send all data that's already been written out and fsync'd to
* disk. We cannot go further than what's been written out given the
* current implementation of XLogRead(). And in any case it's unsafe to
- * send WAL that is not securely down to disk on the master: if the master
+ * send WAL that is not securely down to disk on the primary: if the primary
* subsequently crashes and restarts, slaves must not have applied any WAL
- * that gets lost on the master.
+ * that gets lost on the primary.
*/
SendRqstPtr = GetFlushRecPtr();
@@ -888,6 +1001,9 @@ XLogSend(char *msgbuf, bool *caughtup)
msghdr.walEnd = SendRqstPtr;
msghdr.sendTime = GetCurrentTimestamp();
+ elog(DEBUG2, "sent = %X/%X ",
+ startptr.xlogid, startptr.xrecoff);
+
memcpy(msgbuf + 1, &msghdr, sizeof(WalDataMessageHeader));
pq_putmessage('d', msgbuf, 1 + sizeof(WalDataMessageHeader) + nbytes);
@@ -1045,6 +1161,16 @@ WalSndShmemInit(void)
SpinLockInit(&walsnd->mutex);
InitSharedLatch(&walsnd->latch);
}
+
+ /*
+ * Initialise the spinlocks on each sync rep queue
+ */
+ for (i = 0; i < NUM_SYNC_REP_WAIT_MODES; i++)
+ {
+ SyncRepQueue *queue = &WalSndCtl->sync_rep_queue[i];
+
+ SpinLockInit(&queue->qlock);
+ }
}
}
@@ -1104,7 +1230,7 @@ WalSndGetStateString(WalSndState state)
Datum
pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
-#define PG_STAT_GET_WAL_SENDERS_COLS 3
+#define PG_STAT_GET_WAL_SENDERS_COLS 7
ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc tupdesc;
Tuplestorestate *tupstore;
@@ -1141,9 +1267,13 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
/* use volatile pointer to prevent code rearrangement */
volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
- char sent_location[MAXFNAMELEN];
+ char location[MAXFNAMELEN];
XLogRecPtr sentPtr;
+ XLogRecPtr write;
+ XLogRecPtr flush;
+ XLogRecPtr apply;
WalSndState state;
+ bool sync_rep_service;
Datum values[PG_STAT_GET_WAL_SENDERS_COLS];
bool nulls[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -1153,13 +1283,15 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
SpinLockAcquire(&walsnd->mutex);
sentPtr = walsnd->sentPtr;
state = walsnd->state;
+ write = walsnd->write;
+ flush = walsnd->flush;
+ apply = walsnd->apply;
+ sync_rep_service = walsnd->sync_rep_service;
SpinLockRelease(&walsnd->mutex);
- snprintf(sent_location, sizeof(sent_location), "%X/%X",
- sentPtr.xlogid, sentPtr.xrecoff);
-
memset(nulls, 0, sizeof(nulls));
values[0] = Int32GetDatum(walsnd->pid);
+
if (!superuser())
{
/*
@@ -1168,11 +1300,37 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
*/
nulls[1] = true;
nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ nulls[6] = true;
}
else
{
values[1] = CStringGetTextDatum(WalSndGetStateString(state));
- values[2] = CStringGetTextDatum(sent_location);
+ values[2] = BoolGetDatum(sync_rep_service);
+
+ snprintf(location, sizeof(location), "%X/%X",
+ sentPtr.xlogid, sentPtr.xrecoff);
+ values[3] = CStringGetTextDatum(location);
+
+ if (write.xlogid == 0 && write.xrecoff == 0)
+ nulls[4] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ write.xlogid, write.xrecoff);
+ values[4] = CStringGetTextDatum(location);
+
+ if (flush.xlogid == 0 && flush.xrecoff == 0)
+ nulls[5] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ flush.xlogid, flush.xrecoff);
+ values[5] = CStringGetTextDatum(location);
+
+ if (apply.xlogid == 0 && apply.xrecoff == 0)
+ nulls[6] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ apply.xlogid, apply.xrecoff);
+ values[6] = CStringGetTextDatum(location);
}
tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index be577bc..7aa7671 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -39,6 +39,7 @@
#include "access/xact.h"
#include "miscadmin.h"
#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -196,6 +197,7 @@ InitProcGlobal(void)
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
ProcGlobal->freeProcs = &procs[i];
+ InitSharedLatch(&procs[i].waitLatch);
}
/*
@@ -214,6 +216,7 @@ InitProcGlobal(void)
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
ProcGlobal->autovacFreeProcs = &procs[i];
+ InitSharedLatch(&procs[i].waitLatch);
}
/*
@@ -224,6 +227,7 @@ InitProcGlobal(void)
{
AuxiliaryProcs[i].pid = 0; /* marks auxiliary proc as not in use */
PGSemaphoreCreate(&(AuxiliaryProcs[i].sem));
+ InitSharedLatch(&procs[i].waitLatch);
}
/* Create ProcStructLock spinlock, too */
@@ -326,6 +330,13 @@ InitProcess(void)
SHMQueueInit(&(MyProc->myProcLocks[i]));
MyProc->recoveryConflictPending = false;
+ /* Initialise the waitLSN for sync rep */
+ MyProc->waitLSN.xlogid = 0;
+ MyProc->waitLSN.xrecoff = 0;
+
+ OwnLatch((Latch *) &MyProc->waitLatch);
+ MyProc->ownLatch = true;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -365,6 +376,7 @@ InitProcessPhase2(void)
/*
* Arrange to clean that up at backend exit.
*/
+ on_shmem_exit(SyncRepCleanupAtProcExit, 0);
on_shmem_exit(RemoveProcFromArray, 0);
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2c95ef8..7cbcde4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -55,6 +55,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
#include "postmaster/walwriter.h"
+#include "replication/syncrep.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
#include "storage/standby.h"
@@ -618,6 +619,15 @@ const char *const config_type_names[] =
static struct config_bool ConfigureNamesBool[] =
{
{
+ {"allow_standalone_primary", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Refuse connections on startup and force users to wait forever if synchronous replication has failed."),
+ NULL
+ },
+ &allow_standalone_primary,
+ true, NULL, NULL
+ },
+
+ {
{"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of sequential-scan plans."),
NULL
@@ -1260,6 +1270,33 @@ static struct config_bool ConfigureNamesBool[] =
},
{
+ {"synchronous_replication", PGC_USERSET, WAL_SETTINGS,
+ gettext_noop("Requests synchronous replication."),
+ NULL
+ },
+ &sync_rep_mode,
+ false, NULL, NULL
+ },
+
+ {
+ {"synchronous_replication_feedback", PGC_POSTMASTER, WAL_STANDBY_SERVERS,
+ gettext_noop("Allows feedback from a standby to primary for synchronous replication."),
+ NULL
+ },
+ &sync_rep_service,
+ true, NULL, NULL
+ },
+
+ {
+ {"hot_standby_feedback", PGC_POSTMASTER, WAL_STANDBY_SERVERS,
+ gettext_noop("Allows feedback from a hot standby to primary to avoid query conflicts."),
+ NULL
+ },
+ &hot_standby_feedback,
+ false, NULL, NULL
+ },
+
+ {
{"allow_system_table_mods", PGC_POSTMASTER, DEVELOPER_OPTIONS,
gettext_noop("Allows modifications of the structure of system tables."),
NULL,
@@ -1455,6 +1492,26 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"replication_timeout_client", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Clients waiting for confirmation will timeout after this duration."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &sync_rep_timeout_client,
+ 120, -1, INT_MAX, NULL, NULL
+ },
+
+ {
+ {"replication_timeout_server", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Replication connection will timeout after this duration."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &sync_rep_timeout_server,
+ 30, -1, INT_MAX, NULL, NULL
+ },
+
+ {
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6c6f9a9..eac4076 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -184,7 +184,15 @@
#archive_timeout = 0 # force a logfile segment switch after this
# number of seconds; 0 disables
-# - Streaming Replication -
+# - Replication - User Settings
+
+#synchronous_replication = off # commit waits for reply from standby
+#replication_timeout_client = 120 # -1 means wait forever
+
+# - Streaming Replication - Server Settings
+
+#allow_standalone_primary = on # sync rep parameter
+#replication_timeout_client = 30 # -1 means wait forever
#max_wal_senders = 0 # max number of walsender processes
# (change requires restart)
@@ -196,6 +204,8 @@
#hot_standby = off # "on" allows queries during recovery
# (change requires restart)
+#hot_standby_feedback = off # info from standby to prevent query conflicts
+#synchronous_replication_feedback = off # allows sync replication
#max_standby_archive_delay = 30s # max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 122e96b..784b62e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -288,8 +288,10 @@ extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg);
extern bool RecoveryInProgress(void);
+extern bool HotStandbyActive(void);
extern bool XLogInsertAllowed(void);
extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
+extern XLogRecPtr GetXLogReplayRecPtr(void);
extern void UpdateControlFile(void);
extern uint64 GetSystemIdentifier(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f8b5d4d..b83ed0c 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3075,7 +3075,7 @@ DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 f f
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,23}" "{i,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
DESCR("statistics: information about currently active backends");
-DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25}" "{o,o,o}" "{procpid,state,sent_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,16,25,25,25,25}" "{o,o,o,o,o,o,o}" "{procpid,state,sync,sent_location,write_location,flush_location,apply_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
DESCR("statistics: information about currently active replication");
DATA(insert OID = 2026 ( pg_backend_pid PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
DESCR("statistics: current backend PID");
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 4cdb15f..9a00b2c 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -73,7 +73,7 @@ typedef struct
typedef enum CAC_state
{
CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
- CAC_WAITBACKUP
+ CAC_WAITBACKUP, CAC_REPLICATION_ONLY
} CAC_state;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
new file mode 100644
index 0000000..a071b9a
--- /dev/null
+++ b/src/include/replication/syncrep.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.h
+ * Exports from replication/syncrep.c.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _SYNCREP_H
+#define _SYNCREP_H
+
+#include "access/xlog.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+
+#define SyncRepRequested() (sync_rep_mode)
+#define StandbyOffersSyncRepService() (sync_rep_service)
+
+/*
+ * There is no reply from standby to primary for async mode, so the reply
+ * message needs one less slot than the maximum number of modes
+ */
+#define NUM_SYNC_REP_WAIT_MODES 1
+
+extern XLogRecPtr ReplyLSN[NUM_SYNC_REP_WAIT_MODES];
+
+/*
+ * Each synchronous rep wait mode has one SyncRepWaitQueue in shared memory.
+ * These queues live in the WAL sender shmem area.
+ */
+typedef struct SyncRepQueue
+{
+ /*
+ * Current location of the head of the queue. Nobody should be waiting
+ * on the queue for an lsn equal to or earlier than this value. Procs
+ * on the queue will always be later than this value, though we don't
+ * record those values here.
+ */
+ XLogRecPtr lsn;
+
+ PGPROC *head;
+ PGPROC *tail;
+
+ slock_t qlock; /* locks shared variables shown above */
+} SyncRepQueue;
+
+/* user-settable parameters for synchronous replication */
+extern bool sync_rep_mode;
+extern int sync_rep_timeout_client;
+extern int sync_rep_timeout_server;
+extern bool sync_rep_service;
+
+extern bool hot_standby_feedback;
+
+/* called by user backend */
+extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+
+/* called by wal sender */
+extern void SyncRepReleaseWaiters(bool timeout);
+extern void SyncRepTimeoutExceeded(void);
+
+/* callback at exit */
+extern void SyncRepCleanupAtProcExit(int code, Datum arg);
+
+#endif /* _SYNCREP_H */
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index 1993851..8a7101a 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -40,6 +40,47 @@ typedef struct
} WalDataMessageHeader;
/*
+ * Reply message from standby (message type 'r'). This is wrapped within
+ * a CopyData message at the FE/BE protocol level.
+ *
+ * Note that the data length is not specified here.
+ */
+typedef struct
+{
+ /*
+ * The xlog location that has been written to WAL file by standby-side.
+ * This may be invalid if the standby-side does not choose to offer
+ * a synchronous replication reply service, or is unable to offer
+ * a valid reply for data that has only been written, not fsynced.
+ */
+ XLogRecPtr write;
+
+ /*
+ * The xlog location that has been fsynced onto disk by standby-side.
+ * This may be invalid if the standby-side does not choose to offer
+ * a synchronous replication reply service, or is unable to.
+ */
+ XLogRecPtr flush;
+
+ /*
+ * The xlog location that has been applied by standby Startup process.
+ * This may be invalid if the standby-side does not support apply,
+ * or does not choose to apply records, as yet.
+ */
+ XLogRecPtr apply;
+
+ /*
+ * The current xmin from the standby, for Hot Standby feedback.
+ * This may be invalid if the standby-side does not support feedback,
+ * or Hot Standby is not yet available.
+ */
+ TransactionId xmin;
+
+ /* Sender's system clock at the time of transmission */
+ TimestampTz sendTime;
+} StandbyReplyMessage;
+
+/*
* Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
*
* We don't have a good idea of what a good value would be; there's some
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 24ad438..a6afec4 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -13,6 +13,8 @@
#define _WALRECEIVER_H
#include "access/xlogdefs.h"
+#include "replication/syncrep.h"
+#include "storage/latch.h"
#include "storage/spin.h"
#include "pgtime.h"
@@ -71,6 +73,11 @@ typedef struct
*/
char conninfo[MAXCONNINFO];
+ /*
+ * Latch used by aux procs to wake up walreceiver when it has work to do.
+ */
+ Latch latch;
+
slock_t mutex; /* locks shared variables shown above */
} WalRcvData;
@@ -92,6 +99,7 @@ extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
/* prototypes for functions in walreceiver.c */
extern void WalReceiverMain(void);
+extern void WalRcvWakeup(void);
/* prototypes for functions in walreceiverfuncs.c */
extern Size WalRcvShmemSize(void);
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 9a196ab..ce85cf2 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -15,6 +15,7 @@
#include "access/xlog.h"
#include "nodes/nodes.h"
#include "storage/latch.h"
+#include "replication/syncrep.h"
#include "storage/spin.h"
@@ -35,18 +36,63 @@ typedef struct WalSnd
WalSndState state; /* this walsender's state */
XLogRecPtr sentPtr; /* WAL has been sent up to this point */
- slock_t mutex; /* locks shared variables shown above */
+ /*
+ * The xlog location that has been written to WAL file by standby-side.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr write;
+
+ /*
+ * The xlog location that has been fsynced onto disk by standby-side.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr flush;
+
+ /*
+ * The xlog location that has been applied by standby Startup process.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr apply;
+
+ /*
+ * The current xmin from the standby, for Hot Standby feedback.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ TransactionId xmin;
/*
* Latch used by backends to wake up this walsender when it has work
* to do.
*/
Latch latch;
+
+ /*
+ * Highest level of sync rep available from this standby.
+ */
+ bool sync_rep_service;
+
+ slock_t mutex; /* locks shared variables shown above */
+
} WalSnd;
+extern WalSnd *MyWalSnd;
+
/* There is one WalSndCtl struct for the whole database cluster */
typedef struct
{
+ /*
+ * Sync rep wait queues with one queue per request type.
+ * We use one queue per request type so that we can maintain the
+ * invariant that the individual queues are sorted on LSN.
+ * This may also help performance when multiple wal senders
+ * offer different sync rep service levels.
+ */
+ SyncRepQueue sync_rep_queue[NUM_SYNC_REP_WAIT_MODES];
+
+ bool sync_rep_service_available;
+
+ slock_t ctlmutex; /* locks shared variables shown above */
+
WalSnd walsnds[1]; /* VARIABLE LENGTH ARRAY */
} WalSndCtlData;
@@ -60,6 +106,7 @@ extern volatile sig_atomic_t walsender_ready_to_stop;
/* user-settable parameters */
extern int WalSndDelay;
extern int max_wal_senders;
+extern bool allow_standalone_primary;
extern int WalSenderMain(void);
extern void WalSndSignals(void);
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 97bdc7b..0d2a78e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -29,6 +29,7 @@ typedef enum
PMSIGNAL_START_AUTOVAC_LAUNCHER, /* start an autovacuum launcher */
PMSIGNAL_START_AUTOVAC_WORKER, /* start an autovacuum worker */
PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
+ PMSIGNAL_SYNC_REPLICATION_ACTIVE, /* walsender has completed handshake */
NUM_PMSIGNALS /* Must be last value of enum! */
} PMSignalReason;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 78dbade..27b57c8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -14,6 +14,8 @@
#ifndef _PROC_H_
#define _PROC_H_
+#include "access/xlog.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "storage/pg_sema.h"
#include "utils/timestamp.h"
@@ -115,6 +117,11 @@ struct PGPROC
LOCKMASK heldLocks; /* bitmask for lock types already held on this
* lock object by this backend */
+ /* Info to allow us to wait for synchronous replication, if needed. */
+ Latch waitLatch;
+ XLogRecPtr waitLSN; /* waiting for this LSN or higher */
+ bool ownLatch; /* do we own the above latch? */
+
/*
* All PROCLOCK objects for locks held or awaited by this backend are
* linked into one of these lists, according to the partition number of
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 72e5630..b070340 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1296,7 +1296,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
pg_stat_bgwriter | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc;
pg_stat_database | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts FROM pg_database d;
pg_stat_database_conflicts | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
- pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sent_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
+ pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sync, w.sent_location, w.write_location, w.flush_location, w.apply_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sync, sent_location, write_location, flush_location, apply_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
pg_stat_sys_indexes | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
pg_stat_sys_tables | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_user_functions | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));
On Mon, Feb 7, 2011 at 2:56 PM, Dave Page <dpage@pgadmin.org> wrote:
On Mon, Feb 7, 2011 at 6:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 7, 2011 at 12:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... Well, the current CommitFest ends in one week, ...
Really? I thought the idea for the last CF of a development cycle was
that it kept going till we'd dealt with everything. Arbitrarily
rejecting stuff we haven't dealt with doesn't seem fair.Uh, we did that with 8.4 and it was a disaster. The CommitFest lasted
*five months*. We've been doing schedule-based CommitFests ever since
and it's worked much better.Rejecting stuff because we haven't gotten round to dealing with it in
such a short period of time is a damn good way to limit the number of
contributions we get. I don't believe we've agreed at any point that
the last commitfest should be the same time length as the others
News to me.
http://wiki.postgresql.org/wiki/PostgreSQL_9.1_Development_Plan
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Feb 7, 2011 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 7, 2011 at 2:56 PM, Dave Page <dpage@pgadmin.org> wrote:
On Mon, Feb 7, 2011 at 6:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 7, 2011 at 12:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... Well, the current CommitFest ends in one week, ...
Really? I thought the idea for the last CF of a development cycle was
that it kept going till we'd dealt with everything. Arbitrarily
rejecting stuff we haven't dealt with doesn't seem fair.Uh, we did that with 8.4 and it was a disaster. The CommitFest lasted
*five months*. We've been doing schedule-based CommitFests ever since
and it's worked much better.Rejecting stuff because we haven't gotten round to dealing with it in
such a short period of time is a damn good way to limit the number of
contributions we get. I don't believe we've agreed at any point that
the last commitfest should be the same time length as the othersNews to me.
http://wiki.postgresql.org/wiki/PostgreSQL_9.1_Development_Plan
Yes, and? It doesn't say beta 1 at the after a month of the last
commitfest, which is the milestone which marks the end of development.
It says alpha 4, and possibly more alphas. It's pretty clear that it
is expected that development and polishing will continue past the 20th
February.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote:
Dave Page <dpage@pgadmin.org> wrote:
Robert Haas <robertmhaas@gmail.com> wrote:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... Well, the current CommitFest ends in one week, ...
Really? I thought the idea for the last CF of a development
cycle was that it kept going till we'd dealt with everything.
Arbitrarily rejecting stuff we haven't dealt with doesn't seem
fair.Uh, we did that with 8.4 and it was a disaster. The CommitFest
lasted *five months*. We've been doing schedule-based
CommitFests ever since and it's worked much better.Rejecting stuff because we haven't gotten round to dealing with
it in such a short period of time is a damn good way to limit the
number of contributions we get. I don't believe we've agreed at
any point that the last commitfest should be the same time length
as the othersNews to me.
http://wiki.postgresql.org/wiki/PostgreSQL_9.1_Development_Plan
I believe that with tighter management of the process, it should be
possible to reduce the average delay between someone writing a
feature and that feature appearing in a production release by about
two months without compromising quality. Getting hypothetical for a
moment, delaying release of 50 features for two months to allow
release of one feature ten months earlier is likely to frustrate a
lot more people than having the train leave the station on time and
putting that one feature into the next release.
My impression was that Robert is trying to find a way to help get
Simon's patch into this release without holding everything up for
it. In my book, that's not a declaration of war; it's community
spirit.
-Kevin
Robert Haas <robertmhaas@gmail.com> writes:
I'm not trying to bypass compromising, and I don't know what makes you
think otherwise. I am trying to ensure that the CommitFest wraps up
Well, I'm too tired to allow myself posting such comments, sorry to have
left the previous mail through. More than one commit fest saw its time
frame extended for 1 or 2 weeks already, I think, all I'm saying is that
this one will certainly not be an exception, and that's for the best.
Be sure I appreciate the efforts you're putting into the mix!
Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 2/7/11 11:41 AM, Robert Haas wrote:
However, I don't want to prolong
the CommitFest indefinitely in the face of patches that the authors
are not actively working on or can't finish in the time available, or
where there is no consensus that the proposed change is what we want.
I believe that this, too, is a generally accepted principle in our
community, not something I just made up.
+1.
I, for one, would vote against extending beta if Sync Rep isn't ready
yet. There's plenty of other "big features" in 9.1, and Sync Rep will
benefit from having additional development time given the number of
major spec points we only cleared up a few weeks ago.
I think the majority of our users would prefer a 9.1 in May to one that
has Sync Rep and is delivered in September. If they had a choice.
Speaking of which, time to do some reviewing ...
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Mon, Feb 7, 2011 at 3:06 PM, Dave Page <dpage@pgadmin.org> wrote:
Rejecting stuff because we haven't gotten round to dealing with it in
such a short period of time is a damn good way to limit the number of
contributions we get. I don't believe we've agreed at any point that
the last commitfest should be the same time length as the othersNews to me.
http://wiki.postgresql.org/wiki/PostgreSQL_9.1_Development_Plan
Yes, and? It doesn't say beta 1 at the after a month of the last
commitfest, which is the milestone which marks the end of development.
It says alpha 4, and possibly more alphas. It's pretty clear that it
is expected that development and polishing will continue past the 20th
February.
You're moving the bar. It DOES say that the CommitFest will end on
February 15th. Now, if we want to have a discussion about changing
that, let's have that discussion (perhaps on a thread where the
subject has something to do with the topic), but we DID talk about
this, it WAS agreed, and it's been sitting there on the wiki for
something like 8 months. Obviously, there will continue to be
polishing after the CommitFest is over, but that's not the same thing
as saying we're going to lengthen the CommitFest itself.
I think we need to step back a few paces here and talk about what
we're trying to accomplish by making some change to the proposed and
agreed CommitFest schedule. If there's a concern that some patches
haven't been thoroughly reviewed at this point, then I think that's a
fair concern, and let's talk about which ones they are and see what we
can do about it. I don't believe that's the case, and it's certainly
not the case for sync rep, which was submitted in an unpolished state
by Simon's own admission, reviewed and discussed, then sat for three
weeks without an update. So perhaps the concern is that sync rep is a
make or break for this release. OK, then fine, let's talk about
whether it's worth slipping the release for that feature. I have no
problem with either of those conversations, and I'm happy to offer my
opinions and listen to the opinions of others, and we can make some
decision.
I think, though, that we need to be explicit about what we're doing,
and why we're doing it. I have been working hard on this CommitFest
for a long time (since approximately a month before it started) at the
cost of development projects I would have liked to have worked on,
because I knew we were going to be overwhelmed with patches. I have
helped as many people as I can with as many patches as I have been
able to. I think that finishing on time (or at least as close to on
time as we can manage) is important to our success as a development
community, just as having good features is. We don't have to agree on
what the best thing to do is, but I would certainly appreciate it if
everyone could at least credit me with acting in good faith.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, 2011-02-07 at 12:24 -0800, Josh Berkus wrote:
+1.
I, for one, would vote against extending beta if Sync Rep isn't ready
yet. There's plenty of other "big features" in 9.1, and Sync Rep will
benefit from having additional development time given the number of
major spec points we only cleared up a few weeks ago.I think the majority of our users would prefer a 9.1 in May to one that
has Sync Rep and is delivered in September. If they had a choice.
+1
JD
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Mon, Feb 7, 2011 at 9:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
You're moving the bar. It DOES say that the CommitFest will end on
February 15th. Now, if we want to have a discussion about changing
that, let's have that discussion (perhaps on a thread where the
subject has something to do with the topic), but we DID talk about
this, it WAS agreed, and it's been sitting there on the wiki for
something like 8 months. Obviously, there will continue to be
polishing after the CommitFest is over, but that's not the same thing
as saying we're going to lengthen the CommitFest itself.
I'm not moving the bar - I'm talking practically. Regardless of when
we consider the commitfest itself over, development and commit work of
new features has always continued until beta 1, and that has not
changed as far as I'm aware.
I think, though, that we need to be explicit about what we're doing,
and why we're doing it. I have been working hard on this CommitFest
for a long time (since approximately a month before it started) at the
cost of development projects I would have liked to have worked on,
because I knew we were going to be overwhelmed with patches. I have
helped as many people as I can with as many patches as I have been
able to. I think that finishing on time (or at least as close to on
time as we can manage) is important to our success as a development
community, just as having good features is. We don't have to agree on
what the best thing to do is, but I would certainly appreciate it if
everyone could at least credit me with acting in good faith.
Oh, I have absolutely no doubt you're working in good faith, and
personally I thank you for the hard work you've put in. I just
disagree with your interpretation of the timetable.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Feb 7, 2011 at 3:14 PM, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
I'm not trying to bypass compromising, and I don't know what makes you
think otherwise. I am trying to ensure that the CommitFest wraps upWell, I'm too tired to allow myself posting such comments, sorry to have
left the previous mail through.
Thanks, I understand.
More than one commit fest saw its time
frame extended for 1 or 2 weeks already, I think, all I'm saying is that
this one will certainly not be an exception, and that's for the best.
We've actually done really well. The last CommitFest in 9.0 wrapped
up on 2/17 (two days late), and the others were mostly right on time
as well. The CommitFests for 9.1 ended on: 8/15 (on time), 10/26 (9
days late, but there was no activity on the last two of those days, so
say 7 days late), and 12/21 (six days late). As far as I can tell,
the difference primarily has to do with who manages the CommitFests
and how aggressively they follow up on patches that are dropped. The
last CommitFest we have that really ran late was the final CommitFest
of the 8.4 cycle, and it was that event that led me to accept Josh
Berkus's invitation to be a CF manager for the first 9.0 CommitFest.
Because five month CommitFests with the tree frozen are awful and
sucky for everyone except the people who are getting extra time to
finish their patches, and they aren't really that great for those
people either.
As far as I am concerned, everything from now until we've released a
stable beta with no known issues is time that I can't spend doing
development. So I'd like to minimize that time - not by arbitrarily
throwing patches out the window - but by a combination of postponing
patches that are not done and working my ass off to finish as much as
possible.
Be sure I appreciate the efforts you're putting into the mix!
Thanks.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Feb 7, 2011 at 3:34 PM, Dave Page <dpage@pgadmin.org> wrote:
On Mon, Feb 7, 2011 at 9:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
You're moving the bar. It DOES say that the CommitFest will end on
February 15th. Now, if we want to have a discussion about changing
that, let's have that discussion (perhaps on a thread where the
subject has something to do with the topic), but we DID talk about
this, it WAS agreed, and it's been sitting there on the wiki for
something like 8 months. Obviously, there will continue to be
polishing after the CommitFest is over, but that's not the same thing
as saying we're going to lengthen the CommitFest itself.I'm not moving the bar - I'm talking practically. Regardless of when
we consider the commitfest itself over, development and commit work of
new features has always continued until beta 1, and that has not
changed as far as I'm aware.
I don't think that's really true. Go back and read the output of 'git
log REL9_0_BETA1'. It's bug fixes, rearrangements of things that were
committed but turned out to be controversial, documentation work,
release note editing, pgindent crap... sure, it wasn't a totally hard
freeze, but it was pretty solid slush. We did a good job not letting
things drag out, and FWIW I think that was a good decision. I don't
remember too many people being unhappy about their patches getting
punted, either. There were one or two, but generally we punted things
that needed major rework or just weren't getting updated in a timely
fashion, and that, combined with a lot of hard work on Tom's part
among others, worked fine.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
dpage@pgadmin.org (Dave Page) writes:
On Mon, Feb 7, 2011 at 6:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 7, 2011 at 12:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... Well, the current CommitFest ends in one week, ...
Really? I thought the idea for the last CF of a development cycle was
that it kept going till we'd dealt with everything. Arbitrarily
rejecting stuff we haven't dealt with doesn't seem fair.Uh, we did that with 8.4 and it was a disaster. The CommitFest lasted
*five months*. We've been doing schedule-based CommitFests ever since
and it's worked much better.Rejecting stuff because we haven't gotten round to dealing with it in
such a short period of time is a damn good way to limit the number of
contributions we get. I don't believe we've agreed at any point that
the last commitfest should be the same time length as the others (when
we originally came up with the commitfest idea, it certainly wasn't
expected), and deciding that without giving people advanced notice is
a really good way to piss them off and encourage them to go work on
other things.If we're going to put a time limit on this - and I think we should -
we should publish a date ASAP, that gives everyone a fair chance to
finish their work - say, 4 weeks.Then, if we want to make the last commitfest the same length as the
others next year, we can make that decision and document those plans.
There *is* a problem that there doesn't seem to be enough time to
readily allow development of larger features without people getting
stuck fighting with the release periods. But that's not the problem
taking place here. It was documented, last May, that the final
CommitFest for 9.1 was to complete 2011-02-15, and there did seem to be
agreement on that.
It sure looks to me like there are going to be a bunch of items that,
based on the recognized policies, need to get deferred to 9.2, and the
prospects for Sync Rep getting into 9.1 don't look notably good to me.
Looking at things statistically, the 9.1 commitfests have had the
following numbers of items:
#1 - 2010-09 - 52, of which 26 were committed
#2 - 2010-11 - 43, of which 23 were committed
#3 - 2011-01 - 98, of which 35 have been committed, and 10 are
considered ready to commit.
It may appear unfair to not offer everyone a "fair chance to finish
their work," but it's not as if the date wasn't published Plenty Long
Ago. and well-publicized.
But deferring the end of the CommitFest would be Not Fair to those that
*did* get their proposed changes ready for the preceding Fests. We
cannot evade unfairness.
It's definitely readily arguable that fairness requires that:
- Items not committable by 2011-02-15 be deferred to the 2011-Next fest
There are around 25 items right now that are sitting with [Waiting
for Author] and [Returned with Feedback] statuses. They largely seem
like pretty fair game for "next fest."
- Large items that weren't included in the 2010-11 fest be considered
problematic to try to integrate into 9.1
There sure seem to be some large items in the 2011-01 fest, which I
thought wasn't supposed to be the case.
We shouldn't just impose policy for the sake of imposing policy, but I
do recall Really Long CommitFests being pretty disastrous. And there's
*SO* much outstanding in this particular fest that it's getting past
time for doing some substantial triage so that reviewer attentions may
be directed towards the items most likely to be acceptable for 9.1.
I hate to think that 9.1 won't include Simon's SR material, but that may
have to be.
--
http://www3.sympatico.ca/cbbrowne/slony.html
"It's a pretty rare beginner who isn't clueless. If beginners weren't
clueless, the infamous Unix learning cliff wouldn't be a problem."
-- david parsons
On Mon, Feb 7, 2011 at 9:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 7, 2011 at 3:34 PM, Dave Page <dpage@pgadmin.org> wrote:
On Mon, Feb 7, 2011 at 9:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
You're moving the bar. It DOES say that the CommitFest will end on
February 15th. Now, if we want to have a discussion about changing
that, let's have that discussion (perhaps on a thread where the
subject has something to do with the topic), but we DID talk about
this, it WAS agreed, and it's been sitting there on the wiki for
something like 8 months. Obviously, there will continue to be
polishing after the CommitFest is over, but that's not the same thing
as saying we're going to lengthen the CommitFest itself.I'm not moving the bar - I'm talking practically. Regardless of when
we consider the commitfest itself over, development and commit work of
new features has always continued until beta 1, and that has not
changed as far as I'm aware.I don't think that's really true. Go back and read the output of 'git
log REL9_0_BETA1'. It's bug fixes, rearrangements of things that were
committed but turned out to be controversial, documentation work,
release note editing, pgindent crap... sure, it wasn't a totally hard
freeze, but it was pretty solid slush. We did a good job not letting
things drag out, and FWIW I think that was a good decision. I don't
remember too many people being unhappy about their patches getting
punted, either. There were one or two, but generally we punted things
that needed major rework or just weren't getting updated in a timely
fashion, and that, combined with a lot of hard work on Tom's part
among others, worked fine.
I guess we disagree on what we consider to be "development" then. Just
looking back to April, I see various committers whacking things around
that look to me like the fine tuning and completion of earlier
patches.
Oh - and just so we're clear... I too want us to get the release out
promptly, I'm just concerned that we don't blindside developers.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, 2011-02-07 at 15:25 -0500, Robert Haas wrote:
I would certainly appreciate it if
everyone could at least credit me with acting in good faith.
I think you are, if that helps.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On Mon, 2011-02-07 at 11:50 -0800, Josh Berkus wrote:
I just spoke to my manager at EnterpriseDB and he cleared my schedule
for the next two days to work on this. So I'll go hack on this now.
I haven't read the patch yet so I don't know for sure how quite I'll
be able to get up to speed on it, so if someone who is more familiar
with this code wants to grab the baton away from me, feel free.
Otherwise, I'll see what I can do with it.Presumably you have a reason for declaring war? I'm sorry for that, I
really am.How is clearing out his whole schedule to help review & fix the patch
declaring war? You have an odd attitude towards assistance, Simon.
It seems likely that Robert had not read my reply where I said I had
time to work on this project before posting. In that case, I withdraw my
comments and apologise.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On Mon, Feb 7, 2011 at 5:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, 2011-02-07 at 11:50 -0800, Josh Berkus wrote:
I just spoke to my manager at EnterpriseDB and he cleared my schedule
for the next two days to work on this. So I'll go hack on this now.
I haven't read the patch yet so I don't know for sure how quite I'll
be able to get up to speed on it, so if someone who is more familiar
with this code wants to grab the baton away from me, feel free.
Otherwise, I'll see what I can do with it.Presumably you have a reason for declaring war? I'm sorry for that, I
really am.How is clearing out his whole schedule to help review & fix the patch
declaring war? You have an odd attitude towards assistance, Simon.It seems likely that Robert had not read my reply where I said I had
time to work on this project before posting. In that case, I withdraw my
comments and apologise.
I did read it, but I still don't think saying I'm going to work on a
patch that's been stalled for weeks - or really months - constitutes
any sort of declaration of war, peace, or anything else. I have just
as much of a right to work on a given feature as you or anyone else
does. Typically, when I work on patches and help get them committed,
the response is "thanks". I'm not so sure what's different in this
case.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Feb 7, 2011 at 1:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Jan 15, 2011 at 4:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Here's the latest patch for sync rep.
Here is a rebased version of this patch which applies to head of the
master branch. I haven't tested it yet beyond making sure that it
compiles and passes the regression tests -- but this fixes the bitrot.
As I mentioned yesterday that I would do, I spent some time working on
this. I think that there are somewhere between three and six
independently useful features in this patch, plus a few random changes
to the documentation that I'm not sure whether want or not (e.g.
replacing master by primary in a few places, or the other way around).
One problem with the core synchronous replication technology is that
walreceiver cannot both receive WAL and write WAL at the same time.
It switches back and forth between reading WAL from the network socket
and flushing it to disk. The impact of that is somewhat mitigated in
the current patch because it only implements the "fsync" level of
replication, and chances are that the network read time is small
compared to the fsync time. But it would certainly suck for the
"receive" level we've talked about having in the past, because after
receiving each batch of WAL, the WAL receiver wouldn't be able to send
any more acknowledgments until the fsync completed, and that's bound
to be slow. I'm not really sure how bad it will be in "fsync" mode;
it may be tolerable, but as Simon noted in a comment, in the long run
it'd certainly be nicer to have the WAL writer process running during
recovery.
As a general comment on the quality of the code, I think that the
overall logic is probably sound, but there are an awful lot of
debugging leftovers and inconsistencies between various parts of the
patch. For example, when I initially tested it, *asynchronous*
replication kept breaking between the master and the standby, and I
couldn't figure out why. I finally realized that there was a ten
second pause that had been inserting into the WAL receiver loop as a
debugging tool which was allowing the standby to get far enough behind
that the master was able to recycle WAL segments the standby still
needed. Under ordinary circumstances, I would say that a patch like
this was not mature enough to submit for review, let alone commit.
For that reason, I am pretty doubtful about the chances of getting
this finished for 9.1 without some substantial prolongation of the
schedule.
That having been said, there is at least one part of this patch which
looks to be in pretty good shape and seems independently useful
regardless of what happens to the rest of it, and that is the code
that sends replies from the standby back to the primary. This allows
pg_stat_replication to display the write/flush/apply log positions on
the standby next to the sent position on the primary, which as far as
I am concerned is pure gold. Simon had this set up to happen only
when synchronous replication or XID feedback in use, but I think
people are going to want it even with plain old asynchronous
replication, because it provides a FAR easier way to monitor standby
lag than anything we have today. I've extracted this portion of the
patch, cleaned it up a bit, written docs, and attached it here.
I wasn't too sure how to control the timing of the replies. It's
worth noting that you have to send them pretty frequently for the
distinction between xlog written and xlog flushed to have any value.
What I've done here is made it so that every time we read all
available data on the socket, we send a reply. After flushing, we
send another reply. And then just for the heck of it we send a reply
at least every 10 seconds (configurable), which causes the
last-known-apply position to eventually get updated on the master.
This means the apply position can lag reality by a bit. Simon's
version adds a latch, so that the startup process can poke the WAL
receiver to send a reply when the apply position moves. But this is
substantially more complex and I'm not sure it's worth it. If we were
implementing the "apply" level of synchronized replication, we'd
clearly need that for performance not to stink. But since the patch
is only implementing "fsync" anyway, it doesn't seem necessary for
now.
The only real complaint I can imagine about offering this
functionality all the time is that it uses extra bandwidth. I'm
inclined to think that the ability to shut it off completely is
sufficient answer to that complaint.
<dons asbestos underwear>
Comments?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
wal-receiver-replies.patchapplication/octet-stream; name=wal-receiver-replies.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5a43774..63c6283 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1984,6 +1984,29 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-wal-receiver-status-interval" xreflabel="wal_receiver_status_interval">
+ <term><varname>wal_receiver_status_interval</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>wal_receiver_status_interval</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies the minimum frequency, in seconds, for the WAL receiver
+ process on the standby to send information about replication progress
+ to the primary, where they can be seen using the
+ <literal>pg_stat_replication</literal> view. The standby will report
+ the last transaction log position it has written, the last position it
+ has flushed to disk, and the last position it has applied. Updates are
+ sent each time the write or flush positions changed, or at least as
+ often as specified by this parameter. Thus, the apply position may
+ lag slightly behind the true position. Setting this parameter to zero
+ disables status updates completely. This parameter can only be set in
+ the <filename>postgresql.conf</> file or on the server command line.
+ The default value is 10 seconds.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-vacuum-defer-cleanup-age" xreflabel="vacuum_defer_cleanup_age">
<term><varname>vacuum_defer_cleanup_age</varname> (<type>integer</type>)</term>
<indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index ca83421..f3481bb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -298,8 +298,11 @@ postgres: <replaceable>user</> <replaceable>database</> <replaceable>host</> <re
<entry><structname>pg_stat_replication</><indexterm><primary>pg_stat_replication</primary></indexterm></entry>
<entry>One row per WAL sender process, showing process <acronym>ID</>,
user OID, user name, application name, client's address and port number,
- time at which the server process began execution, current WAL sender
- state and transaction log location. The columns detailing what exactly
+ time at which the server process began execution, and the current WAL
+ sender state and transaction log location. In addition, the standby
+ reports the last transaction log position it received and wrote, the last
+ position it flushed to disk, and the last position it replayed, and this
+ information is also displayed here. The columns detailing what exactly
the connection is doing are only visible if the user examining the view
is a superuser.
</entry>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 25c7e06..3680811 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9029,6 +9029,25 @@ pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
}
/*
+ * Get latest redo apply position.
+ *
+ * Exported to allow WALReceiver to read the pointer directly.
+ */
+XLogRecPtr
+GetXLogReplayRecPtr(void)
+{
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+ XLogRecPtr recptr;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ recptr = xlogctl->recoveryLastRecPtr;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ return recptr;
+}
+
+/*
* Report the last WAL replay location (same format as pg_start_backup etc)
*
* This is useful for determining how much of WAL is visible to read-only
@@ -9037,14 +9056,10 @@ pg_last_xlog_receive_location(PG_FUNCTION_ARGS)
Datum
pg_last_xlog_replay_location(PG_FUNCTION_ARGS)
{
- /* use volatile pointer to prevent code rearrangement */
- volatile XLogCtlData *xlogctl = XLogCtl;
XLogRecPtr recptr;
char location[MAXFNAMELEN];
- SpinLockAcquire(&xlogctl->info_lck);
- recptr = xlogctl->recoveryLastRecPtr;
- SpinLockRelease(&xlogctl->info_lck);
+ recptr = GetXLogReplayRecPtr();
if (recptr.xlogid == 0 && recptr.xrecoff == 0)
PG_RETURN_NULL();
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 718e996..40e94ba 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -502,7 +502,10 @@ CREATE VIEW pg_stat_replication AS
S.client_port,
S.backend_start,
W.state,
- W.sent_location
+ W.sent_location,
+ W.write_location,
+ W.flush_location,
+ W.apply_location
FROM pg_stat_get_activity(NULL) AS S, pg_authid U,
pg_stat_get_wal_senders() AS W
WHERE S.usesysid = U.oid AND
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7005307..35cd121 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -53,6 +53,7 @@
/* Global variable to indicate if this process is a walreceiver process */
bool am_walreceiver;
+int wal_receiver_status_interval;
/* libpqreceiver hooks to these when loaded */
walrcv_connect_type walrcv_connect = NULL;
@@ -88,6 +89,8 @@ static struct
XLogRecPtr Flush; /* last byte + 1 flushed in the standby */
} LogstreamResult;
+static StandbyReplyMessage reply_message;
+
/*
* About SIGTERM handling:
*
@@ -114,6 +117,7 @@ static void WalRcvDie(int code, Datum arg);
static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
static void XLogWalRcvFlush(void);
+static void XLogWalRcvSendReply(void);
/* Signal handlers */
static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -306,12 +310,24 @@ WalReceiverMain(void)
while (walrcv_receive(0, &type, &buf, &len))
XLogWalRcvProcessMsg(type, buf, len);
+ /* Let the master know that we received some data. */
+ XLogWalRcvSendReply();
+
/*
* If we've written some records, flush them to disk and let the
* startup process know about them.
*/
XLogWalRcvFlush();
}
+
+ /*
+ * Send a status update to the master.
+ *
+ * If we received any data this cycle, the flush position will have
+ * advanced; and the apply position may have advanced whether we got
+ * any new data or not.
+ */
+ XLogWalRcvSendReply();
}
}
@@ -548,3 +564,51 @@ XLogWalRcvFlush(void)
}
}
}
+
+/*
+ * Send reply message to primary, indicating our current XLOG positions and
+ * the current time.
+ */
+static void
+XLogWalRcvSendReply(void)
+{
+ TimestampTz now;
+
+ /*
+ * If the user doesn't want status to be reported to the master, be sure
+ * to exit before doing anything at all.
+ */
+ if (wal_receiver_status_interval <= 0)
+ return;
+
+ /* Get current timestamp. */
+ now = GetCurrentTimestamp();
+
+ /*
+ * We can compare the write and flush positions to the last message we sent
+ * without taking any lock, but the apply position requires a spin lock, so
+ * we don't check that unless something else has changed or 10 seconds have
+ * passed. This means that the apply log position will appear, from the
+ * master's point of view, to lag slightly, but since this is only for
+ * reporting purposes and only on idle systems, that's probably OK.
+ */
+ if (XLByteEQ(reply_message.write, LogstreamResult.Write)
+ && XLByteEQ(reply_message.flush, LogstreamResult.Flush)
+ && !TimestampDifferenceExceeds(reply_message.sendTime, now,
+ wal_receiver_status_interval * 1000))
+ return;
+
+ /* Construct a new message. */
+ reply_message.write = LogstreamResult.Write;
+ reply_message.flush = LogstreamResult.Flush;
+ reply_message.apply = GetXLogReplayRecPtr();
+ reply_message.sendTime = now;
+
+ elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
+ reply_message.write.xlogid, reply_message.write.xrecoff,
+ reply_message.flush.xlogid, reply_message.flush.xrecoff,
+ reply_message.apply.xlogid, reply_message.apply.xrecoff);
+
+ /* Send it. */
+ walrcv_send((char *) &reply_message, sizeof(StandbyReplyMessage));
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 78963c1..fcb5a32 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -39,6 +39,7 @@
#include "funcapi.h"
#include "access/xlog_internal.h"
+#include "access/transam.h"
#include "catalog/pg_type.h"
#include "libpq/libpq.h"
#include "libpq/pqformat.h"
@@ -51,6 +52,7 @@
#include "storage/fd.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
+#include "storage/proc.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
#include "utils/guc.h"
@@ -106,9 +108,10 @@ static void InitWalSnd(void);
static void WalSndHandshake(void);
static void WalSndKill(int code, Datum arg);
static bool XLogSend(char *msgbuf, bool *caughtup);
-static void CheckClosedConnection(void);
static void IdentifySystem(void);
static void StartReplication(StartReplicationCmd * cmd);
+static void ProcessStandbyReplyMessage(void);
+static void ProcessRepliesIfAny(void);
/* Main entry point for walsender process */
@@ -173,7 +176,7 @@ WalSenderMain(void)
static void
WalSndHandshake(void)
{
- StringInfoData input_message;
+ static StringInfoData input_message;
bool replication_started = false;
initStringInfo(&input_message);
@@ -442,7 +445,7 @@ HandleReplicationCommand(const char *cmd_string)
* Check if the remote end has closed the connection.
*/
static void
-CheckClosedConnection(void)
+ProcessRepliesIfAny(void)
{
unsigned char firstchar;
int r;
@@ -466,6 +469,13 @@ CheckClosedConnection(void)
switch (firstchar)
{
/*
+ * 'd' means a standby reply wrapped in a COPY BOTH packet.
+ */
+ case 'd':
+ ProcessStandbyReplyMessage();
+ break;
+
+ /*
* 'X' means that the standby is closing down the socket.
*/
case 'X':
@@ -479,6 +489,54 @@ CheckClosedConnection(void)
}
}
+/*
+ * Receive StandbyReplyMessage. False if message send failed.
+ */
+static void
+ProcessStandbyReplyMessage(void)
+{
+ static StringInfoData input_message;
+ StandbyReplyMessage reply;
+
+ initStringInfo(&input_message);
+
+ /*
+ * Read the message contents.
+ */
+ if (pq_getmessage(&input_message, 0))
+ {
+ ereport(COMMERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected EOF on standby connection")));
+ proc_exit(0);
+ }
+
+ pq_copymsgbytes(&input_message, (char *) &reply, sizeof(StandbyReplyMessage));
+
+ elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X ",
+ reply.write.xlogid, reply.write.xrecoff,
+ reply.flush.xlogid, reply.flush.xrecoff,
+ reply.apply.xlogid, reply.apply.xrecoff);
+
+ /*
+ * Update shared state for this WalSender process
+ * based on reply data from standby.
+ */
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalSnd *walsnd = MyWalSnd;
+
+ SpinLockAcquire(&walsnd->mutex);
+ if (XLByteLT(walsnd->write, reply.write))
+ walsnd->write = reply.write;
+ if (XLByteLT(walsnd->flush, reply.flush))
+ walsnd->flush = reply.flush;
+ if (XLByteLT(walsnd->apply, reply.apply))
+ walsnd->apply = reply.apply;
+ SpinLockRelease(&walsnd->mutex);
+ }
+}
+
/* Main loop of walsender process */
static int
WalSndLoop(void)
@@ -518,6 +576,7 @@ WalSndLoop(void)
{
if (!XLogSend(output_message, &caughtup))
break;
+ ProcessRepliesIfAny();
if (caughtup)
walsender_shutdown_requested = true;
}
@@ -561,9 +620,6 @@ WalSndLoop(void)
WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
WalSndDelay * 1000L);
}
-
- /* Check if the connection was closed */
- CheckClosedConnection();
}
else
{
@@ -574,6 +630,7 @@ WalSndLoop(void)
/* Update our state to indicate if we're behind or not */
WalSndSetState(caughtup ? WALSNDSTATE_STREAMING : WALSNDSTATE_CATCHUP);
+ ProcessRepliesIfAny();
}
/*
@@ -1104,7 +1161,7 @@ WalSndGetStateString(WalSndState state)
Datum
pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
-#define PG_STAT_GET_WAL_SENDERS_COLS 3
+#define PG_STAT_GET_WAL_SENDERS_COLS 6
ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc tupdesc;
Tuplestorestate *tupstore;
@@ -1141,8 +1198,11 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
/* use volatile pointer to prevent code rearrangement */
volatile WalSnd *walsnd = &WalSndCtl->walsnds[i];
- char sent_location[MAXFNAMELEN];
+ char location[MAXFNAMELEN];
XLogRecPtr sentPtr;
+ XLogRecPtr write;
+ XLogRecPtr flush;
+ XLogRecPtr apply;
WalSndState state;
Datum values[PG_STAT_GET_WAL_SENDERS_COLS];
bool nulls[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -1153,13 +1213,14 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
SpinLockAcquire(&walsnd->mutex);
sentPtr = walsnd->sentPtr;
state = walsnd->state;
+ write = walsnd->write;
+ flush = walsnd->flush;
+ apply = walsnd->apply;
SpinLockRelease(&walsnd->mutex);
- snprintf(sent_location, sizeof(sent_location), "%X/%X",
- sentPtr.xlogid, sentPtr.xrecoff);
-
memset(nulls, 0, sizeof(nulls));
values[0] = Int32GetDatum(walsnd->pid);
+
if (!superuser())
{
/*
@@ -1168,11 +1229,35 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
*/
nulls[1] = true;
nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
}
else
{
values[1] = CStringGetTextDatum(WalSndGetStateString(state));
- values[2] = CStringGetTextDatum(sent_location);
+
+ snprintf(location, sizeof(location), "%X/%X",
+ sentPtr.xlogid, sentPtr.xrecoff);
+ values[2] = CStringGetTextDatum(location);
+
+ if (write.xlogid == 0 && write.xrecoff == 0)
+ nulls[4] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ write.xlogid, write.xrecoff);
+ values[3] = CStringGetTextDatum(location);
+
+ if (flush.xlogid == 0 && flush.xrecoff == 0)
+ nulls[5] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ flush.xlogid, flush.xrecoff);
+ values[4] = CStringGetTextDatum(location);
+
+ if (apply.xlogid == 0 && apply.xrecoff == 0)
+ nulls[6] = true;
+ snprintf(location, sizeof(location), "%X/%X",
+ apply.xlogid, apply.xrecoff);
+ values[5] = CStringGetTextDatum(location);
}
tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 216236b..5ede280 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -55,6 +55,7 @@
#include "postmaster/postmaster.h"
#include "postmaster/syslogger.h"
#include "postmaster/walwriter.h"
+#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
#include "storage/standby.h"
@@ -1755,6 +1756,16 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"wal_receiver_status_interval", PGC_SIGHUP, WAL_STANDBY_SERVERS,
+ gettext_noop("Sets the maximum interval between WAL receiver status reports to the master."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &wal_receiver_status_interval,
+ 10, 0, INT_MAX/1000, NULL, NULL
+ },
+
+ {
{"checkpoint_segments", PGC_SIGHUP, WAL_CHECKPOINTS,
gettext_noop("Sets the maximum distance in log segments between automatic WAL checkpoints."),
NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index fe80c4d..1b02aa0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -194,6 +194,7 @@
# - Standby Servers -
+#wal_receiver_status_interval = 10s # replies at least this often, 0 disables
#hot_standby = off # "on" allows queries during recovery
# (change requires restart)
#max_standby_archive_delay = 30s # max delay before canceling queries
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 122e96b..352b9a4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -290,6 +290,7 @@ extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg);
extern bool RecoveryInProgress(void);
extern bool XLogInsertAllowed(void);
extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
+extern XLogRecPtr GetXLogReplayRecPtr(void);
extern void UpdateControlFile(void);
extern uint64 GetSystemIdentifier(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f8b5d4d..f842a50 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3075,7 +3075,7 @@ DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 f f
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,23}" "{i,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
DESCR("statistics: information about currently active backends");
-DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25}" "{o,o,o}" "{procpid,state,sent_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25,25,25,25}" "{o,o,o,o,o,o}" "{procpid,state,sent_location,write_location,flush_location,apply_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
DESCR("statistics: information about currently active replication");
DATA(insert OID = 2026 ( pg_backend_pid PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
DESCR("statistics: current backend PID");
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index 1993851..c69ca9d 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -40,6 +40,40 @@ typedef struct
} WalDataMessageHeader;
/*
+ * Reply message from standby (message type 'r'). This is wrapped within
+ * a CopyData message at the FE/BE protocol level.
+ *
+ * Note that the data length is not specified here.
+ */
+typedef struct
+{
+ /*
+ * The xlog location that has been written to WAL file by standby-side.
+ * This may be invalid if the standby-side does not choose to offer
+ * a synchronous replication reply service, or is unable to offer
+ * a valid reply for data that has only been written, not fsynced.
+ */
+ XLogRecPtr write;
+
+ /*
+ * The xlog location that has been fsynced onto disk by standby-side.
+ * This may be invalid if the standby-side does not choose to offer
+ * a synchronous replication reply service, or is unable to.
+ */
+ XLogRecPtr flush;
+
+ /*
+ * The xlog location that has been applied by standby Startup process.
+ * This may be invalid if the standby-side does not support apply,
+ * or does not choose to apply records, as yet.
+ */
+ XLogRecPtr apply;
+
+ /* Sender's system clock at the time of transmission */
+ TimestampTz sendTime;
+} StandbyReplyMessage;
+
+/*
* Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
*
* We don't have a good idea of what a good value would be; there's some
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 24ad438..aa5bfb7 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -17,6 +17,7 @@
#include "pgtime.h"
extern bool am_walreceiver;
+extern int wal_receiver_status_interval;
/*
* MAXCONNINFO: maximum size of a connection string.
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 9a196ab..abee380 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -35,13 +35,34 @@ typedef struct WalSnd
WalSndState state; /* this walsender's state */
XLogRecPtr sentPtr; /* WAL has been sent up to this point */
- slock_t mutex; /* locks shared variables shown above */
+ /*
+ * The xlog location that has been written to WAL file by standby-side.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr write;
+
+ /*
+ * The xlog location that has been fsynced onto disk by standby-side.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr flush;
+
+ /*
+ * The xlog location that has been applied by standby Startup process.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ XLogRecPtr apply;
/*
* Latch used by backends to wake up this walsender when it has work
* to do.
*/
Latch latch;
+
+ /*
+ * Locks shared variables shown above.
+ */
+ slock_t mutex;
} WalSnd;
/* There is one WalSndCtl struct for the whole database cluster */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 72e5630..1dbd1e5 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1296,7 +1296,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
pg_stat_bgwriter | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc;
pg_stat_database | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts FROM pg_database d;
pg_stat_database_conflicts | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
- pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sent_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
+ pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.apply_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, apply_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
pg_stat_sys_indexes | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
pg_stat_sys_tables | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_user_functions | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));
On Tue, Feb 8, 2011 at 19:53, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 7, 2011 at 1:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Jan 15, 2011 at 4:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Here's the latest patch for sync rep.
Here is a rebased version of this patch which applies to head of the
master branch. I haven't tested it yet beyond making sure that it
compiles and passes the regression tests -- but this fixes the bitrot.As I mentioned yesterday that I would do, I spent some time working on
this. I think that there are somewhere between three and six
independently useful features in this patch, plus a few random changes
to the documentation that I'm not sure whether want or not (e.g.
replacing master by primary in a few places, or the other way around).One problem with the core synchronous replication technology is that
walreceiver cannot both receive WAL and write WAL at the same time.
It switches back and forth between reading WAL from the network socket
and flushing it to disk. The impact of that is somewhat mitigated in
the current patch because it only implements the "fsync" level of
replication, and chances are that the network read time is small
compared to the fsync time. But it would certainly suck for the
"receive" level we've talked about having in the past, because after
receiving each batch of WAL, the WAL receiver wouldn't be able to send
any more acknowledgments until the fsync completed, and that's bound
to be slow. I'm not really sure how bad it will be in "fsync" mode;
it may be tolerable, but as Simon noted in a comment, in the long run
it'd certainly be nicer to have the WAL writer process running during
recovery.As a general comment on the quality of the code, I think that the
overall logic is probably sound, but there are an awful lot of
debugging leftovers and inconsistencies between various parts of the
patch. For example, when I initially tested it, *asynchronous*
replication kept breaking between the master and the standby, and I
couldn't figure out why. I finally realized that there was a ten
second pause that had been inserting into the WAL receiver loop as a
debugging tool which was allowing the standby to get far enough behind
that the master was able to recycle WAL segments the standby still
needed. Under ordinary circumstances, I would say that a patch like
this was not mature enough to submit for review, let alone commit.
For that reason, I am pretty doubtful about the chances of getting
this finished for 9.1 without some substantial prolongation of the
schedule.That having been said, there is at least one part of this patch which
looks to be in pretty good shape and seems independently useful
regardless of what happens to the rest of it, and that is the code
that sends replies from the standby back to the primary. This allows
pg_stat_replication to display the write/flush/apply log positions on
the standby next to the sent position on the primary, which as far as
I am concerned is pure gold. Simon had this set up to happen only
when synchronous replication or XID feedback in use, but I think
people are going to want it even with plain old asynchronous
replication, because it provides a FAR easier way to monitor standby
lag than anything we have today. I've extracted this portion of the
patch, cleaned it up a bit, written docs, and attached it here.
+1. I haven't actually looked at the patch, but having this ability
would be *great*.
I also agree with the general idea of trying to break it into smaller
parts - even if they only provide small parts each on it's own. That
also makes it easier to get an overview of exactly how much is left,
to see where to focus.
The only real complaint I can imagine about offering this
functionality all the time is that it uses extra bandwidth. I'm
inclined to think that the ability to shut it off completely is
sufficient answer to that complaint.
Yes, agreed.
I would usually not worry about the bandwidth, really, I'd be more
worried about potentially increasing latency somewhere.
<dons asbestos underwear>
The ones with little rocketships on them?
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On Tue, Feb 8, 2011 at 2:34 PM, Magnus Hagander <magnus@hagander.net> wrote:
I would usually not worry about the bandwidth, really, I'd be more
worried about potentially increasing latency somewhere.
The time to read and write the socket doesn't seem like it should be
significant, unless the network buffers fill up.... or I'm missing
something.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Feb 8, 2011 at 2:34 PM, Magnus Hagander <magnus@hagander.net> wrote:
I also agree with the general idea of trying to break it into smaller
parts - even if they only provide small parts each on it's own. That
also makes it easier to get an overview of exactly how much is left,
to see where to focus.
And on that note, here's the rest of the patch back, rebased over what
I posted ~90 minutes ago.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
syncrep-v9.2.patchapplication/octet-stream; name=syncrep-v9.2.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 63c6283..726c9c0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2029,8 +2029,122 @@ SET ENABLE_SEQSCAN TO OFF;
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
+ <para>
+ You should also consider setting <varname>hot_standby_feedback</>
+ as an alternative to using this parameter.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </sect2>
+
+ <sect2 id="runtime-config-sync-rep">
+ <title>Synchronous Replication</title>
+
+ <para>
+ These settings control the behavior of the built-in
+ <firstterm>synchronous replication</> feature.
+ These parameters would be set on the primary server that is
+ to send replication data to one or more standby servers.
+ </para>
+
+ <variablelist>
+ <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
+ <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>synchronous_replication</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether transaction commit will wait for WAL records
+ to be replicated before the command returns a <quote>success</>
+ indication to the client. The default setting is <literal>off</>.
+ When <literal>on</>, there will be a delay while the client waits
+ for confirmation of successful replication. That delay will
+ increase depending upon the physical distance and network activity
+ between primary and standby. The commit wait will last until the
+ first reply from any standby. Multiple standby servers allow
+ increased availability and possibly increase performance as well.
+ </para>
+ <para>
+ The parameter must be set on both primary and standby.
+ </para>
+ <para>
+ On the primary, this parameter can be changed at any time; the
+ behavior for any one transaction is determined by the setting in
+ effect when it commits. It is therefore possible, and useful, to have
+ some transactions replicate synchronously and others asynchronously.
+ For example, to make a single multistatement transaction commit
+ asynchronously when the default is synchronous replication, issue
+ <command>SET LOCAL synchronous_replication TO OFF</> within the
+ transaction.
+ </para>
+ <para>
+ On the standby, the parameter value is taken only at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-allow-standalone-primary" xreflabel="allow_standalone_primary">
+ <term><varname>allow_standalone_primary</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>allow_standalone_primary</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If <varname>allow_standalone_primary</> is set, then the server
+ can operate normally whether or not replication is active. If
+ a client requests <varname>synchronous_replication</> and it is
+ not available, they will use asynchornous replication instead.
+ </para>
+ <para>
+ If <varname>allow_standalone_primary</> is not set, then the server
+ will prevent normal client connections until a standby connects that
+ has <varname>synchronous_replication_feedback</> enabled. Once
+ clients connect, if they request <varname>synchronous_replication</>
+ and it is no longer available they will wait for
+ <varname>replication_timeout_client</>.
+ </para>
</listitem>
</varlistentry>
+
+ <varlistentry id="guc-replication-timeout-client" xreflabel="replication_timeout_client">
+ <term><varname>replication_timeout_client</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>replication_timeout_client</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If the client has <varname>synchronous_replication</varname> set,
+ and a synchronous standby is currently available
+ then the commit will wait for up to <varname>replication_timeout_client</>
+ seconds before it returns a <quote>success</>. The commit will wait
+ forever for a confirmation when <varname>replication_timeout_client</>
+ is set to -1.
+ </para>
+ <para>
+ If the client has <varname>synchronous_replication</varname> set,
+ and yet no synchronous standby is available when we commit, then the
+ setting of <varname>allow_standalone_primary</> determines whether
+ or not we wait.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-replication-timeout-server" xreflabel="replication_timeout_server">
+ <term><varname>replication_timeout_server</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>replication_timeout_server</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If the primary server does not receive a reply from a standby server
+ within <varname>replication_timeout_server</> seconds then the
+ primary will terminate the replication connection.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
@@ -2121,6 +2235,42 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby">
+ <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>hot_standby_feedback</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether or not a hot standby will send feedback to the primary
+ about queries currently executing on the standby. This parameter can
+ be used to eliminate query cancels caused by cleanup records, though
+ it can cause database bloat on the primary for some workloads.
+ The default value is <literal>off</literal>.
+ This parameter can only be set at server start. It only has effect
+ if <varname>hot_standby</> is enabled.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-synchronous-replication-feedback" xreflabel="synchronous_replication_feedback">
+ <term><varname>synchronous_replication_feedback</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>synchronous_replication_feedback</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether the standby will provide reply messages to
+ allow synchronous replication on the primary.
+ Reasons for doing this might be that the standby is physically
+ co-located with the primary and so would be a bad choice as a
+ future primary server, or the standby might be a test server.
+ The default value is <literal>on</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
</sect1>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index a892969..c006f35 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -738,13 +738,12 @@ archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
</para>
<para>
- Streaming replication is asynchronous, so there is still a small delay
+ There is a small replication delay
between committing a transaction in the primary and for the changes to
become visible in the standby. The delay is however much smaller than with
file-based log shipping, typically under one second assuming the standby
is powerful enough to keep up with the load. With streaming replication,
- <varname>archive_timeout</> is not required to reduce the data loss
- window.
+ <varname>archive_timeout</> is not required.
</para>
<para>
@@ -879,6 +878,236 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
</sect3>
</sect2>
+ <sect2 id="synchronous-replication">
+ <title>Synchronous Replication</title>
+
+ <indexterm zone="high-availability">
+ <primary>Synchronous Replication</primary>
+ </indexterm>
+
+ <para>
+ <productname>PostgreSQL</> streaming replication is asynchronous by
+ default. If the primary server
+ crashes then some transactions that were committed may not have been
+ replicated to the standby server, causing data loss. The amount
+ of data loss is proportional to the replication delay at the time of
+ failover. That could be zero, or more, we do not know for certain
+ either way, when using asynchronous replication.
+ </para>
+
+ <para>
+ Synchronous replication offers the ability to confirm that all changes
+ made by a transaction have been transferred to at least one remote
+ standby server. This extends the standard level of durability
+ offered by a transaction commit. This level of protection is referred
+ to as 2-safe replication in computer science theory.
+ </para>
+
+ <para>
+ Synchronous replication works in the following way. When requested,
+ the commit of a write transaction will wait until confirmation is
+ received that the commit has been written to the transaction log on disk
+ of both the primary and standby server. The only possibility that data
+ can be lost is if both the primary and the standby suffer crashes at the
+ same time. This can provide a much higher level of durability if the
+ sysadmin is cautious about the placement and management of the two servers.
+ Waiting for confirmation increases the user's confidence that the changes
+ will not be lost in the event of server crashes but it also necessarily
+ increases the response time for the requesting transaction. The minimum
+ wait time is the roundtrip time between primary to standby.
+ </para>
+
+ <para>
+ Read only transactions and transaction rollbacks need not wait for
+ replies from standby servers. Subtransaction commits do not wait for
+ responses from standby servers, only final top-level commits. Long
+ running actions such as data loading or index building do not wait
+ until the very final commit message.
+ </para>
+
+ <sect3 id="synchronous-replication-config">
+ <title>Basic Configuration</title>
+
+ <para>
+ Synchronous replication will be active if appropriate options are
+ enabled on both the primary and at least one standby server. If
+ options are not correctly set on both servers, the primary will use
+ use asynchronous replication by default.
+ </para>
+
+ <para>
+ On the primary server we need to set
+
+<programlisting>
+synchronous_replication = on
+</programlisting>
+
+ and on the standby server we need to set
+
+<programlisting>
+synchronous_replication_feedback = on
+</programlisting>
+
+ On the primary, <varname>synchronous_replication</> can be set
+ for particular users or databases, or dynamically by applications
+ programs. On the standby, <varname>synchronous_replication_feedback</>
+ can only be set at server start.
+ </para>
+
+ <para>
+ If more than one standby server
+ specifies <varname>synchronous_replication_feedback</>, then whichever
+ standby replies first will release waiting commits.
+ Turning this setting off for a standby allows the administrator to
+ exclude certain standby servers from releasing waiting transactions.
+ This is useful if not all standby servers are designated as potential
+ future primary servers, such as if a standby were co-located
+ with the primary, so that a disaster would cause both servers to be lost.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-performance">
+ <title>Planning for Performance</title>
+
+ <para>
+ Synchronous replication usually requires carefully planned and placed
+ standby servers to ensure applications perform acceptably. Waiting
+ doesn't utilise system resources, but transaction locks continue to be
+ held until the transfer is confirmed. As a result, incautious use of
+ synchronous replication will reduce performance for database
+ applications because of increased response times and higher contention.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</> allows the application developer
+ to specify the durability level required via replication. This can be
+ specified for the system overall, though it can also be specified for
+ specific users or connections, or even individual transactions.
+ </para>
+
+ <para>
+ For example, an application workload might consist of:
+ 10% of changes are important customer details, while
+ 90% of changes are less important data that the business can more
+ easily survive if it is lost, such as chat messages between users.
+ </para>
+
+ <para>
+ With synchronous replication options specified at the application level
+ (on the primary) we can offer sync rep for the most important changes,
+ without slowing down the bulk of the total workload. Application level
+ options are an important and practical tool for allowing the benefits of
+ synchronous replication for high performance applications.
+ </para>
+
+ <para>
+ You should consider that the network bandwidth must be higher than
+ the rate of generation of WAL data.
+ 10% of changes are important customer details, while
+ 90% of changes are less important data that the business can more
+ easily survive if it is lost, such as chat messages between users.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-ha">
+ <title>Planning for High Availability</title>
+
+ <para>
+ The easiest and safest method of gaining High Availability using
+ synchronous replication is to configure at least two standby servers.
+ To understand why, we need to examine what can happen when you lose all
+ standby servers.
+ </para>
+
+ <para>
+ Commits made when synchronous_replication is set will wait until at
+ least one standby responds. The response may never occur if the last,
+ or only, standby should crash or the network drops. What should we do in
+ that situation?
+ </para>
+
+ <para>
+ Sitting and waiting will typically cause operational problems
+ because it is an effective outage of the primary server should all
+ sessions end up waiting. In contrast, allowing the primary server to
+ continue processing write transactions in the absence of a standby
+ puts those latest data changes at risk. So in this situation there
+ is a direct choice between database availability and the potential
+ durability of the data it contains. How we handle this situation
+ is controlled by <varname>allow_standalone_primary</>. The default
+ setting is <literal>on</>, allowing processing to continue, though
+ there is no recommended setting. Choosing the best setting for
+ <varname>allow_standalone_primary</> is a difficult decision and best
+ left to those with combined business responsibility for both data and
+ applications. The difficulty of this choice is the reason why we
+ recommend that you reduce the possibility of this situation occurring
+ by using multiple standby servers.
+ </para>
+
+ <para>
+ A user will stop waiting once the <varname>replication_timeout_client</>
+ has been reached for their specific session. Users are not waiting for
+ a specific standby to reply, they are waiting for a reply from any
+ standby, so the unavailability of any one standby is not significant
+ to a user. It is possible for user sessions to hit timeout even though
+ standbys are communicating normally. In that case, the setting of
+ <varname>replication_timeout</> is probably too low.
+ </para>
+
+ <para>
+ The standby sends regular status messages to the primary. If no status
+ messages have been received for <varname>replication_timeout_server</>
+ the primary server will assume the connection is dead and terminate it.
+ </para>
+
+ <para>
+ When the primary is started with <varname>allow_standalone_primary</>
+ enabled, the primary will not allow connections until a standby connects
+ that also has <varname>synchronous_replication</> enabled. This is a
+ convenience to ensure that we don't allow connections before write
+ transactions will return successfully.
+ </para>
+
+ <para>
+ When a standby first attaches to the primary, it may not be properly
+ synchronized. The standby is only able to become a synchronous standby
+ once it has become synchronized, or "caught up" with the the primary.
+ The catch-up duration may be long immediately after the standby has
+ been created. If the standby is shutdown, then the catch-up period
+ will increase according to the length of time the standby has been
+ down. You are advised to make sure <varname>allow_standalone_primary</>
+ is not set during the initial catch-up period.
+ </para>
+
+ <para>
+ If primary crashes while commits are waiting for acknowledgement, those
+ transactions will be marked fully committed if the primary database
+ recovers, no matter how <varname>allow_standalone_primary</> is set.
+ There is no way to be certain that all standbys have received all
+ outstanding WAL data at time of the crash of the primary. Some
+ transactions may not show as committed on the standby, even though
+ they show as committed on the primary. The guarantee we offer is that
+ the application will not receive explicit acknowledgement of the
+ successful commit of a transaction until the WAL data is known to be
+ safely received by the standby. Hence this mechanism is technically
+ "semi synchronous" rather than "fully synchronous" replication. Note
+ that replication still not be fully synchronous even if we wait for
+ all standby servers, though this would reduce availability, as
+ described previously.
+ </para>
+
+ <para>
+ If you need to re-create a standby server while transactions are
+ waiting, make sure that the commands to run pg_start_backup() and
+ pg_stop_backup() are run in a session with
+ synchronous_replication = off, otherwise those requests will wait
+ forever for the standby to appear.
+ </para>
+
+ </sect3>
+ </sect2>
</sect1>
<sect1 id="warm-standby-failover">
@@ -1393,11 +1622,18 @@ if (!triggered)
These conflicts are <emphasis>hard conflicts</> in the sense that queries
might need to be cancelled and, in some cases, sessions disconnected to resolve them.
The user is provided with several ways to handle these
- conflicts. Conflict cases include:
+ conflicts. Conflict cases in order of likely frequency are:
<itemizedlist>
<listitem>
<para>
+ Application of a vacuum cleanup record from WAL conflicts with
+ standby transactions whose snapshots can still <quote>see</> any of
+ the rows to be removed.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
Access Exclusive locks taken on the primary server, including both
explicit <command>LOCK</> commands and various <acronym>DDL</>
actions, conflict with table accesses in standby queries.
@@ -1417,14 +1653,8 @@ if (!triggered)
</listitem>
<listitem>
<para>
- Application of a vacuum cleanup record from WAL conflicts with
- standby transactions whose snapshots can still <quote>see</> any of
- the rows to be removed.
- </para>
- </listitem>
- <listitem>
- <para>
- Application of a vacuum cleanup record from WAL conflicts with
+ Buffer pin deadlock caused by
+ application of a vacuum cleanup record from WAL conflicts with
queries accessing the target page on the standby, whether or not
the data to be removed is visible.
</para>
@@ -1539,17 +1769,16 @@ if (!triggered)
<para>
Remedial possibilities exist if the number of standby-query cancellations
- is found to be unacceptable. The first option is to connect to the
- primary server and keep a query active for as long as needed to
- run queries on the standby. This prevents <command>VACUUM</> from removing
- recently-dead rows and so cleanup conflicts do not occur.
- This could be done using <xref linkend="dblink"> and
- <function>pg_sleep()</>, or via other mechanisms. If you do this, you
+ is found to be unacceptable. Typically the best option is to enable
+ <varname>hot_standby_feedback</>. This prevents <command>VACUUM</> from
+ removing recently-dead rows and so cleanup conflicts do not occur.
+ If you do this, you
should note that this will delay cleanup of dead rows on the primary,
which may result in undesirable table bloat. However, the cleanup
situation will be no worse than if the standby queries were running
- directly on the primary server, and you are still getting the benefit of
- off-loading execution onto the standby.
+ directly on the primary server. You are still getting the benefit
+ of off-loading execution onto the standby and the query may complete
+ faster than it would have done on the primary server.
<varname>max_standby_archive_delay</> must be kept large in this case,
because delayed WAL files might already contain entries that conflict with
the desired standby queries.
@@ -1563,7 +1792,8 @@ if (!triggered)
a high <varname>max_standby_streaming_delay</>. However it is
difficult to guarantee any specific execution-time window with this
approach, since <varname>vacuum_defer_cleanup_age</> is measured in
- transactions executed on the primary server.
+ transactions executed on the primary server. As of version 9.1, this
+ second option is much less likely to valuable.
</para>
<para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 287ad26..eb3cd6f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -56,6 +56,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "replication/walsender.h"
+#include "replication/syncrep.h"
#include "storage/fd.h"
#include "storage/predicate.h"
#include "storage/procarray.h"
@@ -2030,6 +2031,14 @@ RecordTransactionCommitPrepared(TransactionId xid,
MyProc->inCommit = false;
END_CRIT_SECTION();
+
+ /*
+ * Wait for synchronous replication, if required.
+ *
+ * Note that at this stage we have marked clog, but still show as
+ * running in the procarray and continue to hold locks.
+ */
+ SyncRepWaitForLSN(recptr);
}
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a0170b4..1da42c9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -37,6 +37,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "replication/walsender.h"
+#include "replication/syncrep.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
@@ -54,6 +55,7 @@
#include "utils/snapmgr.h"
#include "pg_trace.h"
+extern void WalRcvWakeup(void); /* we are only caller, so include directly */
/*
* User-tweakable parameters
@@ -1055,7 +1057,7 @@ RecordTransactionCommit(void)
* if all to-be-deleted tables are temporary though, since they are lost
* anyway if we crash.)
*/
- if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0)
+ if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 || SyncRepRequested())
{
/*
* Synchronous commit case:
@@ -1125,6 +1127,14 @@ RecordTransactionCommit(void)
/* Compute latestXid while we have the child XIDs handy */
latestXid = TransactionIdLatest(xid, nchildren, children);
+ /*
+ * Wait for synchronous replication, if required.
+ *
+ * Note that at this stage we have marked clog, but still show as
+ * running in the procarray and continue to hold locks.
+ */
+ SyncRepWaitForLSN(XactLastRecEnd);
+
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd.xrecoff = 0;
@@ -4533,6 +4543,14 @@ xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, XLogRecPtr lsn)
*/
if (XactCompletionForceSyncCommit(xlrec))
XLogFlush(lsn);
+
+ /*
+ * If this standby is offering sync_rep_service then signal WALReceiver,
+ * in case it needs to send a reply just for this commit on an
+ * otherwise quiet server.
+ */
+ if (sync_rep_service)
+ WalRcvWakeup();
}
/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d2432ce..84a802f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -41,6 +41,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -157,6 +158,11 @@ static XLogRecPtr LastRec;
* known, need to check the shared state".
*/
static bool LocalRecoveryInProgress = true;
+/*
+ * Local copy of SharedHotStandbyActive variable. False actually means "not
+ * known, need to check the shared state".
+ */
+static bool LocalHotStandbyActive = false;
/*
* Local state for XLogInsertAllowed():
@@ -405,6 +411,12 @@ typedef struct XLogCtlData
bool SharedRecoveryInProgress;
/*
+ * SharedHotStandbyActive indicates if we're still in crash or archive
+ * recovery. Protected by info_lck.
+ */
+ bool SharedHotStandbyActive;
+
+ /*
* recoveryWakeupLatch is used to wake up the startup process to
* continue WAL replay, if it is waiting for WAL to arrive or failover
* trigger file to appear.
@@ -4915,6 +4927,7 @@ XLOGShmemInit(void)
*/
XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
XLogCtl->SharedRecoveryInProgress = true;
+ XLogCtl->SharedHotStandbyActive = false;
XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
SpinLockInit(&XLogCtl->info_lck);
InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
@@ -5285,6 +5298,12 @@ readRecoveryCommandFile(void)
(errmsg("recovery command file \"%s\" specified neither primary_conninfo nor restore_command",
RECOVERY_COMMAND_FILE),
errhint("The database server will regularly poll the pg_xlog subdirectory to check for files placed there.")));
+
+ if (PrimaryConnInfo == NULL && sync_rep_service)
+ ereport(WARNING,
+ (errmsg("recovery command file \"%s\" specified synchronous_replication_service yet streaming was not requested",
+ RECOVERY_COMMAND_FILE),
+ errhint("Specify primary_conninfo to allow synchronous replication.")));
}
else
{
@@ -6159,6 +6178,13 @@ StartupXLOG(void)
if (XLByteLT(ControlFile->minRecoveryPoint, checkPoint.redo))
ControlFile->minRecoveryPoint = checkPoint.redo;
}
+ else
+ {
+ /*
+ * No need to calculate feedback if we're not in Hot Standby.
+ */
+ hot_standby_feedback = false;
+ }
/*
* set backupStartupPoint if we're starting archive recovery from a
@@ -6778,8 +6804,6 @@ StartupXLOG(void)
static void
CheckRecoveryConsistency(void)
{
- static bool backendsAllowed = false;
-
/*
* Have we passed our safe starting point?
*/
@@ -6799,11 +6823,19 @@ CheckRecoveryConsistency(void)
* enabling connections.
*/
if (standbyState == STANDBY_SNAPSHOT_READY &&
- !backendsAllowed &&
+ !LocalHotStandbyActive &&
reachedMinRecoveryPoint &&
IsUnderPostmaster)
{
- backendsAllowed = true;
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ xlogctl->SharedHotStandbyActive = true;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ LocalHotStandbyActive = true;
+
SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
}
}
@@ -6851,6 +6883,38 @@ RecoveryInProgress(void)
}
/*
+ * Is HotStandby active yet? This is only important in special backends
+ * since normal backends won't ever be able to connect until this returns
+ * true.
+ *
+ * Unlike testing standbyState, this works in any process that's connected to
+ * shared memory.
+ */
+bool
+HotStandbyActive(void)
+{
+ /*
+ * We check shared state each time only until Hot Standby is active. We
+ * can't de-activate Hot Standby, so there's no need to keep checking after
+ * the shared variable has once been seen true.
+ */
+ if (LocalHotStandbyActive)
+ return true;
+ else
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ /* spinlock is essential on machines with weak memory ordering! */
+ SpinLockAcquire(&xlogctl->info_lck);
+ LocalHotStandbyActive = xlogctl->SharedHotStandbyActive;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ return LocalHotStandbyActive;
+ }
+}
+
+/*
* Is this process allowed to insert new WAL records?
*
* Ordinarily this is essentially equivalent to !RecoveryInProgress().
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 40e94ba..506e908 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -502,6 +502,7 @@ CREATE VIEW pg_stat_replication AS
S.client_port,
S.backend_start,
W.state,
+ W.sync,
W.sent_location,
W.write_location,
W.flush_location,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8f77d1b..1577875 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -275,6 +275,7 @@ typedef enum
PM_STARTUP, /* waiting for startup subprocess */
PM_RECOVERY, /* in archive recovery mode */
PM_HOT_STANDBY, /* in hot standby mode */
+ PM_WAIT_FOR_REPLICATION, /* waiting for sync replication to become active */
PM_RUN, /* normal "database is alive" state */
PM_WAIT_BACKUP, /* waiting for online backup mode to end */
PM_WAIT_READONLY, /* waiting for read only backends to exit */
@@ -735,6 +736,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\" or \"hot_standby\"")));
+ if (!allow_standalone_primary && max_wal_senders == 0)
+ ereport(ERROR,
+ (errmsg("WAL streaming (max_wal_senders > 0) is required if allow_standalone_primary = off")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1845,6 +1849,12 @@ retry1:
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
errmsg("the database system is in recovery mode")));
break;
+ case CAC_REPLICATION_ONLY:
+ if (!am_walsender)
+ ereport(FATAL,
+ (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+ errmsg("the database system is waiting for replication to start")));
+ break;
case CAC_TOOMANY:
ereport(FATAL,
(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
@@ -1942,7 +1952,9 @@ canAcceptConnections(void)
*/
if (pmState != PM_RUN)
{
- if (pmState == PM_WAIT_BACKUP)
+ if (pmState == PM_WAIT_FOR_REPLICATION)
+ result = CAC_REPLICATION_ONLY; /* allow replication only */
+ else if (pmState == PM_WAIT_BACKUP)
result = CAC_WAITBACKUP; /* allow superusers only */
else if (Shutdown > NoShutdown)
return CAC_SHUTDOWN; /* shutdown is pending */
@@ -2396,8 +2408,13 @@ reaper(SIGNAL_ARGS)
* Startup succeeded, commence normal operations
*/
FatalError = false;
- ReachedNormalRunning = true;
- pmState = PM_RUN;
+ if (allow_standalone_primary)
+ {
+ ReachedNormalRunning = true;
+ pmState = PM_RUN;
+ }
+ else
+ pmState = PM_WAIT_FOR_REPLICATION;
/*
* Crank up the background writer, if we didn't do that already
@@ -3233,8 +3250,8 @@ BackendStartup(Port *port)
/* Pass down canAcceptConnections state */
port->canAcceptConnections = canAcceptConnections();
bn->dead_end = (port->canAcceptConnections != CAC_OK &&
- port->canAcceptConnections != CAC_WAITBACKUP);
-
+ port->canAcceptConnections != CAC_WAITBACKUP &&
+ port->canAcceptConnections != CAC_REPLICATION_ONLY);
/*
* Unless it's a dead_end child, assign it a child slot number
*/
@@ -4284,6 +4301,16 @@ sigusr1_handler(SIGNAL_ARGS)
WalReceiverPID = StartWalReceiver();
}
+ if (CheckPostmasterSignal(PMSIGNAL_SYNC_REPLICATION_ACTIVE) &&
+ pmState == PM_WAIT_FOR_REPLICATION)
+ {
+ /* Allow connections now that a synchronous replication standby
+ * has successfully connected and is active.
+ */
+ ReachedNormalRunning = true;
+ pmState = PM_RUN;
+ }
+
PG_SETMASK(&UnBlockSig);
errno = save_errno;
@@ -4534,6 +4561,7 @@ static void
StartAutovacuumWorker(void)
{
Backend *bn;
+ CAC_state cac = CAC_OK;
/*
* If not in condition to run a process, don't try, but handle it like a
@@ -4542,7 +4570,8 @@ StartAutovacuumWorker(void)
* we have to check to avoid race-condition problems during DB state
* changes.
*/
- if (canAcceptConnections() == CAC_OK)
+ cac = canAcceptConnections();
+ if (cac == CAC_OK || cac == CAC_REPLICATION_ONLY)
{
bn = (Backend *) malloc(sizeof(Backend));
if (bn)
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 42c6eaf..3fe490e 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
- repl_gram.o
+ repl_gram.o syncrep.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 9c2e0d8..7387224 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -1,5 +1,27 @@
src/backend/replication/README
+Overview
+--------
+
+The WALSender sends WAL data and receives replies. The WALReceiver
+receives WAL data and sends replies.
+
+If there is no more WAL data to send then WALSender goes quiet,
+apart from checking for replies. If there is no more WAL data
+to receive then WALReceiver keeps sending replies until all the data
+received has been applied, then it too goes quiet. When all is quiet
+WALReceiver sends regular replies so that WALSender knows the link
+is still working - we don't want to wait until a transaction
+arrives before we try to determine the health of the connection.
+
+WALReceiver sends one reply per message received. If nothing is
+received it sends one reply every time apply pointer advances,
+with a minimum of one reply each cycletime.
+
+For synchronous replication, all decisions about whether to wait
+and how long to wait are taken on the primary. The standby has no
+state information about what is happening on the primary.
+
Walreceiver - libpqwalreceiver API
----------------------------------
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
new file mode 100644
index 0000000..12a3825
--- /dev/null
+++ b/src/backend/replication/syncrep.c
@@ -0,0 +1,641 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.c
+ *
+ * Synchronous replication is new as of PostgreSQL 9.1.
+ *
+ * If requested, transaction commits wait until their commit LSN is
+ * acknowledged by the standby, or the wait hits timeout.
+ *
+ * This module contains the code for waiting and release of backends.
+ * All code in this module executes on the primary. The core streaming
+ * replication transport remains within WALreceiver/WALsender modules.
+ *
+ * The essence of this design is that it isolates all logic about
+ * waiting/releasing onto the primary. The primary is aware of which
+ * standby servers offer a synchronisation service. The standby is
+ * completely unaware of the durability requirements of transactions
+ * on the primary, reducing the complexity of the code and streamlining
+ * both standby operations and network bandwidth because there is no
+ * requirement to ship per-transaction state information.
+ *
+ * The bookeeping approach we take is that a commit is either synchronous
+ * or not synchronous (async). If it is async, we just fastpath out of
+ * here. If it is sync, then it follows exactly one rigid definition of
+ * synchronous replication as laid out by the various parameters. If we
+ * change the definition of replication, we'll need to scan through all
+ * waiting backends to see if we should now release them.
+ *
+ * The best performing way to manage the waiting backends is to have a
+ * single ordered queue of waiting backends, so that we can avoid
+ * searching the through all waiters each time we receive a reply.
+ *
+ * Starting sync replication is a two stage process. First, the standby
+ * must have caught up with the primary; that may take some time. Next,
+ * we must receive a reply from the standby before we change state so
+ * that sync rep is fully active and commits can wait on us.
+ *
+ * XXX Changing state to a sync rep service while we are running allows
+ * us to enable sync replication via SIGHUP on the standby at a later
+ * time, without restart, if we need to do that. Though you can't turn
+ * it off without disconnecting.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xact.h"
+#include "access/xlog_internal.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/ipc.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_tables.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+
+
+/* User-settable parameters for sync rep */
+bool sync_rep_mode = false; /* Only set in user backends */
+int sync_rep_timeout_client = 120; /* Only set in user backends */
+int sync_rep_timeout_server = 30; /* Only set in user backends */
+bool sync_rep_service = false; /* Never set in user backends */
+bool hot_standby_feedback = true;
+
+/*
+ * Queuing code is written to allow later extension to multiple
+ * queues. Currently, we use just one queue (==FSYNC).
+ *
+ * XXX We later expect to have RECV, FSYNC and APPLY modes.
+ */
+#define SYNC_REP_NOT_ON_QUEUE -1
+#define SYNC_REP_FSYNC 0
+#define IsOnSyncRepQueue() (current_queue > SYNC_REP_NOT_ON_QUEUE)
+/*
+ * Queue identifier of the queue on which user backend currently waits.
+ */
+static int current_queue = SYNC_REP_NOT_ON_QUEUE;
+
+static void SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN, int qid);
+static void SyncRepRemoveFromQueue(void);
+static void SyncRepAddToQueue(int qid);
+static bool SyncRepServiceAvailable(void);
+static long SyncRepGetWaitTimeout(void);
+
+static void SyncRepWakeFromQueue(int wait_queue, XLogRecPtr lsn);
+
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for normal user backends
+ * ===========================================================
+ */
+
+/*
+ * Wait for synchronous replication, if requested by user.
+ */
+extern void
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+ /*
+ * Fast exit if user has requested async replication, or
+ * streaming replication is inactive in this server.
+ */
+ if (max_wal_senders == 0 || !sync_rep_mode)
+ return;
+
+ Assert(sync_rep_mode);
+
+ if (allow_standalone_primary)
+ {
+ bool avail_sync_mode;
+
+ /*
+ * Check that the service level we want is available.
+ * If not, downgrade the service level to async.
+ */
+ avail_sync_mode = SyncRepServiceAvailable();
+
+ /*
+ * Perform the wait here, then drop through and exit.
+ */
+ if (avail_sync_mode)
+ SyncRepWaitOnQueue(XactCommitLSN, 0);
+ }
+ else
+ {
+ /*
+ * Wait only on the service level requested,
+ * whether or not it is currently available.
+ * Sounds weird, but this mode exists to protect
+ * against changes that will only occur on primary.
+ */
+ SyncRepWaitOnQueue(XactCommitLSN, 0);
+ }
+}
+
+/*
+ * Wait for specified LSN to be confirmed at the requested level
+ * of durability. Each proc has its own wait latch, so we perform
+ * a normal latch check/wait loop here.
+ */
+static void
+SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN, int qid)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[0]);
+ TimestampTz now = GetCurrentTransactionStopTimestamp();
+ long timeout = SyncRepGetWaitTimeout(); /* seconds */
+ char *new_status = NULL;
+ const char *old_status;
+ int len;
+
+ /*
+ * No need to wait for autovacuums. If the standby does go away and
+ * we wait for it to return we may as well do some usefulwork locally.
+ * This is critical since we may need to perform emergency vacuuming
+ * and cannot wait for standby to return.
+ */
+ if (IsAutoVacuumWorkerProcess())
+ return;
+
+ ereport(DEBUG2,
+ (errmsg("synchronous replication waiting for %X/%X starting at %s",
+ XactCommitLSN.xlogid,
+ XactCommitLSN.xrecoff,
+ timestamptz_to_str(GetCurrentTransactionStopTimestamp()))));
+
+ for (;;)
+ {
+ ResetLatch(&MyProc->waitLatch);
+
+ /*
+ * First time through, add ourselves to the appropriate queue.
+ */
+ if (!IsOnSyncRepQueue())
+ {
+ SpinLockAcquire(&queue->qlock);
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ {
+ /* No need to wait */
+ SpinLockRelease(&queue->qlock);
+ return;
+ }
+
+ /*
+ * Set our waitLSN so WALSender will know when to wake us.
+ * We set this before we add ourselves to queue, so that
+ * any proc on the queue can be examined freely without
+ * taking a lock on each process in the queue.
+ */
+ MyProc->waitLSN = XactCommitLSN;
+ SyncRepAddToQueue(qid);
+ SpinLockRelease(&queue->qlock);
+ current_queue = qid; /* Remember which queue we're on */
+
+ /*
+ * Alter ps display to show waiting for sync rep.
+ */
+ old_status = get_ps_display(&len);
+ new_status = (char *) palloc(len + 21 + 1);
+ memcpy(new_status, old_status, len);
+ strcpy(new_status + len, " waiting for sync rep");
+ set_ps_display(new_status, false);
+ new_status[len] = '\0'; /* truncate off " waiting" */
+ }
+ else
+ {
+ bool release = false;
+ bool timeout = false;
+
+ SpinLockAcquire(&queue->qlock);
+
+ /*
+ * Check the LSN on our queue and if its moved far enough then
+ * remove us from the queue. First time through this is
+ * unlikely to be far enough, yet is possible. Next time we are
+ * woken we should be more lucky.
+ */
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ release = true;
+ else if (timeout > 0 &&
+ TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
+ now,
+ timeout))
+ {
+ release = true;
+ timeout = true;
+ }
+
+ if (release)
+ {
+ SyncRepRemoveFromQueue();
+ SpinLockRelease(&queue->qlock);
+
+ if (new_status)
+ {
+ /* Reset ps display */
+ set_ps_display(new_status, false);
+ pfree(new_status);
+ }
+
+ /*
+ * Our response to the timeout is to simply post a NOTICE and
+ * then return to the user. The commit has happened, we just
+ * haven't been able to verify it has been replicated to the
+ * level requested.
+ *
+ * XXX We could check here to see if our LSN has been sent to
+ * another standby that offers a lower level of service. That
+ * could be true if we had, for example, requested 'apply'
+ * with two standbys, one at 'apply' and one at 'recv' and the
+ * apply standby has just gone down. Something for the weekend.
+ */
+ if (timeout)
+ ereport(NOTICE,
+ (errmsg("synchronous replication timeout at %s",
+ timestamptz_to_str(now))));
+ else
+ ereport(DEBUG2,
+ (errmsg("synchronous replication wait complete at %s",
+ timestamptz_to_str(now))));
+
+ /* XXX Do we need to unset the latch? */
+ return;
+ }
+
+ SpinLockRelease(&queue->qlock);
+ }
+
+ WaitLatch(&MyProc->waitLatch, timeout);
+ now = GetCurrentTimestamp();
+ }
+}
+
+/*
+ * Remove myself from sync rep wait queue.
+ *
+ * Assume on queue at start; will not be on queue at end.
+ * Queue is already locked at start and remains locked on exit.
+ *
+ * XXX Implements design pattern "Reinvent Wheel", think about changing
+ */
+void
+SyncRepRemoveFromQueue(void)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[current_queue]);
+ PGPROC *proc = queue->head;
+ int numprocs = 0;
+
+ Assert(IsOnSyncRepQueue());
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG3, "removing myself from queue %d", current_queue);
+#endif
+
+ for (; proc != NULL; proc = proc->lwWaitLink)
+ {
+ if (proc == MyProc)
+ {
+ elog(LOG, "proc %d lsn %X/%X is MyProc",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+ }
+ else
+ {
+ elog(LOG, "proc %d lsn %X/%X",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+ }
+ numprocs++;
+ }
+
+ proc = queue->head;
+
+ if (proc == MyProc)
+ {
+ if (MyProc->lwWaitLink == NULL)
+ {
+ /*
+ * We were the only waiter on the queue. Reset head and tail.
+ */
+ Assert(queue->tail == MyProc);
+ queue->head = NULL;
+ queue->tail = NULL;
+ }
+ else
+ /*
+ * Move head to next proc on the queue.
+ */
+ queue->head = MyProc->lwWaitLink;
+ }
+ else
+ {
+ while (proc->lwWaitLink != NULL)
+ {
+ /* Are we the next proc in our traversal of the queue? */
+ if (proc->lwWaitLink == MyProc)
+ {
+ /*
+ * Remove ourselves from middle of queue.
+ * No need to touch head or tail.
+ */
+ proc->lwWaitLink = MyProc->lwWaitLink;
+ }
+
+ if (proc->lwWaitLink == NULL)
+ elog(WARNING, "could not locate ourselves on wait queue");
+ proc = proc->lwWaitLink;
+ }
+
+ if (proc->lwWaitLink == NULL) /* At tail */
+ {
+ Assert(proc == MyProc);
+ /* Remove ourselves from tail of queue */
+ Assert(queue->tail == MyProc);
+ queue->tail = proc;
+ proc->lwWaitLink = NULL;
+ }
+ }
+ MyProc->lwWaitLink = NULL;
+ current_queue = SYNC_REP_NOT_ON_QUEUE;
+}
+
+/*
+ * Add myself to sync rep wait queue.
+ *
+ * Assume not on queue at start; will be on queue at end.
+ * Queue is already locked at start and remains locked on exit.
+ */
+static void
+SyncRepAddToQueue(int qid)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[qid]);
+ PGPROC *tail = queue->tail;
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG3, "adding myself to queue %d", qid);
+#endif
+
+ /*
+ * Add myself to tail of wait queue.
+ */
+ if (tail == NULL)
+ {
+ queue->head = MyProc;
+ queue->tail = MyProc;
+ }
+ else
+ {
+ /*
+ * XXX extra code needed here to maintain sorted invariant.
+ * Our approach should be same as racing car - slow in, fast out.
+ */
+ Assert(tail->lwWaitLink == NULL);
+ tail->lwWaitLink = MyProc;
+ }
+ queue->tail = MyProc;
+
+ /*
+ * This used to be an Assert, but it keeps failing... why?
+ */
+ MyProc->lwWaitLink = NULL; /* to be sure */
+}
+
+/*
+ * Dynamically decide the sync rep wait mode. It may seem a trifle
+ * wasteful to do this for every transaction but we need to do this
+ * so we can cope sensibly with standby disconnections. It's OK to
+ * spend a few cycles here anyway, since while we're doing this the
+ * WALSender will be sending the data we want to wait for, so this
+ * is dead time and the user has requested to wait anyway.
+ */
+static bool
+SyncRepServiceAvailable(void)
+{
+ bool result = false;
+
+ SpinLockAcquire(&WalSndCtl->ctlmutex);
+ result = WalSndCtl->sync_rep_service_available;
+ SpinLockRelease(&WalSndCtl->ctlmutex);
+
+ return result;
+}
+
+/*
+ * Allows more complex decision making about what the wait time should be.
+ */
+static long
+SyncRepGetWaitTimeout(void)
+{
+ if (sync_rep_timeout_client <= 0)
+ return -1L;
+
+ return 1000000L * sync_rep_timeout_client;
+}
+
+void
+SyncRepCleanupAtProcExit(int code, Datum arg)
+{
+/*
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[qid]);
+
+ if (IsOnSyncRepQueue())
+ {
+ SpinLockAcquire(&queue->qlock);
+ SyncRepRemoveFromQueue();
+ SpinLockRelease(&queue->qlock);
+ }
+*/
+
+ if (MyProc != NULL && MyProc->ownLatch)
+ {
+ DisownLatch(&MyProc->waitLatch);
+ MyProc->ownLatch = false;
+ }
+}
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for wal sender processes
+ * ===========================================================
+ */
+
+/*
+ * Update the LSNs on each queue based upon our latest state. This
+ * implements a simple policy of first-valid-standby-releases-waiter.
+ *
+ * Other policies are possible, which would change what we do here and what
+ * perhaps also which information we store as well.
+ */
+void
+SyncRepReleaseWaiters(bool timeout)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ int mode;
+
+ /*
+ * If we are now streaming, and haven't yet enabled the sync rep service
+ * do so now. We don't enable sync rep service during a base backup since
+ * during that action we aren't sending WAL at all, so there cannot be
+ * any meaningful replies. We don't enable sync rep service while we
+ * are still in catchup mode either, since clients might experience an
+ * extended wait (perhaps hours) if they waited at that point.
+ *
+ * Note that we do release waiters, even if they aren't enabled yet.
+ * That sounds strange, but we may have dropped the connection and
+ * reconnected, so there may still be clients waiting for a response
+ * from when we were connected previously.
+ *
+ * If we already have a sync rep server connected, don't enable
+ * this server as well.
+ *
+ * XXX expect to be able to support multiple sync standbys in future.
+ */
+ if (!MyWalSnd->sync_rep_service &&
+ MyWalSnd->state == WALSNDSTATE_STREAMING &&
+ !SyncRepServiceAvailable())
+ {
+ ereport(LOG,
+ (errmsg("enabling synchronous replication service for standby")));
+
+ /*
+ * Update state for this WAL sender.
+ */
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalSnd *walsnd = MyWalSnd;
+
+ SpinLockAcquire(&walsnd->mutex);
+ walsnd->sync_rep_service = true;
+ SpinLockRelease(&walsnd->mutex);
+ }
+
+ /*
+ * We have at least one standby, so we're open for business.
+ */
+ {
+ SpinLockAcquire(&WalSndCtl->ctlmutex);
+ WalSndCtl->sync_rep_service_available = true;
+ SpinLockRelease(&WalSndCtl->ctlmutex);
+ }
+
+ /*
+ * Let postmaster know we can allow connections, if the user
+ * requested waiting until sync rep was active before starting.
+ * We send this unconditionally to avoid more complexity in
+ * postmaster code.
+ */
+ if (IsUnderPostmaster)
+ SendPostmasterSignal(PMSIGNAL_SYNC_REPLICATION_ACTIVE);
+ }
+
+ /*
+ * No point trying to release waiters while doing a base backup
+ */
+ if (MyWalSnd->state == WALSNDSTATE_BACKUP)
+ return;
+
+#ifdef SYNCREP_DEBUG
+ elog(LOG, "releasing waiters up to flush = %X/%X",
+ MyWalSnd->flush.xlogid, MyWalSnd->flush.xrecoff);
+#endif
+
+
+ /*
+ * Only maintain LSNs of queues for which we advertise a service.
+ * This is important to ensure that we only wakeup users when a
+ * preferred standby has reached the required LSN.
+ *
+ * Since sycnhronous_replication_mode is currently a boolean, we either
+ * offer all modes, or none.
+ */
+ for (mode = 0; mode < NUM_SYNC_REP_WAIT_MODES; mode++)
+ {
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[mode]);
+
+ /*
+ * Lock the queue. Not really necessary with just one sync standby
+ * but it makes clear what needs to happen.
+ */
+ SpinLockAcquire(&queue->qlock);
+ if (XLByteLT(queue->lsn, MyWalSnd->flush))
+ {
+ /*
+ * Set the lsn first so that when we wake backends they will
+ * release up to this location.
+ */
+ queue->lsn = MyWalSnd->flush;
+ SyncRepWakeFromQueue(mode, MyWalSnd->flush);
+ }
+ SpinLockRelease(&queue->qlock);
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG2, "q%d queue = %X/%X flush = %X/%X", mode,
+ queue->lsn.xlogid, queue->lsn.xrecoff,
+ MyWalSnd->flush.xlogid, MyWalSnd->flush.xrecoff);
+#endif
+ }
+}
+
+/*
+ * Walk queue from head setting the latches of any procs that need
+ * to be woken. We don't modify the queue, we leave that for individual
+ * procs to release themselves.
+ *
+ * Must hold spinlock on queue.
+ */
+static void
+SyncRepWakeFromQueue(int wait_queue, XLogRecPtr lsn)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[wait_queue]);
+ PGPROC *proc = queue->head;
+ int numprocs = 0;
+ int totalprocs = 0;
+
+ if (proc == NULL)
+ return;
+
+ for (; proc != NULL; proc = proc->lwWaitLink)
+ {
+ elog(LOG, "proc %d lsn %X/%X",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+
+ if (XLByteLE(proc->waitLSN, lsn))
+ {
+ numprocs++;
+ SetLatch(&proc->waitLatch);
+ }
+ totalprocs++;
+ }
+ elog(DEBUG2, "released %d procs out of %d waiting procs", numprocs, totalprocs);
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG2, "released %d procs up to %X/%X", numprocs, lsn.xlogid, lsn.xrecoff);
+#endif
+}
+
+void
+SyncRepTimeoutExceeded(void)
+{
+ SyncRepReleaseWaiters(true);
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 35cd121..1e37530 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -38,6 +38,7 @@
#include <signal.h>
#include <unistd.h>
+#include "access/transam.h"
#include "access/xlog_internal.h"
#include "libpq/pqsignal.h"
#include "miscadmin.h"
@@ -45,6 +46,7 @@
#include "replication/walreceiver.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
+#include "storage/procarray.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/memutils.h"
@@ -85,9 +87,9 @@ static volatile sig_atomic_t got_SIGTERM = false;
*/
static struct
{
- XLogRecPtr Write; /* last byte + 1 written out in the standby */
- XLogRecPtr Flush; /* last byte + 1 flushed in the standby */
-} LogstreamResult;
+ XLogRecPtr Write; /* last byte + 1 written out in the standby */
+ XLogRecPtr Flush; /* last byte + 1 flushed in the standby */
+} LogstreamResult;
static StandbyReplyMessage reply_message;
@@ -208,6 +210,8 @@ WalReceiverMain(void)
/* Advertise our PID so that the startup process can kill us */
walrcv->pid = MyProcPid;
walrcv->walRcvState = WALRCV_RUNNING;
+ elog(DEBUG2, "WALreceiver starting");
+ OwnLatch(&WalRcv->latch); /* Run before signals enabled, since they can wakeup latch */
/* Fetch information required to start streaming */
strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
@@ -275,6 +279,7 @@ WalReceiverMain(void)
unsigned char type;
char *buf;
int len;
+ bool received_all = false;
/*
* Emergency bailout if postmaster has died. This is to avoid the
@@ -300,24 +305,44 @@ WalReceiverMain(void)
ProcessConfigFile(PGC_SIGHUP);
}
- /* Wait a while for data to arrive */
- if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
+ ResetLatch(&WalRcv->latch);
+
+ if (walrcv_receive(0, &type, &buf, &len))
{
- /* Accept the received data, and process it */
+ received_all = false;
XLogWalRcvProcessMsg(type, buf, len);
+ }
+ else
+ received_all = true;
- /* Receive any more data we can without sleeping */
- while (walrcv_receive(0, &type, &buf, &len))
- XLogWalRcvProcessMsg(type, buf, len);
+ XLogWalRcvSendReply();
- /* Let the master know that we received some data. */
+ if (received_all && !got_SIGHUP && !got_SIGTERM)
+ {
+ /*
+ * Flush, then reply.
+ *
+ * XXX We really need the WALWriter active as well
+ */
+ XLogWalRcvFlush();
XLogWalRcvSendReply();
/*
- * If we've written some records, flush them to disk and let the
- * startup process know about them.
+ * Sleep for up to 500 ms, the fixed keepalive delay.
+ *
+ * We will be woken if new data is received from primary
+ * or if a commit is applied. This is sub-optimal in the
+ * case where a group of commits arrive, then it all goes
+ * quiet, but its not worth the extra code to handle both
+ * that and the simple case of a single commit.
+ *
+ * Note that we do not need to wake up when the Startup
+ * process has applied the last outstanding record. That
+ * is interesting iff that is a commit record.
*/
- XLogWalRcvFlush();
+ pg_usleep(1000000L); /* slow down loop for debugging */
+// WaitLatchOrSocket(&WalRcv->latch, MyProcPort->sock,
+// 500000L);
}
/*
@@ -350,6 +375,8 @@ WalRcvDie(int code, Datum arg)
walrcv->pid = 0;
SpinLockRelease(&walrcv->mutex);
+ DisownLatch(&WalRcv->latch);
+
/* Terminate the connection gracefully. */
if (walrcv_disconnect != NULL)
walrcv_disconnect();
@@ -360,6 +387,7 @@ static void
WalRcvSigHupHandler(SIGNAL_ARGS)
{
got_SIGHUP = true;
+ WalRcvWakeup();
}
/* SIGTERM: set flag for main loop, or shutdown immediately if safe */
@@ -367,6 +395,7 @@ static void
WalRcvShutdownHandler(SIGNAL_ARGS)
{
got_SIGTERM = true;
+ WalRcvWakeup();
/* Don't joggle the elbow of proc_exit */
if (!proc_exit_inprogress && WalRcvImmediateInterruptOK)
@@ -603,12 +632,26 @@ XLogWalRcvSendReply(void)
reply_message.flush = LogstreamResult.Flush;
reply_message.apply = GetXLogReplayRecPtr();
reply_message.sendTime = now;
+ if (hot_standby_feedback && HotStandbyActive())
+ reply_message.xmin = GetOldestXmin(true, false);
+ else
+ reply_message.xmin = InvalidTransactionId;
- elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
+ elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X xmin %d",
reply_message.write.xlogid, reply_message.write.xrecoff,
reply_message.flush.xlogid, reply_message.flush.xrecoff,
- reply_message.apply.xlogid, reply_message.apply.xrecoff);
+ reply_message.apply.xlogid, reply_message.apply.xrecoff,
+ reply_message.xmin);
/* Send it. */
walrcv_send((char *) &reply_message, sizeof(StandbyReplyMessage));
}
+
+/* Wake up the WalRcv
+ * Prototype goes in xact.c since that is only external caller
+ */
+void
+WalRcvWakeup(void)
+{
+ SetLatch(&WalRcv->latch);
+};
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 04c9004..da97528 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -64,6 +64,7 @@ WalRcvShmemInit(void)
MemSet(WalRcv, 0, WalRcvShmemSize());
WalRcv->walRcvState = WALRCV_STOPPED;
SpinLockInit(&WalRcv->mutex);
+ InitSharedLatch(&WalRcv->latch);
}
}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fcb5a32..987cc90 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -65,7 +65,7 @@
WalSndCtlData *WalSndCtl = NULL;
/* My slot in the shared memory array */
-static WalSnd *MyWalSnd = NULL;
+WalSnd *MyWalSnd = NULL;
/* Global state */
bool am_walsender = false; /* Am I a walsender process ? */
@@ -73,6 +73,7 @@ bool am_walsender = false; /* Am I a walsender process ? */
/* User-settable parameters for walsender */
int max_wal_senders = 0; /* the maximum number of concurrent walsenders */
int WalSndDelay = 200; /* max sleep time between some actions */
+bool allow_standalone_primary = true; /* action if no sync standby active */
/*
* These variables are used similarly to openLogFile/Id/Seg/Off,
@@ -89,6 +90,8 @@ static uint32 sendOff = 0;
*/
static XLogRecPtr sentPtr = {0, 0};
+static TimestampTz last_reply_timestamp;
+
/* Flags set by signal handlers for later service in main loop */
static volatile sig_atomic_t got_SIGHUP = false;
volatile sig_atomic_t walsender_shutdown_requested = false;
@@ -113,7 +116,6 @@ static void StartReplication(StartReplicationCmd * cmd);
static void ProcessStandbyReplyMessage(void);
static void ProcessRepliesIfAny(void);
-
/* Main entry point for walsender process */
int
WalSenderMain(void)
@@ -150,6 +152,8 @@ WalSenderMain(void)
/* Unblock signals (they were blocked when the postmaster forked us) */
PG_SETMASK(&UnBlockSig);
+ elog(DEBUG2, "WALsender starting");
+
/* Tell the standby that walsender is ready for receiving commands */
ReadyForQuery(DestRemote);
@@ -166,6 +170,8 @@ WalSenderMain(void)
SpinLockRelease(&walsnd->mutex);
}
+ elog(DEBUG2, "WALsender handshake complete");
+
/* Main loop of walsender */
return WalSndLoop();
}
@@ -250,6 +256,11 @@ WalSndHandshake(void)
errmsg("invalid standby handshake message type %d", firstchar)));
}
}
+
+ /*
+ * Initialize our timeout checking mechanism.
+ */
+ last_reply_timestamp = GetCurrentTimestamp();
}
/*
@@ -417,9 +428,11 @@ HandleReplicationCommand(const char *cmd_string)
/* break out of the loop */
replication_started = true;
+ WalSndSetState(WALSNDSTATE_CATCHUP);
break;
case T_BaseBackupCmd:
+ WalSndSetState(WALSNDSTATE_BACKUP);
SendBaseBackup((BaseBackupCmd *) cmd_node);
/* Send CommandComplete and ReadyForQuery messages */
@@ -513,10 +526,11 @@ ProcessStandbyReplyMessage(void)
pq_copymsgbytes(&input_message, (char *) &reply, sizeof(StandbyReplyMessage));
- elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X ",
+ elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X xmin %d",
reply.write.xlogid, reply.write.xrecoff,
reply.flush.xlogid, reply.flush.xrecoff,
- reply.apply.xlogid, reply.apply.xrecoff);
+ reply.apply.xlogid, reply.apply.xrecoff,
+ reply.xmin);
/*
* Update shared state for this WalSender process
@@ -533,8 +547,16 @@ ProcessStandbyReplyMessage(void)
walsnd->flush = reply.flush;
if (XLByteLT(walsnd->apply, reply.apply))
walsnd->apply = reply.apply;
+ if (TransactionIdIsValid(reply.xmin) &&
+ TransactionIdPrecedes(MyProc->xmin, reply.xmin))
+ MyProc->xmin = reply.xmin;
SpinLockRelease(&walsnd->mutex);
}
+
+ /*
+ * Release any backends waiting to commit.
+ */
+ SyncRepReleaseWaiters(false);
}
/* Main loop of walsender process */
@@ -584,7 +606,11 @@ WalSndLoop(void)
/* Normal exit from the walsender is here */
if (walsender_shutdown_requested)
{
- /* Inform the standby that XLOG streaming was done */
+ ProcessRepliesIfAny();
+
+ /* Inform the standby that XLOG streaming was done
+ * by sending CommandComplete message.
+ */
pq_puttextmessage('C', "COPY 0");
pq_flush();
@@ -592,12 +618,31 @@ WalSndLoop(void)
}
/*
- * If we had sent all accumulated WAL in last round, nap for the
- * configured time before retrying.
+ * If we had sent all accumulated WAL in last round, then we don't
+ * have much to do. We still expect a steady stream of replies from
+ * standby. It is important to note that we don't keep track of
+ * whether or not there are backends waiting here, since that
+ * is potentially very complex state information.
+ *
+ * Also note that there is no delay between sending data and
+ * checking for the replies. We expect replies to take some time
+ * and we are more concerned with overall throughput than absolute
+ * response time to any single request.
*/
if (caughtup)
{
/*
+ * If we were still catching up, change state to streaming.
+ * While in the initial catchup phase, clients waiting for
+ * a response from the standby would wait for a very long
+ * time, so we need to have a one-way state transition to avoid
+ * problems. No need to grab a lock for the check; we are the
+ * only one to ever change the state.
+ */
+ if (MyWalSnd->state < WALSNDSTATE_STREAMING)
+ WalSndSetState(WALSNDSTATE_STREAMING);
+
+ /*
* Even if we wrote all the WAL that was available when we started
* sending, more might have arrived while we were sending this
* batch. We had the latch set while sending, so we have not
@@ -610,6 +655,13 @@ WalSndLoop(void)
break;
if (caughtup && !got_SIGHUP && !walsender_ready_to_stop && !walsender_shutdown_requested)
{
+ long timeout;
+
+ if (sync_rep_timeout_server == -1)
+ timeout = -1L;
+ else
+ timeout = 1000000L * sync_rep_timeout_server;
+
/*
* XXX: We don't really need the periodic wakeups anymore,
* WaitLatchOrSocket should reliably wake up as soon as
@@ -617,8 +669,14 @@ WalSndLoop(void)
*/
/* Sleep */
- WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
- WalSndDelay * 1000L);
+ if (WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
+ timeout) == 0)
+ {
+ ereport(LOG,
+ (errmsg("streaming replication timeout after %d s",
+ sync_rep_timeout_server)));
+ break;
+ }
}
}
else
@@ -634,7 +692,7 @@ WalSndLoop(void)
}
/*
- * Get here on send failure. Clean up and exit.
+ * Get here on send failure or timeout. Clean up and exit.
*
* Reset whereToSendOutput to prevent ereport from attempting to send any
* more messages to the standby.
@@ -865,9 +923,9 @@ XLogSend(char *msgbuf, bool *caughtup)
* Attempt to send all data that's already been written out and fsync'd to
* disk. We cannot go further than what's been written out given the
* current implementation of XLogRead(). And in any case it's unsafe to
- * send WAL that is not securely down to disk on the master: if the master
+ * send WAL that is not securely down to disk on the primary: if the primary
* subsequently crashes and restarts, slaves must not have applied any WAL
- * that gets lost on the master.
+ * that gets lost on the primary.
*/
SendRqstPtr = GetFlushRecPtr();
@@ -945,6 +1003,9 @@ XLogSend(char *msgbuf, bool *caughtup)
msghdr.walEnd = SendRqstPtr;
msghdr.sendTime = GetCurrentTimestamp();
+ elog(DEBUG2, "sent = %X/%X ",
+ startptr.xlogid, startptr.xrecoff);
+
memcpy(msgbuf + 1, &msghdr, sizeof(WalDataMessageHeader));
pq_putmessage('d', msgbuf, 1 + sizeof(WalDataMessageHeader) + nbytes);
@@ -1102,6 +1163,16 @@ WalSndShmemInit(void)
SpinLockInit(&walsnd->mutex);
InitSharedLatch(&walsnd->latch);
}
+
+ /*
+ * Initialise the spinlocks on each sync rep queue
+ */
+ for (i = 0; i < NUM_SYNC_REP_WAIT_MODES; i++)
+ {
+ SyncRepQueue *queue = &WalSndCtl->sync_rep_queue[i];
+
+ SpinLockInit(&queue->qlock);
+ }
}
}
@@ -1161,7 +1232,7 @@ WalSndGetStateString(WalSndState state)
Datum
pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
-#define PG_STAT_GET_WAL_SENDERS_COLS 6
+#define PG_STAT_GET_WAL_SENDERS_COLS 7
ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc tupdesc;
Tuplestorestate *tupstore;
@@ -1204,6 +1275,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
XLogRecPtr flush;
XLogRecPtr apply;
WalSndState state;
+ bool sync_rep_service;
Datum values[PG_STAT_GET_WAL_SENDERS_COLS];
bool nulls[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -1216,6 +1288,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
write = walsnd->write;
flush = walsnd->flush;
apply = walsnd->apply;
+ sync_rep_service = walsnd->sync_rep_service;
SpinLockRelease(&walsnd->mutex);
memset(nulls, 0, sizeof(nulls));
@@ -1232,32 +1305,34 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
nulls[3] = true;
nulls[4] = true;
nulls[5] = true;
+ nulls[6] = true;
}
else
{
values[1] = CStringGetTextDatum(WalSndGetStateString(state));
+ values[2] = BoolGetDatum(sync_rep_service);
snprintf(location, sizeof(location), "%X/%X",
sentPtr.xlogid, sentPtr.xrecoff);
- values[2] = CStringGetTextDatum(location);
+ values[3] = CStringGetTextDatum(location);
if (write.xlogid == 0 && write.xrecoff == 0)
nulls[4] = true;
snprintf(location, sizeof(location), "%X/%X",
write.xlogid, write.xrecoff);
- values[3] = CStringGetTextDatum(location);
+ values[4] = CStringGetTextDatum(location);
if (flush.xlogid == 0 && flush.xrecoff == 0)
nulls[5] = true;
snprintf(location, sizeof(location), "%X/%X",
flush.xlogid, flush.xrecoff);
- values[4] = CStringGetTextDatum(location);
+ values[5] = CStringGetTextDatum(location);
if (apply.xlogid == 0 && apply.xrecoff == 0)
nulls[6] = true;
snprintf(location, sizeof(location), "%X/%X",
apply.xlogid, apply.xrecoff);
- values[5] = CStringGetTextDatum(location);
+ values[6] = CStringGetTextDatum(location);
}
tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index be577bc..7aa7671 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -39,6 +39,7 @@
#include "access/xact.h"
#include "miscadmin.h"
#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -196,6 +197,7 @@ InitProcGlobal(void)
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
ProcGlobal->freeProcs = &procs[i];
+ InitSharedLatch(&procs[i].waitLatch);
}
/*
@@ -214,6 +216,7 @@ InitProcGlobal(void)
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
ProcGlobal->autovacFreeProcs = &procs[i];
+ InitSharedLatch(&procs[i].waitLatch);
}
/*
@@ -224,6 +227,7 @@ InitProcGlobal(void)
{
AuxiliaryProcs[i].pid = 0; /* marks auxiliary proc as not in use */
PGSemaphoreCreate(&(AuxiliaryProcs[i].sem));
+ InitSharedLatch(&procs[i].waitLatch);
}
/* Create ProcStructLock spinlock, too */
@@ -326,6 +330,13 @@ InitProcess(void)
SHMQueueInit(&(MyProc->myProcLocks[i]));
MyProc->recoveryConflictPending = false;
+ /* Initialise the waitLSN for sync rep */
+ MyProc->waitLSN.xlogid = 0;
+ MyProc->waitLSN.xrecoff = 0;
+
+ OwnLatch((Latch *) &MyProc->waitLatch);
+ MyProc->ownLatch = true;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -365,6 +376,7 @@ InitProcessPhase2(void)
/*
* Arrange to clean that up at backend exit.
*/
+ on_shmem_exit(SyncRepCleanupAtProcExit, 0);
on_shmem_exit(RemoveProcFromArray, 0);
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5ede280..c3c9b98 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -56,6 +56,7 @@
#include "postmaster/syslogger.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
+#include "replication/syncrep.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
#include "storage/standby.h"
@@ -620,6 +621,15 @@ const char *const config_type_names[] =
static struct config_bool ConfigureNamesBool[] =
{
{
+ {"allow_standalone_primary", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Refuse connections on startup and force users to wait forever if synchronous replication has failed."),
+ NULL
+ },
+ &allow_standalone_primary,
+ true, NULL, NULL
+ },
+
+ {
{"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of sequential-scan plans."),
NULL
@@ -1279,6 +1289,33 @@ static struct config_bool ConfigureNamesBool[] =
},
{
+ {"synchronous_replication", PGC_USERSET, WAL_SETTINGS,
+ gettext_noop("Requests synchronous replication."),
+ NULL
+ },
+ &sync_rep_mode,
+ false, NULL, NULL
+ },
+
+ {
+ {"synchronous_replication_feedback", PGC_POSTMASTER, WAL_STANDBY_SERVERS,
+ gettext_noop("Allows feedback from a standby to primary for synchronous replication."),
+ NULL
+ },
+ &sync_rep_service,
+ true, NULL, NULL
+ },
+
+ {
+ {"hot_standby_feedback", PGC_POSTMASTER, WAL_STANDBY_SERVERS,
+ gettext_noop("Allows feedback from a hot standby to primary to avoid query conflicts."),
+ NULL
+ },
+ &hot_standby_feedback,
+ false, NULL, NULL
+ },
+
+ {
{"allow_system_table_mods", PGC_POSTMASTER, DEVELOPER_OPTIONS,
gettext_noop("Allows modifications of the structure of system tables."),
NULL,
@@ -1474,6 +1511,26 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"replication_timeout_client", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Clients waiting for confirmation will timeout after this duration."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &sync_rep_timeout_client,
+ 120, -1, INT_MAX, NULL, NULL
+ },
+
+ {
+ {"replication_timeout_server", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Replication connection will timeout after this duration."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &sync_rep_timeout_server,
+ 30, -1, INT_MAX, NULL, NULL
+ },
+
+ {
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1b02aa0..04cba08 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -184,7 +184,15 @@
#archive_timeout = 0 # force a logfile segment switch after this
# number of seconds; 0 disables
-# - Streaming Replication -
+# - Replication - User Settings
+
+#synchronous_replication = off # commit waits for reply from standby
+#replication_timeout_client = 120 # -1 means wait forever
+
+# - Streaming Replication - Server Settings
+
+#allow_standalone_primary = on # sync rep parameter
+#replication_timeout_client = 30 # -1 means wait forever
#max_wal_senders = 0 # max number of walsender processes
# (change requires restart)
@@ -197,6 +205,8 @@
#wal_receiver_status_interval = 10s # replies at least this often, 0 disables
#hot_standby = off # "on" allows queries during recovery
# (change requires restart)
+#hot_standby_feedback = off # info from standby to prevent query conflicts
+#synchronous_replication_feedback = off # allows sync replication
#max_standby_archive_delay = 30s # max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 99754e2..361af19 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -289,6 +289,7 @@ extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg);
extern bool RecoveryInProgress(void);
+extern bool HotStandbyActive(void);
extern bool XLogInsertAllowed(void);
extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
extern XLogRecPtr GetXLogReplayRecPtr(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 77ac369..624e640 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3075,7 +3075,7 @@ DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 f f
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,23}" "{i,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
DESCR("statistics: information about currently active backends");
-DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25,25,25,25}" "{o,o,o,o,o,o}" "{procpid,state,sent_location,write_location,flush_location,apply_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,16,25,25,25,25}" "{o,o,o,o,o,o,o}" "{procpid,state,sync,sent_location,write_location,flush_location,apply_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
DESCR("statistics: information about currently active replication");
DATA(insert OID = 2026 ( pg_backend_pid PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
DESCR("statistics: current backend PID");
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 4cdb15f..9a00b2c 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -73,7 +73,7 @@ typedef struct
typedef enum CAC_state
{
CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
- CAC_WAITBACKUP
+ CAC_WAITBACKUP, CAC_REPLICATION_ONLY
} CAC_state;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
new file mode 100644
index 0000000..a071b9a
--- /dev/null
+++ b/src/include/replication/syncrep.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.h
+ * Exports from replication/syncrep.c.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _SYNCREP_H
+#define _SYNCREP_H
+
+#include "access/xlog.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+
+#define SyncRepRequested() (sync_rep_mode)
+#define StandbyOffersSyncRepService() (sync_rep_service)
+
+/*
+ * There is no reply from standby to primary for async mode, so the reply
+ * message needs one less slot than the maximum number of modes
+ */
+#define NUM_SYNC_REP_WAIT_MODES 1
+
+extern XLogRecPtr ReplyLSN[NUM_SYNC_REP_WAIT_MODES];
+
+/*
+ * Each synchronous rep wait mode has one SyncRepWaitQueue in shared memory.
+ * These queues live in the WAL sender shmem area.
+ */
+typedef struct SyncRepQueue
+{
+ /*
+ * Current location of the head of the queue. Nobody should be waiting
+ * on the queue for an lsn equal to or earlier than this value. Procs
+ * on the queue will always be later than this value, though we don't
+ * record those values here.
+ */
+ XLogRecPtr lsn;
+
+ PGPROC *head;
+ PGPROC *tail;
+
+ slock_t qlock; /* locks shared variables shown above */
+} SyncRepQueue;
+
+/* user-settable parameters for synchronous replication */
+extern bool sync_rep_mode;
+extern int sync_rep_timeout_client;
+extern int sync_rep_timeout_server;
+extern bool sync_rep_service;
+
+extern bool hot_standby_feedback;
+
+/* called by user backend */
+extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+
+/* called by wal sender */
+extern void SyncRepReleaseWaiters(bool timeout);
+extern void SyncRepTimeoutExceeded(void);
+
+/* callback at exit */
+extern void SyncRepCleanupAtProcExit(int code, Datum arg);
+
+#endif /* _SYNCREP_H */
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index c69ca9d..8a7101a 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -69,6 +69,13 @@ typedef struct
*/
XLogRecPtr apply;
+ /*
+ * The current xmin from the standby, for Hot Standby feedback.
+ * This may be invalid if the standby-side does not support feedback,
+ * or Hot Standby is not yet available.
+ */
+ TransactionId xmin;
+
/* Sender's system clock at the time of transmission */
TimestampTz sendTime;
} StandbyReplyMessage;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index aa5bfb7..f57df6a 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -13,6 +13,8 @@
#define _WALRECEIVER_H
#include "access/xlogdefs.h"
+#include "replication/syncrep.h"
+#include "storage/latch.h"
#include "storage/spin.h"
#include "pgtime.h"
@@ -72,6 +74,11 @@ typedef struct
*/
char conninfo[MAXCONNINFO];
+ /*
+ * Latch used by aux procs to wake up walreceiver when it has work to do.
+ */
+ Latch latch;
+
slock_t mutex; /* locks shared variables shown above */
} WalRcvData;
@@ -93,6 +100,7 @@ extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
/* prototypes for functions in walreceiver.c */
extern void WalReceiverMain(void);
+extern void WalRcvWakeup(void);
/* prototypes for functions in walreceiverfuncs.c */
extern Size WalRcvShmemSize(void);
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index abee380..7f2e4d9 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -15,6 +15,7 @@
#include "access/xlog.h"
#include "nodes/nodes.h"
#include "storage/latch.h"
+#include "replication/syncrep.h"
#include "storage/spin.h"
@@ -54,20 +55,46 @@ typedef struct WalSnd
XLogRecPtr apply;
/*
+ * The current xmin from the standby, for Hot Standby feedback.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ TransactionId xmin;
+
+ /*
* Latch used by backends to wake up this walsender when it has work
* to do.
*/
Latch latch;
/*
+ * Highest level of sync rep available from this standby.
+ */
+ bool sync_rep_service;
+
+ /*
* Locks shared variables shown above.
*/
- slock_t mutex;
+ slock_t mutex;
} WalSnd;
+extern WalSnd *MyWalSnd;
+
/* There is one WalSndCtl struct for the whole database cluster */
typedef struct
{
+ /*
+ * Sync rep wait queues with one queue per request type.
+ * We use one queue per request type so that we can maintain the
+ * invariant that the individual queues are sorted on LSN.
+ * This may also help performance when multiple wal senders
+ * offer different sync rep service levels.
+ */
+ SyncRepQueue sync_rep_queue[NUM_SYNC_REP_WAIT_MODES];
+
+ bool sync_rep_service_available;
+
+ slock_t ctlmutex; /* locks shared variables shown above */
+
WalSnd walsnds[1]; /* VARIABLE LENGTH ARRAY */
} WalSndCtlData;
@@ -81,6 +108,7 @@ extern volatile sig_atomic_t walsender_ready_to_stop;
/* user-settable parameters */
extern int WalSndDelay;
extern int max_wal_senders;
+extern bool allow_standalone_primary;
extern int WalSenderMain(void);
extern void WalSndSignals(void);
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 97bdc7b..0d2a78e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -29,6 +29,7 @@ typedef enum
PMSIGNAL_START_AUTOVAC_LAUNCHER, /* start an autovacuum launcher */
PMSIGNAL_START_AUTOVAC_WORKER, /* start an autovacuum worker */
PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
+ PMSIGNAL_SYNC_REPLICATION_ACTIVE, /* walsender has completed handshake */
NUM_PMSIGNALS /* Must be last value of enum! */
} PMSignalReason;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 78dbade..27b57c8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -14,6 +14,8 @@
#ifndef _PROC_H_
#define _PROC_H_
+#include "access/xlog.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "storage/pg_sema.h"
#include "utils/timestamp.h"
@@ -115,6 +117,11 @@ struct PGPROC
LOCKMASK heldLocks; /* bitmask for lock types already held on this
* lock object by this backend */
+ /* Info to allow us to wait for synchronous replication, if needed. */
+ Latch waitLatch;
+ XLogRecPtr waitLSN; /* waiting for this LSN or higher */
+ bool ownLatch; /* do we own the above latch? */
+
/*
* All PROCLOCK objects for locks held or awaited by this backend are
* linked into one of these lists, according to the partition number of
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 1dbd1e5..b070340 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1296,7 +1296,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
pg_stat_bgwriter | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc;
pg_stat_database | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts FROM pg_database d;
pg_stat_database_conflicts | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
- pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.apply_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, apply_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
+ pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sync, w.sent_location, w.write_location, w.flush_location, w.apply_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sync, sent_location, write_location, flush_location, apply_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
pg_stat_sys_indexes | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
pg_stat_sys_tables | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_user_functions | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));
On Tue, 2011-02-08 at 13:53 -0500, Robert Haas wrote:
That having been said, there is at least one part of this patch which
looks to be in pretty good shape and seems independently useful
regardless of what happens to the rest of it, and that is the code
that sends replies from the standby back to the primary. This allows
pg_stat_replication to display the write/flush/apply log positions on
the standby next to the sent position on the primary, which as far as
I am concerned is pure gold. Simon had this set up to happen only
when synchronous replication or XID feedback in use, but I think
people are going to want it even with plain old asynchronous
replication, because it provides a FAR easier way to monitor standby
lag than anything we have today. I've extracted this portion of the
patch, cleaned it up a bit, written docs, and attached it here.
Score! +1
JD
--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Wed, Feb 9, 2011 at 3:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
That having been said, there is at least one part of this patch which
looks to be in pretty good shape and seems independently useful
regardless of what happens to the rest of it, and that is the code
that sends replies from the standby back to the primary. This allows
pg_stat_replication to display the write/flush/apply log positions on
the standby next to the sent position on the primary, which as far as
I am concerned is pure gold. Simon had this set up to happen only
when synchronous replication or XID feedback in use, but I think
people are going to want it even with plain old asynchronous
replication, because it provides a FAR easier way to monitor standby
lag than anything we have today. I've extracted this portion of the
patch, cleaned it up a bit, written docs, and attached it here.
What about also sending back the timestamp of the last applied
transaction? That's more user-friendly than the apply location
when we calculate the lag of replication, I think.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On mån, 2011-02-07 at 12:55 -0500, Robert Haas wrote:
On Mon, Feb 7, 2011 at 12:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... Well, the current CommitFest ends in one week, ...
Really? I thought the idea for the last CF of a development cycle was
that it kept going till we'd dealt with everything. Arbitrarily
rejecting stuff we haven't dealt with doesn't seem fair.Uh, we did that with 8.4 and it was a disaster. The CommitFest lasted
*five months*. We've been doing schedule-based CommitFests ever since
and it's worked much better.
The previous three commit fests contained about 50 patches each and
lasted one month each. The current commit fest contains about 100
patches, so it shouldn't be surprising that it will take about 2 months
to get through it.
Moreover, under the current process, it is apparent that reviewing is
the bottleneck. More code gets written than gets reviewed. By
insisting on the current schedule, we would just push the growing review
backlog ahead of ourselves. The solution (at least short-term, while
maintaining the process) has to be to increase the resources (in
practice: time) dedicated to reviewing relative to coding.
On 02/09/2011 07:53 AM, Peter Eisentraut wrote:
On mån, 2011-02-07 at 12:55 -0500, Robert Haas wrote:
On Mon, Feb 7, 2011 at 12:43 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
Robert Haas<robertmhaas@gmail.com> writes:
... Well, the current CommitFest ends in one week, ...
Really? I thought the idea for the last CF of a development cycle was
that it kept going till we'd dealt with everything. Arbitrarily
rejecting stuff we haven't dealt with doesn't seem fair.Uh, we did that with 8.4 and it was a disaster. The CommitFest lasted
*five months*. We've been doing schedule-based CommitFests ever since
and it's worked much better.The previous three commit fests contained about 50 patches each and
lasted one month each. The current commit fest contains about 100
patches, so it shouldn't be surprising that it will take about 2 months
to get through it.Moreover, under the current process, it is apparent that reviewing is
the bottleneck. More code gets written than gets reviewed. By
insisting on the current schedule, we would just push the growing review
backlog ahead of ourselves. The solution (at least short-term, while
maintaining the process) has to be to increase the resources (in
practice: time) dedicated to reviewing relative to coding.
Personally I think it's not unreasonable to extend the final commitfest
of the release some. It doesn't need to be a huge amount longer,
certainly not five months, but a couple of weeks to a month might be fair.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
On 02/09/2011 07:53 AM, Peter Eisentraut wrote:
The previous three commit fests contained about 50 patches each and
lasted one month each. The current commit fest contains about 100
patches, so it shouldn't be surprising that it will take about 2 months
to get through it.
Personally I think it's not unreasonable to extend the final commitfest
of the release some. It doesn't need to be a huge amount longer,
certainly not five months, but a couple of weeks to a month might be fair.
Yeah. IIRC, in our first cycle using the CF process, we expected the
last CF to take longer than others. I am not sure where the idea came
from that we'd be able to finish this one in a month.
I do accept the fact that we mustn't let it drag on indefinitely.
But two months instead of one isn't indefinite, and it seems more
realistic given the amount of work to be done.
regards, tom lane
On Wed, Feb 9, 2011 at 9:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 02/09/2011 07:53 AM, Peter Eisentraut wrote:
The previous three commit fests contained about 50 patches each and
lasted one month each. The current commit fest contains about 100
patches, so it shouldn't be surprising that it will take about 2 months
to get through it.Personally I think it's not unreasonable to extend the final commitfest
of the release some. It doesn't need to be a huge amount longer,
certainly not five months, but a couple of weeks to a month might be fair.Yeah. IIRC, in our first cycle using the CF process, we expected the
last CF to take longer than others. I am not sure where the idea came
from that we'd be able to finish this one in a month.
It came from the fact that we did it last time.
I do accept the fact that we mustn't let it drag on indefinitely.
But two months instead of one isn't indefinite, and it seems more
realistic given the amount of work to be done.
The work will expand to fill the time available.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Feb 9, 2011 at 7:53 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
Moreover, under the current process, it is apparent that reviewing is
the bottleneck. More code gets written than gets reviewed. By
insisting on the current schedule, we would just push the growing review
backlog ahead of ourselves. The solution (at least short-term, while
maintaining the process) has to be to increase the resources (in
practice: time) dedicated to reviewing relative to coding.
Yep. People who submit patches must also review patches if they want
their own stuff reviewed.
It sounds to me like what's being proposed is that I should spend
another month working on other people's patches, while they work on
their own patches. I can't get excited about that. The situation
with reviewing has gotten totally out of hand. I review and commit
more patches as part of each CommitFest than anyone except Tom, and I
think there have been some CommitFests where I did more patches than
he did (though he still wins by a mile if you factor in patch
complexity). But on the flip side, I can't always get a reviewer for
my own patches, or sometimes I get a perfunctory review that someone
spent ten minutes on. Huh?
So I heartily approve of the suggestion that we need to devote more
energy to reviewing, if it means "more reviewing by the people who are
not me". And allow me to suggest that that energy get put in NOW,
rather than a month from now. Most of the patches that still need
review are not that complicated. At least half of them could probably
be meaningfully reviewed in an hour or two. Then the author could
post an update tomorrow. Then the reviewer could spend another 30
minutes and mark them ready for committer. Next!
There are certainly some patches in this CommitFest that need more
attention than that, and that probably need the attention of a senior
community member. Jeff's range types patch and Alvaro's key lock
patch are two of those. And I would be willing to do that, except
that I'm already listed as a reviewer for FOURTEEN PATCHES this
CommitFest, plus I committed some others that someone else reviewed
and am also functioning as CommitFest manager. The problem isn't so
much the amount of calendar time that's required to get through 100
patches as the many people either submit half-baked code and assume
that they or someone else will fix it later, or else they submit code
but don't do an amount of review work equal to the amount of review
work they generate.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Feb 9, 2011, at 9:20 AM, Robert Haas wrote:
There are certainly some patches in this CommitFest that need more
attention than that, and that probably need the attention of a senior
community member. Jeff's range types patch and Alvaro's key lock
patch are two of those. And I would be willing to do that, except
that I'm already listed as a reviewer for FOURTEEN PATCHES this
CommitFest, plus I committed some others that someone else reviewed
and am also functioning as CommitFest manager. The problem isn't so
much the amount of calendar time that's required to get through 100
patches as the many people either submit half-baked code and assume
that they or someone else will fix it later, or else they submit code
but don't do an amount of review work equal to the amount of review
work they generate.
Frankly, I think you should surrender some of those 14 and cajole some other folks to take on more.
Best,
David
On Wed, Feb 9, 2011 at 1:09 PM, David E. Wheeler <david@kineticode.com> wrote:
Frankly, I think you should surrender some of those 14 and cajole some other folks to take on more.
Happily... only trouble is, I suck at cajoling. Even my begging is
distinctly sub-par.
Pleeeeeeeease?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Feb 9, 2011, at 10:29 AM, Robert Haas wrote:
Frankly, I think you should surrender some of those 14 and cajole some other folks to take on more.
Happily... only trouble is, I suck at cajoling. Even my begging is
distinctly sub-par.Pleeeeeeeease?
Try this:
“Listen up, bitches! I'm tired of Tom and me having to do all the work. All of you who submitted patches need to review some other patches! If you haven't submitted a review for someone else's patch by commitfest end, your patches will be marked "returned."”
Then maybe cuff Jeff or Alvaro or someone, to show you mean business.
HTH,
David
* Robert Haas (robertmhaas@gmail.com) wrote:
On Wed, Feb 9, 2011 at 1:09 PM, David E. Wheeler <david@kineticode.com> wrote:
Frankly, I think you should surrender some of those 14 and cajole some other folks to take on more.
Happily... only trouble is, I suck at cajoling. Even my begging is
distinctly sub-par.Pleeeeeeeease?
Erm, I've been through the commitfest app a couple of different times,
but have ignored things which are marked 'Needs Reivew' when there's a
reviewer listed...
If there are patches where you're marked as the reviewer but you don't
have time to review them or want help, take your name off as a reviewer
for them and/or speak up and explicitly ask for help. I'm not going to
start reviewing something if I think someone else is already working on
it..
Thanks,
Stephen
On Wed, Feb 9, 2011 at 1:32 PM, David E. Wheeler <david@kineticode.com> wrote:
On Feb 9, 2011, at 10:29 AM, Robert Haas wrote:
Frankly, I think you should surrender some of those 14 and cajole some other folks to take on more.
Happily... only trouble is, I suck at cajoling. Even my begging is
distinctly sub-par.Pleeeeeeeease?
Try this:
“Listen up, bitches! I'm tired of Tom and me having to do all the work. All of you who submitted patches need to review some other patches! If you haven't submitted a review for someone else's patch by commitfest end, your patches will be marked "returned."”
Then maybe cuff Jeff or Alvaro or someone, to show you mean business.
That tends not to get a lot of community support, and it isn't my
intention anyway. We actually do not need to impose a draconian rule;
we just need everyone to put in a little extra effort to get us over
the hump.
But speaking of that, I just so happen to notice you haven't signed up
to review any patches this CF. How about grabbing one or two?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Feb 9, 2011, at 10:56 AM, Robert Haas wrote:
“Listen up, bitches! I'm tired of Tom and me having to do all the work. All of you who submitted patches need to review some other patches! If you haven't submitted a review for someone else's patch by commitfest end, your patches will be marked "returned."”
Then maybe cuff Jeff or Alvaro or someone, to show you mean business.
That tends not to get a lot of community support, and it isn't my
intention anyway. We actually do not need to impose a draconian rule;
we just need everyone to put in a little extra effort to get us over
the hump.
Agreed. Let me remove my tongue from my cheek.
But speaking of that, I just so happen to notice you haven't signed up
to review any patches this CF. How about grabbing one or tw
ha ha! Alas, I'm completely overcommitted at this point. Been having a hard time making time for PGXN. I've been tracking the extension stuff closely, though, as you can imagine.
Looking at the patches without reviewers anyway, frankly none look like the sorts of things I have the expertise to test in any but the most superficial way. Are there more that should have the reviewer removed? If there were one I could give a couple of hours to and speak with some knowledge, I could fix up some time next week.
Best,
David
On Wed, Feb 9, 2011 at 2:01 PM, David E. Wheeler <david@kineticode.com> wrote:
ha ha! Alas, I'm completely overcommitted at this point. Been having a hard time making time for PGXN. I've been tracking the extension stuff closely, though, as you can imagine.
It's a common problem, and of course none of us are in a position to
dictate how other people spend their time. But the issue on the table
is whether we want PostgreSQL 9.1 to be released in 2011. If yes,
then without making any statements about what any particular person
has to or must do, we collectively need to step it up a notch or two.
Looking at the patches without reviewers anyway, frankly none look like the sorts of things I have the expertise to test in any but the most superficial way. Are there more that should have the reviewer removed? If there were one I could give a couple of hours to and speak with some knowledge, I could fix up some time next week.
I just sent a note on some that seem like they could use more looking
at, but there may be other ones too. Now is not the time to hold back
because you think someone else might be working on it. Most of the
time, the fact that a patch has a reviewer means that they either
intended to or actually did review it at some point in time, but not
that they are necessarily working on it right this minute, and
certainly not that other input isn't welcome. This is especially true
towards the end of the CommitFest or when the thread hasn't had
anything new posted to it for several days.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Feb 9, 2011 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Feb 8, 2011 at 2:34 PM, Magnus Hagander <magnus@hagander.net> wrote:
I also agree with the general idea of trying to break it into smaller
parts - even if they only provide small parts each on it's own. That
also makes it easier to get an overview of exactly how much is left,
to see where to focus.And on that note, here's the rest of the patch back, rebased over what
I posted ~90 minutes ago.
Though I haven't read the patch enough yet, I have one review comment.
While walsender uses the non-blocking I/O function (i.e.,
pq_getbyte_if_available)
for the receive, it uses the blocking one (i.e., pq_flush, etc) for the send.
So, sync_rep_timeout_server would not work well when the walsender
gets blocked in sending WAL. This is one the problems which I struggled
with when I created the SyncRep patch before. I think that we need to
introduce the non-blocking send function for the replication timeout.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On 08.02.2011 20:53, Robert Haas wrote:
That having been said, there is at least one part of this patch which
looks to be in pretty good shape and seems independently useful
regardless of what happens to the rest of it, and that is the code
that sends replies from the standby back to the primary. This allows
pg_stat_replication to display the write/flush/apply log positions on
the standby next to the sent position on the primary, which as far as
I am concerned is pure gold. Simon had this set up to happen only
when synchronous replication or XID feedback in use, but I think
people are going to want it even with plain old asynchronous
replication, because it provides a FAR easier way to monitor standby
lag than anything we have today. I've extracted this portion of the
patch, cleaned it up a bit, written docs, and attached it here.
Thanks!
I wasn't too sure how to control the timing of the replies. It's
worth noting that you have to send them pretty frequently for the
distinction between xlog written and xlog flushed to have any value.
What I've done here is made it so that every time we read all
available data on the socket, we send a reply. After flushing, we
send another reply. And then just for the heck of it we send a reply
at least every 10 seconds (configurable), which causes the
last-known-apply position to eventually get updated on the master.
This means the apply position can lag reality by a bit.
Seems reasonable. As the patch stands, however, the standby doesn't send
any status updates if its busy receiving, writing, and flushing the
incoming WAL. That would happen if you have a fast network, and slow
disk, and the standby is catching up, e.g after restoring a base backup.
I added a XLogWalRcvSendReply() call into XLogWalRcvFlush() so that it
also sends a status update every time the WAL is flushed. If the
walreceiver is busy receiving and flushing, that would happen once per
WAL segment, which seems sensible.
The comment above StandbyReplyMessage said that its message type is 'r'.
However, no message type was actually sent for the replies. A message
type byte seems like a good idea, for the sake of extensibility, so I
made the code match that comment. I also added documentation of this new
message type in the manual section about the streaming replication protocol.
I committed the patch with those changes, and some minor comment tweaks
and other kibitzing.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Feb 8, 2011 at 3:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Feb 8, 2011 at 2:34 PM, Magnus Hagander <magnus@hagander.net> wrote:
I also agree with the general idea of trying to break it into smaller
parts - even if they only provide small parts each on it's own. That
also makes it easier to get an overview of exactly how much is left,
to see where to focus.And on that note, here's the rest of the patch back, rebased over what
I posted ~90 minutes ago.
Another rebase.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
syncrep-v9.3.patchapplication/octet-stream; name=syncrep-v9.3.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 63c6283..726c9c0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2029,8 +2029,122 @@ SET ENABLE_SEQSCAN TO OFF;
This parameter can only be set in the <filename>postgresql.conf</>
file or on the server command line.
</para>
+ <para>
+ You should also consider setting <varname>hot_standby_feedback</>
+ as an alternative to using this parameter.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </sect2>
+
+ <sect2 id="runtime-config-sync-rep">
+ <title>Synchronous Replication</title>
+
+ <para>
+ These settings control the behavior of the built-in
+ <firstterm>synchronous replication</> feature.
+ These parameters would be set on the primary server that is
+ to send replication data to one or more standby servers.
+ </para>
+
+ <variablelist>
+ <varlistentry id="guc-synchronous-replication" xreflabel="synchronous_replication">
+ <term><varname>synchronous_replication</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>synchronous_replication</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether transaction commit will wait for WAL records
+ to be replicated before the command returns a <quote>success</>
+ indication to the client. The default setting is <literal>off</>.
+ When <literal>on</>, there will be a delay while the client waits
+ for confirmation of successful replication. That delay will
+ increase depending upon the physical distance and network activity
+ between primary and standby. The commit wait will last until the
+ first reply from any standby. Multiple standby servers allow
+ increased availability and possibly increase performance as well.
+ </para>
+ <para>
+ The parameter must be set on both primary and standby.
+ </para>
+ <para>
+ On the primary, this parameter can be changed at any time; the
+ behavior for any one transaction is determined by the setting in
+ effect when it commits. It is therefore possible, and useful, to have
+ some transactions replicate synchronously and others asynchronously.
+ For example, to make a single multistatement transaction commit
+ asynchronously when the default is synchronous replication, issue
+ <command>SET LOCAL synchronous_replication TO OFF</> within the
+ transaction.
+ </para>
+ <para>
+ On the standby, the parameter value is taken only at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-allow-standalone-primary" xreflabel="allow_standalone_primary">
+ <term><varname>allow_standalone_primary</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>allow_standalone_primary</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If <varname>allow_standalone_primary</> is set, then the server
+ can operate normally whether or not replication is active. If
+ a client requests <varname>synchronous_replication</> and it is
+ not available, they will use asynchornous replication instead.
+ </para>
+ <para>
+ If <varname>allow_standalone_primary</> is not set, then the server
+ will prevent normal client connections until a standby connects that
+ has <varname>synchronous_replication_feedback</> enabled. Once
+ clients connect, if they request <varname>synchronous_replication</>
+ and it is no longer available they will wait for
+ <varname>replication_timeout_client</>.
+ </para>
</listitem>
</varlistentry>
+
+ <varlistentry id="guc-replication-timeout-client" xreflabel="replication_timeout_client">
+ <term><varname>replication_timeout_client</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>replication_timeout_client</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If the client has <varname>synchronous_replication</varname> set,
+ and a synchronous standby is currently available
+ then the commit will wait for up to <varname>replication_timeout_client</>
+ seconds before it returns a <quote>success</>. The commit will wait
+ forever for a confirmation when <varname>replication_timeout_client</>
+ is set to -1.
+ </para>
+ <para>
+ If the client has <varname>synchronous_replication</varname> set,
+ and yet no synchronous standby is available when we commit, then the
+ setting of <varname>allow_standalone_primary</> determines whether
+ or not we wait.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-replication-timeout-server" xreflabel="replication_timeout_server">
+ <term><varname>replication_timeout_server</varname> (<type>integer</type>)</term>
+ <indexterm>
+ <primary><varname>replication_timeout_server</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ If the primary server does not receive a reply from a standby server
+ within <varname>replication_timeout_server</> seconds then the
+ primary will terminate the replication connection.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
@@ -2121,6 +2235,42 @@ SET ENABLE_SEQSCAN TO OFF;
</listitem>
</varlistentry>
+ <varlistentry id="guc-hot-standby-feedback" xreflabel="hot_standby">
+ <term><varname>hot_standby_feedback</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>hot_standby_feedback</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether or not a hot standby will send feedback to the primary
+ about queries currently executing on the standby. This parameter can
+ be used to eliminate query cancels caused by cleanup records, though
+ it can cause database bloat on the primary for some workloads.
+ The default value is <literal>off</literal>.
+ This parameter can only be set at server start. It only has effect
+ if <varname>hot_standby</> is enabled.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-synchronous-replication-feedback" xreflabel="synchronous_replication_feedback">
+ <term><varname>synchronous_replication_feedback</varname> (<type>boolean</type>)</term>
+ <indexterm>
+ <primary><varname>synchronous_replication_feedback</> configuration parameter</primary>
+ </indexterm>
+ <listitem>
+ <para>
+ Specifies whether the standby will provide reply messages to
+ allow synchronous replication on the primary.
+ Reasons for doing this might be that the standby is physically
+ co-located with the primary and so would be a bad choice as a
+ future primary server, or the standby might be a test server.
+ The default value is <literal>on</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect2>
</sect1>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index a892969..c006f35 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -738,13 +738,12 @@ archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
</para>
<para>
- Streaming replication is asynchronous, so there is still a small delay
+ There is a small replication delay
between committing a transaction in the primary and for the changes to
become visible in the standby. The delay is however much smaller than with
file-based log shipping, typically under one second assuming the standby
is powerful enough to keep up with the load. With streaming replication,
- <varname>archive_timeout</> is not required to reduce the data loss
- window.
+ <varname>archive_timeout</> is not required.
</para>
<para>
@@ -879,6 +878,236 @@ primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
</sect3>
</sect2>
+ <sect2 id="synchronous-replication">
+ <title>Synchronous Replication</title>
+
+ <indexterm zone="high-availability">
+ <primary>Synchronous Replication</primary>
+ </indexterm>
+
+ <para>
+ <productname>PostgreSQL</> streaming replication is asynchronous by
+ default. If the primary server
+ crashes then some transactions that were committed may not have been
+ replicated to the standby server, causing data loss. The amount
+ of data loss is proportional to the replication delay at the time of
+ failover. That could be zero, or more, we do not know for certain
+ either way, when using asynchronous replication.
+ </para>
+
+ <para>
+ Synchronous replication offers the ability to confirm that all changes
+ made by a transaction have been transferred to at least one remote
+ standby server. This extends the standard level of durability
+ offered by a transaction commit. This level of protection is referred
+ to as 2-safe replication in computer science theory.
+ </para>
+
+ <para>
+ Synchronous replication works in the following way. When requested,
+ the commit of a write transaction will wait until confirmation is
+ received that the commit has been written to the transaction log on disk
+ of both the primary and standby server. The only possibility that data
+ can be lost is if both the primary and the standby suffer crashes at the
+ same time. This can provide a much higher level of durability if the
+ sysadmin is cautious about the placement and management of the two servers.
+ Waiting for confirmation increases the user's confidence that the changes
+ will not be lost in the event of server crashes but it also necessarily
+ increases the response time for the requesting transaction. The minimum
+ wait time is the roundtrip time between primary to standby.
+ </para>
+
+ <para>
+ Read only transactions and transaction rollbacks need not wait for
+ replies from standby servers. Subtransaction commits do not wait for
+ responses from standby servers, only final top-level commits. Long
+ running actions such as data loading or index building do not wait
+ until the very final commit message.
+ </para>
+
+ <sect3 id="synchronous-replication-config">
+ <title>Basic Configuration</title>
+
+ <para>
+ Synchronous replication will be active if appropriate options are
+ enabled on both the primary and at least one standby server. If
+ options are not correctly set on both servers, the primary will use
+ use asynchronous replication by default.
+ </para>
+
+ <para>
+ On the primary server we need to set
+
+<programlisting>
+synchronous_replication = on
+</programlisting>
+
+ and on the standby server we need to set
+
+<programlisting>
+synchronous_replication_feedback = on
+</programlisting>
+
+ On the primary, <varname>synchronous_replication</> can be set
+ for particular users or databases, or dynamically by applications
+ programs. On the standby, <varname>synchronous_replication_feedback</>
+ can only be set at server start.
+ </para>
+
+ <para>
+ If more than one standby server
+ specifies <varname>synchronous_replication_feedback</>, then whichever
+ standby replies first will release waiting commits.
+ Turning this setting off for a standby allows the administrator to
+ exclude certain standby servers from releasing waiting transactions.
+ This is useful if not all standby servers are designated as potential
+ future primary servers, such as if a standby were co-located
+ with the primary, so that a disaster would cause both servers to be lost.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-performance">
+ <title>Planning for Performance</title>
+
+ <para>
+ Synchronous replication usually requires carefully planned and placed
+ standby servers to ensure applications perform acceptably. Waiting
+ doesn't utilise system resources, but transaction locks continue to be
+ held until the transfer is confirmed. As a result, incautious use of
+ synchronous replication will reduce performance for database
+ applications because of increased response times and higher contention.
+ </para>
+
+ <para>
+ <productname>PostgreSQL</> allows the application developer
+ to specify the durability level required via replication. This can be
+ specified for the system overall, though it can also be specified for
+ specific users or connections, or even individual transactions.
+ </para>
+
+ <para>
+ For example, an application workload might consist of:
+ 10% of changes are important customer details, while
+ 90% of changes are less important data that the business can more
+ easily survive if it is lost, such as chat messages between users.
+ </para>
+
+ <para>
+ With synchronous replication options specified at the application level
+ (on the primary) we can offer sync rep for the most important changes,
+ without slowing down the bulk of the total workload. Application level
+ options are an important and practical tool for allowing the benefits of
+ synchronous replication for high performance applications.
+ </para>
+
+ <para>
+ You should consider that the network bandwidth must be higher than
+ the rate of generation of WAL data.
+ 10% of changes are important customer details, while
+ 90% of changes are less important data that the business can more
+ easily survive if it is lost, such as chat messages between users.
+ </para>
+
+ </sect3>
+
+ <sect3 id="synchronous-replication-ha">
+ <title>Planning for High Availability</title>
+
+ <para>
+ The easiest and safest method of gaining High Availability using
+ synchronous replication is to configure at least two standby servers.
+ To understand why, we need to examine what can happen when you lose all
+ standby servers.
+ </para>
+
+ <para>
+ Commits made when synchronous_replication is set will wait until at
+ least one standby responds. The response may never occur if the last,
+ or only, standby should crash or the network drops. What should we do in
+ that situation?
+ </para>
+
+ <para>
+ Sitting and waiting will typically cause operational problems
+ because it is an effective outage of the primary server should all
+ sessions end up waiting. In contrast, allowing the primary server to
+ continue processing write transactions in the absence of a standby
+ puts those latest data changes at risk. So in this situation there
+ is a direct choice between database availability and the potential
+ durability of the data it contains. How we handle this situation
+ is controlled by <varname>allow_standalone_primary</>. The default
+ setting is <literal>on</>, allowing processing to continue, though
+ there is no recommended setting. Choosing the best setting for
+ <varname>allow_standalone_primary</> is a difficult decision and best
+ left to those with combined business responsibility for both data and
+ applications. The difficulty of this choice is the reason why we
+ recommend that you reduce the possibility of this situation occurring
+ by using multiple standby servers.
+ </para>
+
+ <para>
+ A user will stop waiting once the <varname>replication_timeout_client</>
+ has been reached for their specific session. Users are not waiting for
+ a specific standby to reply, they are waiting for a reply from any
+ standby, so the unavailability of any one standby is not significant
+ to a user. It is possible for user sessions to hit timeout even though
+ standbys are communicating normally. In that case, the setting of
+ <varname>replication_timeout</> is probably too low.
+ </para>
+
+ <para>
+ The standby sends regular status messages to the primary. If no status
+ messages have been received for <varname>replication_timeout_server</>
+ the primary server will assume the connection is dead and terminate it.
+ </para>
+
+ <para>
+ When the primary is started with <varname>allow_standalone_primary</>
+ enabled, the primary will not allow connections until a standby connects
+ that also has <varname>synchronous_replication</> enabled. This is a
+ convenience to ensure that we don't allow connections before write
+ transactions will return successfully.
+ </para>
+
+ <para>
+ When a standby first attaches to the primary, it may not be properly
+ synchronized. The standby is only able to become a synchronous standby
+ once it has become synchronized, or "caught up" with the the primary.
+ The catch-up duration may be long immediately after the standby has
+ been created. If the standby is shutdown, then the catch-up period
+ will increase according to the length of time the standby has been
+ down. You are advised to make sure <varname>allow_standalone_primary</>
+ is not set during the initial catch-up period.
+ </para>
+
+ <para>
+ If primary crashes while commits are waiting for acknowledgement, those
+ transactions will be marked fully committed if the primary database
+ recovers, no matter how <varname>allow_standalone_primary</> is set.
+ There is no way to be certain that all standbys have received all
+ outstanding WAL data at time of the crash of the primary. Some
+ transactions may not show as committed on the standby, even though
+ they show as committed on the primary. The guarantee we offer is that
+ the application will not receive explicit acknowledgement of the
+ successful commit of a transaction until the WAL data is known to be
+ safely received by the standby. Hence this mechanism is technically
+ "semi synchronous" rather than "fully synchronous" replication. Note
+ that replication still not be fully synchronous even if we wait for
+ all standby servers, though this would reduce availability, as
+ described previously.
+ </para>
+
+ <para>
+ If you need to re-create a standby server while transactions are
+ waiting, make sure that the commands to run pg_start_backup() and
+ pg_stop_backup() are run in a session with
+ synchronous_replication = off, otherwise those requests will wait
+ forever for the standby to appear.
+ </para>
+
+ </sect3>
+ </sect2>
</sect1>
<sect1 id="warm-standby-failover">
@@ -1393,11 +1622,18 @@ if (!triggered)
These conflicts are <emphasis>hard conflicts</> in the sense that queries
might need to be cancelled and, in some cases, sessions disconnected to resolve them.
The user is provided with several ways to handle these
- conflicts. Conflict cases include:
+ conflicts. Conflict cases in order of likely frequency are:
<itemizedlist>
<listitem>
<para>
+ Application of a vacuum cleanup record from WAL conflicts with
+ standby transactions whose snapshots can still <quote>see</> any of
+ the rows to be removed.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
Access Exclusive locks taken on the primary server, including both
explicit <command>LOCK</> commands and various <acronym>DDL</>
actions, conflict with table accesses in standby queries.
@@ -1417,14 +1653,8 @@ if (!triggered)
</listitem>
<listitem>
<para>
- Application of a vacuum cleanup record from WAL conflicts with
- standby transactions whose snapshots can still <quote>see</> any of
- the rows to be removed.
- </para>
- </listitem>
- <listitem>
- <para>
- Application of a vacuum cleanup record from WAL conflicts with
+ Buffer pin deadlock caused by
+ application of a vacuum cleanup record from WAL conflicts with
queries accessing the target page on the standby, whether or not
the data to be removed is visible.
</para>
@@ -1539,17 +1769,16 @@ if (!triggered)
<para>
Remedial possibilities exist if the number of standby-query cancellations
- is found to be unacceptable. The first option is to connect to the
- primary server and keep a query active for as long as needed to
- run queries on the standby. This prevents <command>VACUUM</> from removing
- recently-dead rows and so cleanup conflicts do not occur.
- This could be done using <xref linkend="dblink"> and
- <function>pg_sleep()</>, or via other mechanisms. If you do this, you
+ is found to be unacceptable. Typically the best option is to enable
+ <varname>hot_standby_feedback</>. This prevents <command>VACUUM</> from
+ removing recently-dead rows and so cleanup conflicts do not occur.
+ If you do this, you
should note that this will delay cleanup of dead rows on the primary,
which may result in undesirable table bloat. However, the cleanup
situation will be no worse than if the standby queries were running
- directly on the primary server, and you are still getting the benefit of
- off-loading execution onto the standby.
+ directly on the primary server. You are still getting the benefit
+ of off-loading execution onto the standby and the query may complete
+ faster than it would have done on the primary server.
<varname>max_standby_archive_delay</> must be kept large in this case,
because delayed WAL files might already contain entries that conflict with
the desired standby queries.
@@ -1563,7 +1792,8 @@ if (!triggered)
a high <varname>max_standby_streaming_delay</>. However it is
difficult to guarantee any specific execution-time window with this
approach, since <varname>vacuum_defer_cleanup_age</> is measured in
- transactions executed on the primary server.
+ transactions executed on the primary server. As of version 9.1, this
+ second option is much less likely to valuable.
</para>
<para>
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 287ad26..eb3cd6f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -56,6 +56,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "replication/walsender.h"
+#include "replication/syncrep.h"
#include "storage/fd.h"
#include "storage/predicate.h"
#include "storage/procarray.h"
@@ -2030,6 +2031,14 @@ RecordTransactionCommitPrepared(TransactionId xid,
MyProc->inCommit = false;
END_CRIT_SECTION();
+
+ /*
+ * Wait for synchronous replication, if required.
+ *
+ * Note that at this stage we have marked clog, but still show as
+ * running in the procarray and continue to hold locks.
+ */
+ SyncRepWaitForLSN(recptr);
}
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a0170b4..1da42c9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -37,6 +37,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "replication/walsender.h"
+#include "replication/syncrep.h"
#include "storage/bufmgr.h"
#include "storage/fd.h"
#include "storage/lmgr.h"
@@ -54,6 +55,7 @@
#include "utils/snapmgr.h"
#include "pg_trace.h"
+extern void WalRcvWakeup(void); /* we are only caller, so include directly */
/*
* User-tweakable parameters
@@ -1055,7 +1057,7 @@ RecordTransactionCommit(void)
* if all to-be-deleted tables are temporary though, since they are lost
* anyway if we crash.)
*/
- if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0)
+ if ((wrote_xlog && XactSyncCommit) || forceSyncCommit || nrels > 0 || SyncRepRequested())
{
/*
* Synchronous commit case:
@@ -1125,6 +1127,14 @@ RecordTransactionCommit(void)
/* Compute latestXid while we have the child XIDs handy */
latestXid = TransactionIdLatest(xid, nchildren, children);
+ /*
+ * Wait for synchronous replication, if required.
+ *
+ * Note that at this stage we have marked clog, but still show as
+ * running in the procarray and continue to hold locks.
+ */
+ SyncRepWaitForLSN(XactLastRecEnd);
+
/* Reset XactLastRecEnd until the next transaction writes something */
XactLastRecEnd.xrecoff = 0;
@@ -4533,6 +4543,14 @@ xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, XLogRecPtr lsn)
*/
if (XactCompletionForceSyncCommit(xlrec))
XLogFlush(lsn);
+
+ /*
+ * If this standby is offering sync_rep_service then signal WALReceiver,
+ * in case it needs to send a reply just for this commit on an
+ * otherwise quiet server.
+ */
+ if (sync_rep_service)
+ WalRcvWakeup();
}
/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f5cb657..3fac09a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -41,6 +41,7 @@
#include "miscadmin.h"
#include "pgstat.h"
#include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
#include "replication/walreceiver.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
@@ -157,6 +158,11 @@ static XLogRecPtr LastRec;
* known, need to check the shared state".
*/
static bool LocalRecoveryInProgress = true;
+/*
+ * Local copy of SharedHotStandbyActive variable. False actually means "not
+ * known, need to check the shared state".
+ */
+static bool LocalHotStandbyActive = false;
/*
* Local state for XLogInsertAllowed():
@@ -405,6 +411,12 @@ typedef struct XLogCtlData
bool SharedRecoveryInProgress;
/*
+ * SharedHotStandbyActive indicates if we're still in crash or archive
+ * recovery. Protected by info_lck.
+ */
+ bool SharedHotStandbyActive;
+
+ /*
* recoveryWakeupLatch is used to wake up the startup process to
* continue WAL replay, if it is waiting for WAL to arrive or failover
* trigger file to appear.
@@ -4915,6 +4927,7 @@ XLOGShmemInit(void)
*/
XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
XLogCtl->SharedRecoveryInProgress = true;
+ XLogCtl->SharedHotStandbyActive = false;
XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
SpinLockInit(&XLogCtl->info_lck);
InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
@@ -5285,6 +5298,12 @@ readRecoveryCommandFile(void)
(errmsg("recovery command file \"%s\" specified neither primary_conninfo nor restore_command",
RECOVERY_COMMAND_FILE),
errhint("The database server will regularly poll the pg_xlog subdirectory to check for files placed there.")));
+
+ if (PrimaryConnInfo == NULL && sync_rep_service)
+ ereport(WARNING,
+ (errmsg("recovery command file \"%s\" specified synchronous_replication_service yet streaming was not requested",
+ RECOVERY_COMMAND_FILE),
+ errhint("Specify primary_conninfo to allow synchronous replication.")));
}
else
{
@@ -6159,6 +6178,13 @@ StartupXLOG(void)
if (XLByteLT(ControlFile->minRecoveryPoint, checkPoint.redo))
ControlFile->minRecoveryPoint = checkPoint.redo;
}
+ else
+ {
+ /*
+ * No need to calculate feedback if we're not in Hot Standby.
+ */
+ hot_standby_feedback = false;
+ }
/*
* set backupStartupPoint if we're starting archive recovery from a
@@ -6778,8 +6804,6 @@ StartupXLOG(void)
static void
CheckRecoveryConsistency(void)
{
- static bool backendsAllowed = false;
-
/*
* Have we passed our safe starting point?
*/
@@ -6799,11 +6823,19 @@ CheckRecoveryConsistency(void)
* enabling connections.
*/
if (standbyState == STANDBY_SNAPSHOT_READY &&
- !backendsAllowed &&
+ !LocalHotStandbyActive &&
reachedMinRecoveryPoint &&
IsUnderPostmaster)
{
- backendsAllowed = true;
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ xlogctl->SharedHotStandbyActive = true;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ LocalHotStandbyActive = true;
+
SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
}
}
@@ -6851,6 +6883,38 @@ RecoveryInProgress(void)
}
/*
+ * Is HotStandby active yet? This is only important in special backends
+ * since normal backends won't ever be able to connect until this returns
+ * true.
+ *
+ * Unlike testing standbyState, this works in any process that's connected to
+ * shared memory.
+ */
+bool
+HotStandbyActive(void)
+{
+ /*
+ * We check shared state each time only until Hot Standby is active. We
+ * can't de-activate Hot Standby, so there's no need to keep checking after
+ * the shared variable has once been seen true.
+ */
+ if (LocalHotStandbyActive)
+ return true;
+ else
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ /* spinlock is essential on machines with weak memory ordering! */
+ SpinLockAcquire(&xlogctl->info_lck);
+ LocalHotStandbyActive = xlogctl->SharedHotStandbyActive;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ return LocalHotStandbyActive;
+ }
+}
+
+/*
* Is this process allowed to insert new WAL records?
*
* Ordinarily this is essentially equivalent to !RecoveryInProgress().
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 408e174..5ce1888 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -509,6 +509,7 @@ CREATE VIEW pg_stat_replication AS
S.client_port,
S.backend_start,
W.state,
+ W.sync,
W.sent_location,
W.write_location,
W.flush_location,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8f77d1b..1577875 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -275,6 +275,7 @@ typedef enum
PM_STARTUP, /* waiting for startup subprocess */
PM_RECOVERY, /* in archive recovery mode */
PM_HOT_STANDBY, /* in hot standby mode */
+ PM_WAIT_FOR_REPLICATION, /* waiting for sync replication to become active */
PM_RUN, /* normal "database is alive" state */
PM_WAIT_BACKUP, /* waiting for online backup mode to end */
PM_WAIT_READONLY, /* waiting for read only backends to exit */
@@ -735,6 +736,9 @@ PostmasterMain(int argc, char *argv[])
if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\" or \"hot_standby\"")));
+ if (!allow_standalone_primary && max_wal_senders == 0)
+ ereport(ERROR,
+ (errmsg("WAL streaming (max_wal_senders > 0) is required if allow_standalone_primary = off")));
/*
* Other one-time internal sanity checks can go here, if they are fast.
@@ -1845,6 +1849,12 @@ retry1:
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
errmsg("the database system is in recovery mode")));
break;
+ case CAC_REPLICATION_ONLY:
+ if (!am_walsender)
+ ereport(FATAL,
+ (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+ errmsg("the database system is waiting for replication to start")));
+ break;
case CAC_TOOMANY:
ereport(FATAL,
(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
@@ -1942,7 +1952,9 @@ canAcceptConnections(void)
*/
if (pmState != PM_RUN)
{
- if (pmState == PM_WAIT_BACKUP)
+ if (pmState == PM_WAIT_FOR_REPLICATION)
+ result = CAC_REPLICATION_ONLY; /* allow replication only */
+ else if (pmState == PM_WAIT_BACKUP)
result = CAC_WAITBACKUP; /* allow superusers only */
else if (Shutdown > NoShutdown)
return CAC_SHUTDOWN; /* shutdown is pending */
@@ -2396,8 +2408,13 @@ reaper(SIGNAL_ARGS)
* Startup succeeded, commence normal operations
*/
FatalError = false;
- ReachedNormalRunning = true;
- pmState = PM_RUN;
+ if (allow_standalone_primary)
+ {
+ ReachedNormalRunning = true;
+ pmState = PM_RUN;
+ }
+ else
+ pmState = PM_WAIT_FOR_REPLICATION;
/*
* Crank up the background writer, if we didn't do that already
@@ -3233,8 +3250,8 @@ BackendStartup(Port *port)
/* Pass down canAcceptConnections state */
port->canAcceptConnections = canAcceptConnections();
bn->dead_end = (port->canAcceptConnections != CAC_OK &&
- port->canAcceptConnections != CAC_WAITBACKUP);
-
+ port->canAcceptConnections != CAC_WAITBACKUP &&
+ port->canAcceptConnections != CAC_REPLICATION_ONLY);
/*
* Unless it's a dead_end child, assign it a child slot number
*/
@@ -4284,6 +4301,16 @@ sigusr1_handler(SIGNAL_ARGS)
WalReceiverPID = StartWalReceiver();
}
+ if (CheckPostmasterSignal(PMSIGNAL_SYNC_REPLICATION_ACTIVE) &&
+ pmState == PM_WAIT_FOR_REPLICATION)
+ {
+ /* Allow connections now that a synchronous replication standby
+ * has successfully connected and is active.
+ */
+ ReachedNormalRunning = true;
+ pmState = PM_RUN;
+ }
+
PG_SETMASK(&UnBlockSig);
errno = save_errno;
@@ -4534,6 +4561,7 @@ static void
StartAutovacuumWorker(void)
{
Backend *bn;
+ CAC_state cac = CAC_OK;
/*
* If not in condition to run a process, don't try, but handle it like a
@@ -4542,7 +4570,8 @@ StartAutovacuumWorker(void)
* we have to check to avoid race-condition problems during DB state
* changes.
*/
- if (canAcceptConnections() == CAC_OK)
+ cac = canAcceptConnections();
+ if (cac == CAC_OK || cac == CAC_REPLICATION_ONLY)
{
bn = (Backend *) malloc(sizeof(Backend));
if (bn)
diff --git a/src/backend/replication/Makefile b/src/backend/replication/Makefile
index 42c6eaf..3fe490e 100644
--- a/src/backend/replication/Makefile
+++ b/src/backend/replication/Makefile
@@ -13,7 +13,7 @@ top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
OBJS = walsender.o walreceiverfuncs.o walreceiver.o basebackup.o \
- repl_gram.o
+ repl_gram.o syncrep.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 9c2e0d8..7387224 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -1,5 +1,27 @@
src/backend/replication/README
+Overview
+--------
+
+The WALSender sends WAL data and receives replies. The WALReceiver
+receives WAL data and sends replies.
+
+If there is no more WAL data to send then WALSender goes quiet,
+apart from checking for replies. If there is no more WAL data
+to receive then WALReceiver keeps sending replies until all the data
+received has been applied, then it too goes quiet. When all is quiet
+WALReceiver sends regular replies so that WALSender knows the link
+is still working - we don't want to wait until a transaction
+arrives before we try to determine the health of the connection.
+
+WALReceiver sends one reply per message received. If nothing is
+received it sends one reply every time apply pointer advances,
+with a minimum of one reply each cycletime.
+
+For synchronous replication, all decisions about whether to wait
+and how long to wait are taken on the primary. The standby has no
+state information about what is happening on the primary.
+
Walreceiver - libpqwalreceiver API
----------------------------------
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
new file mode 100644
index 0000000..12a3825
--- /dev/null
+++ b/src/backend/replication/syncrep.c
@@ -0,0 +1,641 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.c
+ *
+ * Synchronous replication is new as of PostgreSQL 9.1.
+ *
+ * If requested, transaction commits wait until their commit LSN is
+ * acknowledged by the standby, or the wait hits timeout.
+ *
+ * This module contains the code for waiting and release of backends.
+ * All code in this module executes on the primary. The core streaming
+ * replication transport remains within WALreceiver/WALsender modules.
+ *
+ * The essence of this design is that it isolates all logic about
+ * waiting/releasing onto the primary. The primary is aware of which
+ * standby servers offer a synchronisation service. The standby is
+ * completely unaware of the durability requirements of transactions
+ * on the primary, reducing the complexity of the code and streamlining
+ * both standby operations and network bandwidth because there is no
+ * requirement to ship per-transaction state information.
+ *
+ * The bookeeping approach we take is that a commit is either synchronous
+ * or not synchronous (async). If it is async, we just fastpath out of
+ * here. If it is sync, then it follows exactly one rigid definition of
+ * synchronous replication as laid out by the various parameters. If we
+ * change the definition of replication, we'll need to scan through all
+ * waiting backends to see if we should now release them.
+ *
+ * The best performing way to manage the waiting backends is to have a
+ * single ordered queue of waiting backends, so that we can avoid
+ * searching the through all waiters each time we receive a reply.
+ *
+ * Starting sync replication is a two stage process. First, the standby
+ * must have caught up with the primary; that may take some time. Next,
+ * we must receive a reply from the standby before we change state so
+ * that sync rep is fully active and commits can wait on us.
+ *
+ * XXX Changing state to a sync rep service while we are running allows
+ * us to enable sync replication via SIGHUP on the standby at a later
+ * time, without restart, if we need to do that. Though you can't turn
+ * it off without disconnecting.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xact.h"
+#include "access/xlog_internal.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
+#include "replication/walsender.h"
+#include "storage/latch.h"
+#include "storage/ipc.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/guc_tables.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+
+
+/* User-settable parameters for sync rep */
+bool sync_rep_mode = false; /* Only set in user backends */
+int sync_rep_timeout_client = 120; /* Only set in user backends */
+int sync_rep_timeout_server = 30; /* Only set in user backends */
+bool sync_rep_service = false; /* Never set in user backends */
+bool hot_standby_feedback = true;
+
+/*
+ * Queuing code is written to allow later extension to multiple
+ * queues. Currently, we use just one queue (==FSYNC).
+ *
+ * XXX We later expect to have RECV, FSYNC and APPLY modes.
+ */
+#define SYNC_REP_NOT_ON_QUEUE -1
+#define SYNC_REP_FSYNC 0
+#define IsOnSyncRepQueue() (current_queue > SYNC_REP_NOT_ON_QUEUE)
+/*
+ * Queue identifier of the queue on which user backend currently waits.
+ */
+static int current_queue = SYNC_REP_NOT_ON_QUEUE;
+
+static void SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN, int qid);
+static void SyncRepRemoveFromQueue(void);
+static void SyncRepAddToQueue(int qid);
+static bool SyncRepServiceAvailable(void);
+static long SyncRepGetWaitTimeout(void);
+
+static void SyncRepWakeFromQueue(int wait_queue, XLogRecPtr lsn);
+
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for normal user backends
+ * ===========================================================
+ */
+
+/*
+ * Wait for synchronous replication, if requested by user.
+ */
+extern void
+SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
+{
+ /*
+ * Fast exit if user has requested async replication, or
+ * streaming replication is inactive in this server.
+ */
+ if (max_wal_senders == 0 || !sync_rep_mode)
+ return;
+
+ Assert(sync_rep_mode);
+
+ if (allow_standalone_primary)
+ {
+ bool avail_sync_mode;
+
+ /*
+ * Check that the service level we want is available.
+ * If not, downgrade the service level to async.
+ */
+ avail_sync_mode = SyncRepServiceAvailable();
+
+ /*
+ * Perform the wait here, then drop through and exit.
+ */
+ if (avail_sync_mode)
+ SyncRepWaitOnQueue(XactCommitLSN, 0);
+ }
+ else
+ {
+ /*
+ * Wait only on the service level requested,
+ * whether or not it is currently available.
+ * Sounds weird, but this mode exists to protect
+ * against changes that will only occur on primary.
+ */
+ SyncRepWaitOnQueue(XactCommitLSN, 0);
+ }
+}
+
+/*
+ * Wait for specified LSN to be confirmed at the requested level
+ * of durability. Each proc has its own wait latch, so we perform
+ * a normal latch check/wait loop here.
+ */
+static void
+SyncRepWaitOnQueue(XLogRecPtr XactCommitLSN, int qid)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[0]);
+ TimestampTz now = GetCurrentTransactionStopTimestamp();
+ long timeout = SyncRepGetWaitTimeout(); /* seconds */
+ char *new_status = NULL;
+ const char *old_status;
+ int len;
+
+ /*
+ * No need to wait for autovacuums. If the standby does go away and
+ * we wait for it to return we may as well do some usefulwork locally.
+ * This is critical since we may need to perform emergency vacuuming
+ * and cannot wait for standby to return.
+ */
+ if (IsAutoVacuumWorkerProcess())
+ return;
+
+ ereport(DEBUG2,
+ (errmsg("synchronous replication waiting for %X/%X starting at %s",
+ XactCommitLSN.xlogid,
+ XactCommitLSN.xrecoff,
+ timestamptz_to_str(GetCurrentTransactionStopTimestamp()))));
+
+ for (;;)
+ {
+ ResetLatch(&MyProc->waitLatch);
+
+ /*
+ * First time through, add ourselves to the appropriate queue.
+ */
+ if (!IsOnSyncRepQueue())
+ {
+ SpinLockAcquire(&queue->qlock);
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ {
+ /* No need to wait */
+ SpinLockRelease(&queue->qlock);
+ return;
+ }
+
+ /*
+ * Set our waitLSN so WALSender will know when to wake us.
+ * We set this before we add ourselves to queue, so that
+ * any proc on the queue can be examined freely without
+ * taking a lock on each process in the queue.
+ */
+ MyProc->waitLSN = XactCommitLSN;
+ SyncRepAddToQueue(qid);
+ SpinLockRelease(&queue->qlock);
+ current_queue = qid; /* Remember which queue we're on */
+
+ /*
+ * Alter ps display to show waiting for sync rep.
+ */
+ old_status = get_ps_display(&len);
+ new_status = (char *) palloc(len + 21 + 1);
+ memcpy(new_status, old_status, len);
+ strcpy(new_status + len, " waiting for sync rep");
+ set_ps_display(new_status, false);
+ new_status[len] = '\0'; /* truncate off " waiting" */
+ }
+ else
+ {
+ bool release = false;
+ bool timeout = false;
+
+ SpinLockAcquire(&queue->qlock);
+
+ /*
+ * Check the LSN on our queue and if its moved far enough then
+ * remove us from the queue. First time through this is
+ * unlikely to be far enough, yet is possible. Next time we are
+ * woken we should be more lucky.
+ */
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ release = true;
+ else if (timeout > 0 &&
+ TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
+ now,
+ timeout))
+ {
+ release = true;
+ timeout = true;
+ }
+
+ if (release)
+ {
+ SyncRepRemoveFromQueue();
+ SpinLockRelease(&queue->qlock);
+
+ if (new_status)
+ {
+ /* Reset ps display */
+ set_ps_display(new_status, false);
+ pfree(new_status);
+ }
+
+ /*
+ * Our response to the timeout is to simply post a NOTICE and
+ * then return to the user. The commit has happened, we just
+ * haven't been able to verify it has been replicated to the
+ * level requested.
+ *
+ * XXX We could check here to see if our LSN has been sent to
+ * another standby that offers a lower level of service. That
+ * could be true if we had, for example, requested 'apply'
+ * with two standbys, one at 'apply' and one at 'recv' and the
+ * apply standby has just gone down. Something for the weekend.
+ */
+ if (timeout)
+ ereport(NOTICE,
+ (errmsg("synchronous replication timeout at %s",
+ timestamptz_to_str(now))));
+ else
+ ereport(DEBUG2,
+ (errmsg("synchronous replication wait complete at %s",
+ timestamptz_to_str(now))));
+
+ /* XXX Do we need to unset the latch? */
+ return;
+ }
+
+ SpinLockRelease(&queue->qlock);
+ }
+
+ WaitLatch(&MyProc->waitLatch, timeout);
+ now = GetCurrentTimestamp();
+ }
+}
+
+/*
+ * Remove myself from sync rep wait queue.
+ *
+ * Assume on queue at start; will not be on queue at end.
+ * Queue is already locked at start and remains locked on exit.
+ *
+ * XXX Implements design pattern "Reinvent Wheel", think about changing
+ */
+void
+SyncRepRemoveFromQueue(void)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[current_queue]);
+ PGPROC *proc = queue->head;
+ int numprocs = 0;
+
+ Assert(IsOnSyncRepQueue());
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG3, "removing myself from queue %d", current_queue);
+#endif
+
+ for (; proc != NULL; proc = proc->lwWaitLink)
+ {
+ if (proc == MyProc)
+ {
+ elog(LOG, "proc %d lsn %X/%X is MyProc",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+ }
+ else
+ {
+ elog(LOG, "proc %d lsn %X/%X",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+ }
+ numprocs++;
+ }
+
+ proc = queue->head;
+
+ if (proc == MyProc)
+ {
+ if (MyProc->lwWaitLink == NULL)
+ {
+ /*
+ * We were the only waiter on the queue. Reset head and tail.
+ */
+ Assert(queue->tail == MyProc);
+ queue->head = NULL;
+ queue->tail = NULL;
+ }
+ else
+ /*
+ * Move head to next proc on the queue.
+ */
+ queue->head = MyProc->lwWaitLink;
+ }
+ else
+ {
+ while (proc->lwWaitLink != NULL)
+ {
+ /* Are we the next proc in our traversal of the queue? */
+ if (proc->lwWaitLink == MyProc)
+ {
+ /*
+ * Remove ourselves from middle of queue.
+ * No need to touch head or tail.
+ */
+ proc->lwWaitLink = MyProc->lwWaitLink;
+ }
+
+ if (proc->lwWaitLink == NULL)
+ elog(WARNING, "could not locate ourselves on wait queue");
+ proc = proc->lwWaitLink;
+ }
+
+ if (proc->lwWaitLink == NULL) /* At tail */
+ {
+ Assert(proc == MyProc);
+ /* Remove ourselves from tail of queue */
+ Assert(queue->tail == MyProc);
+ queue->tail = proc;
+ proc->lwWaitLink = NULL;
+ }
+ }
+ MyProc->lwWaitLink = NULL;
+ current_queue = SYNC_REP_NOT_ON_QUEUE;
+}
+
+/*
+ * Add myself to sync rep wait queue.
+ *
+ * Assume not on queue at start; will be on queue at end.
+ * Queue is already locked at start and remains locked on exit.
+ */
+static void
+SyncRepAddToQueue(int qid)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[qid]);
+ PGPROC *tail = queue->tail;
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG3, "adding myself to queue %d", qid);
+#endif
+
+ /*
+ * Add myself to tail of wait queue.
+ */
+ if (tail == NULL)
+ {
+ queue->head = MyProc;
+ queue->tail = MyProc;
+ }
+ else
+ {
+ /*
+ * XXX extra code needed here to maintain sorted invariant.
+ * Our approach should be same as racing car - slow in, fast out.
+ */
+ Assert(tail->lwWaitLink == NULL);
+ tail->lwWaitLink = MyProc;
+ }
+ queue->tail = MyProc;
+
+ /*
+ * This used to be an Assert, but it keeps failing... why?
+ */
+ MyProc->lwWaitLink = NULL; /* to be sure */
+}
+
+/*
+ * Dynamically decide the sync rep wait mode. It may seem a trifle
+ * wasteful to do this for every transaction but we need to do this
+ * so we can cope sensibly with standby disconnections. It's OK to
+ * spend a few cycles here anyway, since while we're doing this the
+ * WALSender will be sending the data we want to wait for, so this
+ * is dead time and the user has requested to wait anyway.
+ */
+static bool
+SyncRepServiceAvailable(void)
+{
+ bool result = false;
+
+ SpinLockAcquire(&WalSndCtl->ctlmutex);
+ result = WalSndCtl->sync_rep_service_available;
+ SpinLockRelease(&WalSndCtl->ctlmutex);
+
+ return result;
+}
+
+/*
+ * Allows more complex decision making about what the wait time should be.
+ */
+static long
+SyncRepGetWaitTimeout(void)
+{
+ if (sync_rep_timeout_client <= 0)
+ return -1L;
+
+ return 1000000L * sync_rep_timeout_client;
+}
+
+void
+SyncRepCleanupAtProcExit(int code, Datum arg)
+{
+/*
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[qid]);
+
+ if (IsOnSyncRepQueue())
+ {
+ SpinLockAcquire(&queue->qlock);
+ SyncRepRemoveFromQueue();
+ SpinLockRelease(&queue->qlock);
+ }
+*/
+
+ if (MyProc != NULL && MyProc->ownLatch)
+ {
+ DisownLatch(&MyProc->waitLatch);
+ MyProc->ownLatch = false;
+ }
+}
+
+/*
+ * ===========================================================
+ * Synchronous Replication functions for wal sender processes
+ * ===========================================================
+ */
+
+/*
+ * Update the LSNs on each queue based upon our latest state. This
+ * implements a simple policy of first-valid-standby-releases-waiter.
+ *
+ * Other policies are possible, which would change what we do here and what
+ * perhaps also which information we store as well.
+ */
+void
+SyncRepReleaseWaiters(bool timeout)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ int mode;
+
+ /*
+ * If we are now streaming, and haven't yet enabled the sync rep service
+ * do so now. We don't enable sync rep service during a base backup since
+ * during that action we aren't sending WAL at all, so there cannot be
+ * any meaningful replies. We don't enable sync rep service while we
+ * are still in catchup mode either, since clients might experience an
+ * extended wait (perhaps hours) if they waited at that point.
+ *
+ * Note that we do release waiters, even if they aren't enabled yet.
+ * That sounds strange, but we may have dropped the connection and
+ * reconnected, so there may still be clients waiting for a response
+ * from when we were connected previously.
+ *
+ * If we already have a sync rep server connected, don't enable
+ * this server as well.
+ *
+ * XXX expect to be able to support multiple sync standbys in future.
+ */
+ if (!MyWalSnd->sync_rep_service &&
+ MyWalSnd->state == WALSNDSTATE_STREAMING &&
+ !SyncRepServiceAvailable())
+ {
+ ereport(LOG,
+ (errmsg("enabling synchronous replication service for standby")));
+
+ /*
+ * Update state for this WAL sender.
+ */
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalSnd *walsnd = MyWalSnd;
+
+ SpinLockAcquire(&walsnd->mutex);
+ walsnd->sync_rep_service = true;
+ SpinLockRelease(&walsnd->mutex);
+ }
+
+ /*
+ * We have at least one standby, so we're open for business.
+ */
+ {
+ SpinLockAcquire(&WalSndCtl->ctlmutex);
+ WalSndCtl->sync_rep_service_available = true;
+ SpinLockRelease(&WalSndCtl->ctlmutex);
+ }
+
+ /*
+ * Let postmaster know we can allow connections, if the user
+ * requested waiting until sync rep was active before starting.
+ * We send this unconditionally to avoid more complexity in
+ * postmaster code.
+ */
+ if (IsUnderPostmaster)
+ SendPostmasterSignal(PMSIGNAL_SYNC_REPLICATION_ACTIVE);
+ }
+
+ /*
+ * No point trying to release waiters while doing a base backup
+ */
+ if (MyWalSnd->state == WALSNDSTATE_BACKUP)
+ return;
+
+#ifdef SYNCREP_DEBUG
+ elog(LOG, "releasing waiters up to flush = %X/%X",
+ MyWalSnd->flush.xlogid, MyWalSnd->flush.xrecoff);
+#endif
+
+
+ /*
+ * Only maintain LSNs of queues for which we advertise a service.
+ * This is important to ensure that we only wakeup users when a
+ * preferred standby has reached the required LSN.
+ *
+ * Since sycnhronous_replication_mode is currently a boolean, we either
+ * offer all modes, or none.
+ */
+ for (mode = 0; mode < NUM_SYNC_REP_WAIT_MODES; mode++)
+ {
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[mode]);
+
+ /*
+ * Lock the queue. Not really necessary with just one sync standby
+ * but it makes clear what needs to happen.
+ */
+ SpinLockAcquire(&queue->qlock);
+ if (XLByteLT(queue->lsn, MyWalSnd->flush))
+ {
+ /*
+ * Set the lsn first so that when we wake backends they will
+ * release up to this location.
+ */
+ queue->lsn = MyWalSnd->flush;
+ SyncRepWakeFromQueue(mode, MyWalSnd->flush);
+ }
+ SpinLockRelease(&queue->qlock);
+
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG2, "q%d queue = %X/%X flush = %X/%X", mode,
+ queue->lsn.xlogid, queue->lsn.xrecoff,
+ MyWalSnd->flush.xlogid, MyWalSnd->flush.xrecoff);
+#endif
+ }
+}
+
+/*
+ * Walk queue from head setting the latches of any procs that need
+ * to be woken. We don't modify the queue, we leave that for individual
+ * procs to release themselves.
+ *
+ * Must hold spinlock on queue.
+ */
+static void
+SyncRepWakeFromQueue(int wait_queue, XLogRecPtr lsn)
+{
+ volatile WalSndCtlData *walsndctl = WalSndCtl;
+ volatile SyncRepQueue *queue = &(walsndctl->sync_rep_queue[wait_queue]);
+ PGPROC *proc = queue->head;
+ int numprocs = 0;
+ int totalprocs = 0;
+
+ if (proc == NULL)
+ return;
+
+ for (; proc != NULL; proc = proc->lwWaitLink)
+ {
+ elog(LOG, "proc %d lsn %X/%X",
+ numprocs,
+ proc->waitLSN.xlogid,
+ proc->waitLSN.xrecoff);
+
+ if (XLByteLE(proc->waitLSN, lsn))
+ {
+ numprocs++;
+ SetLatch(&proc->waitLatch);
+ }
+ totalprocs++;
+ }
+ elog(DEBUG2, "released %d procs out of %d waiting procs", numprocs, totalprocs);
+#ifdef SYNCREP_DEBUG
+ elog(DEBUG2, "released %d procs up to %X/%X", numprocs, lsn.xlogid, lsn.xrecoff);
+#endif
+}
+
+void
+SyncRepTimeoutExceeded(void)
+{
+ SyncRepReleaseWaiters(true);
+}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 30e35db..f35ab4a 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -38,6 +38,7 @@
#include <signal.h>
#include <unistd.h>
+#include "access/transam.h"
#include "access/xlog_internal.h"
#include "libpq/pqsignal.h"
#include "miscadmin.h"
@@ -45,6 +46,7 @@
#include "replication/walreceiver.h"
#include "storage/ipc.h"
#include "storage/pmsignal.h"
+#include "storage/procarray.h"
#include "utils/builtins.h"
#include "utils/guc.h"
#include "utils/memutils.h"
@@ -87,9 +89,9 @@ static volatile sig_atomic_t got_SIGTERM = false;
*/
static struct
{
- XLogRecPtr Write; /* last byte + 1 written out in the standby */
- XLogRecPtr Flush; /* last byte + 1 flushed in the standby */
-} LogstreamResult;
+ XLogRecPtr Write; /* last byte + 1 written out in the standby */
+ XLogRecPtr Flush; /* last byte + 1 flushed in the standby */
+} LogstreamResult;
static StandbyReplyMessage reply_message;
@@ -210,6 +212,8 @@ WalReceiverMain(void)
/* Advertise our PID so that the startup process can kill us */
walrcv->pid = MyProcPid;
walrcv->walRcvState = WALRCV_RUNNING;
+ elog(DEBUG2, "WALreceiver starting");
+ OwnLatch(&WalRcv->latch); /* Run before signals enabled, since they can wakeup latch */
/* Fetch information required to start streaming */
strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
@@ -277,6 +281,7 @@ WalReceiverMain(void)
unsigned char type;
char *buf;
int len;
+ bool received_all = false;
/*
* Emergency bailout if postmaster has died. This is to avoid the
@@ -302,24 +307,44 @@ WalReceiverMain(void)
ProcessConfigFile(PGC_SIGHUP);
}
- /* Wait a while for data to arrive */
- if (walrcv_receive(NAPTIME_PER_CYCLE, &type, &buf, &len))
+ ResetLatch(&WalRcv->latch);
+
+ if (walrcv_receive(0, &type, &buf, &len))
{
- /* Accept the received data, and process it */
+ received_all = false;
XLogWalRcvProcessMsg(type, buf, len);
+ }
+ else
+ received_all = true;
- /* Receive any more data we can without sleeping */
- while (walrcv_receive(0, &type, &buf, &len))
- XLogWalRcvProcessMsg(type, buf, len);
+ XLogWalRcvSendReply();
- /* Let the master know that we received some data. */
+ if (received_all && !got_SIGHUP && !got_SIGTERM)
+ {
+ /*
+ * Flush, then reply.
+ *
+ * XXX We really need the WALWriter active as well
+ */
+ XLogWalRcvFlush();
XLogWalRcvSendReply();
/*
- * If we've written some records, flush them to disk and let the
- * startup process know about them.
+ * Sleep for up to 500 ms, the fixed keepalive delay.
+ *
+ * We will be woken if new data is received from primary
+ * or if a commit is applied. This is sub-optimal in the
+ * case where a group of commits arrive, then it all goes
+ * quiet, but its not worth the extra code to handle both
+ * that and the simple case of a single commit.
+ *
+ * Note that we do not need to wake up when the Startup
+ * process has applied the last outstanding record. That
+ * is interesting iff that is a commit record.
*/
- XLogWalRcvFlush();
+ pg_usleep(1000000L); /* slow down loop for debugging */
+// WaitLatchOrSocket(&WalRcv->latch, MyProcPort->sock,
+// 500000L);
}
else
{
@@ -351,6 +376,8 @@ WalRcvDie(int code, Datum arg)
walrcv->pid = 0;
SpinLockRelease(&walrcv->mutex);
+ DisownLatch(&WalRcv->latch);
+
/* Terminate the connection gracefully. */
if (walrcv_disconnect != NULL)
walrcv_disconnect();
@@ -361,6 +388,7 @@ static void
WalRcvSigHupHandler(SIGNAL_ARGS)
{
got_SIGHUP = true;
+ WalRcvWakeup();
}
/* SIGTERM: set flag for main loop, or shutdown immediately if safe */
@@ -368,6 +396,7 @@ static void
WalRcvShutdownHandler(SIGNAL_ARGS)
{
got_SIGTERM = true;
+ WalRcvWakeup();
/* Don't joggle the elbow of proc_exit */
if (!proc_exit_inprogress && WalRcvImmediateInterruptOK)
@@ -609,14 +638,28 @@ XLogWalRcvSendReply(void)
reply_message.flush = LogstreamResult.Flush;
reply_message.apply = GetXLogReplayRecPtr();
reply_message.sendTime = now;
+ if (hot_standby_feedback && HotStandbyActive())
+ reply_message.xmin = GetOldestXmin(true, false);
+ else
+ reply_message.xmin = InvalidTransactionId;
- elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
+ elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X xmin %d",
reply_message.write.xlogid, reply_message.write.xrecoff,
reply_message.flush.xlogid, reply_message.flush.xrecoff,
- reply_message.apply.xlogid, reply_message.apply.xrecoff);
+ reply_message.apply.xlogid, reply_message.apply.xrecoff,
+ reply_message.xmin);
/* Prepend with the message type and send it. */
buf[0] = 'r';
memcpy(&buf[1], &reply_message, sizeof(StandbyReplyMessage));
walrcv_send(buf, sizeof(StandbyReplyMessage) + 1);
}
+
+/* Wake up the WalRcv
+ * Prototype goes in xact.c since that is only external caller
+ */
+void
+WalRcvWakeup(void)
+{
+ SetLatch(&WalRcv->latch);
+};
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 04c9004..da97528 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -64,6 +64,7 @@ WalRcvShmemInit(void)
MemSet(WalRcv, 0, WalRcvShmemSize());
WalRcv->walRcvState = WALRCV_STOPPED;
SpinLockInit(&WalRcv->mutex);
+ InitSharedLatch(&WalRcv->latch);
}
}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3ad95b4..2ace040 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -65,7 +65,7 @@
WalSndCtlData *WalSndCtl = NULL;
/* My slot in the shared memory array */
-static WalSnd *MyWalSnd = NULL;
+WalSnd *MyWalSnd = NULL;
/* Global state */
bool am_walsender = false; /* Am I a walsender process ? */
@@ -73,6 +73,7 @@ bool am_walsender = false; /* Am I a walsender process ? */
/* User-settable parameters for walsender */
int max_wal_senders = 0; /* the maximum number of concurrent walsenders */
int WalSndDelay = 200; /* max sleep time between some actions */
+bool allow_standalone_primary = true; /* action if no sync standby active */
/*
* These variables are used similarly to openLogFile/Id/Seg/Off,
@@ -89,6 +90,8 @@ static uint32 sendOff = 0;
*/
static XLogRecPtr sentPtr = {0, 0};
+static TimestampTz last_reply_timestamp;
+
/* Flags set by signal handlers for later service in main loop */
static volatile sig_atomic_t got_SIGHUP = false;
volatile sig_atomic_t walsender_shutdown_requested = false;
@@ -113,7 +116,6 @@ static void StartReplication(StartReplicationCmd * cmd);
static void ProcessStandbyReplyMessage(void);
static void ProcessRepliesIfAny(void);
-
/* Main entry point for walsender process */
int
WalSenderMain(void)
@@ -150,6 +152,8 @@ WalSenderMain(void)
/* Unblock signals (they were blocked when the postmaster forked us) */
PG_SETMASK(&UnBlockSig);
+ elog(DEBUG2, "WALsender starting");
+
/* Tell the standby that walsender is ready for receiving commands */
ReadyForQuery(DestRemote);
@@ -166,6 +170,8 @@ WalSenderMain(void)
SpinLockRelease(&walsnd->mutex);
}
+ elog(DEBUG2, "WALsender handshake complete");
+
/* Main loop of walsender */
return WalSndLoop();
}
@@ -250,6 +256,11 @@ WalSndHandshake(void)
errmsg("invalid standby handshake message type %d", firstchar)));
}
}
+
+ /*
+ * Initialize our timeout checking mechanism.
+ */
+ last_reply_timestamp = GetCurrentTimestamp();
}
/*
@@ -417,9 +428,11 @@ HandleReplicationCommand(const char *cmd_string)
/* break out of the loop */
replication_started = true;
+ WalSndSetState(WALSNDSTATE_CATCHUP);
break;
case T_BaseBackupCmd:
+ WalSndSetState(WALSNDSTATE_BACKUP);
SendBaseBackup((BaseBackupCmd *) cmd_node);
/* Send CommandComplete and ReadyForQuery messages */
@@ -524,10 +537,11 @@ ProcessStandbyReplyMessage(void)
pq_copymsgbytes(&input_message, (char *) &reply, sizeof(StandbyReplyMessage));
- elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X ",
+ elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X xmin %d",
reply.write.xlogid, reply.write.xrecoff,
reply.flush.xlogid, reply.flush.xrecoff,
- reply.apply.xlogid, reply.apply.xrecoff);
+ reply.apply.xlogid, reply.apply.xrecoff,
+ reply.xmin);
/*
* Update shared state for this WalSender process
@@ -541,8 +555,16 @@ ProcessStandbyReplyMessage(void)
walsnd->write = reply.write;
walsnd->flush = reply.flush;
walsnd->apply = reply.apply;
+ if (TransactionIdIsValid(reply.xmin) &&
+ TransactionIdPrecedes(MyProc->xmin, reply.xmin))
+ MyProc->xmin = reply.xmin;
SpinLockRelease(&walsnd->mutex);
}
+
+ /*
+ * Release any backends waiting to commit.
+ */
+ SyncRepReleaseWaiters(false);
}
/* Main loop of walsender process */
@@ -592,7 +614,11 @@ WalSndLoop(void)
/* Normal exit from the walsender is here */
if (walsender_shutdown_requested)
{
- /* Inform the standby that XLOG streaming was done */
+ ProcessRepliesIfAny();
+
+ /* Inform the standby that XLOG streaming was done
+ * by sending CommandComplete message.
+ */
pq_puttextmessage('C', "COPY 0");
pq_flush();
@@ -600,12 +626,31 @@ WalSndLoop(void)
}
/*
- * If we had sent all accumulated WAL in last round, nap for the
- * configured time before retrying.
+ * If we had sent all accumulated WAL in last round, then we don't
+ * have much to do. We still expect a steady stream of replies from
+ * standby. It is important to note that we don't keep track of
+ * whether or not there are backends waiting here, since that
+ * is potentially very complex state information.
+ *
+ * Also note that there is no delay between sending data and
+ * checking for the replies. We expect replies to take some time
+ * and we are more concerned with overall throughput than absolute
+ * response time to any single request.
*/
if (caughtup)
{
/*
+ * If we were still catching up, change state to streaming.
+ * While in the initial catchup phase, clients waiting for
+ * a response from the standby would wait for a very long
+ * time, so we need to have a one-way state transition to avoid
+ * problems. No need to grab a lock for the check; we are the
+ * only one to ever change the state.
+ */
+ if (MyWalSnd->state < WALSNDSTATE_STREAMING)
+ WalSndSetState(WALSNDSTATE_STREAMING);
+
+ /*
* Even if we wrote all the WAL that was available when we started
* sending, more might have arrived while we were sending this
* batch. We had the latch set while sending, so we have not
@@ -618,6 +663,13 @@ WalSndLoop(void)
break;
if (caughtup && !got_SIGHUP && !walsender_ready_to_stop && !walsender_shutdown_requested)
{
+ long timeout;
+
+ if (sync_rep_timeout_server == -1)
+ timeout = -1L;
+ else
+ timeout = 1000000L * sync_rep_timeout_server;
+
/*
* XXX: We don't really need the periodic wakeups anymore,
* WaitLatchOrSocket should reliably wake up as soon as
@@ -625,8 +677,14 @@ WalSndLoop(void)
*/
/* Sleep */
- WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
- WalSndDelay * 1000L);
+ if (WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
+ timeout) == 0)
+ {
+ ereport(LOG,
+ (errmsg("streaming replication timeout after %d s",
+ sync_rep_timeout_server)));
+ break;
+ }
}
}
else
@@ -642,7 +700,7 @@ WalSndLoop(void)
}
/*
- * Get here on send failure. Clean up and exit.
+ * Get here on send failure or timeout. Clean up and exit.
*
* Reset whereToSendOutput to prevent ereport from attempting to send any
* more messages to the standby.
@@ -873,9 +931,9 @@ XLogSend(char *msgbuf, bool *caughtup)
* Attempt to send all data that's already been written out and fsync'd to
* disk. We cannot go further than what's been written out given the
* current implementation of XLogRead(). And in any case it's unsafe to
- * send WAL that is not securely down to disk on the master: if the master
+ * send WAL that is not securely down to disk on the primary: if the primary
* subsequently crashes and restarts, slaves must not have applied any WAL
- * that gets lost on the master.
+ * that gets lost on the primary.
*/
SendRqstPtr = GetFlushRecPtr();
@@ -953,6 +1011,9 @@ XLogSend(char *msgbuf, bool *caughtup)
msghdr.walEnd = SendRqstPtr;
msghdr.sendTime = GetCurrentTimestamp();
+ elog(DEBUG2, "sent = %X/%X ",
+ startptr.xlogid, startptr.xrecoff);
+
memcpy(msgbuf + 1, &msghdr, sizeof(WalDataMessageHeader));
pq_putmessage('d', msgbuf, 1 + sizeof(WalDataMessageHeader) + nbytes);
@@ -1110,6 +1171,16 @@ WalSndShmemInit(void)
SpinLockInit(&walsnd->mutex);
InitSharedLatch(&walsnd->latch);
}
+
+ /*
+ * Initialise the spinlocks on each sync rep queue
+ */
+ for (i = 0; i < NUM_SYNC_REP_WAIT_MODES; i++)
+ {
+ SyncRepQueue *queue = &WalSndCtl->sync_rep_queue[i];
+
+ SpinLockInit(&queue->qlock);
+ }
}
}
@@ -1169,7 +1240,7 @@ WalSndGetStateString(WalSndState state)
Datum
pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
-#define PG_STAT_GET_WAL_SENDERS_COLS 6
+#define PG_STAT_GET_WAL_SENDERS_COLS 7
ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
TupleDesc tupdesc;
Tuplestorestate *tupstore;
@@ -1212,6 +1283,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
XLogRecPtr flush;
XLogRecPtr apply;
WalSndState state;
+ bool sync_rep_service;
Datum values[PG_STAT_GET_WAL_SENDERS_COLS];
bool nulls[PG_STAT_GET_WAL_SENDERS_COLS];
@@ -1224,6 +1296,7 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
write = walsnd->write;
flush = walsnd->flush;
apply = walsnd->apply;
+ sync_rep_service = walsnd->sync_rep_service;
SpinLockRelease(&walsnd->mutex);
memset(nulls, 0, sizeof(nulls));
@@ -1240,32 +1313,34 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
nulls[3] = true;
nulls[4] = true;
nulls[5] = true;
+ nulls[6] = true;
}
else
{
values[1] = CStringGetTextDatum(WalSndGetStateString(state));
+ values[2] = BoolGetDatum(sync_rep_service);
snprintf(location, sizeof(location), "%X/%X",
sentPtr.xlogid, sentPtr.xrecoff);
- values[2] = CStringGetTextDatum(location);
+ values[3] = CStringGetTextDatum(location);
if (write.xlogid == 0 && write.xrecoff == 0)
nulls[3] = true;
snprintf(location, sizeof(location), "%X/%X",
write.xlogid, write.xrecoff);
- values[3] = CStringGetTextDatum(location);
+ values[4] = CStringGetTextDatum(location);
if (flush.xlogid == 0 && flush.xrecoff == 0)
nulls[4] = true;
snprintf(location, sizeof(location), "%X/%X",
flush.xlogid, flush.xrecoff);
- values[4] = CStringGetTextDatum(location);
+ values[5] = CStringGetTextDatum(location);
if (apply.xlogid == 0 && apply.xrecoff == 0)
nulls[5] = true;
snprintf(location, sizeof(location), "%X/%X",
apply.xlogid, apply.xrecoff);
- values[5] = CStringGetTextDatum(location);
+ values[6] = CStringGetTextDatum(location);
}
tuplestore_putvalues(tupstore, tupdesc, values, nulls);
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index be577bc..7aa7671 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -39,6 +39,7 @@
#include "access/xact.h"
#include "miscadmin.h"
#include "postmaster/autovacuum.h"
+#include "replication/syncrep.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/pmsignal.h"
@@ -196,6 +197,7 @@ InitProcGlobal(void)
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
ProcGlobal->freeProcs = &procs[i];
+ InitSharedLatch(&procs[i].waitLatch);
}
/*
@@ -214,6 +216,7 @@ InitProcGlobal(void)
PGSemaphoreCreate(&(procs[i].sem));
procs[i].links.next = (SHM_QUEUE *) ProcGlobal->autovacFreeProcs;
ProcGlobal->autovacFreeProcs = &procs[i];
+ InitSharedLatch(&procs[i].waitLatch);
}
/*
@@ -224,6 +227,7 @@ InitProcGlobal(void)
{
AuxiliaryProcs[i].pid = 0; /* marks auxiliary proc as not in use */
PGSemaphoreCreate(&(AuxiliaryProcs[i].sem));
+ InitSharedLatch(&procs[i].waitLatch);
}
/* Create ProcStructLock spinlock, too */
@@ -326,6 +330,13 @@ InitProcess(void)
SHMQueueInit(&(MyProc->myProcLocks[i]));
MyProc->recoveryConflictPending = false;
+ /* Initialise the waitLSN for sync rep */
+ MyProc->waitLSN.xlogid = 0;
+ MyProc->waitLSN.xrecoff = 0;
+
+ OwnLatch((Latch *) &MyProc->waitLatch);
+ MyProc->ownLatch = true;
+
/*
* We might be reusing a semaphore that belonged to a failed process. So
* be careful and reinitialize its value here. (This is not strictly
@@ -365,6 +376,7 @@ InitProcessPhase2(void)
/*
* Arrange to clean that up at backend exit.
*/
+ on_shmem_exit(SyncRepCleanupAtProcExit, 0);
on_shmem_exit(RemoveProcFromArray, 0);
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 470183d..8c8e381 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -56,6 +56,7 @@
#include "postmaster/syslogger.h"
#include "postmaster/walwriter.h"
#include "replication/walreceiver.h"
+#include "replication/syncrep.h"
#include "replication/walsender.h"
#include "storage/bufmgr.h"
#include "storage/standby.h"
@@ -620,6 +621,15 @@ const char *const config_type_names[] =
static struct config_bool ConfigureNamesBool[] =
{
{
+ {"allow_standalone_primary", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Refuse connections on startup and force users to wait forever if synchronous replication has failed."),
+ NULL
+ },
+ &allow_standalone_primary,
+ true, NULL, NULL
+ },
+
+ {
{"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of sequential-scan plans."),
NULL
@@ -1279,6 +1289,33 @@ static struct config_bool ConfigureNamesBool[] =
},
{
+ {"synchronous_replication", PGC_USERSET, WAL_SETTINGS,
+ gettext_noop("Requests synchronous replication."),
+ NULL
+ },
+ &sync_rep_mode,
+ false, NULL, NULL
+ },
+
+ {
+ {"synchronous_replication_feedback", PGC_POSTMASTER, WAL_STANDBY_SERVERS,
+ gettext_noop("Allows feedback from a standby to primary for synchronous replication."),
+ NULL
+ },
+ &sync_rep_service,
+ true, NULL, NULL
+ },
+
+ {
+ {"hot_standby_feedback", PGC_POSTMASTER, WAL_STANDBY_SERVERS,
+ gettext_noop("Allows feedback from a hot standby to primary to avoid query conflicts."),
+ NULL
+ },
+ &hot_standby_feedback,
+ false, NULL, NULL
+ },
+
+ {
{"allow_system_table_mods", PGC_POSTMASTER, DEVELOPER_OPTIONS,
gettext_noop("Allows modifications of the structure of system tables."),
NULL,
@@ -1484,6 +1521,26 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"replication_timeout_client", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Clients waiting for confirmation will timeout after this duration."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &sync_rep_timeout_client,
+ 120, -1, INT_MAX, NULL, NULL
+ },
+
+ {
+ {"replication_timeout_server", PGC_SIGHUP, WAL_SETTINGS,
+ gettext_noop("Replication connection will timeout after this duration."),
+ NULL,
+ GUC_UNIT_S
+ },
+ &sync_rep_timeout_server,
+ 30, -1, INT_MAX, NULL, NULL
+ },
+
+ {
{"temp_buffers", PGC_USERSET, RESOURCES_MEM,
gettext_noop("Sets the maximum number of temporary buffers used by each session."),
NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5d31365..56c544d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -184,7 +184,15 @@
#archive_timeout = 0 # force a logfile segment switch after this
# number of seconds; 0 disables
-# - Streaming Replication -
+# - Replication - User Settings
+
+#synchronous_replication = off # commit waits for reply from standby
+#replication_timeout_client = 120 # -1 means wait forever
+
+# - Streaming Replication - Server Settings
+
+#allow_standalone_primary = on # sync rep parameter
+#replication_timeout_client = 30 # -1 means wait forever
#max_wal_senders = 0 # max number of walsender processes
# (change requires restart)
@@ -196,6 +204,8 @@
#hot_standby = off # "on" allows queries during recovery
# (change requires restart)
+#hot_standby_feedback = off # info from standby to prevent query conflicts
+#synchronous_replication_feedback = off # allows sync replication
#max_standby_archive_delay = 30s # max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1803d5a..0fcbfe8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -289,6 +289,7 @@ extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
extern void issue_xlog_fsync(int fd, uint32 log, uint32 seg);
extern bool RecoveryInProgress(void);
+extern bool HotStandbyActive(void);
extern bool XLogInsertAllowed(void);
extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream);
extern XLogRecPtr GetXLogReplayRecPtr(void);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index cb275b8..30fb3bf 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3075,7 +3075,7 @@ DATA(insert OID = 1936 ( pg_stat_get_backend_idset PGNSP PGUID 12 1 100 0 f f
DESCR("statistics: currently active backend IDs");
DATA(insert OID = 2022 ( pg_stat_get_activity PGNSP PGUID 12 1 100 0 f f f f t s 1 0 2249 "23" "{23,26,23,26,25,25,16,1184,1184,1184,869,23}" "{i,o,o,o,o,o,o,o,o,o,o,o}" "{pid,datid,procpid,usesysid,application_name,current_query,waiting,xact_start,query_start,backend_start,client_addr,client_port}" _null_ pg_stat_get_activity _null_ _null_ _null_ ));
DESCR("statistics: information about currently active backends");
-DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,25,25,25,25}" "{o,o,o,o,o,o}" "{procpid,state,sent_location,write_location,flush_location,apply_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 ( pg_stat_get_wal_senders PGNSP PGUID 12 1 10 0 f f f f t s 0 0 2249 "" "{23,25,16,25,25,25,25}" "{o,o,o,o,o,o,o}" "{procpid,state,sync,sent_location,write_location,flush_location,apply_location}" _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
DESCR("statistics: information about currently active replication");
DATA(insert OID = 2026 ( pg_backend_pid PGNSP PGUID 12 1 0 0 f f f t f s 0 0 23 "" _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
DESCR("statistics: current backend PID");
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 4cdb15f..9a00b2c 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -73,7 +73,7 @@ typedef struct
typedef enum CAC_state
{
CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY,
- CAC_WAITBACKUP
+ CAC_WAITBACKUP, CAC_REPLICATION_ONLY
} CAC_state;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
new file mode 100644
index 0000000..a071b9a
--- /dev/null
+++ b/src/include/replication/syncrep.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * syncrep.h
+ * Exports from replication/syncrep.c.
+ *
+ * Portions Copyright (c) 2010-2010, PostgreSQL Global Development Group
+ *
+ * $PostgreSQL$
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _SYNCREP_H
+#define _SYNCREP_H
+
+#include "access/xlog.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+
+#define SyncRepRequested() (sync_rep_mode)
+#define StandbyOffersSyncRepService() (sync_rep_service)
+
+/*
+ * There is no reply from standby to primary for async mode, so the reply
+ * message needs one less slot than the maximum number of modes
+ */
+#define NUM_SYNC_REP_WAIT_MODES 1
+
+extern XLogRecPtr ReplyLSN[NUM_SYNC_REP_WAIT_MODES];
+
+/*
+ * Each synchronous rep wait mode has one SyncRepWaitQueue in shared memory.
+ * These queues live in the WAL sender shmem area.
+ */
+typedef struct SyncRepQueue
+{
+ /*
+ * Current location of the head of the queue. Nobody should be waiting
+ * on the queue for an lsn equal to or earlier than this value. Procs
+ * on the queue will always be later than this value, though we don't
+ * record those values here.
+ */
+ XLogRecPtr lsn;
+
+ PGPROC *head;
+ PGPROC *tail;
+
+ slock_t qlock; /* locks shared variables shown above */
+} SyncRepQueue;
+
+/* user-settable parameters for synchronous replication */
+extern bool sync_rep_mode;
+extern int sync_rep_timeout_client;
+extern int sync_rep_timeout_server;
+extern bool sync_rep_service;
+
+extern bool hot_standby_feedback;
+
+/* called by user backend */
+extern void SyncRepWaitForLSN(XLogRecPtr XactCommitLSN);
+
+/* called by wal sender */
+extern void SyncRepReleaseWaiters(bool timeout);
+extern void SyncRepTimeoutExceeded(void);
+
+/* callback at exit */
+extern void SyncRepCleanupAtProcExit(int code, Datum arg);
+
+#endif /* _SYNCREP_H */
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index 32c4962..8e4e7d0 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -56,6 +56,13 @@ typedef struct
XLogRecPtr flush;
XLogRecPtr apply;
+ /*
+ * The current xmin from the standby, for Hot Standby feedback.
+ * This may be invalid if the standby-side does not support feedback,
+ * or Hot Standby is not yet available.
+ */
+ TransactionId xmin;
+
/* Sender's system clock at the time of transmission */
TimestampTz sendTime;
} StandbyReplyMessage;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index aa5bfb7..f57df6a 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -13,6 +13,8 @@
#define _WALRECEIVER_H
#include "access/xlogdefs.h"
+#include "replication/syncrep.h"
+#include "storage/latch.h"
#include "storage/spin.h"
#include "pgtime.h"
@@ -72,6 +74,11 @@ typedef struct
*/
char conninfo[MAXCONNINFO];
+ /*
+ * Latch used by aux procs to wake up walreceiver when it has work to do.
+ */
+ Latch latch;
+
slock_t mutex; /* locks shared variables shown above */
} WalRcvData;
@@ -93,6 +100,7 @@ extern PGDLLIMPORT walrcv_disconnect_type walrcv_disconnect;
/* prototypes for functions in walreceiver.c */
extern void WalReceiverMain(void);
+extern void WalRcvWakeup(void);
/* prototypes for functions in walreceiverfuncs.c */
extern Size WalRcvShmemSize(void);
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 5843307..b44bdde 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -15,6 +15,7 @@
#include "access/xlog.h"
#include "nodes/nodes.h"
#include "storage/latch.h"
+#include "replication/syncrep.h"
#include "storage/spin.h"
@@ -44,6 +45,17 @@ typedef struct WalSnd
XLogRecPtr flush;
XLogRecPtr apply;
+ /*
+ * The current xmin from the standby, for Hot Standby feedback.
+ * This may be invalid if the standby-side has not offered a value yet.
+ */
+ TransactionId xmin;
+
+ /*
+ * Highest level of sync rep available from this standby.
+ */
+ bool sync_rep_service;
+
/* Protects shared variables shown above. */
slock_t mutex;
@@ -54,9 +66,24 @@ typedef struct WalSnd
Latch latch;
} WalSnd;
+extern WalSnd *MyWalSnd;
+
/* There is one WalSndCtl struct for the whole database cluster */
typedef struct
{
+ /*
+ * Sync rep wait queues with one queue per request type.
+ * We use one queue per request type so that we can maintain the
+ * invariant that the individual queues are sorted on LSN.
+ * This may also help performance when multiple wal senders
+ * offer different sync rep service levels.
+ */
+ SyncRepQueue sync_rep_queue[NUM_SYNC_REP_WAIT_MODES];
+
+ bool sync_rep_service_available;
+
+ slock_t ctlmutex; /* locks shared variables shown above */
+
WalSnd walsnds[1]; /* VARIABLE LENGTH ARRAY */
} WalSndCtlData;
@@ -70,6 +97,7 @@ extern volatile sig_atomic_t walsender_ready_to_stop;
/* user-settable parameters */
extern int WalSndDelay;
extern int max_wal_senders;
+extern bool allow_standalone_primary;
extern int WalSenderMain(void);
extern void WalSndSignals(void);
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 97bdc7b..0d2a78e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -29,6 +29,7 @@ typedef enum
PMSIGNAL_START_AUTOVAC_LAUNCHER, /* start an autovacuum launcher */
PMSIGNAL_START_AUTOVAC_WORKER, /* start an autovacuum worker */
PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
+ PMSIGNAL_SYNC_REPLICATION_ACTIVE, /* walsender has completed handshake */
NUM_PMSIGNALS /* Must be last value of enum! */
} PMSignalReason;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 78dbade..27b57c8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -14,6 +14,8 @@
#ifndef _PROC_H_
#define _PROC_H_
+#include "access/xlog.h"
+#include "storage/latch.h"
#include "storage/lock.h"
#include "storage/pg_sema.h"
#include "utils/timestamp.h"
@@ -115,6 +117,11 @@ struct PGPROC
LOCKMASK heldLocks; /* bitmask for lock types already held on this
* lock object by this backend */
+ /* Info to allow us to wait for synchronous replication, if needed. */
+ Latch waitLatch;
+ XLogRecPtr waitLSN; /* waiting for this LSN or higher */
+ bool ownLatch; /* do we own the above latch? */
+
/*
* All PROCLOCK objects for locks held or awaited by this backend are
* linked into one of these lists, according to the partition number of
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c0142c2..12ca1ee 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1297,7 +1297,7 @@ SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schem
pg_stat_bgwriter | SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed, pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req, pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint, pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean, pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean, pg_stat_get_buf_written_backend() AS buffers_backend, pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync, pg_stat_get_buf_alloc() AS buffers_alloc, pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
pg_stat_database | SELECT d.oid AS datid, d.datname, pg_stat_get_db_numbackends(d.oid) AS numbackends, pg_stat_get_db_xact_commit(d.oid) AS xact_commit, pg_stat_get_db_xact_rollback(d.oid) AS xact_rollback, (pg_stat_get_db_blocks_fetched(d.oid) - pg_stat_get_db_blocks_hit(d.oid)) AS blks_read, pg_stat_get_db_blocks_hit(d.oid) AS blks_hit, pg_stat_get_db_tuples_returned(d.oid) AS tup_returned, pg_stat_get_db_tuples_fetched(d.oid) AS tup_fetched, pg_stat_get_db_tuples_inserted(d.oid) AS tup_inserted, pg_stat_get_db_tuples_updated(d.oid) AS tup_updated, pg_stat_get_db_tuples_deleted(d.oid) AS tup_deleted, pg_stat_get_db_conflict_all(d.oid) AS conflicts, pg_stat_get_db_stat_reset_time(d.oid) AS stats_reset FROM pg_database d;
pg_stat_database_conflicts | SELECT d.oid AS datid, d.datname, pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace, pg_stat_get_db_conflict_lock(d.oid) AS confl_lock, pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot, pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin, pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock FROM pg_database d;
- pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sent_location, w.write_location, w.flush_location, w.apply_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sent_location, write_location, flush_location, apply_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
+ pg_stat_replication | SELECT s.procpid, s.usesysid, u.rolname AS usename, s.application_name, s.client_addr, s.client_port, s.backend_start, w.state, w.sync, w.sent_location, w.write_location, w.flush_location, w.apply_location FROM pg_stat_get_activity(NULL::integer) s(datid, procpid, usesysid, application_name, current_query, waiting, xact_start, query_start, backend_start, client_addr, client_port), pg_authid u, pg_stat_get_wal_senders() w(procpid, state, sync, sent_location, write_location, flush_location, apply_location) WHERE ((s.usesysid = u.oid) AND (s.procpid = w.procpid));
pg_stat_sys_indexes | SELECT pg_stat_all_indexes.relid, pg_stat_all_indexes.indexrelid, pg_stat_all_indexes.schemaname, pg_stat_all_indexes.relname, pg_stat_all_indexes.indexrelname, pg_stat_all_indexes.idx_scan, pg_stat_all_indexes.idx_tup_read, pg_stat_all_indexes.idx_tup_fetch FROM pg_stat_all_indexes WHERE ((pg_stat_all_indexes.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_indexes.schemaname ~ '^pg_toast'::text));
pg_stat_sys_tables | SELECT pg_stat_all_tables.relid, pg_stat_all_tables.schemaname, pg_stat_all_tables.relname, pg_stat_all_tables.seq_scan, pg_stat_all_tables.seq_tup_read, pg_stat_all_tables.idx_scan, pg_stat_all_tables.idx_tup_fetch, pg_stat_all_tables.n_tup_ins, pg_stat_all_tables.n_tup_upd, pg_stat_all_tables.n_tup_del, pg_stat_all_tables.n_tup_hot_upd, pg_stat_all_tables.n_live_tup, pg_stat_all_tables.n_dead_tup, pg_stat_all_tables.last_vacuum, pg_stat_all_tables.last_autovacuum, pg_stat_all_tables.last_analyze, pg_stat_all_tables.last_autoanalyze, pg_stat_all_tables.vacuum_count, pg_stat_all_tables.autovacuum_count, pg_stat_all_tables.analyze_count, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
pg_stat_user_functions | SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, pg_stat_get_function_calls(p.oid) AS calls, (pg_stat_get_function_time(p.oid) / 1000) AS total_time, (pg_stat_get_function_self_time(p.oid) / 1000) AS self_time FROM (pg_proc p LEFT JOIN pg_namespace n ON ((n.oid = p.pronamespace))) WHERE ((p.prolang <> (12)::oid) AND (pg_stat_get_function_calls(p.oid) IS NOT NULL));
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
I committed the patch with those changes, and some minor comment tweaks and
other kibitzing.
+ * 'd' means a standby reply wrapped in a COPY BOTH packet.
+ */
Typo: s/COPY BOTH/CopyData
+ msgtype = pq_getmsgbyte(&input_message);
+ if (msgtype != 'r')
+ ereport(COMMERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type %c", msgtype)));
I think that proc_exit(0) needs to be called in error case.
+ static StringInfoData input_message;
+ StandbyReplyMessage reply;
+ char msgtype;
+
+ initStringInfo(&input_message);
Doesn't the repeat of initStringInfo() cause the memory leak?
@@ -518,6 +584,7 @@ WalSndLoop(void)
{
if (!XLogSend(output_message, &caughtup))
break;
+ ProcessRepliesIfAny();
Why is ProcessRepliesIfAny() required there?
We added new columns "write_location", "flush_location" and
"apply_location". So, for the sake of consistency, the column
name "sent_location" should be changed to "send_location"?
Regards,.
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
I added a XLogWalRcvSendReply() call into XLogWalRcvFlush() so that it also
sends a status update every time the WAL is flushed. If the walreceiver is
busy receiving and flushing, that would happen once per WAL segment, which
seems sensible.
This change can make the callback function "WalRcvDie()" call ereport(ERROR)
via XLogWalRcvFlush(). This looks unsafe.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Mon, Feb 14, 2011 at 2:08 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I committed the patch with those changes, and some minor comment tweaks and
other kibitzing.
I have another comment:
The description of wal_receiver_status_interval is in "18.5.4.
Streaming Replication".
But I think that it should be moved to "18.5.5. Standby Servers" since
it's a parameter
to control the behavior of the standby server rather than that of the master.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Sat, Jan 15, 2011 at 4:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Here's the latest patch for sync rep.
I was looking at this code and found something in SyncRepWaitOnQueue
we declare a timeout variable that is a long and another that is a
boolean (this last one in the else part of the "if
(!IsOnSyncRepQueue())"), and then use the boolean one as if it were
the long one
+ else
+ {
+ bool release = false;
+ bool timeout = false;
+
+ SpinLockAcquire(&queue->qlock);
+
+ /*
+ * Check the LSN on our queue and if its moved far enough then
+ * remove us from the queue. First time through this is
+ * unlikely to be far enough, yet is possible. Next time we are
+ * woken we should be more lucky.
+ */
+ if (XLByteLE(XactCommitLSN, queue->lsn))
+ release = true;
+ else if (timeout > 0 &&
+ TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(),
+ now,
+ timeout))
+ {
+ release = true;
+ timeout = true;
+ }
the other two things are on postgresql.conf.sample:
- we have two replication_timeout_client, obviously one of them should
be replication_timeout_server
- synchronous_replication_feedback is off by default, but docs says otherwise
i also have been testing this, until now the only issue i have found
is that if i set allow_standalone_primary to off and there isn't a
standby connected i need to stop the server with -m immediate which is
at least surprising
--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte y capacitación de PostgreSQL
On Tue, 2011-02-15 at 01:45 -0500, Jaime Casanova wrote:
On Sat, Jan 15, 2011 at 4:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Here's the latest patch for sync rep.
I was looking at this code and found something in SyncRepWaitOnQueue
we declare a timeout variable that is a long and another that is a
boolean (this last one in the else part of the "if
(!IsOnSyncRepQueue())"), and then use the boolean one as if it were
the long one
OK, thanks.
+ else + { + bool release = false; + bool timeout = false; + + SpinLockAcquire(&queue->qlock); + + /* + * Check the LSN on our queue and if its moved far enough then + * remove us from the queue. First time through this is + * unlikely to be far enough, yet is possible. Next time we are + * woken we should be more lucky. + */ + if (XLByteLE(XactCommitLSN, queue->lsn)) + release = true; + else if (timeout > 0 && + TimestampDifferenceExceeds(GetCurrentTransactionStopTimestamp(), + now, + timeout)) + { + release = true; + timeout = true; + }the other two things are on postgresql.conf.sample:
- we have two replication_timeout_client, obviously one of them should
be replication_timeout_server
- synchronous_replication_feedback is off by default, but docs says otherwisei also have been testing this, until now the only issue i have found
is that if i set allow_standalone_primary to off and there isn't a
standby connected i need to stop the server with -m immediate which is
at least surprising
I think that code is being ripped out, so will check again later.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On Mon, Feb 14, 2011 at 12:08 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I committed the patch with those changes, and some minor comment tweaks and
other kibitzing.+ * 'd' means a standby reply wrapped in a COPY BOTH packet. + */Typo: s/COPY BOTH/CopyData
Fixed.
+ msgtype = pq_getmsgbyte(&input_message); + if (msgtype != 'r') + ereport(COMMERROR, + (errcode(ERRCODE_PROTOCOL_VIOLATION), + errmsg("unexpected message type %c", msgtype)));I think that proc_exit(0) needs to be called in error case.
Fixed.
+ static StringInfoData input_message; + StandbyReplyMessage reply; + char msgtype; + + initStringInfo(&input_message);Doesn't the repeat of initStringInfo() cause the memory leak?
Yeah. Fixed, I hope.
@@ -518,6 +584,7 @@ WalSndLoop(void)
{
if (!XLogSend(output_message, &caughtup))
break;
+ ProcessRepliesIfAny();Why is ProcessRepliesIfAny() required there?
I'm not sure if that's 100% necessary, but it seems harmless enough.
We added new columns "write_location", "flush_location" and
"apply_location". So, for the sake of consistency, the column
name "sent_location" should be changed to "send_location"?
I was thinking about stream_location or streaming_location, per
discussion on the other thread.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Feb 15, 2011 at 1:11 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Feb 14, 2011 at 2:08 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I committed the patch with those changes, and some minor comment tweaks and
other kibitzing.I have another comment:
The description of wal_receiver_status_interval is in "18.5.4.
Streaming Replication".
But I think that it should be moved to "18.5.5. Standby Servers" since
it's a parameter
to control the behavior of the standby server rather than that of the master.
Fixed.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Feb 14, 2011 at 12:25 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I added a XLogWalRcvSendReply() call into XLogWalRcvFlush() so that it also
sends a status update every time the WAL is flushed. If the walreceiver is
busy receiving and flushing, that would happen once per WAL segment, which
seems sensible.This change can make the callback function "WalRcvDie()" call ereport(ERROR)
via XLogWalRcvFlush(). This looks unsafe.
Good catch. Is the cleanest solution to pass a boolean parameter to
XLogWalRcvFlush() indicating whether we're in the midst of dying?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Feb 16, 2011 at 2:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 14, 2011 at 12:25 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I added a XLogWalRcvSendReply() call into XLogWalRcvFlush() so that it also
sends a status update every time the WAL is flushed. If the walreceiver is
busy receiving and flushing, that would happen once per WAL segment, which
seems sensible.This change can make the callback function "WalRcvDie()" call ereport(ERROR)
via XLogWalRcvFlush(). This looks unsafe.Good catch. Is the cleanest solution to pass a boolean parameter to
XLogWalRcvFlush() indicating whether we're in the midst of dying?
Agreed if the comment about why such a boolean parameter is
required is added.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Tue, Feb 15, 2011 at 10:13 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Wed, Feb 16, 2011 at 2:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Feb 14, 2011 at 12:25 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I added a XLogWalRcvSendReply() call into XLogWalRcvFlush() so that it also
sends a status update every time the WAL is flushed. If the walreceiver is
busy receiving and flushing, that would happen once per WAL segment, which
seems sensible.This change can make the callback function "WalRcvDie()" call ereport(ERROR)
via XLogWalRcvFlush(). This looks unsafe.Good catch. Is the cleanest solution to pass a boolean parameter to
XLogWalRcvFlush() indicating whether we're in the midst of dying?Agreed if the comment about why such a boolean parameter is
required is added.
OK, done.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, 2011-02-15 at 12:08 -0500, Robert Haas wrote:
On Mon, Feb 14, 2011 at 12:25 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I added a XLogWalRcvSendReply() call into XLogWalRcvFlush() so that it also
sends a status update every time the WAL is flushed. If the walreceiver is
busy receiving and flushing, that would happen once per WAL segment, which
seems sensible.This change can make the callback function "WalRcvDie()" call ereport(ERROR)
via XLogWalRcvFlush(). This looks unsafe.Good catch. Is the cleanest solution to pass a boolean parameter to
XLogWalRcvFlush() indicating whether we're in the midst of dying?
Surely if you do this then sync rep will fail to respond correctly if
WalReceiver dies.
Why is it OK to write to disk, but not OK to reply?
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On 16.02.2011 17:36, Simon Riggs wrote:
On Tue, 2011-02-15 at 12:08 -0500, Robert Haas wrote:
On Mon, Feb 14, 2011 at 12:25 AM, Fujii Masao<masao.fujii@gmail.com> wrote:
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I added a XLogWalRcvSendReply() call into XLogWalRcvFlush() so that it also
sends a status update every time the WAL is flushed. If the walreceiver is
busy receiving and flushing, that would happen once per WAL segment, which
seems sensible.This change can make the callback function "WalRcvDie()" call ereport(ERROR)
via XLogWalRcvFlush(). This looks unsafe.Good catch. Is the cleanest solution to pass a boolean parameter to
XLogWalRcvFlush() indicating whether we're in the midst of dying?Surely if you do this then sync rep will fail to respond correctly if
WalReceiver dies.Why is it OK to write to disk, but not OK to reply?
Because the connection might be dead. A broken connection is a likely
cause of walreceiver death.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, 2011-02-16 at 17:40 +0200, Heikki Linnakangas wrote:
On 16.02.2011 17:36, Simon Riggs wrote:
On Tue, 2011-02-15 at 12:08 -0500, Robert Haas wrote:
On Mon, Feb 14, 2011 at 12:25 AM, Fujii Masao<masao.fujii@gmail.com> wrote:
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I added a XLogWalRcvSendReply() call into XLogWalRcvFlush() so that it also
sends a status update every time the WAL is flushed. If the walreceiver is
busy receiving and flushing, that would happen once per WAL segment, which
seems sensible.This change can make the callback function "WalRcvDie()" call ereport(ERROR)
via XLogWalRcvFlush(). This looks unsafe.Good catch. Is the cleanest solution to pass a boolean parameter to
XLogWalRcvFlush() indicating whether we're in the midst of dying?Surely if you do this then sync rep will fail to respond correctly if
WalReceiver dies.Why is it OK to write to disk, but not OK to reply?
Because the connection might be dead. A broken connection is a likely
cause of walreceiver death.
Would it not be possible to check that?
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Feb 16, 2011 at 11:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Wed, 2011-02-16 at 17:40 +0200, Heikki Linnakangas wrote:
On 16.02.2011 17:36, Simon Riggs wrote:
On Tue, 2011-02-15 at 12:08 -0500, Robert Haas wrote:
On Mon, Feb 14, 2011 at 12:25 AM, Fujii Masao<masao.fujii@gmail.com> wrote:
On Fri, Feb 11, 2011 at 4:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:I added a XLogWalRcvSendReply() call into XLogWalRcvFlush() so that it also
sends a status update every time the WAL is flushed. If the walreceiver is
busy receiving and flushing, that would happen once per WAL segment, which
seems sensible.This change can make the callback function "WalRcvDie()" call ereport(ERROR)
via XLogWalRcvFlush(). This looks unsafe.Good catch. Is the cleanest solution to pass a boolean parameter to
XLogWalRcvFlush() indicating whether we're in the midst of dying?Surely if you do this then sync rep will fail to respond correctly if
WalReceiver dies.Why is it OK to write to disk, but not OK to reply?
Because the connection might be dead. A broken connection is a likely
cause of walreceiver death.Would it not be possible to check that?
I'm not actually sure that it matters that much whether we do or not.
ISTM that the WAL receiver is normally going to exit the main loop (in
WalReceiverMain) right here:
/* Process any requests or signals received recently */
ProcessWalRcvInterrupts();
But to get to that point, we either have to be making our first pass
through the loop (in which case nothing interesting has happened yet)
or we have to have just completed an iteration through the loop (in
which case we just sent a reply). I think that the only thing that
can have changed since the last reply is the replay position, which
this version of the sync rep patch doesn't care about anyway. Even if
it did, I'm not sure it'd be worth complicating the die path to
squeeze in one final reply.
Actually, on further reflection, I'm not even sure why we bother with
the fsync. It seems like a useful safeguard but I'm not seeing how we
can get to that point without having fsync'd everything anyway. Am I
missing something?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 16.02.2011 19:29, Robert Haas wrote:
Actually, on further reflection, I'm not even sure why we bother with
the fsync. It seems like a useful safeguard but I'm not seeing how we
can get to that point without having fsync'd everything anyway. Am I
missing something?
WalRcvDie() is called on error. For example, if the connection dies
unexpectedly during walrcv_receive().
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Feb 16, 2011 at 12:34 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 16.02.2011 19:29, Robert Haas wrote:
Actually, on further reflection, I'm not even sure why we bother with
the fsync. It seems like a useful safeguard but I'm not seeing how we
can get to that point without having fsync'd everything anyway. Am I
missing something?WalRcvDie() is called on error. For example, if the connection dies
unexpectedly during walrcv_receive().
Ah, OK.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, 2011-01-21 at 14:45 +0200, Heikki Linnakangas wrote:
* The UI differs from what was agreed on here:
http://archives.postgresql.org/message-id/4D1DCF5A.7070808@enterprisedb.com.
Patch to add server_name parameter, plus mechanism to send info from
standby to master. While doing that, refactor into 3 message types, not
just 1. This addresses Fujii's comment that we may not wish to send
feedback as often as other replies, but doesn't actually alter yet when
the feedback is sent (nor will I do that anytime soon).
Complete but rough hack, for comments, but nothing surprising.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
server_name_for_replication.v1.patchtext/x-patch; charset=UTF-8; name=server_name_for_replication.v1.patchDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ee09468..ff89035 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -95,6 +95,7 @@ static struct
} LogstreamResult;
static StandbyReplyMessage reply_message;
+static StandbyHSFeedbackMessage feedback_message;
/*
* About SIGTERM handling:
@@ -123,6 +124,8 @@ static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
static void XLogWalRcvFlush(bool dying);
static void XLogWalRcvSendReply(void);
+static void XLogWalRcvSendHSFeedback(void);
+static void XLogWalRcvSendInfo(void);
/* Signal handlers */
static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -303,6 +306,7 @@ WalReceiverMain(void)
{
got_SIGHUP = false;
ProcessConfigFile(PGC_SIGHUP);
+ XLogWalRcvSendInfo();
}
/* Wait a while for data to arrive */
@@ -317,6 +321,7 @@ WalReceiverMain(void)
/* Let the master know that we received some data. */
XLogWalRcvSendReply();
+ XLogWalRcvSendHSFeedback();
/*
* If we've written some records, flush them to disk and let the
@@ -331,6 +336,7 @@ WalReceiverMain(void)
* the master anyway, to report any progress in applying WAL.
*/
XLogWalRcvSendReply();
+ XLogWalRcvSendHSFeedback();
}
}
}
@@ -619,40 +625,84 @@ XLogWalRcvSendReply(void)
reply_message.apply = GetXLogReplayRecPtr();
reply_message.sendTime = now;
+ elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X",
+ reply_message.write.xlogid, reply_message.write.xrecoff,
+ reply_message.flush.xlogid, reply_message.flush.xrecoff,
+ reply_message.apply.xlogid, reply_message.apply.xrecoff);
+
+ /* Prepend with the message type and send it. */
+ buf[0] = 'r';
+ memcpy(&buf[1], &reply_message, sizeof(StandbyReplyMessage));
+ walrcv_send(buf, sizeof(StandbyReplyMessage) + 1);
+}
+
+/*
+ * Send hot standby feedback message to primary, plus the current time,
+ * in case they don't have a watch.
+ */
+static void
+XLogWalRcvSendHSFeedback(void)
+{
+ char buf[sizeof(StandbyHSFeedbackMessage) + 1];
+ TimestampTz now;
+
+ /*
+ * If the user doesn't want status to be reported to the master, be sure
+ * to exit before doing anything at all.
+ */
+ if (!hot_standby_feedback || !HotStandbyActive())
+ return;
+
+ /* Get current timestamp. */
+ now = GetCurrentTimestamp();
+
/*
* Get the OldestXmin and its associated epoch
*/
- if (hot_standby_feedback && HotStandbyActive())
{
TransactionId nextXid;
uint32 nextEpoch;
- reply_message.xmin = GetOldestXmin(true, false);
+ feedback_message.xmin = GetOldestXmin(true, false);
/*
* Get epoch and adjust if nextXid and oldestXmin are different
* sides of the epoch boundary.
*/
GetNextXidAndEpoch(&nextXid, &nextEpoch);
- if (nextXid < reply_message.xmin)
+ if (nextXid < feedback_message.xmin)
nextEpoch--;
- reply_message.epoch = nextEpoch;
- }
- else
- {
- reply_message.xmin = InvalidTransactionId;
- reply_message.epoch = 0;
+ feedback_message.epoch = nextEpoch;
}
- elog(DEBUG2, "sending write %X/%X flush %X/%X apply %X/%X xmin %u epoch %u",
- reply_message.write.xlogid, reply_message.write.xrecoff,
- reply_message.flush.xlogid, reply_message.flush.xrecoff,
- reply_message.apply.xlogid, reply_message.apply.xrecoff,
- reply_message.xmin,
- reply_message.epoch);
+ elog(DEBUG2, "sending xmin %u epoch %u",
+ feedback_message.xmin,
+ feedback_message.epoch);
/* Prepend with the message type and send it. */
- buf[0] = 'r';
- memcpy(&buf[1], &reply_message, sizeof(StandbyReplyMessage));
- walrcv_send(buf, sizeof(StandbyReplyMessage) + 1);
+ buf[0] = 'h';
+ memcpy(&buf[1], &feedback_message, sizeof(StandbyHSFeedbackMessage));
+ walrcv_send(buf, sizeof(StandbyHSFeedbackMessage) + 1);
+}
+
+/*
+ * Send info message to primary.
+ */
+static void
+XLogWalRcvSendInfo(void)
+{
+ char buf[sizeof(StandbyInfoMessage) + 1];
+ StandbyInfoMessage info_message;
+
+ /* Get current timestamp. */
+ info_message.sendTime = GetCurrentTimestamp();
+ strncpy(info_message.servername, ServerName, strlen(ServerName));
+
+ elog(DEBUG2, "sending servername %s",
+ info_message.servername);
+
+ /* Prepend with the message type and send it. */
+ buf[0] = 'i';
+ memcpy(&buf[1], &info_message, sizeof(StandbyInfoMessage));
+ walrcv_send(buf, sizeof(StandbyInfoMessage) + 1);
}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a6a7a14..e46cd01 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -116,7 +116,10 @@ static void WalSndKill(int code, Datum arg);
static bool XLogSend(char *msgbuf, bool *caughtup);
static void IdentifySystem(void);
static void StartReplication(StartReplicationCmd * cmd);
+static void ProcessStandbyMessage(void);
static void ProcessStandbyReplyMessage(void);
+static void ProcessStandbyHSFeedbackMessage(void);
+static void ProcessStandbyInfoMessage(void);
static void ProcessRepliesIfAny(void);
@@ -456,42 +459,45 @@ ProcessRepliesIfAny(void)
unsigned char firstchar;
int r;
- r = pq_getbyte_if_available(&firstchar);
- if (r < 0)
- {
- /* unexpected error or EOF */
- ereport(COMMERROR,
- (errcode(ERRCODE_PROTOCOL_VIOLATION),
- errmsg("unexpected EOF on standby connection")));
- proc_exit(0);
- }
- if (r == 0)
+ for (;;)
{
- /* no data available without blocking */
- return;
- }
+ r = pq_getbyte_if_available(&firstchar);
+ if (r < 0)
+ {
+ /* unexpected error or EOF */
+ ereport(COMMERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected EOF on standby connection")));
+ proc_exit(0);
+ }
+ if (r == 0)
+ {
+ /* no data available without blocking */
+ return;
+ }
- /* Handle the very limited subset of commands expected in this phase */
- switch (firstchar)
- {
- /*
- * 'd' means a standby reply wrapped in a CopyData packet.
- */
- case 'd':
- ProcessStandbyReplyMessage();
- break;
+ /* Handle the very limited subset of commands expected in this phase */
+ switch (firstchar)
+ {
+ /*
+ * 'd' means a standby reply wrapped in a CopyData packet.
+ */
+ case 'd':
+ ProcessStandbyMessage();
+ break;
- /*
- * 'X' means that the standby is closing down the socket.
- */
- case 'X':
- proc_exit(0);
+ /*
+ * 'X' means that the standby is closing down the socket.
+ */
+ case 'X':
+ proc_exit(0);
- default:
- ereport(FATAL,
- (errcode(ERRCODE_PROTOCOL_VIOLATION),
- errmsg("invalid standby closing message type %d",
- firstchar)));
+ default:
+ ereport(FATAL,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("invalid standby closing message type %d",
+ firstchar)));
+ }
}
}
@@ -499,11 +505,9 @@ ProcessRepliesIfAny(void)
* Process a status update message received from standby.
*/
static void
-ProcessStandbyReplyMessage(void)
+ProcessStandbyMessage(void)
{
- StandbyReplyMessage reply;
char msgtype;
- TransactionId newxmin = InvalidTransactionId;
resetStringInfo(&reply_message);
@@ -523,22 +527,43 @@ ProcessStandbyReplyMessage(void)
* one type.
*/
msgtype = pq_getmsgbyte(&reply_message);
- if (msgtype != 'r')
+
+ switch (msgtype)
{
- ereport(COMMERROR,
- (errcode(ERRCODE_PROTOCOL_VIOLATION),
- errmsg("unexpected message type %c", msgtype)));
- proc_exit(0);
+ case 'r':
+ ProcessStandbyReplyMessage();
+ break;
+
+ case 'h':
+ ProcessStandbyHSFeedbackMessage();
+ break;
+
+ case 'i':
+ ProcessStandbyInfoMessage();
+ break;
+
+ default:
+ ereport(COMMERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg("unexpected message type %c", msgtype)));
+ proc_exit(0);
}
+}
+
+/*
+ * Regular reply from standby advising of WAL positions on standby server.
+ */
+static void
+ProcessStandbyReplyMessage(void)
+{
+ StandbyReplyMessage reply;
pq_copymsgbytes(&reply_message, (char *) &reply, sizeof(StandbyReplyMessage));
- elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X xmin %u epoch %u",
+ elog(DEBUG2, "write %X/%X flush %X/%X apply %X/%X",
reply.write.xlogid, reply.write.xrecoff,
reply.flush.xlogid, reply.flush.xrecoff,
- reply.apply.xlogid, reply.apply.xrecoff,
- reply.xmin,
- reply.epoch);
+ reply.apply.xlogid, reply.apply.xrecoff);
/*
* Update shared state for this WalSender process
@@ -554,6 +579,22 @@ ProcessStandbyReplyMessage(void)
walsnd->apply = reply.apply;
SpinLockRelease(&walsnd->mutex);
}
+}
+
+/*
+ * Hot Standby feedback
+ */
+static void
+ProcessStandbyHSFeedbackMessage(void)
+{
+ StandbyHSFeedbackMessage reply;
+ TransactionId newxmin = InvalidTransactionId;
+
+ pq_copymsgbytes(&reply_message, (char *) &reply, sizeof(StandbyHSFeedbackMessage));
+
+ elog(DEBUG2, "xmin %u epoch %u",
+ reply.xmin,
+ reply.epoch);
/*
* Update the WalSender's proc xmin to allow it to be visible
@@ -619,6 +660,16 @@ ProcessStandbyReplyMessage(void)
}
}
+static void
+ProcessStandbyInfoMessage(void)
+{
+ StandbyInfoMessage info;
+
+ pq_copymsgbytes(&reply_message, (char *) &info, sizeof(StandbyInfoMessage));
+
+ elog(DEBUG2, "server name %s", info.servername);
+}
+
/* Main loop of walsender process */
static int
WalSndLoop(void)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 55cbf75..de6de82 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -405,6 +405,9 @@ int tcp_keepalives_idle;
int tcp_keepalives_interval;
int tcp_keepalives_count;
+char *ServerName = NULL;
+
+
/*
* These variables are all dummies that don't do anything, except in some
* cases provide the value for SHOW to display. The real state is elsewhere
@@ -2365,6 +2368,15 @@ static struct config_string ConfigureNamesString[] =
},
{
+ {"server_name", PGC_POSTMASTER, CLIENT_CONN_STATEMENT,
+ gettext_noop("Allows setting of a unique name for this server."),
+ NULL
+ },
+ &ServerName,
+ "", NULL, NULL
+ },
+
+ {
{"temp_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop("Sets the tablespace(s) to use for temporary tables and sort files."),
NULL,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6726733..cbe6fb2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -56,6 +56,7 @@
# - Connection Settings -
+#server_name = '' # optional server name for use when clustering servers
#listen_addresses = 'localhost' # what IP address(es) to listen on;
# comma-separated list of addresses;
# defaults to 'localhost', '*' = all
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index aa8cce5..bf6c262 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -218,6 +218,7 @@ extern int CTimeZone;
#define MAXTZLEN 10 /* max TZ name len, not counting tr. null */
+extern char *ServerName;
extern bool enableFsync;
extern bool allowSystemTableMods;
extern PGDLLIMPORT int work_mem;
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index da94b6b..48cb333 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -56,6 +56,18 @@ typedef struct
XLogRecPtr flush;
XLogRecPtr apply;
+ /* Sender's system clock at the time of transmission */
+ TimestampTz sendTime;
+} StandbyReplyMessage;
+
+/*
+ * Hot Standby feedback from standby (message type 'h'). This is wrapped within
+ * a CopyData message at the FE/BE protocol level.
+ *
+ * Note that the data length is not specified here.
+ */
+typedef struct
+{
/*
* The current xmin and epoch from the standby, for Hot Standby feedback.
* This may be invalid if the standby-side does not support feedback,
@@ -64,10 +76,23 @@ typedef struct
TransactionId xmin;
uint32 epoch;
+ /* Sender's system clock at the time of transmission */
+ TimestampTz sendTime;
+} StandbyHSFeedbackMessage;
+
+/*
+ * Info message from standby (message type 'i'). This is wrapped within
+ * a CopyData message at the FE/BE protocol level.
+ *
+ * Note that the data length is not specified here.
+ */
+typedef struct
+{
+ char servername[64];
/* Sender's system clock at the time of transmission */
TimestampTz sendTime;
-} StandbyReplyMessage;
+} StandbyInfoMessage;
/*
* Maximum data payload in a WAL data message. Must be >= XLOG_BLCKSZ.
On Fri, 2011-02-18 at 00:48 +0000, Simon Riggs wrote:
Complete but rough hack, for comments, but nothing surprising.
This is an implicit requirement from our earlier agreed API, so its
blocking further work on Sync Rep.
I'm looking to commit this in about 3-4 hours unless I get comments.
--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services