pg_last_xact_insert_timestamp
On Thu, Sep 8, 2011 at 7:06 AM, Chris Redekop <chris@replicon.com> wrote:
Is there anything available to get the last time a transaction
occurred?....like say "pg_last_xact_timestamp"? In order to accurately
calculate how far behind my slave is I need to do something like
master::pg_last_xact_timestamp() -
slave::pg_last_xact_replay_timestamp()....currently I'm using now() instead
of the pg_last_xact_timestamp() call, but then when the master is not busy
the slave appears to lag behind. I'm considering writing a C module to get
the last modified file time of the xlog, but I'm hoping there is a better
alternative that I haven't found yet....
The above has been posted in pgsql-general. The reason why Chris thinks
a counterpart of pg_last_xact_replay_timestamp() is required sounds
reasonable to me. So I'd like to propose new function
"pg_last_xact_insert_timestamp" as the counterpart, which returns the
timestamp of the last inserted commit or abort WAL record. I'll add the
attached patch into the next CF.
Comments?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Attachments:
pg_last_xact_insert_timestamp_v1.patchtext/x-patch; charset=US-ASCII; name=pg_last_xact_insert_timestamp_v1.patchDownload
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 13996,14001 **** SELECT set_config('log_statement_stats', 'off', false);
--- 13996,14004 ----
<primary>pg_current_xlog_location</primary>
</indexterm>
<indexterm>
+ <primary>pg_last_xact_insert_timestamp</primary>
+ </indexterm>
+ <indexterm>
<primary>pg_start_backup</primary>
</indexterm>
<indexterm>
***************
*** 14049,14054 **** SELECT set_config('log_statement_stats', 'off', false);
--- 14052,14064 ----
</row>
<row>
<entry>
+ <literal><function>pg_last_xact_insert_timestamp()</function></literal>
+ </entry>
+ <entry><type>timestamp with time zone</type></entry>
+ <entry>Get current transaction log insert time stamp</entry>
+ </row>
+ <row>
+ <entry>
<literal><function>pg_start_backup(<parameter>label</> <type>text</> <optional>, <parameter>fast</> <type>boolean</> </optional>)</function></literal>
</entry>
<entry><type>text</type></entry>
***************
*** 14175,14180 **** postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
--- 14185,14199 ----
</para>
<para>
+ <function>pg_last_xact_insert_timestamp</> displays the time stamp of last inserted
+ transaction. This is the time at which the commit or abort WAL record for that transaction.
+ This is initialized with the time stamp of last transaction replayed during recovery (i.e.,
+ the return value of <function>pg_last_xact_replay_timestamp</>). If no transactions
+ have been replayed during recovery, <function>pg_last_xact_insert_timestamp</>
+ returns NULL until at least one commit or abort WAL record has been inserted.
+ </para>
+
+ <para>
For details about proper usage of these functions, see
<xref linkend="continuous-archiving">.
</para>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 867,872 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 867,881 ----
<command>ps</> command (see <xref linkend="monitoring-ps"> for details).
</para>
<para>
+ You can also calculate the lag in time stamp by comparing the last
+ WAL insert time stamp on the primary with the last WAL replay
+ time stamp on the standby. They can be retrieved using
+ <function>pg_last_xact_insert_timestamp</> on the primary and
+ the <function>pg_last_xact_replay_timestamp</> on the standby,
+ respectively (see <xref linkend="functions-admin-backup-table"> and
+ <xref linkend="functions-recovery-info-table"> for details).
+ </para>
+ <para>
You can retrieve a list of WAL sender processes via the
<link linkend="monitoring-stats-views-table">
<literal>pg_stat_replication</></link> view. Large differences between
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 1041,1046 **** RecordTransactionCommit(void)
--- 1041,1049 ----
rdata[lastrdata].next = NULL;
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata);
+
+ /* Save timestamp of latest transaction commit record */
+ SetLastInsertXTime(xlrec.xact_time);
}
else
{
***************
*** 1064,1069 **** RecordTransactionCommit(void)
--- 1067,1075 ----
rdata[lastrdata].next = NULL;
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_COMPACT, rdata);
+
+ /* Save timestamp of latest transaction commit record */
+ SetLastInsertXTime(xlrec.xact_time);
}
}
***************
*** 1433,1438 **** RecordTransactionAbort(bool isSubXact)
--- 1439,1447 ----
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT, rdata);
+ /* Save timestamp of latest transaction abort record */
+ SetLastInsertXTime(xlrec.xact_time);
+
/*
* Report the latest async abort LSN, so that the WAL writer knows to
* flush this abort. There's nothing to be gained by delaying this, since
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 391,396 **** typedef struct XLogCtlData
--- 391,397 ----
XLogRecPtr asyncXactLSN; /* LSN of newest async commit/abort */
uint32 lastRemovedLog; /* latest removed/recycled XLOG segment */
uint32 lastRemovedSeg;
+ TimestampTz lastInsertXTime; /* timestamp of last COMMIT/ABORT record inserted */
/* Protected by WALWriteLock: */
XLogCtlWrite Write;
***************
*** 608,613 **** static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
--- 609,615 ----
static void recoveryPausesHere(void);
static bool RecoveryIsPaused(void);
static void SetRecoveryPause(bool recoveryPause);
+ static TimestampTz GetLastInsertXTime(void);
static void SetLatestXTime(TimestampTz xtime);
static TimestampTz GetLatestXTime(void);
static void CheckRequiredParameterValues(void);
***************
*** 5866,5871 **** pg_is_xlog_replay_paused(PG_FUNCTION_ARGS)
--- 5868,5928 ----
}
/*
+ * Save timestamp of latest inserted commit/abort record.
+ */
+ void
+ SetLastInsertXTime(TimestampTz xtime)
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ xlogctl->lastInsertXTime = xtime;
+ SpinLockRelease(&xlogctl->info_lck);
+ }
+
+ /*
+ * Fetch timestamp of latest inserted commit/abort record.
+ */
+ static TimestampTz
+ GetLastInsertXTime(void)
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+ TimestampTz xtime;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ xtime = xlogctl->lastInsertXTime;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ return xtime;
+ }
+
+ /*
+ * Returns timestamp of latest inserted commit/abort record.
+ *
+ * When the server has been started normally, this function returns
+ * NULL until at least one commit/abort record has been inserted.
+ */
+ Datum
+ pg_last_xact_insert_timestamp(PG_FUNCTION_ARGS)
+ {
+ TimestampTz xtime;
+
+ if (RecoveryInProgress())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("WAL control functions cannot be executed during recovery.")));
+
+ xtime = GetLastInsertXTime();
+ if (xtime == 0)
+ PG_RETURN_NULL();
+
+ PG_RETURN_TIMESTAMPTZ(xtime);
+ }
+
+ /*
* Save timestamp of latest processed commit/abort record.
*
* We keep this in XLogCtl, not a simple static variable, so that it can be
***************
*** 6944,6949 **** StartupXLOG(void)
--- 7001,7012 ----
}
/*
+ * Initialize the timestamp of the latest inserted COMMIT/ABORT record
+ * with the latest replayed one.
+ */
+ SetLastInsertXTime(GetLatestXTime());
+
+ /*
* If any of the critical GUCs have changed, log them before we allow
* backends to write WAL.
*/
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 306,311 **** extern void CreateCheckPoint(int flags);
--- 306,312 ----
extern bool CreateRestartPoint(int flags);
extern void XLogPutNextOid(Oid nextOid);
extern XLogRecPtr XLogRestorePoint(const char *rpName);
+ extern void SetLastInsertXTime(TimestampTz xtime);
extern XLogRecPtr GetRedoRecPtr(void);
extern XLogRecPtr GetInsertRecPtr(void);
extern XLogRecPtr GetFlushRecPtr(void);
*** a/src/include/access/xlog_internal.h
--- b/src/include/access/xlog_internal.h
***************
*** 270,275 **** extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS);
--- 270,276 ----
extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
extern Datum pg_last_xlog_receive_location(PG_FUNCTION_ARGS);
extern Datum pg_last_xlog_replay_location(PG_FUNCTION_ARGS);
+ extern Datum pg_last_xact_insert_timestamp(PG_FUNCTION_ARGS);
extern Datum pg_last_xact_replay_timestamp(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS);
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 2869,2874 **** DATA(insert OID = 2850 ( pg_xlogfile_name_offset PGNSP PGUID 12 1 0 0 0 f f f t
--- 2869,2876 ----
DESCR("xlog filename and byte offset, given an xlog location");
DATA(insert OID = 2851 ( pg_xlogfile_name PGNSP PGUID 12 1 0 0 0 f f f t f i 1 0 25 "25" _null_ _null_ _null_ _null_ pg_xlogfile_name _null_ _null_ _null_ ));
DESCR("xlog filename, given an xlog location");
+ DATA(insert OID = 3831 ( pg_last_xact_insert_timestamp PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 1184 "" _null_ _null_ _null_ _null_ pg_last_xact_insert_timestamp _null_ _null_ _null_ ));
+ DESCR("timestamp of last insert xact");
DATA(insert OID = 3810 ( pg_is_in_recovery PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_is_in_recovery _null_ _null_ _null_ ));
DESCR("true if server is in recovery");
On Thu, Sep 8, 2011 at 9:36 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Sep 8, 2011 at 7:06 AM, Chris Redekop <chris@replicon.com> wrote:
Is there anything available to get the last time a transaction
occurred?....like say "pg_last_xact_timestamp"? In order to accurately
calculate how far behind my slave is I need to do something like
master::pg_last_xact_timestamp() -
slave::pg_last_xact_replay_timestamp()....currently I'm using now() instead
of the pg_last_xact_timestamp() call, but then when the master is not busy
the slave appears to lag behind. I'm considering writing a C module to get
the last modified file time of the xlog, but I'm hoping there is a better
alternative that I haven't found yet....The above has been posted in pgsql-general. The reason why Chris thinks
a counterpart of pg_last_xact_replay_timestamp() is required sounds
reasonable to me. So I'd like to propose new function
"pg_last_xact_insert_timestamp" as the counterpart, which returns the
timestamp of the last inserted commit or abort WAL record. I'll add the
attached patch into the next CF.
For reasons stated on the original thread, I don't think we need this.
The OP can calculate what he wants without this.
The code already exists in recovery and has some purpose there. Adding
similar code to the mainline just so somebody can run this function
occasionally is not good. Based on this I will be looking to see if we
can optimise the recovery path some more.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Sep 8, 2011 at 5:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Thu, Sep 8, 2011 at 9:36 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
The above has been posted in pgsql-general. The reason why Chris thinks
a counterpart of pg_last_xact_replay_timestamp() is required sounds
reasonable to me. So I'd like to propose new function
"pg_last_xact_insert_timestamp" as the counterpart, which returns the
timestamp of the last inserted commit or abort WAL record. I'll add the
attached patch into the next CF.For reasons stated on the original thread, I don't think we need this.
The OP can calculate what he wants without this.
The code already exists in recovery and has some purpose there. Adding
similar code to the mainline just so somebody can run this function
occasionally is not good. Based on this I will be looking to see if we
can optimise the recovery path some more.
Okay. Let me explain another use case of pg_last_xact_insert_timestamp().
The existing functions might be enough for checking whether the standby
has already caught up with the master. But I think that new function would be
very useful to calculate *how much* the standby is behind from the master.
Of course, we can do that by using the existing functions. But a problem is
that they return LSN and the calculated delay is represented as the size of
WAL. For users, it's not easy to handle LSN (e.g., there is no function to do
calculation of LSN), and the delay in WAL size is not intuitive. I've sometimes
received such a complaint.
OTOH, new function enables users to monitor the delay as a timestamp.
For users, a timestamp is obviously easier to handle than LSN, and the delay
as a timestamp is more intuitive. So, I think that it's worth adding
something like pg_last_xact_insert_timestamp into core for improvement
of user-friendness.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Thu, Sep 8, 2011 at 6:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
OTOH, new function enables users to monitor the delay as a timestamp.
For users, a timestamp is obviously easier to handle than LSN, and the delay
as a timestamp is more intuitive. So, I think that it's worth adding
something like pg_last_xact_insert_timestamp into core for improvement
of user-friendness.
It seems very nice from a usability point of view, but I have to agree
with Simon's concern about performance. Actually, as of today,
WALInsertLock is such a gigantic bottleneck that I suspect the
overhead of this additional bookkeeping would be completely
unnoticeable. But I'm still reluctant to add more centralized
spinlocks that everyone has to fight over, having recently put a lot
of effort into getting rid of some of the ones we've traditionally
had.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Thanks for all the feedback guys. Just to throw another monkey wrench in
here - I've been playing with Simon's proposed solution of returning 0 when
the WAL positions match, and I've come to the realizatiion that even if
using pg_last_xact_insert_timestamp, although it would help, we still
wouldn't be able to get a 100% accurate "how far behind?" counter....not
that this is a big deal, but I know my ops team is going to bitch to me
about it :).....take this situation: there's a lull of 30 seconds where
there are no transactions committed on the server....the slave is totally
caught up, WAL positions match, I'm reporting 0, everything is happy. Then
a transaction is committed on the master....before the slave gets it my
query hits it and sees that we're 30 seconds behind (when in reality we're
<1sec behind).....Because of this affect my graph is a little spikey...I
mean it's not a huge deal or anything - I can put some sanity checking in my
number reporting ("if 1 second ago you were 0 seconds behind, you can't be
more than 1 second behind now" sorta thing). But if we wanted to go for
super-ideal solution there would be a way to get the timestamp of
pg_stat_replication.replay_location+1 (the first transaction that the slave
does not have).
On Thu, Sep 8, 2011 at 7:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Show quoted text
On Thu, Sep 8, 2011 at 6:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
OTOH, new function enables users to monitor the delay as a timestamp.
For users, a timestamp is obviously easier to handle than LSN, and thedelay
as a timestamp is more intuitive. So, I think that it's worth adding
something like pg_last_xact_insert_timestamp into core for improvement
of user-friendness.It seems very nice from a usability point of view, but I have to agree
with Simon's concern about performance. Actually, as of today,
WALInsertLock is such a gigantic bottleneck that I suspect the
overhead of this additional bookkeeping would be completely
unnoticeable. But I'm still reluctant to add more centralized
spinlocks that everyone has to fight over, having recently put a lot
of effort into getting rid of some of the ones we've traditionally
had.--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Sep 8, 2011 at 10:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 8, 2011 at 6:14 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
OTOH, new function enables users to monitor the delay as a timestamp.
For users, a timestamp is obviously easier to handle than LSN, and the delay
as a timestamp is more intuitive. So, I think that it's worth adding
something like pg_last_xact_insert_timestamp into core for improvement
of user-friendness.It seems very nice from a usability point of view, but I have to agree
with Simon's concern about performance. Actually, as of today,
WALInsertLock is such a gigantic bottleneck that I suspect the
overhead of this additional bookkeeping would be completely
unnoticeable.
The patch I've posted adds one acquisition and release of spinlock per
a commit or abort. But it's done outside of WALInsertLock. So I don't
think that the patch degrades a performance.
But I'm still reluctant to add more centralized
spinlocks that everyone has to fight over, having recently put a lot
of effort into getting rid of some of the ones we've traditionally
had.
You are against adding new acquisition and release of spinlock itself
even if it has nothing to do with WALInsertLock? The patch uses
XLogCtl->info_lck spinlock to save the last insert timestamp in the
shmem. XLogCtl->info_lck already protects many shmem variables
related to XLOG. So using XLogCtl->info_lck additionally might make
its contention heavier and degrade a performance. If the patch
defines new spinlock and uses it to save the timestamp to avoid
such a contention, you feel satisfied with the patch?
Another idea to avoid spinlock contention is save the timestamp in
PgBackendStatus (which contains information for pg_stat_activity).
This enables us to write and read the timestamp without spinlock.
Comments?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Thu, Sep 8, 2011 at 8:42 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Another idea to avoid spinlock contention is save the timestamp in
PgBackendStatus (which contains information for pg_stat_activity).
This enables us to write and read the timestamp without spinlock.
Comments?
That seems like a possibly promising approach, in that each backend
could update the information separately, and it's the reader's job to
go find the maximum of all those values when needed. So the overhead
is (properly, in this case) placed on the reader instead of the
writer. But it's a bit tricky, because when the reader wants that
maximum, it has to take into account inactive backends that may have
committed transactions before exiting, not just the ones that are
still connected.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Sep 10, 2011 at 12:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 8, 2011 at 8:42 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Another idea to avoid spinlock contention is save the timestamp in
PgBackendStatus (which contains information for pg_stat_activity).
This enables us to write and read the timestamp without spinlock.
Comments?That seems like a possibly promising approach, in that each backend
could update the information separately, and it's the reader's job to
go find the maximum of all those values when needed. So the overhead
is (properly, in this case) placed on the reader instead of the
writer. But it's a bit tricky, because when the reader wants that
maximum, it has to take into account inactive backends that may have
committed transactions before exiting, not just the ones that are
still connected.
Yes, that's what I was thinking. The attached patch has adopted that
approach.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Attachments:
pg_last_xact_insert_timestamp_v2.patchapplication/octet-stream; name=pg_last_xact_insert_timestamp_v2.patchDownload
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 13996,14001 **** SELECT set_config('log_statement_stats', 'off', false);
--- 13996,14004 ----
<primary>pg_current_xlog_location</primary>
</indexterm>
<indexterm>
+ <primary>pg_last_xact_insert_timestamp</primary>
+ </indexterm>
+ <indexterm>
<primary>pg_start_backup</primary>
</indexterm>
<indexterm>
***************
*** 14049,14054 **** SELECT set_config('log_statement_stats', 'off', false);
--- 14052,14064 ----
</row>
<row>
<entry>
+ <literal><function>pg_last_xact_insert_timestamp()</function></literal>
+ </entry>
+ <entry><type>timestamp with time zone</type></entry>
+ <entry>Get last transaction log insert time stamp</entry>
+ </row>
+ <row>
+ <entry>
<literal><function>pg_start_backup(<parameter>label</> <type>text</> <optional>, <parameter>fast</> <type>boolean</> </optional>)</function></literal>
</entry>
<entry><type>text</type></entry>
***************
*** 14175,14180 **** postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
--- 14185,14197 ----
</para>
<para>
+ <function>pg_last_xact_insert_timestamp</> displays the time stamp of last inserted
+ transaction. This is the time at which the commit or abort WAL record for that transaction.
+ If there has been no transaction committed or aborted yet since the server has started,
+ this function returns NULL.
+ </para>
+
+ <para>
For details about proper usage of these functions, see
<xref linkend="continuous-archiving">.
</para>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 867,872 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 867,881 ----
<command>ps</> command (see <xref linkend="monitoring-ps"> for details).
</para>
<para>
+ You can also calculate the lag in time stamp by comparing the last
+ WAL insert time stamp on the primary with the last WAL replay
+ time stamp on the standby. They can be retrieved using
+ <function>pg_last_xact_insert_timestamp</> on the primary and
+ the <function>pg_last_xact_replay_timestamp</> on the standby,
+ respectively (see <xref linkend="functions-admin-backup-table"> and
+ <xref linkend="functions-recovery-info-table"> for details).
+ </para>
+ <para>
You can retrieve a list of WAL sender processes via the
<link linkend="monitoring-stats-views-table">
<literal>pg_stat_replication</></link> view. Large differences between
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 1042,1047 **** RecordTransactionCommit(void)
--- 1042,1050 ----
rdata[lastrdata].next = NULL;
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata);
+
+ /* Save timestamp of latest transaction commit record */
+ pgstat_report_xact_end_timestamp(xlrec.xact_time);
}
else
{
***************
*** 1065,1070 **** RecordTransactionCommit(void)
--- 1068,1076 ----
rdata[lastrdata].next = NULL;
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_COMPACT, rdata);
+
+ /* Save timestamp of latest transaction commit record */
+ pgstat_report_xact_end_timestamp(xlrec.xact_time);
}
}
***************
*** 1434,1439 **** RecordTransactionAbort(bool isSubXact)
--- 1440,1448 ----
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT, rdata);
+ /* Save timestamp of latest transaction abort record */
+ pgstat_report_xact_end_timestamp(xlrec.xact_time);
+
/*
* Report the latest async abort LSN, so that the WAL writer knows to
* flush this abort. There's nothing to be gained by delaying this, since
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 5867,5872 **** pg_is_xlog_replay_paused(PG_FUNCTION_ARGS)
--- 5867,5907 ----
}
/*
+ * Returns timestamp of latest inserted commit/abort record.
+ *
+ * If there has been no transaction committed or aborted yet since
+ * the server has started, this function returns NULL.
+ */
+ Datum
+ pg_last_xact_insert_timestamp(PG_FUNCTION_ARGS)
+ {
+ TimestampTz result = 0;
+ TimestampTz xtime;
+ PgBackendStatus *beentry;
+ int i;
+
+ if (RecoveryInProgress())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("WAL control functions cannot be executed during recovery.")));
+
+ beentry = pgstat_fetch_all_beentry();
+
+ for (i = 0; i < MaxBackends; i++, beentry++)
+ {
+ xtime = beentry->st_xact_end_timestamp;
+ if (result < xtime)
+ result = xtime;
+ }
+
+ if (result == 0)
+ PG_RETURN_NULL();
+
+ PG_RETURN_TIMESTAMPTZ(result);
+ }
+
+ /*
* Save timestamp of latest processed commit/abort record.
*
* We keep this in XLogCtl, not a simple static variable, so that it can be
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***************
*** 254,260 **** static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
static void pgstat_write_statsfile(bool permanent);
static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
static void backend_read_statsfile(void);
! static void pgstat_read_current_status(void);
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
--- 254,260 ----
static void pgstat_write_statsfile(bool permanent);
static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
static void backend_read_statsfile(void);
! static void pgstat_read_current_status(bool all);
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
***************
*** 2173,2179 **** pgstat_fetch_stat_funcentry(Oid func_id)
PgBackendStatus *
pgstat_fetch_stat_beentry(int beid)
{
! pgstat_read_current_status();
if (beid < 1 || beid > localNumBackends)
return NULL;
--- 2173,2179 ----
PgBackendStatus *
pgstat_fetch_stat_beentry(int beid)
{
! pgstat_read_current_status(false);
if (beid < 1 || beid > localNumBackends)
return NULL;
***************
*** 2192,2202 **** pgstat_fetch_stat_beentry(int beid)
int
pgstat_fetch_stat_numbackends(void)
{
! pgstat_read_current_status();
return localNumBackends;
}
/*
* ---------
* pgstat_fetch_global() -
--- 2192,2220 ----
int
pgstat_fetch_stat_numbackends(void)
{
! pgstat_read_current_status(false);
return localNumBackends;
}
+ /* ----------
+ * pgstat_fetch_all_beentry() -
+ *
+ * Support function for the SQL-callable pgstat* functions. Returns
+ * our local copy of all backend entries.
+ *
+ * NB: caller is responsible for a check if the user is permitted to see
+ * this info (especially the querystring).
+ * ----------
+ */
+ PgBackendStatus *
+ pgstat_fetch_all_beentry(void)
+ {
+ pgstat_read_current_status(true);
+
+ return localBackendStatusTable;
+ }
+
/*
* ---------
* pgstat_fetch_global() -
***************
*** 2414,2419 **** pgstat_bestart(void)
--- 2432,2442 ----
beentry->st_appname[NAMEDATALEN - 1] = '\0';
beentry->st_activity[pgstat_track_activity_query_size - 1] = '\0';
+ /*
+ * Don't reset st_xact_end_timestamp because the previous value can still
+ * be referenced to calculate the latest transaction insert timestamp.
+ */
+
beentry->st_changecount++;
Assert((beentry->st_changecount & 1) == 0);
***************
*** 2560,2565 **** pgstat_report_xact_timestamp(TimestampTz tstamp)
--- 2583,2611 ----
Assert((beentry->st_changecount & 1) == 0);
}
+ /*
+ * Report last transaction end timestamp as the specified value.
+ * Zero means there is no finished transaction.
+ */
+ void
+ pgstat_report_xact_end_timestamp(TimestampTz tstamp)
+ {
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!pgstat_track_activities || !beentry)
+ return;
+
+ /*
+ * Update my status entry, following the protocol of bumping
+ * st_changecount before and after. We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ */
+ beentry->st_changecount++;
+ beentry->st_xact_end_timestamp = tstamp;
+ beentry->st_changecount++;
+ Assert((beentry->st_changecount & 1) == 0);
+ }
+
/* ----------
* pgstat_report_waiting() -
*
***************
*** 2590,2600 **** pgstat_report_waiting(bool waiting)
* pgstat_read_current_status() -
*
* Copy the current contents of the PgBackendStatus array to local memory,
! * if not already done in this transaction.
* ----------
*/
static void
! pgstat_read_current_status(void)
{
volatile PgBackendStatus *beentry;
PgBackendStatus *localtable;
--- 2636,2647 ----
* pgstat_read_current_status() -
*
* Copy the current contents of the PgBackendStatus array to local memory,
! * if not already done in this transaction. If all is true, the local
! * array includes all entries. Otherwise, it includes only valid ones.
* ----------
*/
static void
! pgstat_read_current_status(bool all)
{
volatile PgBackendStatus *beentry;
PgBackendStatus *localtable;
***************
*** 2636,2642 **** pgstat_read_current_status(void)
int save_changecount = beentry->st_changecount;
localentry->st_procpid = beentry->st_procpid;
! if (localentry->st_procpid > 0)
{
memcpy(localentry, (char *) beentry, sizeof(PgBackendStatus));
--- 2683,2689 ----
int save_changecount = beentry->st_changecount;
localentry->st_procpid = beentry->st_procpid;
! if (localentry->st_procpid > 0 || all)
{
memcpy(localentry, (char *) beentry, sizeof(PgBackendStatus));
***************
*** 2659,2666 **** pgstat_read_current_status(void)
}
beentry++;
! /* Only valid entries get included into the local array */
! if (localentry->st_procpid > 0)
{
localentry++;
localappname += NAMEDATALEN;
--- 2706,2713 ----
}
beentry++;
! /* Only valid entries get included into the local array if all is false */
! if (localentry->st_procpid > 0 || all)
{
localentry++;
localappname += NAMEDATALEN;
*** a/src/include/access/xlog_internal.h
--- b/src/include/access/xlog_internal.h
***************
*** 272,277 **** extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS);
--- 272,278 ----
extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
extern Datum pg_last_xlog_receive_location(PG_FUNCTION_ARGS);
extern Datum pg_last_xlog_replay_location(PG_FUNCTION_ARGS);
+ extern Datum pg_last_xact_insert_timestamp(PG_FUNCTION_ARGS);
extern Datum pg_last_xact_replay_timestamp(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS);
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 2869,2874 **** DATA(insert OID = 2850 ( pg_xlogfile_name_offset PGNSP PGUID 12 1 0 0 0 f f f t
--- 2869,2876 ----
DESCR("xlog filename and byte offset, given an xlog location");
DATA(insert OID = 2851 ( pg_xlogfile_name PGNSP PGUID 12 1 0 0 0 f f f t f i 1 0 25 "25" _null_ _null_ _null_ _null_ pg_xlogfile_name _null_ _null_ _null_ ));
DESCR("xlog filename, given an xlog location");
+ DATA(insert OID = 3831 ( pg_last_xact_insert_timestamp PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 1184 "" _null_ _null_ _null_ _null_ pg_last_xact_insert_timestamp _null_ _null_ _null_ ));
+ DESCR("timestamp of last insert xact");
DATA(insert OID = 3810 ( pg_is_in_recovery PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_is_in_recovery _null_ _null_ _null_ ));
DESCR("true if server is in recovery");
*** a/src/include/pgstat.h
--- b/src/include/pgstat.h
***************
*** 623,628 **** typedef struct PgBackendStatus
--- 623,631 ----
TimestampTz st_xact_start_timestamp;
TimestampTz st_activity_start_timestamp;
+ /* Time when last transaction ended */
+ TimestampTz st_xact_end_timestamp;
+
/* Database OID, owning user's OID, connection client address */
Oid st_databaseid;
Oid st_userid;
***************
*** 718,723 **** extern void pgstat_bestart(void);
--- 721,727 ----
extern void pgstat_report_activity(const char *cmd_str);
extern void pgstat_report_appname(const char *appname);
extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+ extern void pgstat_report_xact_end_timestamp(TimestampTz tstamp);
extern void pgstat_report_waiting(bool waiting);
extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
***************
*** 795,800 **** extern void pgstat_send_bgwriter(void);
--- 799,805 ----
extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+ extern PgBackendStatus *pgstat_fetch_all_beentry(void);
extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
extern int pgstat_fetch_stat_numbackends(void);
extern PgStat_GlobalStats *pgstat_fetch_global(void);
Hi, This is a review for pg_last_xact_insert_timestamp patch.
(https://commitfest.postgresql.org/action/patch_view?id=634)
Summary
====================
There's one question and two comments.
Q1: The shmem entry for timestamp is not initialized on
allocating. Is this OK? (I don't know that for OSs other than
Linux) And zeroing double field is OK for all OSs?
And you can find the two more comment in "Code details" section.
I have no objection for commiter review with the Q1 above is
cleared.
Purpose and function of this patch
====================
This patch allows users to grip the amount of replay lag on the
standby by giving the timestamp of the WAL latestly inserted on
the primary.
Each backend writes the timestamp of the just inserted log record
of commit/abort freely on PgBackendStatus, and the function
pg_last_xact_insert_timestamp() gathers them for all backends
including inactive ones and returns the latest one, or NULL when
no timestamp is saved.
Implemented in lockless way.
Patch application, regression test
====================
This patch applies cleanly onto HEAD.
make world completed successfully. make check passed.
Style of the code and patch seems correct.
Documentation
====================
It looks properly documented for new functions.
Performance
====================
This patch writes one TimestampTz value on shmem per one WAL
insertion for commit/abort with no lock, and collect the values
for all backends on reading without any locks.
I think the former gives virtually no penalty for performance of
transactions, and the latter has no influence on other
transactions thanks to the lockless implement.
Code details
====================
== Shared memory initialization
beentry->st_xact_end_timestamp is not initialized on backend
startup. This is because the life time of the field is beyond
that of backends. On the other hand, Linux man page says that
newly created shared memory segment is zeroed, but I don't have
information about other OSs.
Nevertheless this is ok for all OSs, I don't know whether
initializing TimestampTz(double, int64 is ok) field with 8 bytes
zeros is OK or not, for all platforms. (It is ok for
IEEE754-binary64).
== Modification detection protocol in pgstat.c
In pgstat_report_xact_end_timestamp, `beentry->st_changecount
protocol' is used. It is for avoiding reading halfway-updated
beentry as described in pgstat.h. Meanwhile,
beentry->st_xact_end_timestamp is not read or (re-)initialized in
pgstat.c and xlog.c reads only this field of whole beentry and
st_changecount is not get cared here..
So I think at present it is needless to touch st_changecount
here.
Of cource the penalty is very close to nothing so it may be there
for future use...
== Code duplication in xact.c
in xact.c, same lines inserted into the end of both IF and ELSE
blocks.
xact.c:1047> pgstat_report_xact_end_timestamp(xlrec.xact_time);
xact.c:1073> pgstat_report_xact_end_timestamp(xlrec.xact_time);
These two lines refer to xlrec.xact_time, both of which comes
from xactStopTimestamp freezed at xact.c:986
xact.c:986> SetCurrentTransactionStopTimestamp();
xact.c:1008> xlrec.xact_time = xactStopTimestamp;
xact.c:1051> xlrec.xact_time = xactStopTimestamp;
I think it is better to move this line to just after this ELSE
block using xactStopTimestamp instead of xlrec.xact_time. Please
leave it as it is if you put more importance on the similarity
with the code inserted into RecordTransactionAbort().
Conclusion
====================
With the issue of shmem zeroing on other OSs is cleard, I have no
objection to commit this patch.
Sincerely,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Sep 14, 2011 at 6:21 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@oss.ntt.co.jp> wrote:
Hi, This is a review for pg_last_xact_insert_timestamp patch.
(https://commitfest.postgresql.org/action/patch_view?id=634)
Thanks for the review!
Q1: The shmem entry for timestamp is not initialized on
allocating. Is this OK? (I don't know that for OSs other than
Linux) And zeroing double field is OK for all OSs?
CreateSharedBackendStatus() initializes that shmem entries by doing
MemSet(BackendStatusArray, 0, size). You think this is not enough?
Nevertheless this is ok for all OSs, I don't know whether
initializing TimestampTz(double, int64 is ok) field with 8 bytes
zeros is OK or not, for all platforms. (It is ok for
IEEE754-binary64).
Which code are you concerned about?
== Modification detection protocol in pgstat.c
In pgstat_report_xact_end_timestamp, `beentry->st_changecount
protocol' is used. It is for avoiding reading halfway-updated
beentry as described in pgstat.h. Meanwhile,
beentry->st_xact_end_timestamp is not read or (re-)initialized in
pgstat.c and xlog.c reads only this field of whole beentry and
st_changecount is not get cared here..
No, st_changecount is used to read st_xact_end_timestamp.
st_xact_end_timestamp is read from the shmem to the local memory
in pgstat_read_current_status(), and this function always checks
st_changecount when reading the shmem value.
== Code duplication in xact.c
in xact.c, same lines inserted into the end of both IF and ELSE
blocks.xact.c:1047> pgstat_report_xact_end_timestamp(xlrec.xact_time);
xact.c:1073> pgstat_report_xact_end_timestamp(xlrec.xact_time);These two lines refer to xlrec.xact_time, both of which comes
from xactStopTimestamp freezed at xact.c:986xact.c:986> SetCurrentTransactionStopTimestamp();
xact.c:1008> xlrec.xact_time = xactStopTimestamp;
xact.c:1051> xlrec.xact_time = xactStopTimestamp;I think it is better to move this line to just after this ELSE
block using xactStopTimestamp instead of xlrec.xact_time.
Okay, I've changed the patch in that way.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Attachments:
pg_last_xact_insert_timestamp_v3.patchtext/x-patch; charset=US-ASCII; name=pg_last_xact_insert_timestamp_v3.patchDownload
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 13996,14001 **** SELECT set_config('log_statement_stats', 'off', false);
--- 13996,14004 ----
<primary>pg_current_xlog_location</primary>
</indexterm>
<indexterm>
+ <primary>pg_last_xact_insert_timestamp</primary>
+ </indexterm>
+ <indexterm>
<primary>pg_start_backup</primary>
</indexterm>
<indexterm>
***************
*** 14049,14054 **** SELECT set_config('log_statement_stats', 'off', false);
--- 14052,14064 ----
</row>
<row>
<entry>
+ <literal><function>pg_last_xact_insert_timestamp()</function></literal>
+ </entry>
+ <entry><type>timestamp with time zone</type></entry>
+ <entry>Get last transaction log insert time stamp</entry>
+ </row>
+ <row>
+ <entry>
<literal><function>pg_start_backup(<parameter>label</> <type>text</> <optional>, <parameter>fast</> <type>boolean</> </optional>)</function></literal>
</entry>
<entry><type>text</type></entry>
***************
*** 14175,14180 **** postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
--- 14185,14197 ----
</para>
<para>
+ <function>pg_last_xact_insert_timestamp</> displays the time stamp of last inserted
+ transaction. This is the time at which the commit or abort WAL record for that transaction.
+ If there has been no transaction committed or aborted yet since the server has started,
+ this function returns NULL.
+ </para>
+
+ <para>
For details about proper usage of these functions, see
<xref linkend="continuous-archiving">.
</para>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 867,872 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 867,881 ----
<command>ps</> command (see <xref linkend="monitoring-ps"> for details).
</para>
<para>
+ You can also calculate the lag in time stamp by comparing the last
+ WAL insert time stamp on the primary with the last WAL replay
+ time stamp on the standby. They can be retrieved using
+ <function>pg_last_xact_insert_timestamp</> on the primary and
+ the <function>pg_last_xact_replay_timestamp</> on the standby,
+ respectively (see <xref linkend="functions-admin-backup-table"> and
+ <xref linkend="functions-recovery-info-table"> for details).
+ </para>
+ <para>
You can retrieve a list of WAL sender processes via the
<link linkend="monitoring-stats-views-table">
<literal>pg_stat_replication</></link> view. Large differences between
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 1066,1071 **** RecordTransactionCommit(void)
--- 1066,1074 ----
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_COMPACT, rdata);
}
+
+ /* Save timestamp of latest transaction commit record */
+ pgstat_report_xact_end_timestamp(xactStopTimestamp);
}
/*
***************
*** 1434,1439 **** RecordTransactionAbort(bool isSubXact)
--- 1437,1445 ----
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT, rdata);
+ /* Save timestamp of latest transaction abort record */
+ pgstat_report_xact_end_timestamp(xlrec.xact_time);
+
/*
* Report the latest async abort LSN, so that the WAL writer knows to
* flush this abort. There's nothing to be gained by delaying this, since
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 5867,5872 **** pg_is_xlog_replay_paused(PG_FUNCTION_ARGS)
--- 5867,5907 ----
}
/*
+ * Returns timestamp of latest inserted commit/abort record.
+ *
+ * If there has been no transaction committed or aborted yet since
+ * the server has started, this function returns NULL.
+ */
+ Datum
+ pg_last_xact_insert_timestamp(PG_FUNCTION_ARGS)
+ {
+ TimestampTz result = 0;
+ TimestampTz xtime;
+ PgBackendStatus *beentry;
+ int i;
+
+ if (RecoveryInProgress())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("WAL control functions cannot be executed during recovery.")));
+
+ beentry = pgstat_fetch_all_beentry();
+
+ for (i = 0; i < MaxBackends; i++, beentry++)
+ {
+ xtime = beentry->st_xact_end_timestamp;
+ if (result < xtime)
+ result = xtime;
+ }
+
+ if (result == 0)
+ PG_RETURN_NULL();
+
+ PG_RETURN_TIMESTAMPTZ(result);
+ }
+
+ /*
* Save timestamp of latest processed commit/abort record.
*
* We keep this in XLogCtl, not a simple static variable, so that it can be
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***************
*** 254,260 **** static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
static void pgstat_write_statsfile(bool permanent);
static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
static void backend_read_statsfile(void);
! static void pgstat_read_current_status(void);
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
--- 254,260 ----
static void pgstat_write_statsfile(bool permanent);
static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
static void backend_read_statsfile(void);
! static void pgstat_read_current_status(bool all);
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
***************
*** 2173,2179 **** pgstat_fetch_stat_funcentry(Oid func_id)
PgBackendStatus *
pgstat_fetch_stat_beentry(int beid)
{
! pgstat_read_current_status();
if (beid < 1 || beid > localNumBackends)
return NULL;
--- 2173,2179 ----
PgBackendStatus *
pgstat_fetch_stat_beentry(int beid)
{
! pgstat_read_current_status(false);
if (beid < 1 || beid > localNumBackends)
return NULL;
***************
*** 2192,2202 **** pgstat_fetch_stat_beentry(int beid)
int
pgstat_fetch_stat_numbackends(void)
{
! pgstat_read_current_status();
return localNumBackends;
}
/*
* ---------
* pgstat_fetch_global() -
--- 2192,2220 ----
int
pgstat_fetch_stat_numbackends(void)
{
! pgstat_read_current_status(false);
return localNumBackends;
}
+ /* ----------
+ * pgstat_fetch_all_beentry() -
+ *
+ * Support function for the SQL-callable pgstat* functions. Returns
+ * our local copy of all backend entries.
+ *
+ * NB: caller is responsible for a check if the user is permitted to see
+ * this info (especially the querystring).
+ * ----------
+ */
+ PgBackendStatus *
+ pgstat_fetch_all_beentry(void)
+ {
+ pgstat_read_current_status(true);
+
+ return localBackendStatusTable;
+ }
+
/*
* ---------
* pgstat_fetch_global() -
***************
*** 2414,2419 **** pgstat_bestart(void)
--- 2432,2442 ----
beentry->st_appname[NAMEDATALEN - 1] = '\0';
beentry->st_activity[pgstat_track_activity_query_size - 1] = '\0';
+ /*
+ * Don't reset st_xact_end_timestamp because the previous value can still
+ * be referenced to calculate the latest transaction insert timestamp.
+ */
+
beentry->st_changecount++;
Assert((beentry->st_changecount & 1) == 0);
***************
*** 2560,2565 **** pgstat_report_xact_timestamp(TimestampTz tstamp)
--- 2583,2611 ----
Assert((beentry->st_changecount & 1) == 0);
}
+ /*
+ * Report last transaction end timestamp as the specified value.
+ * Zero means there is no finished transaction.
+ */
+ void
+ pgstat_report_xact_end_timestamp(TimestampTz tstamp)
+ {
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!pgstat_track_activities || !beentry)
+ return;
+
+ /*
+ * Update my status entry, following the protocol of bumping
+ * st_changecount before and after. We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ */
+ beentry->st_changecount++;
+ beentry->st_xact_end_timestamp = tstamp;
+ beentry->st_changecount++;
+ Assert((beentry->st_changecount & 1) == 0);
+ }
+
/* ----------
* pgstat_report_waiting() -
*
***************
*** 2590,2600 **** pgstat_report_waiting(bool waiting)
* pgstat_read_current_status() -
*
* Copy the current contents of the PgBackendStatus array to local memory,
! * if not already done in this transaction.
* ----------
*/
static void
! pgstat_read_current_status(void)
{
volatile PgBackendStatus *beentry;
PgBackendStatus *localtable;
--- 2636,2647 ----
* pgstat_read_current_status() -
*
* Copy the current contents of the PgBackendStatus array to local memory,
! * if not already done in this transaction. If all is true, the local
! * array includes all entries. Otherwise, it includes only valid ones.
* ----------
*/
static void
! pgstat_read_current_status(bool all)
{
volatile PgBackendStatus *beentry;
PgBackendStatus *localtable;
***************
*** 2636,2642 **** pgstat_read_current_status(void)
int save_changecount = beentry->st_changecount;
localentry->st_procpid = beentry->st_procpid;
! if (localentry->st_procpid > 0)
{
memcpy(localentry, (char *) beentry, sizeof(PgBackendStatus));
--- 2683,2689 ----
int save_changecount = beentry->st_changecount;
localentry->st_procpid = beentry->st_procpid;
! if (localentry->st_procpid > 0 || all)
{
memcpy(localentry, (char *) beentry, sizeof(PgBackendStatus));
***************
*** 2659,2666 **** pgstat_read_current_status(void)
}
beentry++;
! /* Only valid entries get included into the local array */
! if (localentry->st_procpid > 0)
{
localentry++;
localappname += NAMEDATALEN;
--- 2706,2713 ----
}
beentry++;
! /* Only valid entries get included into the local array if all is false */
! if (localentry->st_procpid > 0 || all)
{
localentry++;
localappname += NAMEDATALEN;
*** a/src/include/access/xlog_internal.h
--- b/src/include/access/xlog_internal.h
***************
*** 272,277 **** extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS);
--- 272,278 ----
extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
extern Datum pg_last_xlog_receive_location(PG_FUNCTION_ARGS);
extern Datum pg_last_xlog_replay_location(PG_FUNCTION_ARGS);
+ extern Datum pg_last_xact_insert_timestamp(PG_FUNCTION_ARGS);
extern Datum pg_last_xact_replay_timestamp(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS);
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 2869,2874 **** DATA(insert OID = 2850 ( pg_xlogfile_name_offset PGNSP PGUID 12 1 0 0 0 f f f t
--- 2869,2876 ----
DESCR("xlog filename and byte offset, given an xlog location");
DATA(insert OID = 2851 ( pg_xlogfile_name PGNSP PGUID 12 1 0 0 0 f f f t f i 1 0 25 "25" _null_ _null_ _null_ _null_ pg_xlogfile_name _null_ _null_ _null_ ));
DESCR("xlog filename, given an xlog location");
+ DATA(insert OID = 3831 ( pg_last_xact_insert_timestamp PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 1184 "" _null_ _null_ _null_ _null_ pg_last_xact_insert_timestamp _null_ _null_ _null_ ));
+ DESCR("timestamp of last insert xact");
DATA(insert OID = 3810 ( pg_is_in_recovery PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_is_in_recovery _null_ _null_ _null_ ));
DESCR("true if server is in recovery");
*** a/src/include/pgstat.h
--- b/src/include/pgstat.h
***************
*** 623,628 **** typedef struct PgBackendStatus
--- 623,631 ----
TimestampTz st_xact_start_timestamp;
TimestampTz st_activity_start_timestamp;
+ /* Time when last transaction ended */
+ TimestampTz st_xact_end_timestamp;
+
/* Database OID, owning user's OID, connection client address */
Oid st_databaseid;
Oid st_userid;
***************
*** 718,723 **** extern void pgstat_bestart(void);
--- 721,727 ----
extern void pgstat_report_activity(const char *cmd_str);
extern void pgstat_report_appname(const char *appname);
extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+ extern void pgstat_report_xact_end_timestamp(TimestampTz tstamp);
extern void pgstat_report_waiting(bool waiting);
extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
***************
*** 795,800 **** extern void pgstat_send_bgwriter(void);
--- 799,805 ----
extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+ extern PgBackendStatus *pgstat_fetch_all_beentry(void);
extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
extern int pgstat_fetch_stat_numbackends(void);
extern PgStat_GlobalStats *pgstat_fetch_global(void);
On Thu, Sep 15, 2011 at 4:52 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
Thanks for the review!
Koyotaro Horiguchi -
Are you going to re-review the latest version of this patch?
Thanks,
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Sorry for late to re-review.
One question is remaind,
Q1: The shmem entry for timestamp is not initialized on
allocating. Is this OK? (I don't know that for OSs other than
Linux) And zeroing double field is OK for all OSs?CreateSharedBackendStatus() initializes that shmem entries by doing
MemSet(BackendStatusArray, 0, size). You think this is not enough?
Sorry for missing it. That's enough.
Nevertheless this is ok for all OSs, I don't know whether
initializing TimestampTz(double, int64 is ok) field with 8 bytes
zeros is OK or not, for all platforms. (It is ok for
IEEE754-binary64).Which code are you concerned about?
xlog.c: 5889
beentry = pgstat_fetch_all_beentry();
for (i = 0; i < MaxBackends; i++, beentry++)
{
xtime = beentry->st_xact_end_timestamp;
I think the last line in quoted code above reads possibly
zero-initialized double (or int64) value, then the doubted will
be compared and copied to another double.
if (result < xtime)
result = xtime;
No, st_changecount is used to read st_xact_end_timestamp.
st_xact_end_timestamp is read from the shmem to the local memory
in pgstat_read_current_status(), and this function always checks
st_changecount when reading the shmem value.
Yes, I confirmed that pg_lats_xact_insert_timestamp looks local
copy of BackendStatus. I've found it not unnecessary.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Sep 29, 2011 at 5:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@oss.ntt.co.jp> wrote:
Sorry for late to re-review.
Thanks!
Nevertheless this is ok for all OSs, I don't know whether
initializing TimestampTz(double, int64 is ok) field with 8 bytes
zeros is OK or not, for all platforms. (It is ok for
IEEE754-binary64).Which code are you concerned about?
xlog.c: 5889
beentry = pgstat_fetch_all_beentry();
for (i = 0; i < MaxBackends; i++, beentry++)
{
xtime = beentry->st_xact_end_timestamp;I think the last line in quoted code above reads possibly
zero-initialized double (or int64) value, then the doubted will
be compared and copied to another double.if (result < xtime)
result = xtime;
I believe it's safe. Such a code is placed elsewhere in the source, too.
If it's unsafe, we should have received lots of bug reports related
to that. But we've not.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Hello,
I understand that it has been at least practically no problem.
Ok, I send this patch to comitters.
Thanks for your dealing with nuisance questions.
At Thu, 29 Sep 2011 21:21:32 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwG0C21F0CZY5ExX-49dxdx7hJuNeiBBJ0Tzvh+7vMXWgw@mail.gmail.com>
Nevertheless this is ok for all OSs, I don't know whether
initializing TimestampTz(double, int64 is ok) field with 8 bytes
...
I believe it's safe. Such a code is placed elsewhere in the source, too.
If it's unsafe, we should have received lots of bug reports related
to that. But we've not.
Regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Sep 30, 2011 at 3:24 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@oss.ntt.co.jp> wrote:
Ok, I send this patch to comitters.
I repeat my objection to this patch. I'm very sorry I haven't been
around much in last few weeks to keep up a dialogue about this and to
make it clear how wrong I think this is.
Adding something onto the main path of transaction commit is not good,
especially when the only purpose of it is to run an occasional
monitoring query for those people that use replication. Not everybody
uses replication and even people that do use it don't need to run it
so frequently that every single commit needs this. This is just bloat
that we do not need and can also easily avoid.
The calculation itself would be problematic since it differs from the
way standby_delay is calculated on the standby, which will cause much
confusion. Some thought or comment should be made about that also.
If we want to measure times, we can easily send regular messages into
WAL to provide this function. Using checkpoint records would seem
frequent enough to me. We don't always send checkpoint records but we
can send an info message instead if we are streaming. If
archive_timeout is not set and max_wal_senders > 0 then we can send an
info WAL message with timestamp on a regular cycle.
Alternatively, we can send regular special messages from WALSender to
WALReceiver that do not form part of the WAL stream, so we don't bulk
up WAL archives. (i.e. don't use "w" messages). That seems like the
most viable approach to this problem.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 30, 2011 at 3:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Sep 30, 2011 at 3:24 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@oss.ntt.co.jp> wrote:Ok, I send this patch to comitters.
I repeat my objection to this patch. I'm very sorry I haven't been
around much in last few weeks to keep up a dialogue about this and to
make it clear how wrong I think this is.Adding something onto the main path of transaction commit is not good,
especially when the only purpose of it is to run an occasional
monitoring query for those people that use replication. Not everybody
uses replication and even people that do use it don't need to run it
so frequently that every single commit needs this. This is just bloat
that we do not need and can also easily avoid.
I think the overhead of this is so completely trivial that we
shouldn't be concerned about it. It's writing 12 bytes to shared
memory for each commit, without taking a lock, in a cache line that
should be minimally contended. We write plenty of other data to
shared memory on each commit, and that's nowhere close to being the
expensive part of committing a transaction. What's expensive is
fighting over WALInsertLock and ProcArrayLock and CLogControlLock, and
possibly waiting for the commit to be durably flushed to disk, and
maybe (at the margin) wrenching the cache lines in our PGPROC away
from whatever processor last stole them to do GetSnapshotData(). I
don't think that a couple of stores to uncontended shared memory are
going to make any difference.
The calculation itself would be problematic since it differs from the
way standby_delay is calculated on the standby, which will cause much
confusion. Some thought or comment should be made about that also.
The string standby_delay doesn't appear in our source tree anywhere,
so I'm not sure what this is referring to. In any case, I'm in favor
of this feature. Currently, the only way to measure the lag on the
standby is to measure it in WAL bytes - and you get to write your own
script to parse the WAL positions. This will allow people to measure
it in minutes. That seems like a significant usability improvement.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Sep 30, 2011 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 30, 2011 at 3:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Sep 30, 2011 at 3:24 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@oss.ntt.co.jp> wrote:Ok, I send this patch to comitters.
I repeat my objection to this patch. I'm very sorry I haven't been
around much in last few weeks to keep up a dialogue about this and to
make it clear how wrong I think this is.Adding something onto the main path of transaction commit is not good,
especially when the only purpose of it is to run an occasional
monitoring query for those people that use replication. Not everybody
uses replication and even people that do use it don't need to run it
so frequently that every single commit needs this. This is just bloat
that we do not need and can also easily avoid.I think the overhead of this is so completely trivial that we
shouldn't be concerned about it. It's writing 12 bytes to shared
memory for each commit, without taking a lock, in a cache line that
should be minimally contended. We write plenty of other data to
shared memory on each commit, and that's nowhere close to being the
expensive part of committing a transaction. What's expensive is
fighting over WALInsertLock and ProcArrayLock and CLogControlLock, and
possibly waiting for the commit to be durably flushed to disk, and
maybe (at the margin) wrenching the cache lines in our PGPROC away
from whatever processor last stole them to do GetSnapshotData(). I
don't think that a couple of stores to uncontended shared memory are
going to make any difference.
If the feature could not be done another way, easily, I might agree.
The point is it isn't necessary, useful or elegant to do it this way
and *any* cycles added to mainline transactions need to have careful
justification. And there really isn't any justification for doing
things this way other than its the first way somebody thought of.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 30, 2011 at 3:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
If the feature could not be done another way, easily, I might agree.
I don't see that you've offered a reasonable alternative. The
alternative proposals that you proposed don't appear to me to be
solving the same problem. AIUI, the requested feature is to be able
to get, on the master, the timestamp of the last commit or abort, just
as we can already get the timestamp of the last commit or abort
replayed on the standby. Nothing you WAL log and no messages you send
to the standby are going to accomplish that goal.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Sep 30, 2011 at 8:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 30, 2011 at 3:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
If the feature could not be done another way, easily, I might agree.
I don't see that you've offered a reasonable alternative. The
alternative proposals that you proposed don't appear to me to be
solving the same problem. AIUI, the requested feature is to be able
to get, on the master, the timestamp of the last commit or abort, just
as we can already get the timestamp of the last commit or abort
replayed on the standby. Nothing you WAL log and no messages you send
to the standby are going to accomplish that goal.
The goal of the OP was to find out the replication delay. This
function was suggested, but its not the only way.
My alternative proposals solve the original problem in a better way.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 30, 2011 at 4:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Sep 30, 2011 at 8:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 30, 2011 at 3:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
If the feature could not be done another way, easily, I might agree.
I don't see that you've offered a reasonable alternative. The
alternative proposals that you proposed don't appear to me to be
solving the same problem. AIUI, the requested feature is to be able
to get, on the master, the timestamp of the last commit or abort, just
as we can already get the timestamp of the last commit or abort
replayed on the standby. Nothing you WAL log and no messages you send
to the standby are going to accomplish that goal.The goal of the OP was to find out the replication delay. This
function was suggested, but its not the only way.My alternative proposals solve the original problem in a better way.
As far as I can see, they don't solve the problem at all. Your
proposals involve sending additional information from the master to
the slave, but the slave already knows both its WAL position and the
timestamp of the transaction it has most recently replayed, because
the startup process on the slave tracks that information and publishes
it in shared memory. On the master, however, only the WAL position is
centrally tracked; the transaction timestamps are not.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Sep 15, 2011 at 4:52 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
Okay, I've changed the patch in that way.
It occurs to me that pgstat_report_xact_end_timestamp doesn't really
need to follow the protocol of bumping the change count before and
after bumping the timestamp. We elsewhere assume that four-byte reads
and writes are atomic, so there's no harm in assuming the same thing
here (and if they're not... then the change-count thing is pretty
dubious anyway). I think it's sufficient to just set the value, full
stop.
Also, in pg_last_xact_insert_timestamp, the errhint() seems a little
strange - this is not exactly a WAL *control* function, is it?
In the documentation, for the short description of
pg_last_xact_insert_timestamp(), how about something like "returns the
time at which a transaction commit or transaction about record was
last inserted into the transaction log"? Or maybe that's too long.
But the current description doesn't seem to do much other than
recapitulate the function name, so I'm wondering if we can do any
better than that.
I think that instead of hacking up the backend-status copying code to
have a mode where it copies everything, you should just have a
special-purpose function that computes the value you need directly off
the backend status entries themselves. This approach seems like it
both clutters the code and adds lots of extra data copying.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Oct 2, 2011 at 12:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 30, 2011 at 4:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Sep 30, 2011 at 8:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 30, 2011 at 3:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
If the feature could not be done another way, easily, I might agree.
I don't see that you've offered a reasonable alternative. The
alternative proposals that you proposed don't appear to me to be
solving the same problem. AIUI, the requested feature is to be able
to get, on the master, the timestamp of the last commit or abort, just
as we can already get the timestamp of the last commit or abort
replayed on the standby. Nothing you WAL log and no messages you send
to the standby are going to accomplish that goal.The goal of the OP was to find out the replication delay. This
function was suggested, but its not the only way.My alternative proposals solve the original problem in a better way.
As far as I can see, they don't solve the problem at all. Your
proposals involve sending additional information from the master to
the slave, but the slave already knows both its WAL position and the
timestamp of the transaction it has most recently replayed, because
the startup process on the slave tracks that information and publishes
it in shared memory. On the master, however, only the WAL position is
centrally tracked; the transaction timestamps are not.
The problem is to find the replication delay, even when the system is quiet.
What I have proposed finds the replication delay more accurately even
than looking at the last commit, since often there are writes but no
commits.
If we focus on the problem, rather than the first suggested solution
to that problem, we'll come out on top.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes:
It occurs to me that pgstat_report_xact_end_timestamp doesn't really
need to follow the protocol of bumping the change count before and
after bumping the timestamp. We elsewhere assume that four-byte reads
and writes are atomic, so there's no harm in assuming the same thing
here (and if they're not... then the change-count thing is pretty
dubious anyway). I think it's sufficient to just set the value, full
stop.
I agree you can read the value without that, but I think that setting
it should still bump the change count. Otherwise there's no way for
another process to collect the whole struct and be sure that it's
self-consistent. We may not have a critical need for that right now,
but it's silly to foreclose the possibility to save a cycle or so.
regards, tom lane
On Sun, Oct 2, 2011 at 8:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It occurs to me that pgstat_report_xact_end_timestamp doesn't really
need to follow the protocol of bumping the change count before and
after bumping the timestamp. We elsewhere assume that four-byte reads
and writes are atomic, so there's no harm in assuming the same thing
here (and if they're not... then the change-count thing is pretty
dubious anyway). I think it's sufficient to just set the value, full
stop.
I agree with Tom here. It seems to be safer to follow the protocol even if
that's not required for now.
Also, in pg_last_xact_insert_timestamp, the errhint() seems a little
strange - this is not exactly a WAL *control* function, is it?
Not only "control" but also "WAL" might be confusing. What about
"transaction information functions"?
BTW, pg_current_xlog_location() and pg_current_xlog_insert_location()
use the same HINT message as I used for pg_last_xact_insert_timestamp(),
but they are also not WAL *control* function. And, in the document,
they are categorized as "Backup Control Functions", but which sounds also
strange. We should call them "WAL information functions" in both
HINT message and the document?
In the documentation, for the short description of
pg_last_xact_insert_timestamp(), how about something like "returns the
time at which a transaction commit or transaction about record was
last inserted into the transaction log"? Or maybe that's too long.
But the current description doesn't seem to do much other than
recapitulate the function name, so I'm wondering if we can do any
better than that.
Agreed. I will change the description per your suggestion.
I think that instead of hacking up the backend-status copying code to
have a mode where it copies everything, you should just have a
special-purpose function that computes the value you need directly off
the backend status entries themselves. This approach seems like it
both clutters the code and adds lots of extra data copying.
Agreed. Will change.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Fri, Sep 30, 2011 at 4:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
If we want to measure times, we can easily send regular messages into
WAL to provide this function. Using checkpoint records would seem
frequent enough to me. We don't always send checkpoint records but we
can send an info message instead if we are streaming. If
archive_timeout is not set and max_wal_senders > 0 then we can send an
info WAL message with timestamp on a regular cycle.
What timestamp are you thinking the walsender should send? What we need
is the timestamp which is comparable with the result of
pg_last_xact_replay_timestamp() which returns the timestamp of the
transaction commit or abort record. So, even if we adopt your proposal,
ISTM that we still need to collect the timestamp for each commit. No?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Sun, Oct 2, 2011 at 8:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
The problem is to find the replication delay, even when the system is quiet.
What I have proposed finds the replication delay more accurately even
than looking at the last commit, since often there are writes but no
commits.If we focus on the problem, rather than the first suggested solution
to that problem, we'll come out on top.
Sorry, but I still don't really think it's fair to say that you've
proposed a solution to this problem. Or if you have, neither I nor
Fujii Masao understand that proposal well enough to decide whether we
like it. You said "maybe we could WAL log something once per
checkpoint cycle" or "maybe we could add a new protocol message".
We've both replied with various emails saying "we don't understand how
that would solve the problem". If you want to add some detail to your
proposal, then we can weigh the pros and cons as compared with what
the patch does - but right now all you've provided is a theory that
there might be a better solution to this problem out there, not any
details about how it would work. Or if you have, then please post a
link to the message where those details are written out, because I
cannot find them on the thread.
I do, however, agree that that the case where the system is quiet is
the problem case for computing replication delay. It seems to me
that, even without this patch, if the system has a continuous stream
of commits, you can easily find the replication delay by differencing
the current time on the master with the value returned by
pg_last_xact_replay_timestamp(). However, if the master goes quiet,
then the slave will appear to be progressively farther behind. With
the addition of this patch, that problem goes away: you can now
difference the return value of pg_last_xact_insert_timestamp() on the
master with the return value of pg_last_xact_replay_timestamp() on the
slave. If the master goes quiet, then pg_last_xact_insert_timestamp()
will stop advancing, and so the two values you are comparing will be
equal once the slave has caught up, and remain equal until activity
resumes on the master.
Now, there is a more subtle remaining problem, which is that when
activity *does* resume on the master, there will be a (probably quite
short) window of time during which the slave will have a much earlier
timestamp than the one on the master. When the master has a commit
after a long idle period but the slave has not yet replayed the commit
record, the replication delay will appear to be equal to the length of
the idle period. But that doesn't seem like a sufficient reason to
reject the whole approach, because there are several ways around it.
First, you could simply decide that the large computed lag value,
although counterintuitive, is accurate under some definition, because,
well, that really is the lag between the last transaction committed on
the master and the last transaction committed on the standby, and if
you don't like the fact that timestamps behave that way, you should
compare using WAL positions instead. If you don't like that approach,
then a second, also viable approach is to teach your monitoring
software that the replication delay can never increase faster than the
rate at which clock time is passing. So if you were caught up a
minute ago, then you can't be more than a minute behind now.
Another point I want to make here is that there's probably more than
one useful definition of replication delay. The previous question
presupposes that you're trying to answer the question "if I have a
transaction that committed N seconds ago on the master, will it be
visible on the standby?". It's also a reasonable time-based
substitute for measuring the difference in master and standby WAL
positions, although certainly it's going to work better if the rate of
WAL generation is relatively even. But for a lot of people, it may be
that what they really want to know is "what is the expected time for
the standby to replay all generated but not yet applied WAL?" - or
maybe some third thing that I'm not thinking of - and this function
won't provide that. I think we can ultimately afford to provide more
than one mechanism here, so I don't see doing this as foreclosing any
other also-useful calculation that someone may wish to add in the
future.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Oct 3, 2011 at 8:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Sorry, but I still don't really think it's fair to say that you've
proposed a solution to this problem. Or if you have, neither I nor
Fujii Masao understand that proposal well enough to decide whether we
like it.
Arguing between trenches doesn't really get us anywhere.
As ever, when someone claims to have a better solution then it is up
to them to prove that is the case.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Oct 3, 2011 at 4:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Oct 3, 2011 at 8:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Sorry, but I still don't really think it's fair to say that you've
proposed a solution to this problem. Or if you have, neither I nor
Fujii Masao understand that proposal well enough to decide whether we
like it.Arguing between trenches doesn't really get us anywhere.
As ever, when someone claims to have a better solution then it is up
to them to prove that is the case.
So... are you going to do that?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Oct 3, 2011 at 6:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Also, in pg_last_xact_insert_timestamp, the errhint() seems a little
strange - this is not exactly a WAL *control* function, is it?Not only "control" but also "WAL" might be confusing. What about
"transaction information functions"?
Attached is the updated version of the patch. In the patch, I used the
function name itself in the HINT message, i.e., the HINT message is
the following.
pg_last_xact_insert_timestamp() cannot be executed during recovery.
In the documentation, for the short description of
pg_last_xact_insert_timestamp(), how about something like "returns the
time at which a transaction commit or transaction about record was
last inserted into the transaction log"? Or maybe that's too long.
But the current description doesn't seem to do much other than
recapitulate the function name, so I'm wondering if we can do any
better than that.Agreed. I will change the description per your suggestion.
Done.
I think that instead of hacking up the backend-status copying code to
have a mode where it copies everything, you should just have a
special-purpose function that computes the value you need directly off
the backend status entries themselves. This approach seems like it
both clutters the code and adds lots of extra data copying.Agreed. Will change.
Done.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Attachments:
pg_last_xact_insert_timestamp_v4.patchtext/x-patch; charset=US-ASCII; name=pg_last_xact_insert_timestamp_v4.patchDownload
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 13996,14001 **** SELECT set_config('log_statement_stats', 'off', false);
--- 13996,14004 ----
<primary>pg_current_xlog_location</primary>
</indexterm>
<indexterm>
+ <primary>pg_last_xact_insert_timestamp</primary>
+ </indexterm>
+ <indexterm>
<primary>pg_start_backup</primary>
</indexterm>
<indexterm>
***************
*** 14049,14054 **** SELECT set_config('log_statement_stats', 'off', false);
--- 14052,14067 ----
</row>
<row>
<entry>
+ <literal><function>pg_last_xact_insert_timestamp()</function></literal>
+ </entry>
+ <entry><type>timestamp with time zone</type></entry>
+ <entry>
+ Get the time at which a transaction commit or transaction abort record
+ was last inserted into the transaction log
+ </entry>
+ </row>
+ <row>
+ <entry>
<literal><function>pg_start_backup(<parameter>label</> <type>text</> <optional>, <parameter>fast</> <type>boolean</> </optional>)</function></literal>
</entry>
<entry><type>text</type></entry>
***************
*** 14175,14180 **** postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
--- 14188,14200 ----
</para>
<para>
+ <function>pg_last_xact_insert_timestamp</> displays the time stamp of last inserted
+ transaction. This is the time at which the commit or abort WAL record for that transaction.
+ If there has been no transaction committed or aborted yet since the server has started,
+ this function returns NULL.
+ </para>
+
+ <para>
For details about proper usage of these functions, see
<xref linkend="continuous-archiving">.
</para>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 867,872 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 867,881 ----
<command>ps</> command (see <xref linkend="monitoring-ps"> for details).
</para>
<para>
+ You can also calculate the lag in time stamp by comparing the last
+ WAL insert time stamp on the primary with the last WAL replay
+ time stamp on the standby. They can be retrieved using
+ <function>pg_last_xact_insert_timestamp</> on the primary and
+ the <function>pg_last_xact_replay_timestamp</> on the standby,
+ respectively (see <xref linkend="functions-admin-backup-table"> and
+ <xref linkend="functions-recovery-info-table"> for details).
+ </para>
+ <para>
You can retrieve a list of WAL sender processes via the
<link linkend="monitoring-stats-views-table">
<literal>pg_stat_replication</></link> view. Large differences between
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 1066,1071 **** RecordTransactionCommit(void)
--- 1066,1074 ----
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_COMPACT, rdata);
}
+
+ /* Save timestamp of latest transaction commit record */
+ pgstat_report_xact_end_timestamp(xactStopTimestamp);
}
/*
***************
*** 1434,1439 **** RecordTransactionAbort(bool isSubXact)
--- 1437,1445 ----
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT, rdata);
+ /* Save timestamp of latest transaction abort record */
+ pgstat_report_xact_end_timestamp(xlrec.xact_time);
+
/*
* Report the latest async abort LSN, so that the WAL writer knows to
* flush this abort. There's nothing to be gained by delaying this, since
***************
*** 4968,4970 **** xact_desc(StringInfo buf, uint8 xl_info, char *rec)
--- 4974,5000 ----
else
appendStringInfo(buf, "UNKNOWN");
}
+
+ /*
+ * Returns timestamp of latest inserted commit/abort record.
+ *
+ * If there has been no transaction committed or aborted yet since
+ * the server has started, this function returns NULL.
+ */
+ Datum
+ pg_last_xact_insert_timestamp(PG_FUNCTION_ARGS)
+ {
+ TimestampTz xtime;
+
+ if (RecoveryInProgress())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("pg_last_xact_insert_timestamp() cannot be executed during recovery.")));
+
+ xtime = pgstat_get_last_xact_end_timestamp();
+ if (xtime == 0)
+ PG_RETURN_NULL();
+
+ PG_RETURN_TIMESTAMPTZ(xtime);
+ }
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***************
*** 2201,2206 **** pgstat_fetch_stat_numbackends(void)
--- 2201,2207 ----
return localNumBackends;
}
+
/*
* ---------
* pgstat_fetch_global() -
***************
*** 2418,2423 **** pgstat_bestart(void)
--- 2419,2429 ----
beentry->st_appname[NAMEDATALEN - 1] = '\0';
beentry->st_activity[pgstat_track_activity_query_size - 1] = '\0';
+ /*
+ * Don't reset st_xact_end_timestamp because the previous value can still
+ * be referenced to calculate the latest transaction insert timestamp.
+ */
+
beentry->st_changecount++;
Assert((beentry->st_changecount & 1) == 0);
***************
*** 2564,2569 **** pgstat_report_xact_timestamp(TimestampTz tstamp)
--- 2570,2598 ----
Assert((beentry->st_changecount & 1) == 0);
}
+ /*
+ * Report last transaction end timestamp as the specified value.
+ * Zero means there is no finished transaction.
+ */
+ void
+ pgstat_report_xact_end_timestamp(TimestampTz tstamp)
+ {
+ volatile PgBackendStatus *beentry = MyBEEntry;
+
+ if (!beentry)
+ return;
+
+ /*
+ * Update my status entry, following the protocol of bumping
+ * st_changecount before and after. We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ */
+ beentry->st_changecount++;
+ beentry->st_xact_end_timestamp = tstamp;
+ beentry->st_changecount++;
+ Assert((beentry->st_changecount & 1) == 0);
+ }
+
/* ----------
* pgstat_report_waiting() -
*
***************
*** 2752,2757 **** pgstat_get_backend_current_activity(int pid, bool checkUser)
--- 2781,2840 ----
}
+ /* ----------
+ * pgstat_get_last_xact_end_timestamp() -
+ *
+ * Returns timestamp of latest inserted commit/abort record.
+ *
+ * If there has been no transaction committed or aborted yet since
+ * the server has started, this function returns zero.
+ * ----------
+ */
+ TimestampTz
+ pgstat_get_last_xact_end_timestamp(void)
+ {
+ volatile PgBackendStatus *beentry;
+ TimestampTz result = 0;
+ int i;
+
+ Assert(!pgStatRunningInCollector);
+
+ beentry = BackendStatusArray;
+ for (i = 1; i <= MaxBackends; i++)
+ {
+ TimestampTz xtime = 0;
+
+ /*
+ * Follow the protocol of retrying if st_changecount changes while we
+ * copy the entry, or if it's odd. (The check for odd is needed to
+ * cover the case where we are able to completely copy the entry while
+ * the source backend is between increment steps.) We use a volatile
+ * pointer here to ensure the compiler doesn't try to get cute.
+ */
+ for (;;)
+ {
+ int save_changecount = beentry->st_changecount;
+
+ xtime = beentry->st_xact_end_timestamp;
+
+ if (save_changecount == beentry->st_changecount &&
+ (save_changecount & 1) == 0)
+ break;
+
+ /* Make sure we can break out of loop if stuck... */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ if (result < xtime)
+ result = xtime;
+
+ beentry++;
+ }
+
+ return result;
+ }
+
+
/* ------------------------------------------------------------
* Local support functions follow
* ------------------------------------------------------------
*** a/src/include/access/xlog_internal.h
--- b/src/include/access/xlog_internal.h
***************
*** 272,277 **** extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS);
--- 272,278 ----
extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
extern Datum pg_last_xlog_receive_location(PG_FUNCTION_ARGS);
extern Datum pg_last_xlog_replay_location(PG_FUNCTION_ARGS);
+ extern Datum pg_last_xact_insert_timestamp(PG_FUNCTION_ARGS);
extern Datum pg_last_xact_replay_timestamp(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS);
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 2869,2874 **** DATA(insert OID = 2850 ( pg_xlogfile_name_offset PGNSP PGUID 12 1 0 0 0 f f f t
--- 2869,2876 ----
DESCR("xlog filename and byte offset, given an xlog location");
DATA(insert OID = 2851 ( pg_xlogfile_name PGNSP PGUID 12 1 0 0 0 f f f t f i 1 0 25 "25" _null_ _null_ _null_ _null_ pg_xlogfile_name _null_ _null_ _null_ ));
DESCR("xlog filename, given an xlog location");
+ DATA(insert OID = 3831 ( pg_last_xact_insert_timestamp PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 1184 "" _null_ _null_ _null_ _null_ pg_last_xact_insert_timestamp _null_ _null_ _null_ ));
+ DESCR("timestamp of last insert xact");
DATA(insert OID = 3810 ( pg_is_in_recovery PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_is_in_recovery _null_ _null_ _null_ ));
DESCR("true if server is in recovery");
*** a/src/include/pgstat.h
--- b/src/include/pgstat.h
***************
*** 623,628 **** typedef struct PgBackendStatus
--- 623,631 ----
TimestampTz st_xact_start_timestamp;
TimestampTz st_activity_start_timestamp;
+ /* Time when last transaction ended */
+ TimestampTz st_xact_end_timestamp;
+
/* Database OID, owning user's OID, connection client address */
Oid st_databaseid;
Oid st_userid;
***************
*** 718,725 **** extern void pgstat_bestart(void);
--- 721,730 ----
extern void pgstat_report_activity(const char *cmd_str);
extern void pgstat_report_appname(const char *appname);
extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+ extern void pgstat_report_xact_end_timestamp(TimestampTz tstamp);
extern void pgstat_report_waiting(bool waiting);
extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+ extern TimestampTz pgstat_get_last_xact_end_timestamp(void);
extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
On Tue, Oct 4, 2011 at 5:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Oct 3, 2011 at 4:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Oct 3, 2011 at 8:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Sorry, but I still don't really think it's fair to say that you've
proposed a solution to this problem. Or if you have, neither I nor
Fujii Masao understand that proposal well enough to decide whether we
like it.Arguing between trenches doesn't really get us anywhere.
As ever, when someone claims to have a better solution then it is up
to them to prove that is the case.So... are you going to do that?
Simon, could you? If your proposal turns out to be better than mine, I'd be
happy to agree to drop my patch and adopt yours.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Tue, Oct 4, 2011 at 1:05 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Tue, Oct 4, 2011 at 5:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Oct 3, 2011 at 4:25 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Oct 3, 2011 at 8:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Sorry, but I still don't really think it's fair to say that you've
proposed a solution to this problem. Or if you have, neither I nor
Fujii Masao understand that proposal well enough to decide whether we
like it.Arguing between trenches doesn't really get us anywhere.
As ever, when someone claims to have a better solution then it is up
to them to prove that is the case.So... are you going to do that?
Simon, could you? If your proposal turns out to be better than mine, I'd be
happy to agree to drop my patch and adopt yours.
Yes, will do.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Oct 4, 2011 at 9:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Simon, could you? If your proposal turns out to be better than mine, I'd be
happy to agree to drop my patch and adopt yours.Yes, will do.
Simon,
I believe that we are still waiting for this.
Thanks,
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Oct 13, 2011 at 14:25, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Oct 4, 2011 at 9:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Simon, could you? If your proposal turns out to be better than mine, I'd be
happy to agree to drop my patch and adopt yours.Yes, will do.
Simon,
I believe that we are still waiting for this.
Are we going to hear anything back on this one for the current CF? If
not, I suggest we go with Fujiis version for now - we can always
change it for a potentially better version later.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On 10/02/2011 07:10 AM, Robert Haas wrote:
Your proposals involve sending additional information from the master to
the slave, but the slave already knows both its WAL position and the
timestamp of the transaction it has most recently replayed, because
the startup process on the slave tracks that information and publishes
it in shared memory. On the master, however, only the WAL position is
centrally tracked; the transaction timestamps are not.
This seems to be the question that was never really answered well enough
to satisfy anyone, so let's rewind to here for a bit. I wasn't
following this closely until now, so I just did my own review from
scratch against the latest patch. I found a few issues, and I don't
think all of them have been vented here yet:
-It adds overhead at every commit, even for people who aren't using it.
Probably not enough to matter, but it's yet another thing going through
the often maligned as too heavy pgstat system, often.
-In order to measure lag this way, you need access to both the master
and the standby. Yuck, dblink or application code doing timestamp math,
either idea makes me shudder. It would be nice to answer "how many
seconds of lag do have?" directly from the standby. Ideally I would
like both the master and standby to know those numbers.
-After the server is restarted, you get a hole in the monitoring data
until the first transaction is committed or aborted. The way the
existing pg_last_xact_replay_timestamp side of this computation goes
NULL for some unpredictable period after restart is going to drive
monitoring systems batty. Building this new facility similarly will
force everyone who writes a lag monitor to special case for that
condition on both sides. Sure, that's less user hostile than the status
quo, but it isn't going to help PostgreSQL's battered reputation in the
area of monitoring either.
-The transaction ID advancing is not a very good proxy for real-world
lag. You can have a very large amount of writing in between commits.
The existing lag monitoring possibilities focus on WAL position instead,
which is better correlated with the sort of activity that causes lag.
Making one measurement of lag WAL position based (the existing ones) and
another based on transaction advance (this proposal) is bound to raise
some "which of these is the correct lag?" questions, when the two
diverge. Large data loading operations come to mind as a not unusual at
all situation where this would happen.
I'm normally a fan of building the simplest thing that will do something
useful, and this patch succeeds there. But the best data to collect
needs to have a timestamp that keeps moving forward in a way that
correlates reasonably well to the system WAL load. Using the
transaction ID doesn't do that. Simon did some hand waving around
sending a timestamp every checkpoint. That would allow the standby to
compute its own lag, limit overhead to something much lighter than
per-transaction, and better track write volume. There could still be a
bigger than normal discontinuity after server restart, if the server was
down a while, but at least there wouldn't ever be a point where the
value was returned by the master but was NULL.
But as Simon mentioned in passing, it will bloat the WAL streams for
everyone. Here's the as yet uncoded mini-proposal that seems to have
slid by uncommented upon:
"We can send regular special messages from WALSender to WALReceiver that
do not form part of the WAL stream, so we don't bulk
up WAL archives. (i.e. don't use "w" messages)."
Here's my understanding of how this would work. Each time a
commit/abort record appears in the WAL, that updates XLogCtl with the
associated timestamp. If WALReceiver received regular updates
containing the master's clock timestamp and stored them
similarly--either via regular streaming or the heartbeat--it could
compute lag with the same resolution as this patch aims to do, for the
price of two spinlocks: just subtract the two timestamps. No overhead
on the master, and lag can be directly computed and queried from each
standby. If you want to feed that data back to the master so it can
appear in pg_stat_replication in both places, send it periodically via
the same channel sync rep and standby feedback use. I believe that will
be cheap in many cases, since it can piggyback on messages that will
still be quite small relative to minimum packet size on most networks.
(Exception for compressed/encrypted networks where packets aren't
discrete in this way doesn't seem that relevant, presuming that if
you're doing one of those I would think this overhead is the least of
your worries)
Now, there's still one problem here. This doesn't address the "lots of
write volume but no commit" problem any better than the proposed patch
does. Maybe there's some fancy way to inject it along with the received
WAL on the standby, I'm out of brain power to think through that right
now. One way to force this to go away is to add a "time update" WAL
record type here, one that only appears in some of these unusual
situations. My first idea for keeping that overhead small is to put the
injection logic for it at WAL file switch time. If you haven't
committed a transaction during that entire 16MB of writes, start the new
WAL segment with a little time update record. That would ensure you
never do too many writes before the WAL replay clock is updated, while
avoiding this record type altogether during commit-heavy periods.
The WAL sender and receiver pair should be able to work out how to
handle the other corner case, where neither WAL advance nor transactions
occured. You don't want lag to keep increasing forever there. If the
transaction ID hasn't moved forward, the heartbeat can still update the
time. I think if you do that, all you'd need to special case is that if
master XID=standby replay XID, lag time should be forced to 0 instead of
being its usual value (master timestamp - last standby commit/abort record).
As for "what do you return if asked for lag before the first data with a
timestamp appears?" problem, there are surely still cases where that
happens in this approach. I'm less concerned about that if there's only
a single function involved though. The part that worries me is the high
probability people are going to do NULL math wrong if they have to
combine two values from different servers, and not catch it during
development. If I had a penny for every NULL handling mistake ever made
in that category, I'd be writing this from my island lair instead of
Baltimore. I'm more comfortable saying "this lag interval might be
NULL"; if that's the only exception people have to worry about, that
doesn't stress me as much.
Last disclaimer: if this argument is persuasive enough to punt
pg_last_xact_insert_timestamp for now, in favor of a new specification
along the lines I've outlined here, I have no intention of letting
myself or Simon end up blocking progress on that. This one is important
enough for 9.2 that if there's not another patch offering the same
feature sitting in the queue by the end of the month, I'll ask someone
else here to sit on this problem a while. Probably Jaime, because he's
suffering with this problem as much as I am. We maintain code in repmgr
to do this job with brute force: it saves a history table to translate
XID->timestamps and works out lag from there. Window function query +
constantly churning monitoring table=high overhead. It would really be
great to EOL that whole messy section as deprecated starting in 9.2.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On Sat, Dec 10, 2011 at 12:29 PM, Greg Smith <greg@2ndquadrant.com> wrote:
"We can send regular special messages from WALSender to WALReceiver that do
not form part of the WAL stream, so we don't bulk
up WAL archives. (i.e. don't use "w" messages)."Here's my understanding of how this would work.
Let me explain a little more and provide a very partial patch.
We define a new replication protocol message 'k' which sends a
keepalive from primary to standby when there is no WAL to send. The
message does not form part of the WAL stream so does not bloat WAL
files, nor cause them to fill when unattended.
Keepalives contain current end of WAL and a current timestamp.
Keepalive processing is all done on the standby and there is no
overhead on a primary which does not use replication. There is a
slight overhead on primary for keepalives but this happens only when
there are no writes. On the standby we already update shared state
when we receive some data, so not much else to do there.
When the standby has applied up to the end of WAL the replication
delay is receipt time - send time of keepalive.
When standby receives a data packet it records WAL ptr and time. As
standby applies each chunk it removes the record for each data packet
and sets the last applied timestamp.
If standby falls behind the number of data packet records will build
up, so we begin to keep record every 2 packets, then every 4 packets
etc. So the further the standby falls behind the less accurately we
record the replication delay - though the accuracy remains
proportional to the delay.
To complete the patch I need to
* send the keepalive messages when no WAL outstanding
* receive the messages
* store timestamp info for data and keepalives
* progressively filter the messages if we get too many
I will be working on this patch some more this week.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
primary_keepalive.v1.patchtext/x-patch; charset=US-ASCII; name=primary_keepalive.v1.patchDownload
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index d6332e5..71c40cc 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1467,6 +1467,54 @@ The commands accepted in walsender mode are:
<variablelist>
<varlistentry>
<term>
+ Primary keepalive message (B)
+ </term>
+ <listitem>
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term>
+ Byte1('k')
+ </term>
+ <listitem>
+ <para>
+ Identifies the message as a sender keepalive.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>
+ Byte8
+ </term>
+ <listitem>
+ <para>
+ The current end of WAL on the server, given in
+ XLogRecPtr format.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>
+ Byte8
+ </term>
+ <listitem>
+ <para>
+ The server's system clock at the time of transmission,
+ given in TimestampTz format.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term>
Standby status update (F)
</term>
<listitem>
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index 656c8fc..1c73d35 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -40,6 +40,21 @@ typedef struct
} WalDataMessageHeader;
/*
+ * Keepalive message from primary (message type 'k'). (lowercase k)
+ * This is wrapped within a CopyData message at the FE/BE protocol level.
+ *
+ * Note that the data length is not specified here.
+ */
+typedef struct
+{
+ /* Current end of WAL on the sender */
+ XLogRecPtr walEnd;
+
+ /* Sender's system clock at the time of transmission */
+ TimestampTz sendTime;
+} PrimaryKeepaliveMessage;
+
+/*
* Reply message from standby (message type 'r'). This is wrapped within
* a CopyData message at the FE/BE protocol level.
*
On Sat, Dec 10, 2011 at 7:29 AM, Greg Smith <greg@2ndquadrant.com> wrote:
-It adds overhead at every commit, even for people who aren't using it.
Probably not enough to matter, but it's yet another thing going through the
often maligned as too heavy pgstat system, often.
The bit about the pgstat system being too heavy is a red herring; this
could equally well be stored in the PGPROC. So, the overhead is
basically one additional store to shared memory per commit, and we can
arrange for that store to involve a cache line that has to be touched
anyway (by ProcArrayEndTransaction). I suppose you can make the
argument that that still isn't free, but there aren't very many places
in the code where we worry about effects this small. Operations like
AcceptInvalidationMessages() or LockRelationOid() happen more often
than transaction commit, and yet we were happy for years with a system
where AcceptInvalidationMessages() did three spinlock
acquire-and-release cycles that were unnecessary in 99% of cases.
That cost was vastly greater than what we're talking about here, and
it affected an operation that is more frequent than this one.
[ usability complaints ]
Fair enough.
I'm normally a fan of building the simplest thing that will do something
useful, and this patch succeeds there. But the best data to collect needs
to have a timestamp that keeps moving forward in a way that correlates
reasonably well to the system WAL load. Using the transaction ID doesn't do
that. Simon did some hand waving around sending a timestamp every
checkpoint. That would allow the standby to compute its own lag, limit
overhead to something much lighter than per-transaction, and better track
write volume. There could still be a bigger than normal discontinuity after
server restart, if the server was down a while, but at least there wouldn't
ever be a point where the value was returned by the master but was NULL.But as Simon mentioned in passing, it will bloat the WAL streams for
everyone.
But, as with this proposal, the overhead seems largely irrelevant; the
question is whether it actually solves the problem. Unless I am
misunderstanding, we are talking about 4 bytes of WAL per checkpoint
cycle, which strikes me as even less worth worrying about than one
store to shared memory per transaction commit. So, personally, I see
no reason to fret about the overhead. But I'm skeptical that anything
that we only update once per checkpoint cycle will help much in
calculating an accurate lag value. It also strikes me that anything
that is based on augmenting the walsender/walreceiver protocol leaves
anyone who is using WAL shipping out in the cold. I'm not clear from
the comments you or Simon have made how important you think that use
case still is.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Dec 12, 2011 at 1:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It also strikes me that anything
that is based on augmenting the walsender/walreceiver protocol leaves
anyone who is using WAL shipping out in the cold. I'm not clear from
the comments you or Simon have made how important you think that use
case still is.
archive_timeout > 0 works just fine at generating files even when
quiet, or if it does not, it is a bug.
So I don't understand your comments, please explain.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 12, 2011 at 9:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Dec 12, 2011 at 1:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It also strikes me that anything
that is based on augmenting the walsender/walreceiver protocol leaves
anyone who is using WAL shipping out in the cold. I'm not clear from
the comments you or Simon have made how important you think that use
case still is.archive_timeout > 0 works just fine at generating files even when
quiet, or if it does not, it is a bug.So I don't understand your comments, please explain.
If the standby has restore_command set but not primary_conninfo, then
it will never make a direct connection to the master. So anything
that's based on extending that protocol won't get used in that case.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Dec 12, 2011 at 2:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 12, 2011 at 9:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Dec 12, 2011 at 1:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It also strikes me that anything
that is based on augmenting the walsender/walreceiver protocol leaves
anyone who is using WAL shipping out in the cold. I'm not clear from
the comments you or Simon have made how important you think that use
case still is.archive_timeout > 0 works just fine at generating files even when
quiet, or if it does not, it is a bug.So I don't understand your comments, please explain.
If the standby has restore_command set but not primary_conninfo, then
it will never make a direct connection to the master. So anything
that's based on extending that protocol won't get used in that case.
Got that, but now explain the reason for saying such people are "out
in the cold".
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Dec 12, 2011 at 9:51 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Dec 12, 2011 at 2:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 12, 2011 at 9:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Dec 12, 2011 at 1:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
It also strikes me that anything
that is based on augmenting the walsender/walreceiver protocol leaves
anyone who is using WAL shipping out in the cold. I'm not clear from
the comments you or Simon have made how important you think that use
case still is.archive_timeout > 0 works just fine at generating files even when
quiet, or if it does not, it is a bug.So I don't understand your comments, please explain.
If the standby has restore_command set but not primary_conninfo, then
it will never make a direct connection to the master. So anything
that's based on extending that protocol won't get used in that case.Got that, but now explain the reason for saying such people are "out
in the cold".
By that I only meant that if we add a lag-monitoring feature that
relies on the streaming replication protocol, then people who are not
using the streaming replication replication protocol will be unable to
monitor lag (or at least, the will be unable to do it any better than
they can do it today).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 12/12/2011 08:45 AM, Robert Haas wrote:
But I'm skeptical that anything that we only update once per
checkpoint cycle will help much in
calculating an accurate lag value.
I'm sure there is no upper bound on how much WAL lag you can build up
between commit/abort records either; they can be far less frequent than
checkpoints. All it takes is a multi-hour COPY with no other commits to
completely hose lag measured by that advance, and that is not an unusual
situation at all. Overnight daily ETL or reporting MV-ish roll-ups,
scheduled specifically for when no one is normally at the office, are
the first thing that spring to mind.
Anyway, I wasn't suggesting checkpoints as anything other than a worst
case behavior. We can always thump out more frequent updates to reduce
lag, and in what I expect to the most common case the WAL send/receive
stuff will usually do much better. I see the XID vs. WAL position UI
issues as being fundamentally unsolvable, which really bothers me. I'd
have preferred to run screaming away from this thread if it hadn't.
It also strikes me that anything that is based on augmenting the walsender/walreceiver protocol leaves
anyone who is using WAL shipping out in the cold. I'm not clear from
the comments you or Simon have made how important you think that use
case still is.
There's a number of reasons why we might want more timestamps streamed
into the WAL; this might be one. We'd just need one to pop out one as
part of the archive_timeout switch to in theory make it possible for
these people to be happy. I think Simon was hoping to avoid WAL
timestamps, I wouldn't bet too much on that myself. The obvious
implementation problem here is that the logical place to put the
timestamps is right at the end of the WAL file, just before it's closed
for archiving. But that position isn't known until you've at least
started processing it, which you clearly are not doing fast enough if
lag exists.
As far as who's still important here, two observations. Note that the
pg_last_xact_insert_timestamp approach can fail to satisfy WAL shipping
people who are going to a separate network, where it's impractical to
connect to both servers with libpq. I have some customers who like
putting a one-way WAL wall (sorry) between production and the standby
server, with the log shipping being the only route between them; that's
one reason why they might still be doing this instead of using
streaming. There's really no good way to make these people happy and
provide time lag monitoring inside the database.
I was actually the last person I recall who suggested some extra
monitoring mainly aimed at WAL shipping environments:
http://archives.postgresql.org/pgsql-hackers/2010-01/msg01522.php Had
some pg_standby changes I was also working on back then, almost two
years ago. I never circled back to any of it due to having zero demand
since 9.0 shipped, the requests I had been regularly getting about this
all dried up. While I'm all for keeping new features working for
everyone when it doesn't hold progress back, it's not unreasonable to
recognize we can't support every monitoring option through all of the
weird ways WAL files can move around. pg_stat_replication isn't very
helpful for 9.0+ WAL shippers either, yet they still go on doing their
thing.
In the other direction, people who will immediately adopt the latest
hotness, cascading is a whole new layer of use case concerns on top of
the ones considered so far. Now you're talking two layers of
connections users have to navigate though to compute master->cascaded
standby lag. Cascade the WALSender timestamps instead, which seems
pretty simple to do, and then people can just ask their local standby.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On Mon, Dec 12, 2011 at 12:17 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Sat, Dec 10, 2011 at 12:29 PM, Greg Smith <greg@2ndquadrant.com> wrote:
"We can send regular special messages from WALSender to WALReceiver that do
not form part of the WAL stream, so we don't bulk
up WAL archives. (i.e. don't use "w" messages)."Here's my understanding of how this would work.
Let me explain a little more and provide a very partial patch.
We define a new replication protocol message 'k' which sends a
keepalive from primary to standby when there is no WAL to send. The
message does not form part of the WAL stream so does not bloat WAL
files, nor cause them to fill when unattended.Keepalives contain current end of WAL and a current timestamp.
Keepalive processing is all done on the standby and there is no
overhead on a primary which does not use replication. There is a
slight overhead on primary for keepalives but this happens only when
there are no writes. On the standby we already update shared state
when we receive some data, so not much else to do there.When the standby has applied up to the end of WAL the replication
delay is receipt time - send time of keepalive.
Patch introduces regular keepalives from WALsender to WALreceiver,
using a new protocol message 'k'.
These are sent at intervals of a fraction of replication_delay or 10s
if not set,
whenever no WAL records have been sent recently.
Patch exposes info used for standby_delay calculation as used in 9.0+
In addition introduces direct calculations of replication apply delay
and replication transfer latency, both in ms.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
primary_keepalive.v2.patchtext/x-patch; charset=US-ASCII; name=primary_keepalive.v2.patchDownload
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index d6332e5..71c40cc 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1467,6 +1467,54 @@ The commands accepted in walsender mode are:
<variablelist>
<varlistentry>
<term>
+ Primary keepalive message (B)
+ </term>
+ <listitem>
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term>
+ Byte1('k')
+ </term>
+ <listitem>
+ <para>
+ Identifies the message as a sender keepalive.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>
+ Byte8
+ </term>
+ <listitem>
+ <para>
+ The current end of WAL on the server, given in
+ XLogRecPtr format.
+ </para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>
+ Byte8
+ </term>
+ <listitem>
+ <para>
+ The server's system clock at the time of transmission,
+ given in TimestampTz format.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+
+ <para>
+ <variablelist>
+ <varlistentry>
+ <term>
Standby status update (F)
</term>
<listitem>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9d96044..77e2760 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -452,6 +452,9 @@ typedef struct XLogCtlData
XLogRecPtr recoveryLastRecPtr;
/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
TimestampTz recoveryLastXTime;
+ /* timestamp of when we started replaying the current chunk of WAL data,
+ * only relevant for replication or archive recovery */
+ TimestampTz currentChunkStartTime;
/* end of the last record restored from the archive */
XLogRecPtr restoreLastRecPtr;
/* Are we requested to pause recovery? */
@@ -606,6 +609,7 @@ static void exitArchiveRecovery(TimeLineID endTLI,
static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
static void recoveryPausesHere(void);
static void SetLatestXTime(TimestampTz xtime);
+static void SetCurrentChunkStartTime(TimestampTz xtime);
static void CheckRequiredParameterValues(void);
static void XLogReportParameters(void);
static void LocalSetXLogInsertAllowed(void);
@@ -5846,6 +5850,41 @@ GetLatestXTime(void)
}
/*
+ * Save timestamp of the next chunk of WAL records to apply.
+ *
+ * We keep this in XLogCtl, not a simple static variable, so that it can be
+ * seen by all backends.
+ */
+static void
+SetCurrentChunkStartTime(TimestampTz xtime)
+{
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ xlogctl->currentChunkStartTime = xtime;
+ SpinLockRelease(&xlogctl->info_lck);
+}
+
+/*
+ * Fetch timestamp of latest processed commit/abort record.
+ * Startup process maintains an accurate local copy in XLogReceiptTime
+ */
+TimestampTz
+GetCurrentChunkReplayStartTime(void)
+{
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+ TimestampTz xtime;
+
+ SpinLockAcquire(&xlogctl->info_lck);
+ xtime = xlogctl->currentChunkStartTime;
+ SpinLockRelease(&xlogctl->info_lck);
+
+ return xtime;
+}
+
+/*
* Returns time of receipt of current chunk of XLOG data, as well as
* whether it was received from streaming replication or from archives.
*/
@@ -6388,6 +6427,7 @@ StartupXLOG(void)
xlogctl->replayEndRecPtr = ReadRecPtr;
xlogctl->recoveryLastRecPtr = ReadRecPtr;
xlogctl->recoveryLastXTime = 0;
+ xlogctl->currentChunkStartTime = 0;
xlogctl->recoveryPause = false;
SpinLockRelease(&xlogctl->info_lck);
@@ -9688,7 +9728,10 @@ retry:
{
havedata = true;
if (!XLByteLT(*RecPtr, latestChunkStart))
+ {
XLogReceiptTime = GetCurrentTimestamp();
+ SetCurrentChunkStartTime(XLogReceiptTime);
+ }
}
else
havedata = false;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 1f12dcb..8106d6b 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -124,6 +124,7 @@ static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
static void XLogWalRcvFlush(bool dying);
static void XLogWalRcvSendReply(void);
static void XLogWalRcvSendHSFeedback(void);
+static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
/* Signal handlers */
static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -218,6 +219,10 @@ WalReceiverMain(void)
/* Fetch information required to start streaming */
strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
startpoint = walrcv->receiveStart;
+
+ /* Initialise to a sanish value */
+ walrcv->lastMsgSendTime = walrcv->lastMsgReceiptTime = GetCurrentTimestamp();
+
SpinLockRelease(&walrcv->mutex);
/* Arrange to clean up at walreceiver exit */
@@ -433,12 +438,28 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
errmsg_internal("invalid WAL message received from primary")));
/* memcpy is required here for alignment reasons */
memcpy(&msghdr, buf, sizeof(WalDataMessageHeader));
+
+ ProcessWalSndrMessage(msghdr.walEnd, msghdr.sendTime);
+
buf += sizeof(WalDataMessageHeader);
len -= sizeof(WalDataMessageHeader);
-
XLogWalRcvWrite(buf, len, msghdr.dataStart);
break;
}
+ case 'k': /* Keepalive */
+ {
+ PrimaryKeepaliveMessage keepalive;
+
+ if (len != sizeof(PrimaryKeepaliveMessage))
+ ereport(ERROR,
+ (errcode(ERRCODE_PROTOCOL_VIOLATION),
+ errmsg_internal("invalid keepalive message received from primary")));
+ /* memcpy is required here for alignment reasons */
+ memcpy(&keepalive, buf, sizeof(PrimaryKeepaliveMessage));
+
+ ProcessWalSndrMessage(keepalive.walEnd, keepalive.sendTime);
+ break;
+ }
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -711,3 +732,27 @@ XLogWalRcvSendHSFeedback(void)
memcpy(&buf[1], &feedback_message, sizeof(StandbyHSFeedbackMessage));
walrcv_send(buf, sizeof(StandbyHSFeedbackMessage) + 1);
}
+
+/*
+ * Keep track of important messages from primary.
+ */
+static void
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
+{
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalRcvData *walrcv = WalRcv;
+
+ TimestampTz lastMsgReceiptTime = GetCurrentTimestamp();
+
+ /* Update shared-memory status */
+ SpinLockAcquire(&walrcv->mutex);
+ walrcv->lastMsgSendTime = sendTime;
+ walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
+ SpinLockRelease(&walrcv->mutex);
+
+ elog(DEBUG2, "sendtime %s receipttime %s replication apply delay %d transfer latency %d",
+ timestamptz_to_str(sendTime),
+ timestamptz_to_str(lastMsgReceiptTime),
+ GetReplicationApplyDelay(),
+ GetReplicationTransferLatency());
+}
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 5bce1c3..054355b 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -28,6 +28,7 @@
#include "replication/walreceiver.h"
#include "storage/pmsignal.h"
#include "storage/shmem.h"
+#include "utils/timestamp.h"
WalRcvData *WalRcv = NULL;
@@ -238,3 +239,65 @@ GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart)
return recptr;
}
+
+/*
+ * Returns the replication apply delay in ms
+ */
+int
+GetReplicationApplyDelay(void)
+{
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalRcvData *walrcv = WalRcv;
+
+ XLogRecPtr receivePtr;
+ XLogRecPtr replayPtr;
+
+ long secs;
+ int usecs;
+
+ SpinLockAcquire(&walrcv->mutex);
+ receivePtr = walrcv->receivedUpto;
+ SpinLockRelease(&walrcv->mutex);
+
+ replayPtr = GetXLogReplayRecPtr(NULL);
+
+ if (XLByteLE(receivePtr, replayPtr))
+ return 0;
+
+ TimestampDifference(GetCurrentChunkReplayStartTime(),
+ GetCurrentTimestamp(),
+ &secs, &usecs);
+
+ return (((int) secs * 1000) + (usecs / 1000));
+}
+
+/*
+ * Returns the network latency in ms, note that this includes any
+ * difference in clock settings between the servers, as well as timezone.
+ */
+int
+GetReplicationTransferLatency(void)
+{
+ /* use volatile pointer to prevent code rearrangement */
+ volatile WalRcvData *walrcv = WalRcv;
+
+ TimestampTz lastMsgSendTime;
+ TimestampTz lastMsgReceiptTime;
+
+ long secs = 0;
+ int usecs = 0;
+ int ms;
+
+ SpinLockAcquire(&walrcv->mutex);
+ lastMsgSendTime = walrcv->lastMsgSendTime;
+ lastMsgReceiptTime = walrcv->lastMsgReceiptTime;
+ SpinLockRelease(&walrcv->mutex);
+
+ TimestampDifference(lastMsgSendTime,
+ lastMsgReceiptTime,
+ &secs, &usecs);
+
+ ms = ((int) secs * 1000) + (usecs / 1000);
+
+ return ms;
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index ea86520..ed7298b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -131,6 +131,7 @@ static void ProcessStandbyMessage(void);
static void ProcessStandbyReplyMessage(void);
static void ProcessStandbyHSFeedbackMessage(void);
static void ProcessRepliesIfAny(void);
+static void WalSndKeepalive(char *msgbuf);
/* Main entry point for walsender process */
@@ -823,30 +824,24 @@ WalSndLoop(void)
*/
if (caughtup || pq_is_send_pending())
{
- TimestampTz finish_time = 0;
- long sleeptime = -1;
+ TimestampTz timeout = 0;
+ long sleeptime = 10000; /* 10 s */
int wakeEvents;
wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH |
- WL_SOCKET_READABLE;
+ WL_SOCKET_READABLE | WL_TIMEOUT;
+
if (pq_is_send_pending())
wakeEvents |= WL_SOCKET_WRITEABLE;
+ else
+ WalSndKeepalive(output_message);
/* Determine time until replication timeout */
if (replication_timeout > 0)
{
- long secs;
- int usecs;
-
- finish_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
+ timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
replication_timeout);
- TimestampDifference(GetCurrentTimestamp(),
- finish_time, &secs, &usecs);
- sleeptime = secs * 1000 + usecs / 1000;
- /* Avoid Assert in WaitLatchOrSocket if timeout is past */
- if (sleeptime < 0)
- sleeptime = 0;
- wakeEvents |= WL_TIMEOUT;
+ sleeptime = 1 + (replication_timeout / 10);
}
/* Sleep until something happens or replication timeout */
@@ -859,7 +854,7 @@ WalSndLoop(void)
* timeout ... he's supposed to reply *before* that.
*/
if (replication_timeout > 0 &&
- GetCurrentTimestamp() >= finish_time)
+ GetCurrentTimestamp() >= timeout)
{
/*
* Since typically expiration of replication timeout means
@@ -1627,6 +1622,23 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
return (Datum) 0;
}
+static void
+WalSndKeepalive(char *msgbuf)
+{
+ PrimaryKeepaliveMessage keepalive_message;
+
+ /* Construct a new message */
+ keepalive_message.walEnd = sentPtr;
+ keepalive_message.sendTime = GetCurrentTimestamp();
+
+ elog(DEBUG2, "sending replication keepalive");
+
+ /* Prepend with the message type and send it. */
+ msgbuf[0] = 'k';
+ memcpy(msgbuf + 1, &keepalive_message, sizeof(PrimaryKeepaliveMessage));
+ pq_putmessage_noblock('d', msgbuf, sizeof(PrimaryKeepaliveMessage) + 1);
+}
+
/*
* This isn't currently used for anything. Monitoring tools might be
* interested in the future, and we'll need something like this in the
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1b31414..5a21017 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -301,6 +301,7 @@ extern XLogRecPtr GetXLogWriteRecPtr(void);
extern bool RecoveryIsPaused(void);
extern void SetRecoveryPause(bool recoveryPause);
extern TimestampTz GetLatestXTime(void);
+extern TimestampTz GetCurrentChunkReplayStartTime(void);
extern void UpdateControlFile(void);
extern uint64 GetSystemIdentifier(void);
diff --git a/src/include/replication/walprotocol.h b/src/include/replication/walprotocol.h
index 656c8fc..053376d 100644
--- a/src/include/replication/walprotocol.h
+++ b/src/include/replication/walprotocol.h
@@ -17,6 +17,20 @@
/*
+ * All messages from WalSender must contain these fields to allow us to
+ * correctly calculate the replication delay.
+ */
+typedef struct
+{
+ /* Current end of WAL on the sender */
+ XLogRecPtr walEnd;
+
+ /* Sender's system clock at the time of transmission */
+ TimestampTz sendTime;
+} WalSndrMessage;
+
+
+/*
* Header for a WAL data message (message type 'w'). This is wrapped within
* a CopyData message at the FE/BE protocol level.
*
@@ -40,6 +54,14 @@ typedef struct
} WalDataMessageHeader;
/*
+ * Keepalive message from primary (message type 'k'). (lowercase k)
+ * This is wrapped within a CopyData message at the FE/BE protocol level.
+ *
+ * Note that the data length is not specified here.
+ */
+typedef WalSndrMessage PrimaryKeepaliveMessage;
+
+/*
* Reply message from standby (message type 'r'). This is wrapped within
* a CopyData message at the FE/BE protocol level.
*
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 77f5252..926730c 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -79,6 +79,12 @@ typedef struct
XLogRecPtr latestChunkStart;
/*
+ * Time of send and receive of any message received.
+ */
+ TimestampTz lastMsgSendTime;
+ TimestampTz lastMsgReceiptTime;
+
+ /*
* connection string; is used for walreceiver to connect with the primary.
*/
char conninfo[MAXCONNINFO];
@@ -112,5 +118,7 @@ extern void ShutdownWalRcv(void);
extern bool WalRcvInProgress(void);
extern void RequestXLogStreaming(XLogRecPtr recptr, const char *conninfo);
extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart);
+extern int GetReplicationApplyDelay(void);
+extern int GetReplicationTransferLatency(void);
#endif /* _WALRECEIVER_H */
Hi,
On 2011-10-04 20:52:59 +0900, Fujii Masao wrote:
*** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *************** *** 1066,1071 **** RecordTransactionCommit(void) --- 1066,1074 ----(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_COMPACT, rdata); } + + /* Save timestamp of latest transaction commit record */ + pgstat_report_xact_end_timestamp(xactStopTimestamp); }
Perhaps that pgstat_report() should instead be combined with the
pgstat_report_xact_timestamp(0) in CommitTransaction()? Then the number
of changecount increases and cacheline references would stay the
same. The only thing that'd change would be a single additional
assignment.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 11, 2014 at 4:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,
On 2011-10-04 20:52:59 +0900, Fujii Masao wrote:
*** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *************** *** 1066,1071 **** RecordTransactionCommit(void) --- 1066,1074 ----(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_COMPACT, rdata); } + + /* Save timestamp of latest transaction commit record */ + pgstat_report_xact_end_timestamp(xactStopTimestamp); }Perhaps that pgstat_report() should instead be combined with the
pgstat_report_xact_timestamp(0) in CommitTransaction()? Then the number
of changecount increases and cacheline references would stay the
same. The only thing that'd change would be a single additional
assignment.
Sounds good suggestion.
While reading the patch again, I found it didn't handle the COMMIT/ABORT
PREPARED case properly. According to the commit e74e090, now
pg_last_xact_replay_timestamp() returns the timestamp of COMMIT/ABORT PREPARED.
pg_last_xact_insert_timestamp() is mainly expected to be used to calculate
the replication delay, so it also needs to return that timestam. But the patch
didn't change 2PC code at all. We need to add pgstat_report_xact_end_timestamp()
into FinishPreparedTransaction(), RecordTransactionCommitPrepared() or
RecordTransactionAbortPrepared().
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 11, 2014 at 8:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Aug 11, 2014 at 4:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,
On 2011-10-04 20:52:59 +0900, Fujii Masao wrote:
*** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *************** *** 1066,1071 **** RecordTransactionCommit(void) --- 1066,1074 ----(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_COMPACT, rdata); } + + /* Save timestamp of latest transaction commit record */ + pgstat_report_xact_end_timestamp(xactStopTimestamp); }Perhaps that pgstat_report() should instead be combined with the
pgstat_report_xact_timestamp(0) in CommitTransaction()? Then the number
of changecount increases and cacheline references would stay the
same. The only thing that'd change would be a single additional
assignment.Sounds good suggestion.
I attached the updated version of the patch. I changed pgstat_report_xx
functions like Andres suggested.
While reading the patch again, I found it didn't handle the COMMIT/ABORT
PREPARED case properly. According to the commit e74e090, now
pg_last_xact_replay_timestamp() returns the timestamp of COMMIT/ABORT PREPARED.
pg_last_xact_insert_timestamp() is mainly expected to be used to calculate
the replication delay, so it also needs to return that timestam. But the patch
didn't change 2PC code at all. We need to add pgstat_report_xact_end_timestamp()
into FinishPreparedTransaction(), RecordTransactionCommitPrepared() or
RecordTransactionAbortPrepared().
Done.
Regards,
--
Fujii Masao
Attachments:
pg_last_xact_insert_timestamp_v4.patchtext/x-patch; charset=US-ASCII; name=pg_last_xact_insert_timestamp_v4.patchDownload
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 16116,16121 **** SELECT set_config('log_statement_stats', 'off', false);
--- 16116,16124 ----
<primary>pg_current_xlog_location</primary>
</indexterm>
<indexterm>
+ <primary>pg_last_xact_insert_timestamp</primary>
+ </indexterm>
+ <indexterm>
<primary>pg_start_backup</primary>
</indexterm>
<indexterm>
***************
*** 16180,16185 **** SELECT set_config('log_statement_stats', 'off', false);
--- 16183,16195 ----
</row>
<row>
<entry>
+ <literal><function>pg_last_xact_insert_timestamp()</function></literal>
+ </entry>
+ <entry><type>timestamp with time zone</type></entry>
+ <entry>Get last transaction log insert time stamp</entry>
+ </row>
+ <row>
+ <entry>
<literal><function>pg_start_backup(<parameter>label</> <type>text</> <optional>, <parameter>fast</> <type>boolean</> </optional>)</function></literal>
</entry>
<entry><type>pg_lsn</type></entry>
***************
*** 16334,16339 **** postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
--- 16344,16356 ----
</para>
<para>
+ <function>pg_last_xact_insert_timestamp</> displays the time stamp of last inserted
+ transaction. This is the time at which the commit or abort WAL record for that transaction.
+ If there has been no transaction committed or aborted yet since the server has started,
+ this function returns NULL.
+ </para>
+
+ <para>
For details about proper usage of these functions, see
<xref linkend="continuous-archiving">.
</para>
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 862,867 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 862,876 ----
<command>ps</> command (see <xref linkend="monitoring-ps"> for details).
</para>
<para>
+ You can also calculate the lag in time stamp by comparing the last
+ WAL insert time stamp on the primary with the last WAL replay
+ time stamp on the standby. They can be retrieved using
+ <function>pg_last_xact_insert_timestamp</> on the primary and
+ the <function>pg_last_xact_replay_timestamp</> on the standby,
+ respectively (see <xref linkend="functions-admin-backup-table"> and
+ <xref linkend="functions-recovery-info-table"> for details).
+ </para>
+ <para>
You can retrieve a list of WAL sender processes via the
<link linkend="monitoring-stats-views-table">
<literal>pg_stat_replication</></link> view. Large differences between
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 156,167 **** static void RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
! bool initfileinval);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
! RelFileNode *rels);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
--- 156,169 ----
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
! bool initfileinval,
! TimestampTz *lastXactTime);
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
! RelFileNode *rels,
! TimestampTz *lastXactTime);
static void ProcessRecords(char *bufptr, TransactionId xid,
const TwoPhaseCallback callbacks[]);
static void RemoveGXact(GlobalTransaction gxact);
***************
*** 1353,1358 **** FinishPreparedTransaction(const char *gid, bool isCommit)
--- 1355,1361 ----
int ndelrels;
SharedInvalidationMessage *invalmsgs;
int i;
+ TimestampTz lastXactTime = -1;
/*
* Validate the GID, and lock the GXACT to ensure that two backends do not
***************
*** 1404,1414 **** FinishPreparedTransaction(const char *gid, bool isCommit)
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
! hdr->initfileinval);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
! hdr->nabortrels, abortrels);
ProcArrayRemove(proc, latestXid);
--- 1407,1419 ----
hdr->nsubxacts, children,
hdr->ncommitrels, commitrels,
hdr->ninvalmsgs, invalmsgs,
! hdr->initfileinval,
! &lastXactTime);
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
! hdr->nabortrels, abortrels,
! &lastXactTime);
ProcArrayRemove(proc, latestXid);
***************
*** 1470,1475 **** FinishPreparedTransaction(const char *gid, bool isCommit)
--- 1475,1482 ----
/* Count the prepared xact as committed or aborted */
AtEOXact_PgStat(isCommit);
+ pgstat_report_xact_timestamp(-1, lastXactTime);
+
/*
* And now we can clean up our mess.
*/
***************
*** 2068,2074 **** RecordTransactionCommitPrepared(TransactionId xid,
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
! bool initfileinval)
{
XLogRecData rdata[4];
int lastrdata = 0;
--- 2075,2082 ----
RelFileNode *rels,
int ninvalmsgs,
SharedInvalidationMessage *invalmsgs,
! bool initfileinval,
! TimestampTz *lastXactTime)
{
XLogRecData rdata[4];
int lastrdata = 0;
***************
*** 2126,2131 **** RecordTransactionCommitPrepared(TransactionId xid,
--- 2134,2140 ----
rdata[lastrdata].next = NULL;
recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_PREPARED, rdata);
+ *lastXactTime = xlrec.crec.xact_time;
/*
* We don't currently try to sleep before flush here ... nor is there any
***************
*** 2166,2172 **** RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
! RelFileNode *rels)
{
XLogRecData rdata[3];
int lastrdata = 0;
--- 2175,2182 ----
int nchildren,
TransactionId *children,
int nrels,
! RelFileNode *rels,
! TimestampTz *lastXactTime)
{
XLogRecData rdata[3];
int lastrdata = 0;
***************
*** 2212,2217 **** RecordTransactionAbortPrepared(TransactionId xid,
--- 2222,2228 ----
rdata[lastrdata].next = NULL;
recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT_PREPARED, rdata);
+ *lastXactTime = xlrec.arec.xact_time;
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 272,278 **** static void CleanupTransaction(void);
static void CheckTransactionChain(bool isTopLevel, bool throwError,
const char *stmtType);
static void CommitTransaction(void);
! static TransactionId RecordTransactionAbort(bool isSubXact);
static void StartTransaction(void);
static void StartSubTransaction(void);
--- 272,279 ----
static void CheckTransactionChain(bool isTopLevel, bool throwError,
const char *stmtType);
static void CommitTransaction(void);
! static TransactionId RecordTransactionAbort(bool isSubXact,
! TimestampTz *lastXactTime);
static void StartTransaction(void);
static void StartSubTransaction(void);
***************
*** 992,998 **** AtSubStart_ResourceOwner(void)
* if the xact has no XID. (We compute that here just because it's easier.)
*/
static TransactionId
! RecordTransactionCommit(void)
{
TransactionId xid = GetTopTransactionIdIfAny();
bool markXidCommitted = TransactionIdIsValid(xid);
--- 993,999 ----
* if the xact has no XID. (We compute that here just because it's easier.)
*/
static TransactionId
! RecordTransactionCommit(TimestampTz *lastXactTime)
{
TransactionId xid = GetTopTransactionIdIfAny();
bool markXidCommitted = TransactionIdIsValid(xid);
***************
*** 1138,1143 **** RecordTransactionCommit(void)
--- 1139,1145 ----
rdata[lastrdata].next = NULL;
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata);
+ *lastXactTime = xlrec.xact_time;
}
else
{
***************
*** 1162,1167 **** RecordTransactionCommit(void)
--- 1164,1170 ----
rdata[lastrdata].next = NULL;
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_COMPACT, rdata);
+ *lastXactTime = xlrec.xact_time;
}
}
***************
*** 1426,1432 **** AtSubCommit_childXids(void)
* if the xact has no XID. (We compute that here just because it's easier.)
*/
static TransactionId
! RecordTransactionAbort(bool isSubXact)
{
TransactionId xid = GetCurrentTransactionIdIfAny();
TransactionId latestXid;
--- 1429,1435 ----
* if the xact has no XID. (We compute that here just because it's easier.)
*/
static TransactionId
! RecordTransactionAbort(bool isSubXact, TimestampTz *lastXactTime)
{
TransactionId xid = GetCurrentTransactionIdIfAny();
TransactionId latestXid;
***************
*** 1508,1513 **** RecordTransactionAbort(bool isSubXact)
--- 1511,1517 ----
rdata[lastrdata].next = NULL;
(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT, rdata);
+ *lastXactTime = xlrec.xact_time;
/*
* Report the latest async abort LSN, so that the WAL writer knows to
***************
*** 1818,1824 **** StartTransaction(void)
*/
xactStartTimestamp = stmtStartTimestamp;
xactStopTimestamp = 0;
! pgstat_report_xact_timestamp(xactStartTimestamp);
/*
* initialize current transaction state fields
--- 1822,1828 ----
*/
xactStartTimestamp = stmtStartTimestamp;
xactStopTimestamp = 0;
! pgstat_report_xact_timestamp(xactStartTimestamp, -1);
/*
* initialize current transaction state fields
***************
*** 1862,1867 **** CommitTransaction(void)
--- 1866,1872 ----
{
TransactionState s = CurrentTransactionState;
TransactionId latestXid;
+ TimestampTz lastXactTime = -1;
ShowTransactionState("CommitTransaction");
***************
*** 1945,1951 **** CommitTransaction(void)
/*
* Here is where we really truly commit.
*/
! latestXid = RecordTransactionCommit();
TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid);
--- 1950,1956 ----
/*
* Here is where we really truly commit.
*/
! latestXid = RecordTransactionCommit(&lastXactTime);
TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid);
***************
*** 2027,2033 **** CommitTransaction(void)
AtEOXact_HashTables(true);
AtEOXact_PgStat(true);
AtEOXact_Snapshot(true);
! pgstat_report_xact_timestamp(0);
CurrentResourceOwner = NULL;
ResourceOwnerDelete(TopTransactionResourceOwner);
--- 2032,2038 ----
AtEOXact_HashTables(true);
AtEOXact_PgStat(true);
AtEOXact_Snapshot(true);
! pgstat_report_xact_timestamp(0, lastXactTime);
CurrentResourceOwner = NULL;
ResourceOwnerDelete(TopTransactionResourceOwner);
***************
*** 2294,2300 **** PrepareTransaction(void)
AtEOXact_HashTables(true);
/* don't call AtEOXact_PgStat here; we fixed pgstat state above */
AtEOXact_Snapshot(true);
! pgstat_report_xact_timestamp(0);
CurrentResourceOwner = NULL;
ResourceOwnerDelete(TopTransactionResourceOwner);
--- 2299,2305 ----
AtEOXact_HashTables(true);
/* don't call AtEOXact_PgStat here; we fixed pgstat state above */
AtEOXact_Snapshot(true);
! pgstat_report_xact_timestamp(0, -1);
CurrentResourceOwner = NULL;
ResourceOwnerDelete(TopTransactionResourceOwner);
***************
*** 2330,2335 **** AbortTransaction(void)
--- 2335,2341 ----
{
TransactionState s = CurrentTransactionState;
TransactionId latestXid;
+ TimestampTz lastXactTime = -1;
/* Prevent cancel/die interrupt while cleaning up */
HOLD_INTERRUPTS();
***************
*** 2412,2418 **** AbortTransaction(void)
* Advertise the fact that we aborted in pg_clog (assuming that we got as
* far as assigning an XID to advertise).
*/
! latestXid = RecordTransactionAbort(false);
TRACE_POSTGRESQL_TRANSACTION_ABORT(MyProc->lxid);
--- 2418,2424 ----
* Advertise the fact that we aborted in pg_clog (assuming that we got as
* far as assigning an XID to advertise).
*/
! latestXid = RecordTransactionAbort(false, &lastXactTime);
TRACE_POSTGRESQL_TRANSACTION_ABORT(MyProc->lxid);
***************
*** 2457,2464 **** AbortTransaction(void)
AtEOXact_ComboCid();
AtEOXact_HashTables(false);
AtEOXact_PgStat(false);
! pgstat_report_xact_timestamp(0);
}
/*
* State remains TRANS_ABORT until CleanupTransaction().
--- 2463,2472 ----
AtEOXact_ComboCid();
AtEOXact_HashTables(false);
AtEOXact_PgStat(false);
! pgstat_report_xact_timestamp(0, lastXactTime);
}
+ else
+ pgstat_report_xact_timestamp(-1, lastXactTime);
/*
* State remains TRANS_ABORT until CleanupTransaction().
***************
*** 4277,4282 **** static void
--- 4285,4291 ----
AbortSubTransaction(void)
{
TransactionState s = CurrentTransactionState;
+ TimestampTz lastXactTime = -1;
/* Prevent cancel/die interrupt while cleaning up */
HOLD_INTERRUPTS();
***************
*** 4353,4359 **** AbortSubTransaction(void)
AtSubAbort_Notify();
/* Advertise the fact that we aborted in pg_clog. */
! (void) RecordTransactionAbort(true);
/* Post-abort cleanup */
if (TransactionIdIsValid(s->transactionId))
--- 4362,4368 ----
AtSubAbort_Notify();
/* Advertise the fact that we aborted in pg_clog. */
! (void) RecordTransactionAbort(true, &lastXactTime);
/* Post-abort cleanup */
if (TransactionIdIsValid(s->transactionId))
***************
*** 4387,4392 **** AbortSubTransaction(void)
--- 4396,4402 ----
AtEOSubXact_HashTables(false, s->nestingLevel);
AtEOSubXact_PgStat(false, s->nestingLevel);
AtSubAbort_Snapshot(s->nestingLevel);
+ pgstat_report_xact_timestamp(-1, lastXactTime);
}
/*
*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
***************
*** 25,30 ****
--- 25,31 ----
#include "catalog/pg_type.h"
#include "funcapi.h"
#include "miscadmin.h"
+ #include "pgstat.h"
#include "replication/walreceiver.h"
#include "storage/smgr.h"
#include "utils/builtins.h"
***************
*** 206,211 **** pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
--- 207,247 ----
}
/*
+ * Returns timestamp of latest inserted commit/abort record.
+ *
+ * If there has been no transaction committed or aborted yet since
+ * the server has started, this function returns NULL.
+ */
+ Datum
+ pg_last_xact_insert_timestamp(PG_FUNCTION_ARGS)
+ {
+ TimestampTz result = 0;
+ TimestampTz xtime;
+ LocalPgBackendStatus *local_beentry;
+ int i;
+
+ if (RecoveryInProgress())
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("recovery is in progress"),
+ errhint("WAL control functions cannot be executed during recovery.")));
+
+ local_beentry = pgstat_fetch_stat_all_local_beentry();
+
+ for (i = 0; i < MaxBackends; i++, local_beentry++)
+ {
+ xtime = local_beentry->backendStatus.st_xact_end_timestamp;
+ if (result < xtime)
+ result = xtime;
+ }
+
+ if (result == 0)
+ PG_RETURN_NULL();
+
+ PG_RETURN_TIMESTAMPTZ(result);
+ }
+
+ /*
* Report the last WAL receive location (same format as pg_start_backup etc)
*
* This is useful for determining how much of WAL is guaranteed to be received
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***************
*** 262,268 **** static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanen
static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
static void backend_read_statsfile(void);
! static void pgstat_read_current_status(void);
static bool pgstat_write_statsfile_needed(void);
static bool pgstat_db_requested(Oid databaseid);
--- 262,268 ----
static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
static void backend_read_statsfile(void);
! static void pgstat_read_current_status(bool all);
static bool pgstat_write_statsfile_needed(void);
static bool pgstat_db_requested(Oid databaseid);
***************
*** 2298,2304 **** pgstat_fetch_stat_funcentry(Oid func_id)
PgBackendStatus *
pgstat_fetch_stat_beentry(int beid)
{
! pgstat_read_current_status();
if (beid < 1 || beid > localNumBackends)
return NULL;
--- 2298,2304 ----
PgBackendStatus *
pgstat_fetch_stat_beentry(int beid)
{
! pgstat_read_current_status(false);
if (beid < 1 || beid > localNumBackends)
return NULL;
***************
*** 2320,2326 **** pgstat_fetch_stat_beentry(int beid)
LocalPgBackendStatus *
pgstat_fetch_stat_local_beentry(int beid)
{
! pgstat_read_current_status();
if (beid < 1 || beid > localNumBackends)
return NULL;
--- 2320,2326 ----
LocalPgBackendStatus *
pgstat_fetch_stat_local_beentry(int beid)
{
! pgstat_read_current_status(false);
if (beid < 1 || beid > localNumBackends)
return NULL;
***************
*** 2330,2335 **** pgstat_fetch_stat_local_beentry(int beid)
--- 2330,2354 ----
/* ----------
+ * pgstat_fetch_stat_all_local_beentry() -
+ *
+ * Support function for the SQL-callable pgstat* functions. Returns
+ * our local copy of all backend entries.
+ *
+ * NB: caller is responsible for a check if the user is permitted to see
+ * this info (especially the querystring).
+ * ----------
+ */
+ LocalPgBackendStatus *
+ pgstat_fetch_stat_all_local_beentry(void)
+ {
+ pgstat_read_current_status(false);
+
+ return localBackendStatusTable;
+ }
+
+
+ /* ----------
* pgstat_fetch_stat_numbackends() -
*
* Support function for the SQL-callable pgstat* functions. Returns
***************
*** 2339,2349 **** pgstat_fetch_stat_local_beentry(int beid)
int
pgstat_fetch_stat_numbackends(void)
{
! pgstat_read_current_status();
return localNumBackends;
}
/*
* ---------
* pgstat_fetch_stat_archiver() -
--- 2358,2369 ----
int
pgstat_fetch_stat_numbackends(void)
{
! pgstat_read_current_status(false);
return localNumBackends;
}
+
/*
* ---------
* pgstat_fetch_stat_archiver() -
***************
*** 2588,2593 **** pgstat_bestart(void)
--- 2608,2618 ----
beentry->st_appname[NAMEDATALEN - 1] = '\0';
beentry->st_activity[pgstat_track_activity_query_size - 1] = '\0';
+ /*
+ * Don't reset st_xact_end_timestamp because the previous value can still
+ * be referenced to calculate the latest transaction insert timestamp.
+ */
+
beentry->st_changecount++;
Assert((beentry->st_changecount & 1) == 0);
***************
*** 2744,2754 **** pgstat_report_appname(const char *appname)
}
/*
! * Report current transaction start timestamp as the specified value.
! * Zero means there is no active transaction.
*/
void
! pgstat_report_xact_timestamp(TimestampTz tstamp)
{
volatile PgBackendStatus *beentry = MyBEEntry;
--- 2769,2780 ----
}
/*
! * Report current transaction start/end timestamp as the specified value.
! * Reporting transaction start/end timestamp as zero means there is no
! * active/finished transaction.
*/
void
! pgstat_report_xact_timestamp(TimestampTz startTime, TimestampTz endTime)
{
volatile PgBackendStatus *beentry = MyBEEntry;
***************
*** 2761,2767 **** pgstat_report_xact_timestamp(TimestampTz tstamp)
* ensure the compiler doesn't try to get cute.
*/
beentry->st_changecount++;
! beentry->st_xact_start_timestamp = tstamp;
beentry->st_changecount++;
Assert((beentry->st_changecount & 1) == 0);
}
--- 2787,2796 ----
* ensure the compiler doesn't try to get cute.
*/
beentry->st_changecount++;
! if (startTime >= 0)
! beentry->st_xact_start_timestamp = startTime;
! if (endTime >= 0)
! beentry->st_xact_end_timestamp = endTime;
beentry->st_changecount++;
Assert((beentry->st_changecount & 1) == 0);
}
***************
*** 2796,2806 **** pgstat_report_waiting(bool waiting)
* pgstat_read_current_status() -
*
* Copy the current contents of the PgBackendStatus array to local memory,
! * if not already done in this transaction.
* ----------
*/
static void
! pgstat_read_current_status(void)
{
volatile PgBackendStatus *beentry;
LocalPgBackendStatus *localtable;
--- 2825,2836 ----
* pgstat_read_current_status() -
*
* Copy the current contents of the PgBackendStatus array to local memory,
! * if not already done in this transaction. If all is true, the local
! * array includes all entries. Otherwise, it includes only valid ones.
* ----------
*/
static void
! pgstat_read_current_status(bool all)
{
volatile PgBackendStatus *beentry;
LocalPgBackendStatus *localtable;
***************
*** 2842,2848 **** pgstat_read_current_status(void)
int save_changecount = beentry->st_changecount;
localentry->backendStatus.st_procpid = beentry->st_procpid;
! if (localentry->backendStatus.st_procpid > 0)
{
memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
--- 2872,2878 ----
int save_changecount = beentry->st_changecount;
localentry->backendStatus.st_procpid = beentry->st_procpid;
! if (localentry->backendStatus.st_procpid > 0 || all)
{
memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
***************
*** 2865,2872 **** pgstat_read_current_status(void)
}
beentry++;
! /* Only valid entries get included into the local array */
! if (localentry->backendStatus.st_procpid > 0)
{
BackendIdGetTransactionIds(i,
&localentry->backend_xid,
--- 2895,2902 ----
}
beentry++;
! /* Only valid entries get included into the local array if all is false */
! if (localentry->backendStatus.st_procpid > 0 || all)
{
BackendIdGetTransactionIds(i,
&localentry->backend_xid,
*** a/src/include/access/xlog_fn.h
--- b/src/include/access/xlog_fn.h
***************
*** 21,26 **** extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS);
--- 21,27 ----
extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS);
extern Datum pg_last_xlog_receive_location(PG_FUNCTION_ARGS);
extern Datum pg_last_xlog_replay_location(PG_FUNCTION_ARGS);
+ extern Datum pg_last_xact_insert_timestamp(PG_FUNCTION_ARGS);
extern Datum pg_last_xact_replay_timestamp(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS);
extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS);
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 3038,3043 **** DATA(insert OID = 2850 ( pg_xlogfile_name_offset PGNSP PGUID 12 1 0 0 0 f f f f
--- 3038,3045 ----
DESCR("xlog filename and byte offset, given an xlog location");
DATA(insert OID = 2851 ( pg_xlogfile_name PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 25 "3220" _null_ _null_ _null_ _null_ pg_xlogfile_name _null_ _null_ _null_ ));
DESCR("xlog filename, given an xlog location");
+ DATA(insert OID = 3218 ( pg_last_xact_insert_timestamp PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 1184 "" _null_ _null_ _null_ _null_ pg_last_xact_insert_timestamp _null_ _null_ _null_ ));
+ DESCR("timestamp of last insert xact");
DATA(insert OID = 3165 ( pg_xlog_location_diff PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 1700 "3220 3220" _null_ _null_ _null_ _null_ pg_xlog_location_diff _null_ _null_ _null_ ));
DESCR("difference in bytes, given two xlog locations");
*** a/src/include/pgstat.h
--- b/src/include/pgstat.h
***************
*** 726,731 **** typedef struct PgBackendStatus
--- 726,734 ----
TimestampTz st_activity_start_timestamp;
TimestampTz st_state_start_timestamp;
+ /* Time when last transaction ended */
+ TimestampTz st_xact_end_timestamp;
+
/* Database OID, owning user's OID, connection client address */
Oid st_databaseid;
Oid st_userid;
***************
*** 860,866 **** extern void pgstat_bestart(void);
extern void pgstat_report_activity(BackendState state, const char *cmd_str);
extern void pgstat_report_tempfile(size_t filesize);
extern void pgstat_report_appname(const char *appname);
! extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
extern void pgstat_report_waiting(bool waiting);
extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
--- 863,871 ----
extern void pgstat_report_activity(BackendState state, const char *cmd_str);
extern void pgstat_report_tempfile(size_t filesize);
extern void pgstat_report_appname(const char *appname);
! extern void pgstat_report_xact_timestamp(TimestampTz startTime,
! TimestampTz endTime);
! extern void pgstat_report_xact_end_timestamp(TimestampTz tstamp);
extern void pgstat_report_waiting(bool waiting);
extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
***************
*** 946,951 **** extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
--- 951,957 ----
extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
+ extern LocalPgBackendStatus *pgstat_fetch_stat_all_local_beentry(void);
extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
extern int pgstat_fetch_stat_numbackends(void);
extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
On Wed, Aug 13, 2014 at 12:41:24PM +0900, Fujii Masao wrote:
On Mon, Aug 11, 2014 at 8:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Aug 11, 2014 at 4:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,
On 2011-10-04 20:52:59 +0900, Fujii Masao wrote:
*** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *************** *** 1066,1071 **** RecordTransactionCommit(void) --- 1066,1074 ----(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_COMPACT, rdata); } + + /* Save timestamp of latest transaction commit record */ + pgstat_report_xact_end_timestamp(xactStopTimestamp); }Perhaps that pgstat_report() should instead be combined with the
pgstat_report_xact_timestamp(0) in CommitTransaction()? Then the number
of changecount increases and cacheline references would stay the
same. The only thing that'd change would be a single additional
assignment.Sounds good suggestion.
I attached the updated version of the patch. I changed pgstat_report_xx
functions like Andres suggested.While reading the patch again, I found it didn't handle the COMMIT/ABORT
PREPARED case properly. According to the commit e74e090, now
pg_last_xact_replay_timestamp() returns the timestamp of COMMIT/ABORT PREPARED.
pg_last_xact_insert_timestamp() is mainly expected to be used to calculate
the replication delay, so it also needs to return that timestam. But the patch
didn't change 2PC code at all. We need to add pgstat_report_xact_end_timestamp()
into FinishPreparedTransaction(), RecordTransactionCommitPrepared() or
RecordTransactionAbortPrepared().Done.
Is this going to be applied?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers