Monitoring time of fsyncing WALs
I wonder why we are monitoring time of writing to WAL, but not time of
fsyncing WAL segments?
Is there are principle reason for it or just because nobody added it yet?
If so, please find very small patch which adding WAIT_EVENT_WAL_FSYNC
event type.
Our engineers in PgPro complain me that there is no information about
time spent in syncing WALs...
Unfortunately Postgres still is not able to aggregate this statistic.
But at least we have pg_wait_sampling extension for it:
https://github.com/postgrespro/pg_wait_sampling
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
wait_event_wal_fsync.patchtext/x-patch; name=wait_event_wal_fsync.patchDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c2adb22..ef0addd 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1682,6 +1682,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry><literal>WALWrite</literal></entry>
<entry>Waiting for a write to a WAL file.</entry>
</row>
+ <row>
+ <entry><literal>WALFsync</literal></entry>
+ <entry>Waiting for a fsync for WAL segment.</entry>
+ </row>
</tbody>
</tgroup>
</table>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1a419aa..0956655 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10156,6 +10156,7 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
void
issue_xlog_fsync(int fd, XLogSegNo segno)
{
+ pgstat_report_wait_start(WAIT_EVENT_WAL_FSYNC);
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
@@ -10191,6 +10192,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
break;
}
+ pgstat_report_wait_end();
}
/*
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e..adbf6af 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3931,6 +3931,9 @@ pgstat_get_wait_io(WaitEventIO w)
case WAIT_EVENT_WAL_WRITE:
event_name = "WALWrite";
break;
+ case WAIT_EVENT_WAL_FSYNC:
+ event_name = "WALFsync";
+ break;
/* no default case, so that compiler will warn */
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f592..b76074b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -922,7 +922,8 @@ typedef enum
WAIT_EVENT_WAL_INIT_WRITE,
WAIT_EVENT_WAL_READ,
WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
- WAIT_EVENT_WAL_WRITE
+ WAIT_EVENT_WAL_WRITE,
+ WAIT_EVENT_WAL_FSYNC
} WaitEventIO;
/* ----------
On 6/27/18 12:32 PM, Konstantin Knizhnik wrote:
I wonder why we are monitoring time of writing to WAL, but not time of
fsyncing WAL segments?
Is there are principle reason for it or just because nobody added it yet?
If so, please find very small patch which adding WAIT_EVENT_WAL_FSYNC
event type.
+1.
--
-David
david@pgmasters.net
On Wed, Jun 27, 2018 at 07:32:18PM +0300, Konstantin Knizhnik wrote:
I wonder why we are monitoring time of writing to WAL, but not time of
fsyncing WAL segments?
Is there are principle reason for it or just because nobody added it yet?
If so, please find very small patch which adding WAIT_EVENT_WAL_FSYNC event
type.
Let's name it WAIT_EVENT_WAL_SYNC as it is more consistent with the
other wait events of the same type, and also list the wait event
alphabetically everywhere this is added. I have also reworded the
documentation to be more consistent.
Our engineers in PgPro complain me that there is no information about time
spent in syncing WALs...
Unfortunately Postgres still is not able to aggregate this statistic. But at
least we have pg_wait_sampling extension for it:
https://github.com/postgrespro/pg_wait_sampling
Complain justified. It is a bit too late for v11 I think though, so
let's wait for v12 to open for business, and then I'll apply the patch
at if there are no objections until then.
Attached is an updated patch.
--
Michael
Attachments:
wait_event_wal_fsync-v2.patchtext/x-diff; charset=us-asciiDownload
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c2adb22dff..36d393d329 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1674,6 +1674,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry><literal>WALSenderTimelineHistoryRead</literal></entry>
<entry>Waiting for a read from a timeline history file during walsender timeline command.</entry>
</row>
+ <row>
+ <entry><literal>WALSync</literal></entry>
+ <entry>Waiting for a WAL file to reach stable storage.</entry>
+ </row>
<row>
<entry><literal>WALSyncMethodAssign</literal></entry>
<entry>Waiting for data to reach stable storage while assigning WAL sync method.</entry>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1a419aa49b..6aae426744 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10156,6 +10156,7 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
void
issue_xlog_fsync(int fd, XLogSegNo segno)
{
+ pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
@@ -10191,6 +10192,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
break;
}
+ pgstat_report_wait_end();
}
/*
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e77c..bbe73618c7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3925,6 +3925,9 @@ pgstat_get_wait_io(WaitEventIO w)
case WAIT_EVENT_WAL_READ:
event_name = "WALRead";
break;
+ case WAIT_EVENT_WAL_SYNC:
+ event_name = "WALSync";
+ break;
case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
event_name = "WALSyncMethodAssign";
break;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f59239b..d59c24ae23 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -921,6 +921,7 @@ typedef enum
WAIT_EVENT_WAL_INIT_SYNC,
WAIT_EVENT_WAL_INIT_WRITE,
WAIT_EVENT_WAL_READ,
+ WAIT_EVENT_WAL_SYNC,
WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
WAIT_EVENT_WAL_WRITE
} WaitEventIO;
On Wed, Jun 27, 2018 at 10:27 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Jun 27, 2018 at 07:32:18PM +0300, Konstantin Knizhnik wrote:
I wonder why we are monitoring time of writing to WAL, but not time of
fsyncing WAL segments?
Is there are principle reason for it or just because nobody added it yet?
If so, please find very small patch which adding WAIT_EVENT_WAL_FSYNC event
type.Let's name it WAIT_EVENT_WAL_SYNC as it is more consistent with the
other wait events of the same type, and also list the wait event
alphabetically everywhere this is added. I have also reworded the
documentation to be more consistent.Our engineers in PgPro complain me that there is no information about time
spent in syncing WALs...
Unfortunately Postgres still is not able to aggregate this statistic. But at
least we have pg_wait_sampling extension for it:
https://github.com/postgrespro/pg_wait_samplingComplain justified. It is a bit too late for v11 I think though, so
let's wait for v12 to open for business, and then I'll apply the patch
at if there are no objections until then.
Are there other instances of fsync() that also need to be covered?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 29.06.2018 15:48, Robert Haas wrote:
On Wed, Jun 27, 2018 at 10:27 PM, Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Jun 27, 2018 at 07:32:18PM +0300, Konstantin Knizhnik wrote:
I wonder why we are monitoring time of writing to WAL, but not time of
fsyncing WAL segments?
Is there are principle reason for it or just because nobody added it yet?
If so, please find very small patch which adding WAIT_EVENT_WAL_FSYNC event
type.Let's name it WAIT_EVENT_WAL_SYNC as it is more consistent with the
other wait events of the same type, and also list the wait event
alphabetically everywhere this is added. I have also reworded the
documentation to be more consistent.Our engineers in PgPro complain me that there is no information about time
spent in syncing WALs...
Unfortunately Postgres still is not able to aggregate this statistic. But at
least we have pg_wait_sampling extension for it:
https://github.com/postgrespro/pg_wait_samplingComplain justified. It is a bit too late for v11 I think though, so
let's wait for v12 to open for business, and then I'll apply the patch
at if there are no objections until then.Are there other instances of fsync() that also need to be covered?
Syncing database files is already monitored (in mdsync and FileFlush).
If we grep for pg_fsync occurrences in Postgres code, then the only
place where pgstat_report_wait_start is not called before is
issue_xlog_fsync in xlog.c
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Fri, Jun 29, 2018 at 08:48:33AM -0400, Robert Haas wrote:
Are there other instances of fsync() that also need to be covered?
Yeah, I double-checked the calls to pg_fsync back when I looked at this
patch, but now that I look again I see at least two more of them:
- fsync_fname_ext.
- write_auto_conf_file, where a wait event for a write() is missing as
well.
Hm. I am wondering if it would not be worth extending pg_fsync() with
an argument for a wait event and introduce a sort of pg_write() which
wraps write() with an extra wait event argument, and similarly for
read() (warning, conflict with Windows ahead!). That's somehow related
to the feeling I had when working with transient file's writes and reads
a couple of days back. Those are most likely going to be forgotten
again and again in the future. In both cases the caller would still be
responsible for looking at errno and decide the error handling, but I
got no feedback about the idea on transient files.
There are also a couple of places where the wait events are longer than
they should. For example in writeTimeLineHistory, there is a wait event
for write() which is still switched on even within an error code path.
And on top of it I think that a call to pgstat_report_wait_end() is
missing in the error code path as on ERROR the session still exists.
That's a bug. Those need an extra lookup and visibly some cleanup back
to v10.
--
Michael
On Fri, Jun 29, 2018 at 10:39:00PM +0900, Michael Paquier wrote:
On Fri, Jun 29, 2018 at 08:48:33AM -0400, Robert Haas wrote:
Are there other instances of fsync() that also need to be covered?
Yeah, I double-checked the calls to pg_fsync back when I looked at this
patch, but now that I look again I see at least two more of them:
- fsync_fname_ext.
This one is used internally by things like durable_rename or such, on
which I am not sure that we actually can pass down correctly a wait
event as multiple syncs may happen within each call, so that would
require a weird API layer.
- write_auto_conf_file, where a wait event for a write() is missing as
well.
I have been looking at the archives and this has been left out on
purpose:
/messages/by-id/CAGPqQf0bzVfTTZdxcr4qHb3WDbn=+iH6sWchbN4HGjhwtcbPYQ@mail.gmail.com
Hm. I am wondering if it would not be worth extending pg_fsync() with
an argument for a wait event and introduce a sort of pg_write() which
wraps write() with an extra wait event argument, and similarly for
read() (warning, conflict with Windows ahead!). That's somehow related
to the feeling I had when working with transient file's writes and reads
a couple of days back. Those are most likely going to be forgotten
again and again in the future. In both cases the caller would still be
responsible for looking at errno and decide the error handling, but I
got no feedback about the idea on transient files.
That would finish by being sort of ugly as durable_rename or such would
also need some treatment. That would be quite invasive.
There are also a couple of places where the wait events are longer than
they should. For example in writeTimeLineHistory, there is a wait event
for write() which is still switched on even within an error code path.
And on top of it I think that a call to pgstat_report_wait_end() is
missing in the error code path as on ERROR the session still exists.
That's a bug. Those need an extra lookup and visibly some cleanup back
to v10.
On ERROR a backend would switch back quickly on ClientRead. Perhaps
we'd want to call pgstat_report_wait_end() when an error or higher is
thrown? That's a different discussion.
So at the end, I would like to use the proposed patch and call it a
day. Thoughts?
--
Michael
On 1 July 2018 at 11:29, Michael Paquier <michael@paquier.xyz> wrote:
So at the end, I would like to use the proposed patch and call it a
day. Thoughts?
Patch looks good.
I'll hijack the thread to add a few more perf/dtrace tracepoints in the WAL
code, as they were also missing. Proposed rider patch attached.
I've updated
https://wiki.postgresql.org/wiki/Profiling_with_perf#PostgreSQL_pre-defined_tracepoint_events
to document how to add tracepoints.
It's not that hard to trace duration of a given function call with dtrace
without such annotations, but they make it much easier to discover where in
a large codebase to look, providing a form of documentation. With perf they
make life much easier. I should add some more to make it easier to analyse
relation extension contention on indexes and the heap, instrument btree
index ops like page splits, instrument heavyweight locking (beyond
LOCK_WAIT_START), etc. They're also handy for gdb -
https://sourceware.org/gdb/onlinedocs/gdb/Static-Probe-Points.html - also
handy for gdb's "info probes".
BTW, we might want to instrument the pgstat_ counter calls
and pgstat_report_wait_start / pgstat_report_wait_end, but it's easy enough
to use dynamic tracepoints for those so I haven't yet. Maybe even just
document them as points of interest.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
v1-0001-Add-more-statically-defined-tracepoints-around-xl.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Add-more-statically-defined-tracepoints-around-xl.patchDownload
From 74cfc819e2c4323613191bcbf6c91f128617c353 Mon Sep 17 00:00:00 2001
From: Craig Ringer <craig@2ndquadrant.com>
Date: Mon, 2 Jul 2018 11:33:46 +0800
Subject: [PATCH v1] Add more statically defined tracepoints around xlog
handling
New tracepoints are added to allow for tracing of WAL flushes and
segment initialization/recycling.
The existing wal_insert tracepoint is renamed to wal_insert_start
and paired with a wal_insert_done, so that insert durations
may be recorded including fsync time.
---
src/backend/access/transam/xlog.c | 14 ++++++++++++++
src/backend/access/transam/xloginsert.c | 4 +++-
src/backend/utils/probes.d | 18 +++++++++++++++++-
3 files changed, 34 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dcfef36591..b3dead5ac0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -955,6 +955,8 @@ static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
* (LSN is the XLOG point up to which the XLOG must be flushed to disk
* before the data page can be written out. This implements the basic
* WAL rule "write the log before the data".)
+ *
+ * You want the WAL_INSERT tracepoints if you're looking for activity here.
*/
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata,
@@ -2810,6 +2812,8 @@ XLogFlush(XLogRecPtr record)
(uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
#endif
+ TRACE_POSTGRESQL_WAL_FLUSH_START(record);
+
START_CRIT_SECTION();
/*
@@ -2942,6 +2946,8 @@ XLogFlush(XLogRecPtr record)
"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
(uint32) (record >> 32), (uint32) record,
(uint32) (LogwrtResult.Flush >> 32), (uint32) LogwrtResult.Flush);
+
+ TRACE_POSTGRESQL_WAL_FLUSH_DONE();
}
/*
@@ -3190,6 +3196,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
+ TRACE_POSTGRESQL_WAL_SEG_INIT_START();
+
/*
* Try to use existent file (checkpoint maker may have created it already)
*/
@@ -3204,7 +3212,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
errmsg("could not open file \"%s\": %m", path)));
}
else
+ {
+ TRACE_POSTGRESQL_WAL_SEG_INIT_DONE(*use_existent);
return fd;
+ }
}
/*
@@ -3327,6 +3338,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
elog(DEBUG2, "done creating and filling new WAL file");
+ TRACE_POSTGRESQL_WAL_SEG_INIT_DONE(*use_existent);
return fd;
}
@@ -10156,6 +10168,7 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
void
issue_xlog_fsync(int fd, XLogSegNo segno)
{
+ TRACE_POSTGRESQL_WAL_SYNC_START(fd, segno);
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
@@ -10191,6 +10204,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
break;
}
+ TRACE_POSTGRESQL_WAL_SYNC_DONE();
}
/*
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 5bea073a2b..8c9b57b0da 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -429,7 +429,7 @@ XLogInsert(RmgrId rmid, uint8 info)
XLR_CHECK_CONSISTENCY)) != 0)
elog(PANIC, "invalid xlog info mask %02X", info);
- TRACE_POSTGRESQL_WAL_INSERT(rmid, info);
+ TRACE_POSTGRESQL_WAL_INSERT_START(rmid, info);
/*
* In bootstrap mode, we don't actually log anything but XLOG resources;
@@ -464,6 +464,8 @@ XLogInsert(RmgrId rmid, uint8 info)
XLogResetInsertion();
+ TRACE_POSTGRESQL_WAL_INSERT_DONE(EndPos);
+
return EndPos;
}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index ad06e8e2ea..2be68d99d6 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -21,6 +21,14 @@
#define Oid unsigned int
#define ForkNumber int
#define bool char
+/*
+ * SystemTap doesn't like 'unsigned long long int', but except on Windows 'long
+ * int' is 64-bit. Windows is LLP64 and will treat it as 32-bit but our stub
+ * macros don't care about typing, and dtrace/stap isn't supported on Windows.
+ * Windows has Event Tracing (ETW) and TraceLogging, but we don't support it right
+ * now, so we'll deal with it if it comes up.
+ */
+#define XLogRecPtr unsigned long int
provider postgresql {
@@ -75,6 +83,7 @@ provider postgresql {
probe checkpoint__done(int, int, int, int, int);
probe clog__checkpoint__start(bool);
probe clog__checkpoint__done(bool);
+
probe subtrans__checkpoint__start(bool);
probe subtrans__checkpoint__done(bool);
probe multixact__checkpoint__start(bool);
@@ -87,7 +96,14 @@ provider postgresql {
probe smgr__md__write__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
probe smgr__md__write__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
- probe wal__insert(unsigned char, unsigned char);
+ probe wal__insert__start(unsigned char, unsigned char);
+ probe wal__insert__done(XLogRecPtr);
+ probe wal__seg__init__start();
+ probe wal__seg__init__done(bool);
+ probe wal__flush__start(XLogRecPtr);
+ probe wal__flush__done();
+ probe wal__sync__start(int, int);
+ probe wal__sync__done();
probe wal__switch();
probe wal__buffer__write__dirty__start();
probe wal__buffer__write__dirty__done();
--
2.14.3
On Mon, Jul 02, 2018 at 11:36:06AM +0800, Craig Ringer wrote:
On 1 July 2018 at 11:29, Michael Paquier <michael@paquier.xyz> wrote:
So at the end, I would like to use the proposed patch and call it a
day. Thoughts?Patch looks good.
Thanks Craig for the review! I have just pushed the previous patch with
the adjustments mentioned. Regarding enforcing pg_fsync with a wait
event, I think I'll step back on this one. There are also arguments for
allowing code paths to not have wait events as that's quite low-level,
and for extensions the choice is limited as long as there is no way to
register custom names, if that ever happens, which I am not sure of.
I'll hijack the thread to add a few more perf/dtrace tracepoints in the WAL
code, as they were also missing. Proposed rider patch attached.
:)
As that is a separate discussion and as the commit fest has already
begun with many items in the queue, could you begin a new thread and add
this stuff within a new CF entry? I would say that having more trace
points in this area could be helpful, but let's discuss that
appropriately so as your proposal catches proper attention.
BTW, we might want to instrument the pgstat_ counter calls
and pgstat_report_wait_start / pgstat_report_wait_end, but it's easy enough
to use dynamic tracepoints for those so I haven't yet. Maybe even just
document them as points of interest.
Perhaps. Feel free to propose anything you have in mind of course.
--
Michael