Monitoring time of fsyncing WALs

Started by Konstantin Knizhnikalmost 8 years ago9 messageshackers

k.knizhnik@postgrespro.ru

almost 8 years ago

I wonder why we are monitoring time of writing to WAL, but not time of
fsyncing WAL segments?
Is there are principle reason for it or just because nobody added it yet?
If so, please find very small patch which adding WAIT_EVENT_WAL_FSYNC
event type.

Our engineers in PgPro complain me that there is no information about
time spent in syncing WALs...
Unfortunately Postgres still is not able to aggregate this statistic.
But at least we have pg_wait_sampling extension for it:
https://github.com/postgrespro/pg_wait_sampling

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

David Steele

david@pgmasters.net

almost 8 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Monitoring time of fsyncing WALs

On 6/27/18 12:32 PM, Konstantin Knizhnik wrote:

I wonder why we are monitoring time of writing to WAL, but not time of
fsyncing WAL segments?
Is there are principle reason for it or just because nobody added it yet?
If so, please find very small patch which adding WAIT_EVENT_WAL_FSYNC
event type.

+1.

--
-David
david@pgmasters.net

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Monitoring time of fsyncing WALs

On Wed, Jun 27, 2018 at 07:32:18PM +0300, Konstantin Knizhnik wrote:

I wonder why we are monitoring time of writing to WAL, but not time of
fsyncing WAL segments?
Is there are principle reason for it or just because nobody added it yet?
If so, please find very small patch which adding WAIT_EVENT_WAL_FSYNC event
type.

Let's name it WAIT_EVENT_WAL_SYNC as it is more consistent with the
other wait events of the same type, and also list the wait event
alphabetically everywhere this is added. I have also reworded the
documentation to be more consistent.

Our engineers in PgPro complain me that there is no information about time
spent in syncing WALs...
Unfortunately Postgres still is not able to aggregate this statistic. But at
least we have pg_wait_sampling extension for it:
https://github.com/postgrespro/pg_wait_sampling

Complain justified. It is a bit too late for v11 I think though, so
let's wait for v12 to open for business, and then I'll apply the patch
at if there are no objections until then.

Attached is an updated patch.
--
Michael

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Michael Paquier (#3)

Re: Monitoring time of fsyncing WALs

On Wed, Jun 27, 2018 at 10:27 PM, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jun 27, 2018 at 07:32:18PM +0300, Konstantin Knizhnik wrote:

I wonder why we are monitoring time of writing to WAL, but not time of
fsyncing WAL segments?
Is there are principle reason for it or just because nobody added it yet?
If so, please find very small patch which adding WAIT_EVENT_WAL_FSYNC event
type.

Let's name it WAIT_EVENT_WAL_SYNC as it is more consistent with the
other wait events of the same type, and also list the wait event
alphabetically everywhere this is added. I have also reworded the
documentation to be more consistent.

Our engineers in PgPro complain me that there is no information about time
spent in syncing WALs...
Unfortunately Postgres still is not able to aggregate this statistic. But at
least we have pg_wait_sampling extension for it:
https://github.com/postgrespro/pg_wait_sampling

Complain justified. It is a bit too late for v11 I think though, so
let's wait for v12 to open for business, and then I'll apply the patch
at if there are no objections until then.

Are there other instances of fsync() that also need to be covered?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

almost 8 years ago

In reply to: Robert Haas (#4)

Re: Monitoring time of fsyncing WALs

On 29.06.2018 15:48, Robert Haas wrote:

On Wed, Jun 27, 2018 at 10:27 PM, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jun 27, 2018 at 07:32:18PM +0300, Konstantin Knizhnik wrote:

I wonder why we are monitoring time of writing to WAL, but not time of
fsyncing WAL segments?
Is there are principle reason for it or just because nobody added it yet?
If so, please find very small patch which adding WAIT_EVENT_WAL_FSYNC event
type.

Let's name it WAIT_EVENT_WAL_SYNC as it is more consistent with the
other wait events of the same type, and also list the wait event
alphabetically everywhere this is added. I have also reworded the
documentation to be more consistent.

Our engineers in PgPro complain me that there is no information about time
spent in syncing WALs...
Unfortunately Postgres still is not able to aggregate this statistic. But at
least we have pg_wait_sampling extension for it:
https://github.com/postgrespro/pg_wait_sampling

Complain justified. It is a bit too late for v11 I think though, so
let's wait for v12 to open for business, and then I'll apply the patch
at if there are no objections until then.

Are there other instances of fsync() that also need to be covered?

Syncing database files is already monitored (in mdsync and FileFlush).
If we grep for pg_fsync occurrences in Postgres code, then the only
place where pgstat_report_wait_start is not called before is
issue_xlog_fsync in xlog.c

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Robert Haas (#4)

Re: Monitoring time of fsyncing WALs

On Fri, Jun 29, 2018 at 08:48:33AM -0400, Robert Haas wrote:

Are there other instances of fsync() that also need to be covered?

Yeah, I double-checked the calls to pg_fsync back when I looked at this
patch, but now that I look again I see at least two more of them:
- fsync_fname_ext.
- write_auto_conf_file, where a wait event for a write() is missing as
well.
Hm. I am wondering if it would not be worth extending pg_fsync() with
an argument for a wait event and introduce a sort of pg_write() which
wraps write() with an extra wait event argument, and similarly for
read() (warning, conflict with Windows ahead!). That's somehow related
to the feeling I had when working with transient file's writes and reads
a couple of days back. Those are most likely going to be forgotten
again and again in the future. In both cases the caller would still be
responsible for looking at errno and decide the error handling, but I
got no feedback about the idea on transient files.

There are also a couple of places where the wait events are longer than
they should. For example in writeTimeLineHistory, there is a wait event
for write() which is still switched on even within an error code path.
And on top of it I think that a call to pgstat_report_wait_end() is
missing in the error code path as on ERROR the session still exists.
That's a bug. Those need an extra lookup and visibly some cleanup back
to v10.
--
Michael

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Michael Paquier (#6)

Re: Monitoring time of fsyncing WALs

On Fri, Jun 29, 2018 at 10:39:00PM +0900, Michael Paquier wrote:

On Fri, Jun 29, 2018 at 08:48:33AM -0400, Robert Haas wrote:

Are there other instances of fsync() that also need to be covered?

Yeah, I double-checked the calls to pg_fsync back when I looked at this
patch, but now that I look again I see at least two more of them:
- fsync_fname_ext.

This one is used internally by things like durable_rename or such, on
which I am not sure that we actually can pass down correctly a wait
event as multiple syncs may happen within each call, so that would
require a weird API layer.

- write_auto_conf_file, where a wait event for a write() is missing as
well.

I have been looking at the archives and this has been left out on
purpose:
/messages/by-id/CAGPqQf0bzVfTTZdxcr4qHb3WDbn=+iH6sWchbN4HGjhwtcbPYQ@mail.gmail.com

Hm. I am wondering if it would not be worth extending pg_fsync() with
an argument for a wait event and introduce a sort of pg_write() which
wraps write() with an extra wait event argument, and similarly for
read() (warning, conflict with Windows ahead!). That's somehow related
to the feeling I had when working with transient file's writes and reads
a couple of days back. Those are most likely going to be forgotten
again and again in the future. In both cases the caller would still be
responsible for looking at errno and decide the error handling, but I
got no feedback about the idea on transient files.

That would finish by being sort of ugly as durable_rename or such would
also need some treatment. That would be quite invasive.

There are also a couple of places where the wait events are longer than
they should. For example in writeTimeLineHistory, there is a wait event
for write() which is still switched on even within an error code path.
And on top of it I think that a call to pgstat_report_wait_end() is
missing in the error code path as on ERROR the session still exists.
That's a bug. Those need an extra lookup and visibly some cleanup back
to v10.

On ERROR a backend would switch back quickly on ClientRead. Perhaps
we'd want to call pgstat_report_wait_end() when an error or higher is
thrown? That's a different discussion.

So at the end, I would like to use the proposed patch and call it a
day. Thoughts?
--
Michael

Craig Ringer

craig@2ndquadrant.com

almost 8 years ago

In reply to: Michael Paquier (#7)

Re: Monitoring time of fsyncing WALs

On 1 July 2018 at 11:29, Michael Paquier <michael@paquier.xyz> wrote:

So at the end, I would like to use the proposed patch and call it a
day. Thoughts?

Patch looks good.

I'll hijack the thread to add a few more perf/dtrace tracepoints in the WAL
code, as they were also missing. Proposed rider patch attached.

I've updated
https://wiki.postgresql.org/wiki/Profiling_with_perf#PostgreSQL_pre-defined_tracepoint_events
to document how to add tracepoints.

It's not that hard to trace duration of a given function call with dtrace
without such annotations, but they make it much easier to discover where in
a large codebase to look, providing a form of documentation. With perf they
make life much easier. I should add some more to make it easier to analyse
relation extension contention on indexes and the heap, instrument btree
index ops like page splits, instrument heavyweight locking (beyond
LOCK_WAIT_START), etc. They're also handy for gdb -
https://sourceware.org/gdb/onlinedocs/gdb/Static-Probe-Points.html - also
handy for gdb's "info probes".

BTW, we might want to instrument the pgstat_ counter calls
and pgstat_report_wait_start / pgstat_report_wait_end, but it's easy enough
to use dynamic tracepoints for those so I haven't yet. Maybe even just
document them as points of interest.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Michael Paquier

michael@paquier.xyz

almost 8 years ago

In reply to: Craig Ringer (#8)

Re: Monitoring time of fsyncing WALs

On Mon, Jul 02, 2018 at 11:36:06AM +0800, Craig Ringer wrote:

On 1 July 2018 at 11:29, Michael Paquier <michael@paquier.xyz> wrote:

So at the end, I would like to use the proposed patch and call it a
day. Thoughts?

Patch looks good.

Thanks Craig for the review! I have just pushed the previous patch with
the adjustments mentioned. Regarding enforcing pg_fsync with a wait
event, I think I'll step back on this one. There are also arguments for
allowing code paths to not have wait events as that's quite low-level,
and for extensions the choice is limited as long as there is no way to
register custom names, if that ever happens, which I am not sure of.

I'll hijack the thread to add a few more perf/dtrace tracepoints in the WAL
code, as they were also missing. Proposed rider patch attached.

As that is a separate discussion and as the commit fest has already
begun with many items in the queue, could you begin a new thread and add
this stuff within a new CF entry? I would say that having more trace
points in this area could be helpful, but let's discuss that
appropriately so as your proposal catches proper attention.

BTW, we might want to instrument the pgstat_ counter calls
and pgstat_report_wait_start / pgstat_report_wait_end, but it's easy enough
to use dynamic tracepoints for those so I haven't yet. Maybe even just
document them as points of interest.

Perhaps. Feel free to propose anything you have in mind of course.
--
Michael

Monitoring time of fsyncing WALs

Attachments:

Attachments:

Attachments: