About to add WAL write/fsync statistics to pg_stat_wal view

Started by Masahiro Ikedaover 5 years ago67 messageshackers
Jump to latest
#1Masahiro Ikeda
ikedamsh@oss.nttdata.com

Hi,

I propose to add wal write/fsync statistics to pg_stat_wal view.
It's useful not only for developing/improving source code related to WAL
but also for users to detect workload changes, HW failure, and so on.

I introduce "track_wal_io_timing" parameter and provide the following
information on pg_stat_wal view.
I separate the parameter from "track_io_timing" to "track_wal_io_timing"
because IIUC, WAL I/O activity may have a greater impact on query
performance than database I/O activity.

```
postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time
FROM pg_stat_wal;
-[ RECORD 1 ]--+----
wal_write | 650 # Total number of times WAL data was written to
the disk

wal_write_time | 43 # Total amount of time that has been spent in the
portion of WAL data was written to disk
# if track-wal-io-timing is enabled, otherwise
zero

wal_sync | 78 # Total number of times WAL data was synced to the
disk

wal_sync_time | 104 # Total amount of time that has been spent in the
portion of WAL data was synced to disk
# if track-wal-io-timing is enabled, otherwise
zero
```

What do you think?
Please let me know your comments.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

0001_add_wal_io_activity_to_the_pg_stat_wal.patchtext/x-diff; name=0001_add_wal_io_activity_to_the_pg_stat_wal.patchDownload+192-12
#2Japin Li
japinli@hotmail.com
In reply to: Masahiro Ikeda (#1)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

Hi,

On Dec 8, 2020, at 1:06 PM, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

Hi,

I propose to add wal write/fsync statistics to pg_stat_wal view.
It's useful not only for developing/improving source code related to WAL
but also for users to detect workload changes, HW failure, and so on.

I introduce "track_wal_io_timing" parameter and provide the following information on pg_stat_wal view.
I separate the parameter from "track_io_timing" to "track_wal_io_timing"
because IIUC, WAL I/O activity may have a greater impact on query performance than database I/O activity.

```
postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time FROM pg_stat_wal;
-[ RECORD 1 ]--+----
wal_write | 650 # Total number of times WAL data was written to the disk

wal_write_time | 43 # Total amount of time that has been spent in the portion of WAL data was written to disk
# if track-wal-io-timing is enabled, otherwise zero

wal_sync | 78 # Total number of times WAL data was synced to the disk

wal_sync_time | 104 # Total amount of time that has been spent in the portion of WAL data was synced to disk
# if track-wal-io-timing is enabled, otherwise zero
```

What do you think?
Please let me know your comments.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION<0001_add_wal_io_activity_to_the_pg_stat_wal.patch>

There is a no previous prototype warning for ‘fsyncMethodCalled’, and it now only used in xlog.c,
should we declare with static? And this function wants a boolean as a return, should we use
true/false other than 0/1?

+/*
+ * Check if fsync mothod is called.
+ */
+bool
+fsyncMethodCalled()
+{
+       if (!enableFsync)
+               return 0;
+
+       switch (sync_method)
+       {
+               case SYNC_METHOD_FSYNC:
+               case SYNC_METHOD_FSYNC_WRITETHROUGH:
+               case SYNC_METHOD_FDATASYNC:
+                       return 1;
+               default:
+                       /* others don't have a specific fsync method */
+                       return 0;
+       }
+}
+

--
Best regards
ChengDu WenWu Information Technology Co.,Ltd.
Japin Li

#3Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Japin Li (#2)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2020-12-08 16:45, Li Japin wrote:

Hi,

On Dec 8, 2020, at 1:06 PM, Masahiro Ikeda <ikedamsh@oss.nttdata.com>
wrote:

Hi,

I propose to add wal write/fsync statistics to pg_stat_wal view.
It's useful not only for developing/improving source code related to
WAL
but also for users to detect workload changes, HW failure, and so on.

I introduce "track_wal_io_timing" parameter and provide the following
information on pg_stat_wal view.
I separate the parameter from "track_io_timing" to
"track_wal_io_timing"
because IIUC, WAL I/O activity may have a greater impact on query
performance than database I/O activity.

```
postgres=# SELECT wal_write, wal_write_time, wal_sync, wal_sync_time
FROM pg_stat_wal;
-[ RECORD 1 ]--+----
wal_write | 650 # Total number of times WAL data was written to
the disk

wal_write_time | 43 # Total amount of time that has been spent in
the portion of WAL data was written to disk
# if track-wal-io-timing is enabled, otherwise
zero

wal_sync | 78 # Total number of times WAL data was synced to
the disk

wal_sync_time | 104 # Total amount of time that has been spent in
the portion of WAL data was synced to disk
# if track-wal-io-timing is enabled, otherwise
zero
```

What do you think?
Please let me know your comments.

Regards
--
Masahiro Ikeda
NTT DATA
CORPORATION<0001_add_wal_io_activity_to_the_pg_stat_wal.patch>

There is a no previous prototype warning for ‘fsyncMethodCalled’, and
it now only used in xlog.c,
should we declare with static? And this function wants a boolean as a
return, should we use
true/false other than 0/1?

+/*
+ * Check if fsync mothod is called.
+ */
+bool
+fsyncMethodCalled()
+{
+       if (!enableFsync)
+               return 0;
+
+       switch (sync_method)
+       {
+               case SYNC_METHOD_FSYNC:
+               case SYNC_METHOD_FSYNC_WRITETHROUGH:
+               case SYNC_METHOD_FDATASYNC:
+                       return 1;
+               default:
+                       /* others don't have a specific fsync method */
+                       return 0;
+       }
+}
+

Thanks for your review.
I agree with your comments. I fixed them.

Regards
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

0002_add_wal_io_activity_to_the_pg_stat_wal.patchtext/x-diff; name=0002_add_wal_io_activity_to_the_pg_stat_wal.patchDownload+194-12
#4Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Masahiro Ikeda (#3)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

Hi,

I rebased the patch to the master branch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

0003_add_wal_io_activity_to_the_pg_stat_wal.patchtext/x-diff; name=0003_add_wal_io_activity_to_the_pg_stat_wal.patchDownload+194-12
#5Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiro Ikeda (#4)
RE: About to add WAL write/fsync statistics to pg_stat_wal view

Dear Ikeda-san,

This patch cannot be applied to the HEAD, but anyway I put a comment.

```
+	/*
+	 * Measure i/o timing to fsync WAL data.
+	 *
+	 * The wal receiver skip to collect it to avoid performance degradation of standy servers.
+	 * If sync_method doesn't have its fsync method, to skip too.
+	 */
+	if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+		INSTR_TIME_SET_CURRENT(start);
```

I think m_wal_sync_time should be collected even if the process is WalRecevier.
Because all wal_fsync should be recorded, and
some performance issues have been aleady occurred if track_wal_io_timing is turned on.
I think it's strange only to take care of the walrecevier case.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#6Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiro Ikeda (#4)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are comments:

---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }

* I think it should add the time in micro sec.

After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.

* How about checking track_wal_io_timing first?

* s/standy/standby/

---
+   /* increment the i/o timing and the number of times to fsync WAL data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time += INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

---
+/*
+ * Check if fsync mothod is called.
+ */
+static bool
+fsyncMethodCalled()
+{
+   if (!enableFsync)
+       return false;
+
+   switch (sync_method)
+   {
+       case SYNC_METHOD_FSYNC:
+       case SYNC_METHOD_FSYNC_WRITETHROUGH:
+       case SYNC_METHOD_FDATASYNC:
+           return true;
+       default:
+           /* others don't have a specific fsync method */
+           return false;
+   }
+}

* I'm concerned that the function name could confuse the reader
because it's called even before the fsync method is called. As I
commented above, calling to fsyncMethodCalled() can be eliminated.
That way, this function is called at only once. So do we really need
this function?

* As far as I read the code, issue_xlog_fsync() seems to do fsync even
if enableFsync is false. Why does the function return false in that
case? I might be missing something.

* void is missing as argument?

* s/mothod/method/

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#7Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Hayato Kuroda (Fujitsu) (#5)
RE: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-22 11:54, kuroda.hayato@fujitsu.com wrote:

Dear Ikeda-san,

This patch cannot be applied to the HEAD, but anyway I put a comment.

```
+	/*
+	 * Measure i/o timing to fsync WAL data.
+	 *
+	 * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+	 * If sync_method doesn't have its fsync method, to skip too.
+	 */
+	if (!AmWalReceiverProcess() && track_wal_io_timing && 
fsyncMethodCalled())
+		INSTR_TIME_SET_CURRENT(start);
```

I think m_wal_sync_time should be collected even if the process is
WalRecevier.
Because all wal_fsync should be recorded, and
some performance issues have been aleady occurred if
track_wal_io_timing is turned on.
I think it's strange only to take care of the walrecevier case.

Kuroda-san, Thanks for your comments.

Although I thought that the performance impact may be bigger in standby
servers
because WALReceiver didn't use wal buffers, it's no need to be
considered.
I agreed that if track_wal_io_timing is turned on, the primary server's
performance degradation occurs too.

I will make rebased and modified.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#8Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Masahiko Sawada (#6)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-22 14:50, Masahiko Sawada wrote:

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:

---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }

* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing && 
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

---
+   /* increment the i/o timing and the number of times to fsync WAL 
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time += 
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

---
+/*
+ * Check if fsync mothod is called.
+ */
+static bool
+fsyncMethodCalled()
+{
+   if (!enableFsync)
+       return false;
+
+   switch (sync_method)
+   {
+       case SYNC_METHOD_FSYNC:
+       case SYNC_METHOD_FSYNC_WRITETHROUGH:
+       case SYNC_METHOD_FDATASYNC:
+           return true;
+       default:
+           /* others don't have a specific fsync method */
+           return false;
+   }
+}

* I'm concerned that the function name could confuse the reader
because it's called even before the fsync method is called. As I
commented above, calling to fsyncMethodCalled() can be eliminated.
That way, this function is called at only once. So do we really need
this function?

Thanks to your comments, I removed them.

* As far as I read the code, issue_xlog_fsync() seems to do fsync even
if enableFsync is false. Why does the function return false in that
case? I might be missing something.

IIUC, the reason is that I thought that each fsync functions like
pg_fsync_no_writethrough() check enableFsync.

If this code doesn't check, m_wal_sync_time may be incremented
even though some sync methods like SYNC_METHOD_OPEN don't call to sync
some data to the disk at the time.

* void is missing as argument?

* s/mothod/method/

I removed them.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v4-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v4-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload+183-16
#9Japin Li
japinli@hotmail.com
In reply to: Masahiro Ikeda (#8)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

Hi, Masahiro

Thanks for you update the v4 patch. Here are some comments:

(1)
+       char            *msg = NULL;
+       bool            sync_called;            /* whether to sync data to the disk. */
+       instr_time      start;
+       instr_time      duration;
+
+       /* check whether to sync data to the disk is really occurred. */
+       sync_called = false;

Maybe we can initialize the "sync_called" variable when declare it.

(2)
+       if (sync_called)
+       {
+               /* increment the i/o timing and the number of times to fsync WAL data */
+               if (track_wal_io_timing)
+               {
+                       INSTR_TIME_SET_CURRENT(duration);
+                       INSTR_TIME_SUBTRACT(duration, start);
+                       WalStats.m_wal_sync_time =  INSTR_TIME_GET_MICROSEC(duration);
+               }
+
+               WalStats.m_wal_sync++;
+       }

There is an extra space before INSTR_TIME_GET_MICROSEC(duration).

In the issue_xlog_fsync(), the comment says that if sync_method is
SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced.
Does that mean it synced when write the WAL data? And for those cases, we
cannot get accurate write/sync timing and number of write/sync times, right?

case SYNC_METHOD_OPEN:
case SYNC_METHOD_OPEN_DSYNC:
/* write synced it already */
break;

On Fri, 22 Jan 2021 at 21:05, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

On 2021-01-22 14:50, Masahiko Sawada wrote:

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:

---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }

* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing && 
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

---
+   /* increment the i/o timing and the number of times to fsync WAL 
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time += 
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

---
+/*
+ * Check if fsync mothod is called.
+ */
+static bool
+fsyncMethodCalled()
+{
+   if (!enableFsync)
+       return false;
+
+   switch (sync_method)
+   {
+       case SYNC_METHOD_FSYNC:
+       case SYNC_METHOD_FSYNC_WRITETHROUGH:
+       case SYNC_METHOD_FDATASYNC:
+           return true;
+       default:
+           /* others don't have a specific fsync method */
+           return false;
+   }
+}

* I'm concerned that the function name could confuse the reader
because it's called even before the fsync method is called. As I
commented above, calling to fsyncMethodCalled() can be eliminated.
That way, this function is called at only once. So do we really need
this function?

Thanks to your comments, I removed them.

* As far as I read the code, issue_xlog_fsync() seems to do fsync even
if enableFsync is false. Why does the function return false in that
case? I might be missing something.

IIUC, the reason is that I thought that each fsync functions like
pg_fsync_no_writethrough() check enableFsync.

If this code doesn't check, m_wal_sync_time may be incremented
even though some sync methods like SYNC_METHOD_OPEN don't call to sync
some data to the disk at the time.

* void is missing as argument?

* s/mothod/method/

I removed them.

Regards,

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

#10Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Japin Li (#9)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

Hi, Japin

Thanks for your comments.

On 2021-01-23 01:46, japin wrote:

Hi, Masahiro

Thanks for you update the v4 patch. Here are some comments:

(1)
+       char            *msg = NULL;
+       bool            sync_called;            /* whether to sync
data to the disk. */
+       instr_time      start;
+       instr_time      duration;
+
+       /* check whether to sync data to the disk is really occurred. 
*/
+       sync_called = false;

Maybe we can initialize the "sync_called" variable when declare it.

Yes, I fixed it.

(2)
+       if (sync_called)
+       {
+               /* increment the i/o timing and the number of times to
fsync WAL data */
+               if (track_wal_io_timing)
+               {
+                       INSTR_TIME_SET_CURRENT(duration);
+                       INSTR_TIME_SUBTRACT(duration, start);
+                       WalStats.m_wal_sync_time =
INSTR_TIME_GET_MICROSEC(duration);
+               }
+
+               WalStats.m_wal_sync++;
+       }

There is an extra space before INSTR_TIME_GET_MICROSEC(duration).

Yes, I removed it.

In the issue_xlog_fsync(), the comment says that if sync_method is
SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC, it already write synced.
Does that mean it synced when write the WAL data? And for those cases,
we
cannot get accurate write/sync timing and number of write/sync times,
right?

case SYNC_METHOD_OPEN:
case SYNC_METHOD_OPEN_DSYNC:
/* write synced it already */
break;

Yes, I add the following comments in the document.

@@ -3515,6 +3515,9 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
        </para>
        <para>
         Total number of times WAL data was synced to disk
+       (if <xref linkend="guc-wal-sync-method"/> is 
<literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero because WAL 
data is synced
+       when to write it).
        </para></entry>
       </row>
@@ -3525,7 +3528,10 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
        <para>
         Total amount of time that has been spent in the portion of
         WAL data was synced to disk, in milliseconds
-       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, 
otherwise zero)
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, 
otherwise zero.
+       if <xref linkend="guc-wal-sync-method"/> is 
<literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero too because WAL 
data is synced
+       when to write it).
        </para></entry>
       </row>

I attached a modified patch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v5-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; charset=us-ascii; name=v5-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload+189-16
#11Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiro Ikeda (#10)
RE: About to add WAL write/fsync statistics to pg_stat_wal view

Dear Ikeda-san,

Thank you for updating the patch. This can be applied to master, and
can be used on my RHEL7.
wal_write_time and wal_sync_time increase normally :-).

```
postgres=# select * from pg_stat_wal;
-[ RECORD 1 ]----+------------------------------
wal_records | 121781
wal_fpi | 2287
wal_bytes | 36055146
wal_buffers_full | 799
wal_write | 12770
wal_write_time | 4.469
wal_sync | 11962
wal_sync_time | 132.352
stats_reset | 2021-01-25 00:51:40.674412+00
```

I put a further comment:

```
@@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        <structfield>wal_buffers_full</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of times WAL data was written to disk because WAL buffers became full
+       Total number of times WAL data was written to disk because WAL buffers became full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was written to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was synced to disk
+       (if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>, this value is zero because WAL data is synced 
+       when to write it).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled, otherwise zero.
+       if <xref linkend="guc-wal-sync-method"/> is <literal>open_datasync</literal> or 
+       <literal>open_sync</literal>, this value is zero too because WAL data is synced 
+       when to write it).
       </para></entry>
      </row>
 ```

Maybe "Total amount of time" should be used, not "Total number of time."
Other views use "amount."

I have no comments anymore.

Hayato Kuroda
FUJITSU LIMITED

#12Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiro Ikeda (#8)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

On 2021-01-22 14:50, Masahiko Sawada wrote:

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:

---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }

* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.

And maybe XLogWalRcvWrite() also needs to track I/O?

---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}

I think we can change the first switch-case to an if statement.

* As far as I read the code, issue_xlog_fsync() seems to do fsync even
if enableFsync is false. Why does the function return false in that
case? I might be missing something.

IIUC, the reason is that I thought that each fsync functions like
pg_fsync_no_writethrough() check enableFsync.

If this code doesn't check, m_wal_sync_time may be incremented
even though some sync methods like SYNC_METHOD_OPEN don't call to sync
some data to the disk at the time.

Right.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#13Japin Li
japinli@hotmail.com
In reply to: Masahiko Sawada (#12)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

On 2021-01-22 14:50, Masahiko Sawada wrote:

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:

---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }

* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.

And maybe XLogWalRcvWrite() also needs to track I/O?

---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}

I think we can change the first switch-case to an if statement.

+1. We can also narrow the scope of "duration" into "if (track_wal_io_timing)" branch.

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

#14Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Hayato Kuroda (Fujitsu) (#11)
RE: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-25 10:34, kuroda.hayato@fujitsu.com wrote:

Dear Ikeda-san,

Thank you for updating the patch. This can be applied to master, and
can be used on my RHEL7.
wal_write_time and wal_sync_time increase normally :-).

```
postgres=# select * from pg_stat_wal;
-[ RECORD 1 ]----+------------------------------
wal_records | 121781
wal_fpi | 2287
wal_bytes | 36055146
wal_buffers_full | 799
wal_write | 12770
wal_write_time | 4.469
wal_sync | 11962
wal_sync_time | 132.352
stats_reset | 2021-01-25 00:51:40.674412+00
```

Thanks for checking.

I put a further comment:

```
@@ -3485,7 +3485,53 @@ SELECT pid, wait_event_type, wait_event FROM
pg_stat_activity WHERE wait_event i
<structfield>wal_buffers_full</structfield> <type>bigint</type>
</para>
<para>
-       Number of times WAL data was written to disk because WAL
buffers became full
+       Total number of times WAL data was written to disk because WAL
buffers became full
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para 
role="column_definition">
+       <structfield>wal_write</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was written to disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para 
role="column_definition">
+       <structfield>wal_write_time</structfield> <type>double 
precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was written to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
otherwise zero).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para 
role="column_definition">
+       <structfield>wal_sync</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of times WAL data was synced to disk
+       (if <xref linkend="guc-wal-sync-method"/> is
<literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero because WAL
data is synced
+       when to write it).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para 
role="column_definition">
+       <structfield>wal_sync_time</structfield> <type>double 
precision</type>
+      </para>
+      <para>
+       Total amount of time that has been spent in the portion of
+       WAL data was synced to disk, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
otherwise zero.
+       if <xref linkend="guc-wal-sync-method"/> is
<literal>open_datasync</literal> or
+       <literal>open_sync</literal>, this value is zero too because
WAL data is synced
+       when to write it).
</para></entry>
</row>
```

Maybe "Total amount of time" should be used, not "Total number of
time."
Other views use "amount."

Thanks.

I checked columns' descriptions of other views.
There are "Number of xxx", "Total number of xxx", "Total amount of time
that xxx" and "Total time spent xxx".

Since the "time" is used for showing spending time, not count,
I'll change it to "Total number of WAL data written/synced to disk".
Thought?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#15Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiro Ikeda (#14)
RE: About to add WAL write/fsync statistics to pg_stat_wal view

Dear Ikeda-san,

I checked columns' descriptions of other views.
There are "Number of xxx", "Total number of xxx", "Total amount of time
that xxx" and "Total time spent xxx".

Right.

Since the "time" is used for showing spending time, not count,
I'll change it to "Total number of WAL data written/synced to disk".
Thought?

I misread your patch, sorry. I prefer your suggestion.
Please fix like that way with others.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#16Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Masahiko Sawada (#12)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-25 10:36, Masahiko Sawada wrote:

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

On 2021-01-22 14:50, Masahiko Sawada wrote:

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:

---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }

* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.
And maybe XLogWalRcvWrite() also needs to track I/O?

Thanks, I forgot to add them.
I'll fix it.

---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL
data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}

IIUC, I think we can't handle the following case.

When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
"track_wal_io_timing" is enabled, "start" doesn't be initialized.

My understanding is something wrong, isn't it?

I think we can change the first switch-case to an if statement.

Yes, I'll change it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#17Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Japin Li (#13)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-25 11:47, japin wrote:

On Mon, 25 Jan 2021 at 09:36, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

On 2021-01-22 14:50, Masahiko Sawada wrote:

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:

---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }

* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.

And maybe XLogWalRcvWrite() also needs to track I/O?

---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL
data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}

I think we can change the first switch-case to an if statement.

+1. We can also narrow the scope of "duration" into "if
(track_wal_io_timing)" branch.

Thanks, I'll change it.

--
Masahiro Ikeda
NTT DATA CORPORATION

#18Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Masahiro Ikeda (#16)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

On 2021-01-25 13:15, Masahiro Ikeda wrote:

On 2021-01-25 10:36, Masahiko Sawada wrote:

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

On 2021-01-22 14:50, Masahiko Sawada wrote:

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:

---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }

* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.
And maybe XLogWalRcvWrite() also needs to track I/O?

Thanks, I forgot to add them.
I'll fix it.

---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL
data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}

IIUC, I think we can't handle the following case.

When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
"track_wal_io_timing" is enabled, "start" doesn't be initialized.

My understanding is something wrong, isn't it?

I thought the following is better.

```
/* Measure i/o timing to sync WAL data.*/
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);

(do fsync and error handling here)

/* check whether to sync WAL data to the disk right now. */
if (enableFsync)
{
if ((sync_method == SYNC_METHOD_FSYNC) ||
(sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
(sync_method == SYNC_METHOD_FDATASYNC))
{
/* increment the i/o timing and the number of times to fsync WAL data
*/
if (track_wal_io_timing)
{
instr_time duration;

INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}
WalStats.m_wal_sync++;
}
}
```

Although INSTR_TIME_SET_CURRENT(start) is called everytime regardless
of the "sync_method" and "enableFsync", we don't make additional
variables.
But it's ok because "track_wal_io_timing" leads already performance
degradation.

What do you think?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#19Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiro Ikeda (#18)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

On Mon, Jan 25, 2021 at 1:28 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

On 2021-01-25 13:15, Masahiro Ikeda wrote:

On 2021-01-25 10:36, Masahiko Sawada wrote:

On Fri, Jan 22, 2021 at 10:05 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

On 2021-01-22 14:50, Masahiko Sawada wrote:

On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:

Hi,

I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are
comments:

---
+               if (track_wal_io_timing)
+               {
+                   INSTR_TIME_SET_CURRENT(duration);
+                   INSTR_TIME_SUBTRACT(duration, start);
+                   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+               }

* I think it should add the time in micro sec.
After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]----+-----------------------------
wal_records | 285947
wal_fpi | 53285
wal_bytes | 442008213
wal_buffers_full | 0
wal_write | 25516
wal_write_time | 0
wal_sync | 25437
wal_sync_time | 14490
stats_reset | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

Thanks for your comments. I didn't notice that.
I changed the unit from milliseconds to microseconds.

---
+   /*
+    * Measure i/o timing to fsync WAL data.
+    *
+    * The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+    * If sync_method doesn't have its fsync method, to skip too.
+    */
+   if (!AmWalReceiverProcess() && track_wal_io_timing &&
fsyncMethodCalled())
+       INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.
* How about checking track_wal_io_timing first?
* s/standy/standby/

I fixed it.
As kuroda-san mentioned too, the skip is no need to be considered.

I think you also removed the code to have the wal receiver report the
stats. So with the latest patch, the wal receiver tracks those
statistics but doesn't report.
And maybe XLogWalRcvWrite() also needs to track I/O?

Thanks, I forgot to add them.
I'll fix it.

---
+   /* increment the i/o timing and the number of times to fsync WAL
data */
+   if (fsyncMethodCalled())
+   {
+       if (!AmWalReceiverProcess() && track_wal_io_timing)
+       {
+           INSTR_TIME_SET_CURRENT(duration);
+           INSTR_TIME_SUBTRACT(duration, start);
+           WalStats.m_wal_sync_time +=
INSTR_TIME_GET_MILLISEC(duration);
+       }
+
+       WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

I think if syncing the disk does not occur, m_wal_sync should not be
incremented.
It depends enableFsync and sync_method.

enableFsync is checked in each fsync method like
pg_fsync_no_writethrough(),
so if incrementing m_wal_sync after each sync operation, it should be
implemented
in each fsync method. It leads to many duplicated codes.

Right. I missed that each fsync function checks enableFsync.

So, why don't you change the function to a flag whether to
sync data to the disk will be occurred or not in issue_xlog_fsync()?

Looks better. Since we don't necessarily need to increment m_wal_sync
after doing fsync we can write the code without an additional variable
as follows:

if (enableFsync)
{
switch (sync_method)
{
case SYNC_METHOD_FSYNC:
#ifdef HAVE_FSYNC_WRITETHROUGH
case SYNC_METHOD_FSYNC_WRITETHROUGH:
#endif
#ifdef HAVE_FDATASYNC
case SYNC_METHOD_FDATASYNC:
#endif
WalStats.m_wal_sync++;
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
break;
default:
break;
}
}

(do fsync and error handling here)

/* increment the i/o timing and the number of times to fsync WAL
data */
if (track_wal_io_timing)
{
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}

IIUC, I think we can't handle the following case.

When "sync_method" is SYNC_METHOD_OPEN or SYNC_METHOD_OPEN_DSYNC and
"track_wal_io_timing" is enabled, "start" doesn't be initialized.

My understanding is something wrong, isn't it?

You're right. We might want to initialize 'start' with 0 in those two
cases and check if INSTR_TIME_IS_ZERO() later when accumulating the
I/O time.

I thought the following is better.

```
/* Measure i/o timing to sync WAL data.*/
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);

(do fsync and error handling here)

/* check whether to sync WAL data to the disk right now. */
if (enableFsync)
{
if ((sync_method == SYNC_METHOD_FSYNC) ||
(sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH) ||
(sync_method == SYNC_METHOD_FDATASYNC))
{
/* increment the i/o timing and the number of times to fsync WAL data
*/
if (track_wal_io_timing)
{
instr_time duration;

INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
WalStats.m_wal_sync_time = INSTR_TIME_GET_MICROSEC(duration);
}
WalStats.m_wal_sync++;
}
}
```

Although INSTR_TIME_SET_CURRENT(start) is called everytime regardless
of the "sync_method" and "enableFsync", we don't make additional
variables.
But it's ok because "track_wal_io_timing" leads already performance
degradation.

What do you think?

That also fine with me.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#20Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Masahiko Sawada (#19)
Re: About to add WAL write/fsync statistics to pg_stat_wal view

Hi, thanks for the reviews.

I updated the attached patch.
The summary of the changes is following.

1. fix document

I followed another view's comments.

2. refactor issue_xlog_fsync()

I removed "sync_called" variables, narrowed the "duration" scope and
change the switch statement to if statement.

3. make wal-receiver report WAL statistics

I add the code to collect the statistics for a written operation
in XLogWalRcvWrite() and to report stats in WalReceiverMain().

Since WalReceiverMain() can loop fast, to avoid loading stats collector,
I add "force" argument to the pgstat_send_wal function. If "force" is
false, it can skip reporting until at least 500 msec since it last
reported. WalReceiverMain() almost calls pgstat_send_wal() with "force"
as false.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v5_v6.difftext/x-diff; name=v5_v6.diffDownload+173-57
v6-0001-Add-statistics-related-to-write-sync-wal-records.patchtext/x-diff; name=v6-0001-Add-statistics-related-to-write-sync-wal-records.patchDownload+232-21
#21Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiro Ikeda (#20)
#22Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Masahiko Sawada (#21)
#23David G. Johnston
david.g.johnston@gmail.com
In reply to: Masahiko Sawada (#21)
#24David G. Johnston
david.g.johnston@gmail.com
In reply to: Masahiro Ikeda (#22)
#25Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: David G. Johnston (#23)
#26David G. Johnston
david.g.johnston@gmail.com
In reply to: Masahiro Ikeda (#25)
#27Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: David G. Johnston (#26)
#28Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Masahiro Ikeda (#27)
#29Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#28)
#30Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#29)
#31David G. Johnston
david.g.johnston@gmail.com
In reply to: Masahiro Ikeda (#28)
#32Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#29)
#33Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#30)
#34Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: David G. Johnston (#31)
#35Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#34)
#36Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#35)
#37Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#36)
#38Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#37)
#39Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Masahiro Ikeda (#38)
#40Ibrar Ahmed
ibrar.ahmad@gmail.com
In reply to: Masahiro Ikeda (#39)
#41Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#39)
#42Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#41)
#43Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#42)
#44Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#43)
#45Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#44)
#46Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#45)
#47Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#46)
#48David G. Johnston
david.g.johnston@gmail.com
In reply to: Fujii Masao (#47)
#49Fujii Masao
masao.fujii@gmail.com
In reply to: David G. Johnston (#48)
#50Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#42)
#51Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#50)
#52Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#51)
#53Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#52)
#54Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#53)
#55Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#54)
#56Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#55)
#57Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#55)
#58Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#56)
#59Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#57)
#60Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#58)
#61Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#59)
#62Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#61)
#63Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#62)
#64Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#63)
#65Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#64)
#66Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#65)
#67Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#66)