wait events for disk I/O

Started by Rushabh Lathiaabout 9 years ago24 messageshackers
Jump to latest
#1Rushabh Lathia
rushabh.lathia@gmail.com

Hi All,

Attached is the patch, which extend the existing wait event infrastructure
to
implement the wait events for the disk I/O. Basically pg_stat_activity's
wait
event information to show data about disk I/O as well as IPC primitives.

Implementation details:

- Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
- Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
FileWriteback, FileSync, and FileTruncate. Set this wait event just before
performing the file system operation and clear it just after.
- Pass down an appropriate wait event from caller of any of those
functions.
- Also set and clear a wait event around standalone calls to read(),
write(), fsync() in other parts of the system.
- Added documentation for all newly added wait event.

Open issue:
- Might missed few standalone calls to read(), write(), etc which need
to pass the wait_event_info.

Thanks to my colleague Robert Haas for his help in design.

Please let me know your thought, and thanks for reading.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com

Attachments:

wait_event_disk_IO.patchbinary/octet-stream; name=wait_event_disk_IO.patchDownload+823-30
#2Michael Paquier
michael@paquier.xyz
In reply to: Rushabh Lathia (#1)
Re: wait events for disk I/O

On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Attached is the patch, which extend the existing wait event infrastructure
to implement the wait events for the disk I/O. Basically pg_stat_activity's
wait event information to show data about disk I/O as well as IPC primitives.

Implementation details:

- Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
- Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
FileWriteback, FileSync, and FileTruncate. Set this wait event just before
performing the file system operation and clear it just after.
- Pass down an appropriate wait event from caller of any of those
functions.
- Also set and clear a wait event around standalone calls to read(),
write(), fsync() in other parts of the system.
- Added documentation for all newly added wait event.

Looks neat, those are unlikely to overlap with other wait events.

Open issue:
- Might missed few standalone calls to read(), write(), etc which need
to pass the wait_event_info.

It may be an idea to use LD_PRELOAD with custom versions of read(),
write() and fsync(), and look at the paths where no flags are set in
MyProc->wait_event_info, and log information when that happens.

Thanks to my colleague Robert Haas for his help in design.
Please let me know your thought, and thanks for reading.

Did you consider a wrapper of the type pg_read_event() or
pg_write_event(), etc.?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Michael Paquier (#2)
Re: wait events for disk I/O

On Tue, Jan 31, 2017 at 8:54 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Attached is the patch, which extend the existing wait event

infrastructure

to implement the wait events for the disk I/O. Basically

pg_stat_activity's

wait event information to show data about disk I/O as well as IPC

primitives.

Implementation details:

- Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
- Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
FileWriteback, FileSync, and FileTruncate. Set this wait event just

before

performing the file system operation and clear it just after.
- Pass down an appropriate wait event from caller of any of those
functions.
- Also set and clear a wait event around standalone calls to read(),
write(), fsync() in other parts of the system.
- Added documentation for all newly added wait event.

Looks neat, those are unlikely to overlap with other wait events.

Thanks.

Open issue:
- Might missed few standalone calls to read(), write(), etc which need
to pass the wait_event_info.

It may be an idea to use LD_PRELOAD with custom versions of read(),
write() and fsync(), and look at the paths where no flags are set in
MyProc->wait_event_info, and log information when that happens.

Yes, may be I will try this.

Thanks to my colleague Robert Haas for his help in design.
Please let me know your thought, and thanks for reading.

Did you consider a wrapper of the type pg_read_event() or
pg_write_event(), etc.?

I thought on that, but eventually stick to this approach as it looks
more neat and uniform with other wait event implementation.

--
Michael

Thanks,
Rushabh Lathia
www.EnterpriseDB.com

#4Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Rushabh Lathia (#3)
Re: wait events for disk I/O

My colleague Rahila reported compilation issue with
the patch. Issue was only coming with we do the clean
build on the branch.

Fixed the same into latest version of patch.

Thanks,

On Tue, Jan 31, 2017 at 11:09 AM, Rushabh Lathia <rushabh.lathia@gmail.com>
wrote:

On Tue, Jan 31, 2017 at 8:54 AM, Michael Paquier <
michael.paquier@gmail.com> wrote:

On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Attached is the patch, which extend the existing wait event

infrastructure

to implement the wait events for the disk I/O. Basically

pg_stat_activity's

wait event information to show data about disk I/O as well as IPC

primitives.

Implementation details:

- Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
- Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
FileWriteback, FileSync, and FileTruncate. Set this wait event just

before

performing the file system operation and clear it just after.
- Pass down an appropriate wait event from caller of any of those
functions.
- Also set and clear a wait event around standalone calls to read(),
write(), fsync() in other parts of the system.
- Added documentation for all newly added wait event.

Looks neat, those are unlikely to overlap with other wait events.

Thanks.

Open issue:
- Might missed few standalone calls to read(), write(), etc which need
to pass the wait_event_info.

It may be an idea to use LD_PRELOAD with custom versions of read(),
write() and fsync(), and look at the paths where no flags are set in
MyProc->wait_event_info, and log information when that happens.

Yes, may be I will try this.

Thanks to my colleague Robert Haas for his help in design.
Please let me know your thought, and thanks for reading.

Did you consider a wrapper of the type pg_read_event() or
pg_write_event(), etc.?

I thought on that, but eventually stick to this approach as it looks
more neat and uniform with other wait event implementation.

--
Michael

Thanks,
Rushabh Lathia
www.EnterpriseDB.com

--
Rushabh Lathia

Attachments:

wait_event_for_disk_IO_v1.patchapplication/x-download; name=wait_event_for_disk_IO_v1.patchDownload+816-29
#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Rushabh Lathia (#4)
Re: wait events for disk I/O

On Mon, Feb 20, 2017 at 4:04 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

My colleague Rahila reported compilation issue with
the patch. Issue was only coming with we do the clean
build on the branch.

Fixed the same into latest version of patch.

Few assorted comments:

1.
+        <row>
+         <entry><literal>WriteBuffile</></entry>
+         <entry>Waiting during
buffile read operation.</entry>
+        </row>

Operation name and definition are not matching.

2.
+FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info)
 {
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
  int returnCode;
@@ -1527,8 +1527,10 @@ FilePrefetch(File file, off_t offset, int amount)
  if (returnCode < 0)
  return returnCode;
+ pgstat_report_wait_start(wait_event_info);
  returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
    POSIX_FADV_WILLNEED);
+ pgstat_report_wait_end();

AFAIK, this call is non-blocking and will just initiate a read, so do
you think we should record wait event for such a call.

3.
- written = FileWrite(src->vfd, waldata_start, len);
+ written = FileWrite(src->vfd, waldata_start, len,
+ WAIT_EVENT_WRITE_REWRITE_DATA_BLOCK);
  if (written != len)
  ereport(ERROR,
  (errcode_for_file_access(),
@@ -957,7 +960,7 @@ logical_end_heap_rewrite(RewriteState state)
  hash_seq_init(&seq_status, state->rs_logical_mappings);
  while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
  {
- if (FileSync(src->vfd) != 0)
+ if (FileSync(src->vfd, WAIT_EVENT_SYNC_REWRITE_DATA_BLOCK) != 0)

Do we want to consider recording wait event for both write and sync?
It seems to me OS level writes are relatively cheap and sync calls are
expensive, so shouldn't we just log for sync calls? I could see the
similar usage at multiple places in the patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Amit Kapila (#5)
Re: wait events for disk I/O

On Sat, Mar 4, 2017 at 7:53 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 20, 2017 at 4:04 PM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

My colleague Rahila reported compilation issue with
the patch. Issue was only coming with we do the clean
build on the branch.

Fixed the same into latest version of patch.

Few assorted comments:

1.
+        <row>
+         <entry><literal>WriteBuffile</></entry>
+         <entry>Waiting during
buffile read operation.</entry>
+        </row>

Operation name and definition are not matching.

Will fix this.

2.
+FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info)
{
#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
int returnCode;
@@ -1527,8 +1527,10 @@ FilePrefetch(File file, off_t offset, int amount)
if (returnCode < 0)
return returnCode;
+ pgstat_report_wait_start(wait_event_info);
returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
POSIX_FADV_WILLNEED);
+ pgstat_report_wait_end();

AFAIK, this call is non-blocking and will just initiate a read, so do
you think we should record wait event for such a call.

Yes, prefecth call is a non-blocking and will just initiate a read. But
this info
about the prefetch into wait events will give more info about the system.

3.

- written = FileWrite(src->vfd, waldata_start, len);
+ written = FileWrite(src->vfd, waldata_start, len,
+ WAIT_EVENT_WRITE_REWRITE_DATA_BLOCK);
if (written != len)
ereport(ERROR,
(errcode_for_file_access(),
@@ -957,7 +960,7 @@ logical_end_heap_rewrite(RewriteState state)
hash_seq_init(&seq_status, state->rs_logical_mappings);
while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) !=
NULL)
{
- if (FileSync(src->vfd) != 0)
+ if (FileSync(src->vfd, WAIT_EVENT_SYNC_REWRITE_DATA_BLOCK) != 0)

Do we want to consider recording wait event for both write and sync?
It seems to me OS level writes are relatively cheap and sync calls are
expensive, so shouldn't we just log for sync calls? I could see the
similar usage at multiple places in the patch.

Yes, I thought of adding wait event only for the sync but then recording the
wait event for both write and sync. I understand that OS level writes are
cheap but it still have some cost attached to that. Also I thought for the
monitoring tool being develop using this wait events, will have more useful
capture data if we try to collect as much info as we can. Or may be not.

I am open for other opinion/suggestions.

--

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Rushabh Lathia

#7Robert Haas
robertmhaas@gmail.com
In reply to: Rushabh Lathia (#6)
Re: wait events for disk I/O

On Mon, Mar 6, 2017 at 3:27 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

Yes, I thought of adding wait event only for the sync but then recording the
wait event for both write and sync. I understand that OS level writes are
cheap but it still have some cost attached to that. Also I thought for the
monitoring tool being develop using this wait events, will have more useful
capture data if we try to collect as much info as we can. Or may be not.

I am open for other opinion/suggestions.

Writes are NOT always fast. I've seen cases of write() blocking for
LONG periods of time on systems that are in the process of failing, or
just busy. So I think we certainly want to advertise a wait event for
those.

Likewise, I agree that the prefetch call probably SHOULDN'T block, but
just because it shouldn't doesn't mean it never will.

I think somebody should try a pgbench run with this patch applied,
using a scale factor greater than shared_buffers, and generate a wait
event profile, just to see if these are showing up and how often.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#7)
Re: wait events for disk I/O

On Tue, Mar 7, 2017 at 8:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 6, 2017 at 3:27 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

Yes, I thought of adding wait event only for the sync but then recording the
wait event for both write and sync. I understand that OS level writes are
cheap but it still have some cost attached to that. Also I thought for the
monitoring tool being develop using this wait events, will have more useful
capture data if we try to collect as much info as we can. Or may be not.

I am open for other opinion/suggestions.

Writes are NOT always fast. I've seen cases of write() blocking for
LONG periods of time on systems that are in the process of failing, or
just busy. So I think we certainly want to advertise a wait event for
those.

+1. I see that quite often really happens, and a lot particularly on
overloaded VMs. Having to debug such systems until you notice that
what is slow is just the underlying system is no fun. So being able to
show up an event state in memory where we can say that things are
stuck on I/O just with PG interface can really help in diagnostics
without having to look at storage-level logs for a first impression.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#7)
Re: wait events for disk I/O

On Tue, Mar 7, 2017 at 5:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 6, 2017 at 3:27 AM, Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

Yes, I thought of adding wait event only for the sync but then recording the
wait event for both write and sync. I understand that OS level writes are
cheap but it still have some cost attached to that. Also I thought for the
monitoring tool being develop using this wait events, will have more useful
capture data if we try to collect as much info as we can. Or may be not.

I am open for other opinion/suggestions.

Writes are NOT always fast. I've seen cases of write() blocking for
LONG periods of time on systems that are in the process of failing, or
just busy. So I think we certainly want to advertise a wait event for
those.

Sure, if you think both Writes and Reads at OS level can have some
chance of blocking in obscure cases, then we should add a wait event
for them.

Likewise, I agree that the prefetch call probably SHOULDN'T block, but
just because it shouldn't doesn't mean it never will.

I think somebody should try a pgbench run with this patch applied,
using a scale factor greater than shared_buffers, and generate a wait
event profile, just to see if these are showing up and how often.

Yeah, that makes sense to me and we should try for both read-write and
read-only tests.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#9)
Re: wait events for disk I/O

On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Sure, if you think both Writes and Reads at OS level can have some
chance of blocking in obscure cases, then we should add a wait event
for them.

I think writes have a chance of blocking in cases even in cases that
are not very obscure at all.

I think somebody should try a pgbench run with this patch applied,
using a scale factor greater than shared_buffers, and generate a wait
event profile, just to see if these are showing up and how often.

Yeah, that makes sense to me and we should try for both read-write and
read-only tests.

wfm.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#10)
Re: wait events for disk I/O

On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Sure, if you think both Writes and Reads at OS level can have some
chance of blocking in obscure cases, then we should add a wait event
for them.

I think writes have a chance of blocking in cases even in cases that
are not very obscure at all.

Point taken for writes, but I think in general we should have some
criteria based on which we can decide whether to have a wait event for
a particular call. It should not happen that we have tons of wait
events and out of which, only a few are helpful in most of the cases
in real-world scenarios.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#11)
Re: wait events for disk I/O

On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Sure, if you think both Writes and Reads at OS level can have some
chance of blocking in obscure cases, then we should add a wait event
for them.

I think writes have a chance of blocking in cases even in cases that
are not very obscure at all.

Point taken for writes, but I think in general we should have some
criteria based on which we can decide whether to have a wait event for
a particular call. It should not happen that we have tons of wait
events and out of which, only a few are helpful in most of the cases
in real-world scenarios.

Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Robert Haas (#12)
Re: wait events for disk I/O

On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Sure, if you think both Writes and Reads at OS level can have some
chance of blocking in obscure cases, then we should add a wait event
for them.

I think writes have a chance of blocking in cases even in cases that
are not very obscure at all.

Point taken for writes, but I think in general we should have some
criteria based on which we can decide whether to have a wait event for
a particular call. It should not happen that we have tons of wait
events and out of which, only a few are helpful in most of the cases
in real-world scenarios.

Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.

Yes, I agree with Robert. Knowing what we want and what we don't
want is difficult to judge. Something which we might think its not useful
information, and later of end up into situation where we re-think about
adding those missing stuff is not good. Having more information about
the system, specially for monitoring purpose is always good.

I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial value
to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

--
Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

wait_event_for_disk_IO_v2.patchapplication/x-download; name=wait_event_for_disk_IO_v2.patchDownload+816-29
#14Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Rushabh Lathia (#13)
Re: wait events for disk I/O

On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com>
wrote:

I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial
value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

I have applied attached patch, set shared_buffers to 128kB and ran pgbench,
I am able to see below distinct IO events.

--with ./pgbench -i -s 500 postgres
application_name wait_event_type wait_event query
pgbench IO ExtendDataBlock
copy pgbench_account
pgbench IO WriteXLog
copy pgbench_account
pgbench IO WriteDataBlock
copy pgbench_account
pgbench IO ReadDataBlock
vacuum analyze pgben
pgbench IO ReadBuffile
alter table pgbench_

--with ./pgbench -T 600 postgres (readwrite)
application_name wait_event_type wait_event query
pgbench IO ReadDataBlock
UPDATE pgbench_accou
pgbench IO WriteDataBlock
UPDATE pgbench_telle
IO SyncDataBlock

pgbench IO SyncDataBlock
UPDATE pgbench_telle
IO SyncDataBlock
autovacuum: VACUUM A
pgbench IO WriteXLog
END;
pgbench IO ExtendDataBlock
copy pgbench_account

--with ./pgbench -T 600 -S postgres (select only)
application_name wait_event_type wait_event query
pgbench IO ReadDataBlock
SELECT abalance FROM

Attached excel with all IO event values.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Attachments:

wait_event_for_disk_io.odsapplication/vnd.oasis.opendocument.spreadsheet; name=wait_event_for_disk_io.odsDownload+0-1
#15Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Rajkumar Raghuwanshi (#14)
Re: wait events for disk I/O

Thanks Rajkumar for performing tests on this patch.

Yes, I also noticed similar results in my testing. Additionally sometime I
also
noticed ReadSLRUPage event on my system.

I also run the reindex database command and I notices below IO events.

SyncImmedRelation,
WriteDataBlock
WriteBuffile,
WriteXLog,
ReadDataBlock

On Wed, Mar 8, 2017 at 6:41 PM, Rajkumar Raghuwanshi <
rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com>
wrote:

I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial
value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

I have applied attached patch, set shared_buffers to 128kB and ran
pgbench, I am able to see below distinct IO events.

--with ./pgbench -i -s 500 postgres
application_name wait_event_type wait_event query
pgbench IO ExtendDataBlock
copy pgbench_account
pgbench IO WriteXLog
copy pgbench_account
pgbench IO WriteDataBlock
copy pgbench_account
pgbench IO ReadDataBlock
vacuum analyze pgben
pgbench IO ReadBuffile
alter table pgbench_

--with ./pgbench -T 600 postgres (readwrite)
application_name wait_event_type wait_event query
pgbench IO ReadDataBlock
UPDATE pgbench_accou
pgbench IO WriteDataBlock
UPDATE pgbench_telle
IO SyncDataBlock

pgbench IO SyncDataBlock
UPDATE pgbench_telle
IO SyncDataBlock
autovacuum: VACUUM A
pgbench IO WriteXLog
END;
pgbench IO ExtendDataBlock
copy pgbench_account

--with ./pgbench -T 600 -S postgres (select only)
application_name wait_event_type wait_event query
pgbench IO ReadDataBlock
SELECT abalance FROM

Attached excel with all IO event values.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

--
Rushabh Lathia
www.EnterpriseDB.com

#16Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Rushabh Lathia (#15)
Re: wait events for disk I/O

On Thu, Mar 9, 2017 at 10:54 AM, Rushabh Lathia
<rushabh.lathia@gmail.com> wrote:

Thanks Rajkumar for performing tests on this patch.

Yes, I also noticed similar results in my testing. Additionally sometime I
also
noticed ReadSLRUPage event on my system.

I also run the reindex database command and I notices below IO events.

SyncImmedRelation,
WriteDataBlock
WriteBuffile,
WriteXLog,
ReadDataBlock

I have tried for generating some more events, by running pgbench on
master/slave configuration, able to see three more events.
WriteInitXLogFile
SyncInitXLogFile
ReadXLog

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Rahila Syed
rahilasyed90@gmail.com
In reply to: Rushabh Lathia (#13)
Re: wait events for disk I/O

Hello,

I applied and tested this patch on latest sources and it works fine.

Following are some comments,

+ /* Wait event for SNRU */
+ WAIT_EVENT_READ_SLRU_PAGE,

Typo in the comment.

FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush,

WAIT_EVENT_FLUSH_DATA_BLOCK);
This call is inside mdwriteback() which can flush more than one block so
should WAIT_EVENT _FLUSH_DATA_BLOCK
be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?

Should calls to write() in following functions be tracked too?
qtext_store() - This is related to pg_stat_statements

dsm_impl_mmap() - This is in relation to creating dsm segments.

write_auto_conf_file()- This is called when updated configuration
parameters are
written to a temp file.

Thank you,
Rahila Syed

On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com>
wrote:

Show quoted text

On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Sure, if you think both Writes and Reads at OS level can have some
chance of blocking in obscure cases, then we should add a wait event
for them.

I think writes have a chance of blocking in cases even in cases that
are not very obscure at all.

Point taken for writes, but I think in general we should have some
criteria based on which we can decide whether to have a wait event for
a particular call. It should not happen that we have tons of wait
events and out of which, only a few are helpful in most of the cases
in real-world scenarios.

Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.

Yes, I agree with Robert. Knowing what we want and what we don't
want is difficult to judge. Something which we might think its not useful
information, and later of end up into situation where we re-think about
adding those missing stuff is not good. Having more information about
the system, specially for monitoring purpose is always good.

I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial
value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

--
Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Rahila Syed (#17)
Re: wait events for disk I/O

Thanks Rahila for reviewing this patch.

On Tue, Mar 14, 2017 at 8:13 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:

Hello,

I applied and tested this patch on latest sources and it works fine.

Following are some comments,

+ /* Wait event for SNRU */
+ WAIT_EVENT_READ_SLRU_PAGE,

Typo in the comment.

Fixed.

FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush,

WAIT_EVENT_FLUSH_DATA_BLOCK);
This call is inside mdwriteback() which can flush more than one block so
should WAIT_EVENT _FLUSH_DATA_BLOCK
be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?

Changed with WAIT_EVENT_FLUSH_DATA_BLOCKS.

Should calls to write() in following functions be tracked too?
qtext_store() - This is related to pg_stat_statements

I am not quite sure about this, as this is for stat statements. Also part
from the
place you found there are many other fwrite() call into pg_stat_statements,
and
I intentionally haven't added event here as it is very small write about
stat, and
doesn't look like we should add for those call.

dsm_impl_mmap() - This is in relation to creating dsm segments.

Added new event here. Actually particular write call is zero-filling the
DSM file.

write_auto_conf_file()- This is called when updated configuration
parameters are
written to a temp file.

write_auto_conf_file() is getting called during the ALTER SYSTEM call. Here
write
happen only when someone explicitly run the ALTER SYSTEM call. This is
administrator call and so doesn't seem like necessary to add separate wait
event
for this.

PFA latest patch with other fixes.

On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com>
wrote:

On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Sure, if you think both Writes and Reads at OS level can have some
chance of blocking in obscure cases, then we should add a wait event
for them.

I think writes have a chance of blocking in cases even in cases that
are not very obscure at all.

Point taken for writes, but I think in general we should have some
criteria based on which we can decide whether to have a wait event for
a particular call. It should not happen that we have tons of wait
events and out of which, only a few are helpful in most of the cases
in real-world scenarios.

Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.

Yes, I agree with Robert. Knowing what we want and what we don't
want is difficult to judge. Something which we might think its not useful
information, and later of end up into situation where we re-think about
adding those missing stuff is not good. Having more information about
the system, specially for monitoring purpose is always good.

I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial
value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

--
Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Regards,
Rushabh Lathia
www.EnterpriseDB.com
The Enterprise PostgreSQL Company

Attachments:

wait_event_for_disk_IO_v3.patchtext/x-patch; charset=US-ASCII; name=wait_event_for_disk_IO_v3.patchDownload+829-29
#19Rahila Syed
rahilasyed90@gmail.com
In reply to: Rushabh Lathia (#18)
Re: wait events for disk I/O

Thank you for the updated patch.

I have applied and tested it on latest sources and the patch looks good to
me.

I am not quite sure about this, as this is for stat statements. Also part

from the

place you found there are many other fwrite() call into

pg_stat_statements, and

I intentionally haven't added event here as it is very small write about

stat, and

doesn't look like we should add for those call.

I agree that this writes less amount of data. Although tracking this can
be useful too in scenarios where pg_stat_statements is lagging due to I/O
bottleneck.
I will leave this decision to the committer.

Thank you,
Rahila Syed

On Wed, Mar 15, 2017 at 1:03 PM, Rushabh Lathia <rushabh.lathia@gmail.com>
wrote:

Show quoted text

Thanks Rahila for reviewing this patch.

On Tue, Mar 14, 2017 at 8:13 PM, Rahila Syed <rahilasyed90@gmail.com>
wrote:

Hello,

I applied and tested this patch on latest sources and it works fine.

Following are some comments,

+ /* Wait event for SNRU */
+ WAIT_EVENT_READ_SLRU_PAGE,

Typo in the comment.

Fixed.

FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush,

WAIT_EVENT_FLUSH_DATA_BLOCK);
This call is inside mdwriteback() which can flush more than one block so
should WAIT_EVENT _FLUSH_DATA_BLOCK
be renamed to WAIT_EVENT_FLUSH_DATA_BLOCKS?

Changed with WAIT_EVENT_FLUSH_DATA_BLOCKS.

Should calls to write() in following functions be tracked too?
qtext_store() - This is related to pg_stat_statements

I am not quite sure about this, as this is for stat statements. Also part
from the
place you found there are many other fwrite() call into
pg_stat_statements, and
I intentionally haven't added event here as it is very small write about
stat, and
doesn't look like we should add for those call.

dsm_impl_mmap() - This is in relation to creating dsm segments.

Added new event here. Actually particular write call is zero-filling the
DSM file.

write_auto_conf_file()- This is called when updated configuration
parameters are
written to a temp file.

write_auto_conf_file() is getting called during the ALTER SYSTEM call.
Here write
happen only when someone explicitly run the ALTER SYSTEM call. This is
administrator call and so doesn't seem like necessary to add separate wait
event
for this.

PFA latest patch with other fixes.

On Wed, Mar 8, 2017 at 4:50 PM, Rushabh Lathia <rushabh.lathia@gmail.com>
wrote:

On Wed, Mar 8, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Tue, Mar 7, 2017 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Mar 7, 2017 at 9:16 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Mon, Mar 6, 2017 at 9:09 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Sure, if you think both Writes and Reads at OS level can have some
chance of blocking in obscure cases, then we should add a wait event
for them.

I think writes have a chance of blocking in cases even in cases that
are not very obscure at all.

Point taken for writes, but I think in general we should have some
criteria based on which we can decide whether to have a wait event for
a particular call. It should not happen that we have tons of wait
events and out of which, only a few are helpful in most of the cases
in real-world scenarios.

Well, the problem is that if you pick and choose which wait events to
add based on what you think will be common, you're actually kind of
hosing yourself. Because now when something uncommon happens, suddenly
you don't get any wait event data and you can't tell what's happening.
I think the number of new wait events added by Rushabh's patch is
wholly reasonable. Yeah, some of those are going to be a lot more
common than others, but so what? We add wait events so that we can
find out what's going on. I don't want to sometimes know when a
backend is blocked on an I/O. I want to ALWAYS know.

Yes, I agree with Robert. Knowing what we want and what we don't
want is difficult to judge. Something which we might think its not useful
information, and later of end up into situation where we re-think about
adding those missing stuff is not good. Having more information about
the system, specially for monitoring purpose is always good.

I am attaching another version of the patch, as I found stupid mistake
in the earlier version of patch, where I missed to initialize initial
value to
WaitEventIO enum. Also earlier version was not getting cleanly apply on
the current version of sources.

--
Rushabh Lathia
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Regards,
Rushabh Lathia
www.EnterpriseDB.com
The Enterprise PostgreSQL Company

#20Robert Haas
robertmhaas@gmail.com
In reply to: Rahila Syed (#19)
Re: wait events for disk I/O

On Thu, Mar 16, 2017 at 8:28 AM, Rahila Syed <rahilasyed90@gmail.com> wrote:

Thank you for the updated patch.

I have applied and tested it on latest sources and the patch looks good to
me.

The documentation puts the new wait events in a pretty random order.
I think they should be alphabetized, like we do with the IPC events.
I also suggest we change the naming scheme so that the kind of thing
being operated on is first and this is followed by the operation name.
This will let us keep related entries next to each other after
alphabetizing. So with that principle in mind:

- instead of ReadDataBlock etc. I propose DataFileRead, DataFileWrite,
DataFileSync, DataFileExtend, DataFileFlush, DataFilePrefetch,
DataFileTruncate. using file instead of block avoids singular/plural
confusion.
- instead of RelationSync and RelationImmedSync I proposed
DataFileSync and DataFileImmediateSync; these are md.c operations like
the previous set, so why name it differently?
- instead of WriteRewriteDataBlock and SyncRewriteDataBlock and
TruncateLogicalMappingRewrite, which aren't consistent with each other
even though they are related, I propose LogicalRewriteWrite,
LogicalRewriteSync, and LogicalRewriteTruncate, which are also closer
to the names of the functions that contain those wait points
- for ReadBuffile and WriteBuffile seem OK, I propose BufFileRead and
BufFileWrite, again reversing the order and also tweaking the
capitalization
- in keeping with our new policy of referring to xlog as wal in user
visible interfaces, I propose WALRead, WALCopyRead, WALWrite,
WALInitWrite, WALCopyWrite, WALBootstrapWrite, WALInitSync,
WALBootstrapSync, WALSyncMethodAssign
- for control file ops, ControlFileRead, ControlFileWrite,
ControlFileWriteUpdate, ControlFileSync, ControlFileSyncUpdate
- ReadApplyLogicalMapping and friends seem to have to do with the
reorderbuffer code, so maybe ReorderBufferRead etc.
- there seems to be some discrepancy between the documentation and
pgstat_get_wait_io for the snapbuild stuff. maybe SnapBuildWrite,
SnapBuildSync, SnapBuildRead.
- SLRURead, SLRUWrite, etc.
- TimelineHistoryRead, etc.
- the syslogger changes should be dropped, since the syslogger is not
and should not be connected to shared memory
- the replslot terminology seems like a case of odd capitalization and
overeager abbreviation. why not ReplicationSlotRead,
ReplicationSlotWrite, etc? similarly RelationMapRead,
RelationMapWrite, etc?
- CopyFileRead, CopyFileWrite
- LockFileCreateRead, etc.
- AddToDataDirLockFileRead is a little long and incomprehensible;
maybe LockFileUpdateRead etc.
- DSMWriteZeroBytes, maybe?

Of course the constants should be renamed to match.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Robert Haas (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Rushabh Lathia (#21)
#23Rushabh Lathia
rushabh.lathia@gmail.com
In reply to: Robert Haas (#22)
#24Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Rushabh Lathia (#23)