Patch: add timing of buffer I/O requests

Started by Ants Aasmaover 14 years ago89 messageshackers
Jump to latest
#1Ants Aasma
ants.aasma@cybertec.at

Hi,

I know I'm late for the ongoing commitfest, but I thought I might as well
be early for the next one.

Attached is a patch against master that implements timing of shared buffer
fills and per relation stats collection of said timings. Buffer flushes are
timed as well but aren't exposed per table because of difficulty of
correctly attributing them.

Some notes on the implementation:
* The timing is done in bufmgr.c. Under high load some CPU contention will
get attributed to IO because the process doing the I/O won't get a time-
slice immediately.
* I decided to also account waiting for others doing the I/O as I/O waits.
They aren't double counted for per relation stats though.
* I added a GUC track_iotiming defaulting to off because timing isn't cheap
on all platforms.
* I used instr_time to keep the counts to be consistent with function
timings, but maybe both should convert to plain uint64 counts to make the
arithmetic code cleaner.
* Timings are exposed via EXPLAIN (BUFFERS), pg_stat_statements and
pg_statio_* views.
* I noticed there aren't any pg_statio_xact_* views. I don't have any need
for them myself, but thought I'd mention the inconsistency.
* The feature is really useful for me with auto_explain. Even better with
Peter's pg_stat_statements query cleaning applied.

I did some testing on an older AMD Athlon X2 BE-2350 and an Intel i5 M 540
to see the overhead. The AMD CPU doesn't have the necessary features for
fast user mode timing and has an overhead of about 900ns per gettimeofday
call. The Intel processor has an overhead of 22ns per call.

I tried a read only pgbench with scalefactor 50 and shared_buffers=32MB to
induce a lot of IO traffic that hits the OS cache. Seems like it should be
the worst case for this patch.

On the AMD I saw about 3% performance drop with timing enabled. On the
Intel machine I couldn't measure any statistically significant change. The
median was actually higher with timing enabled, but stddevs were large
enough to hide a couple of percent of performance loss. This needs some
further testing.

Preliminary results for the Intel machine with stddev (10 5min runs):
-c | master | io-stats
4 | 16521.53 ±4.49% | +1.16% ±3.21%
16 | 13923.49 ±5.98% | +0.56% ±7.11%

This is my first patch, so I hope I haven't missed anything too trivial.

--
Ants Aasma
ants.aasma@eesti.ee

Attachments:

io-stats.v1.patchtext/x-patch; charset=US-ASCII; name=io-stats.v1.patchDownload+317-72
#2Greg Smith
gsmith@gregsmith.com
In reply to: Ants Aasma (#1)
Re: Patch: add timing of buffer I/O requests

On 11/27/2011 04:39 PM, Ants Aasma wrote:

On the AMD I saw about 3% performance drop with timing enabled. On the
Intel machine I couldn't measure any statistically significant change.

Oh no, it's party pooper time again. Sorry I have to be the one to do
it this round. The real problem with this whole area is that we know
there are systems floating around where the amount of time taken to grab
timestamps like this is just terrible. I've been annoyed enough by that
problem to spend some time digging into why that is--seems to be a bunch
of trivia around the multiple ways to collect time info on x86
systems--and after this CommitFest is over I was already hoping to dig
through my notes and start quantifying that more. So you can't really
prove the overhead of this approach is acceptable just by showing two
examples; we need to find one of the really terrible clocks and test
there to get a real feel for the worst-case.

I recall a patch similar to this one was submitted by Greg Stark some
time ago. It used the info for different reasons--to try and figure out
whether reads were cached or not--but I believe it withered rather than
being implemented mainly because it ran into the same fundamental
roadblocks here. My memory could be wrong here, there were also
concerns about what the data would be used for.

I've been thinking about a few ways to try and cope with this whole
class of timing problem:

-Document the underlying problem and known workarounds, provide a way to
test how bad the overhead is, and just throw our hands up and say
"sorry, you just can't instrument like this" if someone has a slow system.

-Have one of the PostgreSQL background processes keep track of a time
estimate on its own, only periodically pausing to sync against the real
time. Then most calls to gettimeofday() can use that value instead. I
was thinking of that idea for slightly longer running things though; I
doubt that can be made accurate enough to test instrument buffer

And while I hate to kick off massive bike-shedding in your direction,
I'm also afraid this area--collecting stats about how long individual
operations take--will need a much wider ranging approach than just
looking at the buffer cache ones. If you step back and ask "what do
people expect here?", there's a pretty large number who really want
something like Oracle's v$session_wait and v$system_event interface for
finding the underlying source of slow things. There's enough demand for
that that EnterpriseDB has even done some work in this area too; what
I've been told about it suggests the code isn't a great fit for
contribution to community PostgreSQL though. Like I said, this area is
really messy and hard to get right.

Something more ambitious like the v$ stuff would also take care of what
you're doing here; I'm not sure that what you've done helps built it
though. Please don't take that personally. Part of one of my own
instrumentation patches recently was rejected out of hand for the same
reason, just not being general enough.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Greg Smith (#2)
Re: Patch: add timing of buffer I/O requests

On 28 Listopad 2011, 8:54, Greg Smith wrote:

-Have one of the PostgreSQL background processes keep track of a time
estimate on its own, only periodically pausing to sync against the real
time. Then most calls to gettimeofday() can use that value instead. I
was thinking of that idea for slightly longer running things though; I
doubt that can be made accurate enough to test instrument buffer

What about random sampling, i.e. "measure just 5% of the events" or
something like that? Sure, it's not exact but it significantly reduces the
overhead. And it might be a config parameter, so the user might decide how
precise results are needed, and even consider how fast the clocks are.

Something more ambitious like the v$ stuff would also take care of what
you're doing here; I'm not sure that what you've done helps built it
though. Please don't take that personally. Part of one of my own
instrumentation patches recently was rejected out of hand for the same
reason, just not being general enough.

Yes, that'd be significant improvement. The wait-event stuff is very
useful and changes the tuning significantly.

Tomas

#4Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#2)
Re: Patch: add timing of buffer I/O requests

On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith <greg@2ndquadrant.com> wrote:

The real problem with this whole area is that we know there are
systems floating around where the amount of time taken to grab timestamps
like this is just terrible.

Assuming the feature is off by default (and I can't imagine we'd
consider anything else), I don't see why that should be cause for
concern. If the instrumentation creates too much system load, then
don't use it: simple as that. A more interesting question is "how
much load does this feature create even when it's turned off?".

The other big problem for a patch of this sort is that it would bloat
the stats file. I think we really need to come up with a more
scalable alternative to the current system, but I haven't looked the
current system in enough detail to have a clear feeling about what
such an alternative would look like.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#2)
Re: Patch: add timing of buffer I/O requests

On Nov 28, 2011 8:55 AM, "Greg Smith" <greg@2ndquadrant.com> wrote:

On 11/27/2011 04:39 PM, Ants Aasma wrote:

On the AMD I saw about 3% performance drop with timing enabled. On the
Intel machine I couldn't measure any statistically significant change.

Oh no, it's party pooper time again. Sorry I have to be the one to do it

this round. The real problem with this whole area is that we know there
are systems floating around where the amount of time taken to grab
timestamps like this is just terrible.

I believe on most systems on modern linux kernels gettimeofday an its ilk
will be a vsyscall and nearly as fast as a regular function call.

I recall a patch similar to this one was submitted by Greg Stark some

time ago. It used the info for different reasons--to try and figure out
whether reads were cached or not--but I believe it withered rather than
being implemented mainly because it ran into the same fundamental
roadblocks here. My memory could be wrong here, there were also concerns
about what the data would be used for.

I speculated about doing that but never did. I had an experimental patch
using mincore to do what you describe but it wasn't intended for production
code I think. The only real patch was to use getrusage which I still intend
to commit but it doesn't tell you the time spent in I/O -- though it does
tell you the sys time which should be similar.

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bruce Momjian (#5)
Re: Patch: add timing of buffer I/O requests

On 28 Listopad 2011, 15:40, Greg Stark wrote:

On Nov 28, 2011 8:55 AM, "Greg Smith" <greg@2ndquadrant.com> wrote:

On 11/27/2011 04:39 PM, Ants Aasma wrote:

On the AMD I saw about 3% performance drop with timing enabled. On the
Intel machine I couldn't measure any statistically significant change.

Oh no, it's party pooper time again. Sorry I have to be the one to do
it

this round. The real problem with this whole area is that we know there
are systems floating around where the amount of time taken to grab
timestamps like this is just terrible.

I believe on most systems on modern linux kernels gettimeofday an its ilk
will be a vsyscall and nearly as fast as a regular function call.

AFAIK a vsyscall should be faster than a regular syscall. It does not need
to switch to kernel space at all, it "just" reads the data from a shared
page. The problem is that this is Linux-specific - for example FreeBSD
does not have vsyscall at all (it's actually one of the Linux-isms
mentioned here: http://wiki.freebsd.org/AvoidingLinuxisms).

There's also something called VDSO, that (among other things) uses
vsyscall if availabe, or the best implementation available. So there are
platforms that do not provide vsyscall, and in that case it'd be just as
slow as a regular syscall :(

I wouldn't expect a patch that works fine on Linux but not on other
platforms to be accepted, unless there's a compile-time configure switch
(--with-timings) that'd allow to disable that.

Another option would be to reimplement the vsyscall, even on platforms
that don't provide it. The principle is actually quite simple - allocate a
shared memory, store there a current time and update it whenever a clock
interrupt happens. This is basically what Greg suggested in one of the
previous posts, where "regularly" means "on every interrupt". Greg was
worried about the precision, but this should be just fine I guess. It's
the precision you get on Linux, anyway ...

I recall a patch similar to this one was submitted by Greg Stark some

time ago. It used the info for different reasons--to try and figure out
whether reads were cached or not--but I believe it withered rather than
being implemented mainly because it ran into the same fundamental
roadblocks here. My memory could be wrong here, there were also concerns
about what the data would be used for.

The difficulty when distinguishing whether the reads were cached or not is
the price we pay for using filesystem cache instead of managing our own.
Not sure if this can be solved just by measuring the latency - with
spinners it's quite easy, the differences are rather huge (and it's not
difficult to derive that even from pgbench log). But with SSDs, multiple
tablespaces on different storage, etc. it gets much harder.

Tomas

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#4)
Re: Patch: add timing of buffer I/O requests

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith <greg@2ndquadrant.com> wrote:

The real problem with this whole area is that we know there are
systems floating around where the amount of time taken to grab timestamps
like this is just terrible.

Assuming the feature is off by default (and I can't imagine we'd
consider anything else), I don't see why that should be cause for
concern. If the instrumentation creates too much system load, then
don't use it: simple as that. A more interesting question is "how
much load does this feature create even when it's turned off?".

Right. I see that the code already has a switch to skip the
gettimeofday calls, so the objection is only problematic if the added
overhead is significant even with the switch off. I would worry mainly
about the added time/space to deal with the extra stats counters.

The other big problem for a patch of this sort is that it would bloat
the stats file.

Yes. Which begs the question of why we need to measure this per-table.
I would think per-tablespace would be sufficient.

regards, tom lane

#8Martijn van Oosterhout
kleptog@svana.org
In reply to: Greg Smith (#2)
Re: Patch: add timing of buffer I/O requests

On Sun, Nov 27, 2011 at 11:54:38PM -0800, Greg Smith wrote:

On 11/27/2011 04:39 PM, Ants Aasma wrote:

On the AMD I saw about 3% performance drop with timing enabled. On the
Intel machine I couldn't measure any statistically significant change.

Oh no, it's party pooper time again. Sorry I have to be the one to
do it this round. The real problem with this whole area is that we
know there are systems floating around where the amount of time
taken to grab timestamps like this is just terrible. I've been
annoyed enough by that problem to spend some time digging into why
that is--seems to be a bunch of trivia around the multiple ways to
collect time info on x86 systems--and after this CommitFest is over

Something good to know: in Linux the file
/sys/devices/system/clocksource/clocksource0/current_clocksource
lists the current clock source, and
/sys/devices/system/clocksource/clocksource0/available_clocksource
lists the available clock sources. With cat you can switch them. That
way you may be able to quantify the effects on a single machine.

Learned the hard way while tracking clock-skew on a multicore system.
The hpet may not be the fastest (that would be the cpu timer), but it's
the fastest (IME) that gives guarenteed monotonic time.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#9Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Tomas Vondra (#6)
Re: Patch: add timing of buffer I/O requests

"Tomas Vondra" <tv@fuzzy.cz> writes:

Another option would be to reimplement the vsyscall, even on platforms
that don't provide it. The principle is actually quite simple - allocate a
shared memory, store there a current time and update it whenever a clock
interrupt happens. This is basically what Greg suggested in one of the
previous posts, where "regularly" means "on every interrupt". Greg was
worried about the precision, but this should be just fine I guess. It's
the precision you get on Linux, anyway ...

That sounds good for other interesting things, which entails being able
to have timing information attached to the XID sequence. If we go this
way, how far are we from having a ticker in PostgreSQL?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#10Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tomas Vondra (#6)
Re: Patch: add timing of buffer I/O requests

On Nov 28, 2011, at 9:29 AM, Tomas Vondra wrote:

I recall a patch similar to this one was submitted by Greg Stark some

time ago. It used the info for different reasons--to try and figure out
whether reads were cached or not--but I believe it withered rather than
being implemented mainly because it ran into the same fundamental
roadblocks here. My memory could be wrong here, there were also concerns
about what the data would be used for.

The difficulty when distinguishing whether the reads were cached or not is
the price we pay for using filesystem cache instead of managing our own.
Not sure if this can be solved just by measuring the latency - with
spinners it's quite easy, the differences are rather huge (and it's not
difficult to derive that even from pgbench log). But with SSDs, multiple
tablespaces on different storage, etc. it gets much harder.

True, but every use case for this information I can think of ultimately only cares about how long it took to perform some kind of IO; it doesn't *really* care about whether it was cached. So in that context, we don't really care if SSDs are fast enough that they look like cache, because that means they're performing (essentially) the same as cache.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dimitri Fontaine (#9)
Re: Patch: add timing of buffer I/O requests

On 28.11.2011 22:32, Dimitri Fontaine wrote:

"Tomas Vondra" <tv@fuzzy.cz> writes:

Another option would be to reimplement the vsyscall, even on platforms
that don't provide it. The principle is actually quite simple - allocate a
shared memory, store there a current time and update it whenever a clock
interrupt happens. This is basically what Greg suggested in one of the
previous posts, where "regularly" means "on every interrupt". Greg was
worried about the precision, but this should be just fine I guess. It's
the precision you get on Linux, anyway ...

That sounds good for other interesting things, which entails being able
to have timing information attached to the XID sequence. If we go this
way, how far are we from having a ticker in PostgreSQL?

I'm not sure. On Linux/x86 this is already done, but my knowledge of
kernel development is rather limited, especially when it comes to other
OSes and platforms. E.g. I'm not sure why it's not available in FreeBSD
on x86, I guess it's rather "we don't want it" than "it's not possible."

In Linux sources, the most interesting pieces are probably these:

1) arch/x86/include/asm/vgtod.h - that's the shared memory structure

2) arch/x86/kernel/vsyscall_64.c - this is how the memory is read
(do_vgettimeofday)

3) arch/x86/kernel/vsyscall_64.c - this is how the memory is updated
(update_vsyscall)

4) kernel/time/timekeeping.c - do_settimeofday (calls update_vsyscall)

5) drivers/rtc/class.c (and other) RTC drivers call do_settimeofday

Tomas

#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jim Nasby (#10)
Re: Patch: add timing of buffer I/O requests

On 29.11.2011 02:14, Jim Nasby wrote:

On Nov 28, 2011, at 9:29 AM, Tomas Vondra wrote:

I recall a patch similar to this one was submitted by Greg
Stark some

time ago. It used the info for different reasons--to try and
figure out whether reads were cached or not--but I believe it
withered rather than being implemented mainly because it ran into
the same fundamental roadblocks here. My memory could be wrong
here, there were also concerns about what the data would be used
for.

The difficulty when distinguishing whether the reads were cached or
not is the price we pay for using filesystem cache instead of
managing our own. Not sure if this can be solved just by measuring
the latency - with spinners it's quite easy, the differences are
rather huge (and it's not difficult to derive that even from
pgbench log). But with SSDs, multiple tablespaces on different
storage, etc. it gets much harder.

True, but every use case for this information I can think of
ultimately only cares about how long it took to perform some kind of
IO; it doesn't *really* care about whether it was cached. So in that
context, we don't really care if SSDs are fast enough that they look
like cache, because that means they're performing (essentially) the
same as cache.

Yup, that's right. The wait times are generally much more interesting
than the cached/not cached ratio.

Tomas

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dimitri Fontaine (#9)
Re: Patch: add timing of buffer I/O requests

Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:

That sounds good for other interesting things, which entails being able
to have timing information attached to the XID sequence. If we go this
way, how far are we from having a ticker in PostgreSQL?

Those of us who are trying to get rid of idle-state process wakeups will
protest any such thing with vigor.

regards, tom lane

#14Ants Aasma
ants.aasma@cybertec.at
In reply to: Tom Lane (#13)
Re: Patch: add timing of buffer I/O requests

Sorry for taking so long to respong, had a pretty busy day at work. Anyway..

On Mon, Nov 28, 2011 at 9:54 AM, Greg Smith <greg@2ndquadrant.com> wrote:

Oh no, it's party pooper time again.  Sorry I have to be the one to do it
this round.  The real problem with this whole area is that we know there are
systems floating around where the amount of time taken to grab timestamps
like this is just terrible.  I've been annoyed enough by that problem to
spend some time digging into why that is--seems to be a bunch of trivia
around the multiple ways to collect time info on x86 systems--and after this
CommitFest is over I was already hoping to dig through my notes and start
quantifying that more.  So you can't really prove the overhead of this
approach is acceptable just by showing two examples; we need to find one of
the really terrible clocks and test there to get a real feel for the
worst-case.

Sure, I know that the timing calls might be awfully slow. That's why I turned
it off by default. I saw that track_functions was already using this, so I
figured it was ok to have it potentially run very slowly.

-Document the underlying problem and known workarounds, provide a way to
test how bad the overhead is, and just throw our hands up and say "sorry,
you just can't instrument like this" if someone has a slow system.

Some documentation about potential problems would definitely be good.
Same goes for a test tool. ISTM that fast accurate timing is just not
possible on all supported platforms. That doesn't seem like a good enough
justification to refuse implementing something useful for the majority that
do as long as it doesn't cause regressions for those that don't or
significant code complexity.

-Have one of the PostgreSQL background processes keep track of a time
estimate on its own, only periodically pausing to sync against the real
time.  Then most calls to gettimeofday() can use that value instead.  I was
thinking of that idea for slightly longer running things though; I doubt
that can be made accurate enough to test instrument buffer

This would limit it to those cases where hundreds of milliseconds of jitter
or more don't bother all that much.

And while I hate to kick off massive bike-shedding in your direction, I'm
also afraid this area--collecting stats about how long individual operations
take--will need a much wider ranging approach than just looking at the
buffer cache ones.  If you step back and ask "what do people expect here?",
there's a pretty large number who really want something like Oracle's
v$session_wait  and v$system_event interface for finding the underlying
source of slow things.  There's enough demand for that that EnterpriseDB has
even done some work in this area too; what I've been told about it suggests
the code isn't a great fit for contribution to community PostgreSQL though.
 Like I said, this area is really messy and hard to get right.

Yeah, something like that should probably be something to strive for. I'll
ponder a bit more about resource and latency tracking a general. Maybe the
question here should be about the cost/benefit ratio of having some utility
now vs maintaining/deprecating the user visible interface when a more
general framework will turn up.

Something more ambitious like the v$ stuff would also take care of what
you're doing here; I'm not sure that what you've done helps built it though.
 Please don't take that personally.  Part of one of my own instrumentation
patches recently was rejected out of hand for the same reason, just not
being general enough.

No problem, I understand that half-way solutions can be more trouble than
they're worth. I actually built this to help with performance testing an
application and thought it would be an interesting experience to try to
give the community back something.

On Mon, Nov 28, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote:

I believe on most systems on modern linux kernels gettimeofday an its ilk
will be a vsyscall and nearly as fast as a regular function call.

clock_gettime() is implemented as a vDSO since 2.6.23. gettimeofday() has
been user context callable since before git shows any history (2.6.12).

On Mon, Nov 28, 2011 at 5:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The other big problem for a patch of this sort is that it would bloat
the stats file.

Yes.  Which begs the question of why we need to measure this per-table.
I would think per-tablespace would be sufficient.

Yeah, I figured that this is something that should be discussed. I
implemented per-table collection because I thought it might be useful for
tools to pick up and show a quick overview on which tables are causing the
most IO overhead for queries.

On Mon, Nov 28, 2011 at 8:10 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:

Something good to know: in Linux the file
/sys/devices/system/clocksource/clocksource0/current_clocksource
lists the current clock source, and
/sys/devices/system/clocksource/clocksource0/available_clocksource
lists the available clock sources. With cat you can switch them. That
way you may be able to quantify the effects on a single machine.

Learned the hard way while tracking clock-skew on a multicore system.
The hpet may not be the fastest (that would be the cpu timer), but it's
the fastest (IME) that gives guarenteed monotonic time.

The Linux kernel seems to go pretty far out of its way to ensure that TSC
(CPU timestamp counter) based clocksource returns monotonic values,
including actually testing if it does. [1]https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc_sync.c#L143 If the hardware doesn't support
stable and consistent tsc values, tsc isn't used as a clock source.

Of course trying to keep it monotonic doesn't mean succeeding. I thought
about inserting a sanity check. But as the current instrumentation doesn't
use one and it would catch errors only in one direction, biasing the long
term average, I decided against it.

Because this is non-essential instrumentation, I don't see an issue with
it returning bogus information when the system clock is broken. Atleast it
seems that no one has complained about the same issue in track_functions.
The only complaint I found is that it's off by default.

On Mon, Nov 28, 2011 at 5:29 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

Another option would be to reimplement the vsyscall, even on platforms
that don't provide it. The principle is actually quite simple - allocate a
shared memory, store there a current time and update it whenever a clock
interrupt happens. This is basically what Greg suggested in one of the
previous posts, where "regularly" means "on every interrupt". Greg was
worried about the precision, but this should be just fine I guess. It's
the precision you get on Linux, anyway ...

On modern platforms you actually really do get the microsecond precision.
Even more, if you use clock_gettime(CLOCK_MONOTONIC), you get nanosecond
precision and avoid issues with someone changing the system time while
you're timing. This precision does require OS and hardware cooperation,
because of CPU offsets, TSC's changing frequencies, stopping, etc.

--
Ants Aasma

[1]: https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc_sync.c#L143

#15Greg Smith
gsmith@gregsmith.com
In reply to: Robert Haas (#4)
Re: Patch: add timing of buffer I/O requests

On 11/28/2011 05:51 AM, Robert Haas wrote:

On Mon, Nov 28, 2011 at 2:54 AM, Greg Smith<greg@2ndquadrant.com> wrote:

The real problem with this whole area is that we know there are
systems floating around where the amount of time taken to grab timestamps
like this is just terrible.

Assuming the feature is off by default (and I can't imagine we'd
consider anything else), I don't see why that should be cause for
concern. If the instrumentation creates too much system load, then
don't use it: simple as that.

It's not quite that simple though. Releasing a performance measurement
feature that itself can perform terribly under undocumented conditions
has a wider downside than that.

Consider that people aren't going to turn it on until they are already
overloaded. If that has the potential to completely tank performance,
we better make sure that area is at least explored usefully first; the
minimum diligence should be to document that fact and make suggestions
for avoiding or testing it.

Instrumentation that can itself become a performance problem is an
advocacy problem waiting to happen. As I write this I'm picturing such
an encounter resulting in an angry blog post, about how this proves
PostgreSQL isn't usable for serious systems because someone sees massive
overhead turning this on. Right now the primary exposure to this class
of issue is EXPLAIN ANALYZE. When I was working on my book, I went out
of my way to find a worst case for that[1](Dell Store 2 schema, query was "SELECT count(*) FROM customers;"), and that turned out to be a
query that went from 7.994ms to 69.837ms when instrumented. I've been
meaning to investigate what was up there since finding that one. The
fact that we already have one such problem bit exposed already worries
me; I'd really prefer not to have two.

[1]: (Dell Store 2 schema, query was "SELECT count(*) FROM customers;")

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#15)
Re: Patch: add timing of buffer I/O requests

Greg Smith <greg@2ndQuadrant.com> writes:

On 11/28/2011 05:51 AM, Robert Haas wrote:

Assuming the feature is off by default (and I can't imagine we'd
consider anything else), I don't see why that should be cause for
concern. If the instrumentation creates too much system load, then
don't use it: simple as that.

It's not quite that simple though. Releasing a performance measurement
feature that itself can perform terribly under undocumented conditions
has a wider downside than that.

Yeah, that's a good point, and the machines on which this would suck
are exactly the ones where EXPLAIN ANALYZE creates very large overhead.
We don't seem to see a lot of complaints about that anymore, but we do
still see some ... and yes, it's documented that EXPLAIN ANALYZE can add
significant overhead, but that doesn't stop the questions.

Instrumentation that can itself become a performance problem is an
advocacy problem waiting to happen. As I write this I'm picturing such
an encounter resulting in an angry blog post, about how this proves
PostgreSQL isn't usable for serious systems because someone sees massive
overhead turning this on.

Of course, the rejoinder could be that if you see that, you're not
testing on serious hardware. But still, I take your point.

Right now the primary exposure to this class
of issue is EXPLAIN ANALYZE. When I was working on my book, I went out
of my way to find a worst case for that[1],
[1] (Dell Store 2 schema, query was "SELECT count(*) FROM customers;")

That's pretty meaningless without saying what sort of clock hardware
was on the machine...

regards, tom lane

#17Ants Aasma
ants.aasma@cybertec.at
In reply to: Ants Aasma (#1)
Re: Patch: add timing of buffer I/O requests

Here's the second version of the I/O timings patch. Changes from the
previous version:

* Rebased against master.
* Added the missing pg_stat_statements upgrade script that I
accidentally left out from the previous version.
* Added a tool to test timing overhead under contrib/pg_test_timing

I hope that having a tool to measure the overhead and check the sanity
of clock sources is enough to answer the worries about the potential
performance hit. We could also check that the clock source is fast
enough on start-up/when the guc is changed, but that seems a bit too
much and leaves open the question about what is fast enough.

About issues with stats file bloat - if it really is a blocker, I can
easily rip out the per-table or even per-database stats fields. The
patch is plenty useful without them. It seemed like a useful tool for
overworked DBAs with limited amount of SSD space available to easily
figure out which tables and indexes would benefit most from fast
storage.

--
Ants Aasma

Attachments:

io-stats.v2.patchtext/x-patch; charset=US-ASCII; name=io-stats.v2.patchDownload+571-72
#18Greg Smith
gsmith@gregsmith.com
In reply to: Ants Aasma (#17)
Re: Patch: add timing of buffer I/O requests

On 01/15/2012 05:14 PM, Ants Aasma wrote:

I hope that having a tool to measure the overhead and check the sanity
of clock sources is enough to answer the worries about the potential
performance hit. We could also check that the clock source is fast
enough on start-up/when the guc is changed, but that seems a bit too
much and leaves open the question about what is fast enough.

I've been thinking along those same lines--check at startup, provide
some guidance on the general range of what's considered fast vs. slow in
both the code and documentation. What I'm hoping to do here is split
your patch in half and work on the pg_test_timing contrib utility
first. That part answers some overdue questions about when EXPLAIN
ANALYZE can be expected to add a lot of overhead, which means it's
useful all on its own. I'd like to see that utility go into 9.2, along
with a new documentation section covering that topic. I'll write the
new documentation bit.

As far as the buffer timing goes, there is a lot of low-level timing
information I'd like to see the database collect. To pick a second
example with very similar mechanics, I'd like to know which queries
spend a lot of their time waiting on locks. The subset of time a
statement spends waiting just for commit related things is a third. The
industry standard term for these is wait events, as seen in Oracle,
MySQL, MS SQL Server. etc. That's so standard I don't see an
intellectual property issue with PostgreSQL using the same term. Talk
with a random person who is converting from Oracle to PostgreSQL, ask
them about their performance concerns. At least 3/4 of those
conversations I have mention being nervous about not having wait event data.

Right now, I feel the biggest hurdle to performance tuning PostgreSQL is
not having good enough built-in query log analysis tools. If the
pg_stat_statements normalization upgrade in the CF queue is commited,
that's enough to make me bump that to "solved well enough". After
clearing that hurdle, figuring out how to log, analyze, and manage
storage of wait events is the next biggest missing piece. One of my top
goals for 9.3 was to make sure that happened.

I don't think the long-term answer for how to manage wait event data is
to collect it as part of pg_stat_statements though. But I don't have a
good alternate proposal, while you've submitted a patch that actually
does something useful right now. I'm going to think some more about how
to reconcile all that. There is an intermediate point to consider as
well, which is just committing something that adjusts the core code to
make the buffer wait event data available. pg_stat_statements is easy
enough to continue work on outside of core. I could see a path where
that happens in parallel with adding a better core wait event
infrastructure, just to get the initial buffer wait info into people's
hands earlier.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#19Greg Smith
gsmith@gregsmith.com
In reply to: Greg Smith (#18)
Re: Patch: add timing of buffer I/O requests

Attached is the pg_test_timing utility portion of this submission,
broken out into its own patch. It's a contrib module modeled on
pg_test_fsync.

The documentation is still a bit rough, I'm not done with that yet. I
have included an example of good timing results, switching to a bad
clock source, and the resulting bad results. Code review found some
formatting things to nitpick I've already fixed: non-standard brace
locations and not including enough spaces in expressions were the main two.

This is now referenced by the existing cryptic documentation comment
around EXPLAIN ANALYZE, which says that overhead can be high because
gettimeofday is slow on some systems. Since this utility measures that
directly, I think it's a clear win to include it just for that purpose.
The fact that there are more places coming where timing overhead matters
is also true. But this existing one is already bad enough to justify
shipping something to help measure/manage it in my mind.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachments:

pg_test_timing-v2.patchtext/x-patch; name=pg_test_timing-v2.patchDownload+297-1
#20Greg Smith
gsmith@gregsmith.com
In reply to: Ants Aasma (#1)
Re: Patch: add timing of buffer I/O requests

Attached are updated versions of this feature without the pg_test_timing
tool part, since I broke that out into another discussion thread. I've
split the part that updates pg_stat_statistics out from the main feature
too, separate patch attached to here (but I'm not reviewing that yet).
Lots of bitrot since this was submitted, and yes I noticed that I've
almost recreated earlier versions of this patch--by splitting off the
parts that were developed later.

Earlier discussion of this got side tracked on a few things, partly my
fault. It's worth taking a look at what this provides before judging it
too much. It can demo well.

The stated purpose is helping figure out what relations are gobbling up
the most access time, presumably to optimize them and/or the storage
they are on. "What do I put onto SSD" is surely a popular request
nowadays. To check suitability for that, I decided to run the standard
pgbench test and see what it would show lots of time being consumed by.
Any answer other than "pgbench_accounts and to a lesser extent its
index" is a failing grade. I started with a clean database and OS cache
so I'd get real read timings:

$ psql -d pgbench -x -c "select
relname,heap_blks_read,heap_blks_hit,heap_blks_time,
idx_blks_read ,idx_blks_hit,idx_blks_time
from pg_statio_user_tables where idx_blks_read > 0
order by heap_blks_time desc"

relname | pgbench_accounts
heap_blks_read | 7879
heap_blks_hit | 43837
heap_blks_time | 151770
idx_blks_read | 7503
idx_blks_hit | 60484
idx_blks_time | 70968

relname | pgbench_tellers
heap_blks_read | 19
heap_blks_hit | 15856
heap_blks_time | 105
idx_blks_read | 11
idx_blks_hit | 15745
idx_blks_time | 62

relname | pgbench_branches
heap_blks_read | 11
heap_blks_hit | 32333
heap_blks_time | 77
idx_blks_read | 2
idx_blks_hit | 0
idx_blks_time | 9

Now, the first critical question to ask is "what additional information
is this providing above the existing counters?" After all, it's
possible to tell pgbench_accounts is the hotspot just from comparing
heap_blks_read, right? To really be useful, this would need to make it
obvious that reads from pgbench_accounts are slower than the other two,
because it's bigger and requires more seeking around to populate. That
should show up if we compute time per read numbers:

$ psql -d pgbench -x -c "select relname,
1.0 * heap_blks_time / heap_blks_read as time_per_read,
1.0 * idx_blks_time / idx_blks_read as time_per_idx_read
from pg_statio_user_tables where idx_blks_read > 0
order by heap_blks_time desc"

relname | pgbench_accounts
time_per_read | 19.2625967762406397
time_per_idx_read | 9.4586165533786485

relname | pgbench_tellers
time_per_read | 5.5263157894736842
time_per_idx_read | 5.6363636363636364

relname | pgbench_branches
time_per_read | 7.0000000000000000
time_per_idx_read | 4.5000000000000000

This run looks useful at providing the data wished for--that read times
are slower per capita from the accounts table. The first time I tried
this I got a bizarre high number for pgbench_branches.heap_blks_time ;
I'm not sure how reliable this is yet. One problem that might be easy
to fix is that the write timing info doesn't show in any of these system
views, only in EXPLAIN and statement level ones.

I still think a full wait timing interface is the right long-term
direction here. It's hard to reject this idea when it seems to be
working right now though, while more comprehensive wait storage is still
at least a release off. Opinions welcome, I'm still juggling this
around now that I have it working again.

Some implementation notes. This currently fails regression test
create_function_3, haven't looked into why yet. I've confirmed that on
a system where timing is cheap, this is too. On something touching real
data, not just a synthetic thing moving memory around, Aants couldn't
measure it on a server similar to mine; I can't either. Yes, this is
going to gobble up more room for statistics.

The track_iotiming GUC seems to work as expected. Off by default, can
turn it on in a session and see that session's work get timed, and it
toggles on a config reload. Everything needed to only turn it on
selectively; main penalty you pay all the time is the stats bloat.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachments:

io-stats.v3.patchtext/x-patch; name=io-stats.v3.patchDownload+242-24
io-stats-statment.v3.patchtext/x-patch; name=io-stats-statment.v3.patchDownload+105-48
#21Ants Aasma
ants.aasma@cybertec.at
In reply to: Greg Smith (#20)
#22Ants Aasma
ants.aasma@cybertec.at
In reply to: Ants Aasma (#21)
#23Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Ants Aasma (#21)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Ants Aasma (#22)
#25Ants Aasma
ants.aasma@cybertec.at
In reply to: Robert Haas (#24)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Ants Aasma (#25)
#27Ants Aasma
ants.aasma@cybertec.at
In reply to: Robert Haas (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#26)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Ants Aasma (#27)
#30Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#26)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#30)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#29)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#33)
In reply to: Robert Haas (#33)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#34)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#36)
#38Magnus Hagander
magnus@hagander.net
In reply to: Robert Haas (#36)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#38)
#40Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#37)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#41)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#42)
#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#43)
#45Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#44)
In reply to: Robert Haas (#45)
#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#45)
#48Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#39)
In reply to: Greg Smith (#48)
#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#49)
#51Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#49)
In reply to: Robert Haas (#51)
#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#52)
In reply to: Tom Lane (#53)
In reply to: Peter Geoghegan (#54)
#56Robert Haas
robertmhaas@gmail.com
In reply to: Kenneth Marshall (#55)
#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#56)
#58Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#57)
In reply to: Robert Haas (#56)
#60Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Greg Smith (#48)
#61Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#33)
#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#60)
#63Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#56)
#64Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#63)
In reply to: Tom Lane (#64)
#66Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#64)
#67Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#62)
#68Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#67)
In reply to: Tom Lane (#47)
In reply to: Bruce Momjian (#63)
#71Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#69)
#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#71)
#73Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#72)
#74Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#73)
#75Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#74)
#76Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#75)
#77Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#76)
#78Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#74)
#79Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#78)
#80Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#78)
#81Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#73)
#82Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#81)
#83Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#82)
#84Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#77)
#85Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#84)
#86Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#84)
#87Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#86)
#88Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#86)
#89Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#88)