RFC: replace pg_stat_activity.waiting with something more descriptive
When a PostgreSQL system wedges, or when it becomes dreadfully slow
for some reason, I often find myself relying on tools like strace,
gdb, or perf to figure out what is happening. This doesn't tend to
instill customers with confidence; they would like (quite
understandably) a process that doesn't require installing developer
tools on their production systems, and doesn't require a developer to
interpret the results, and perhaps even something that they could
connect up to PEM or Nagios or whatever alerting system they are
using.
There are obviously many ways that we might think about improving
things here, but what I'd like to do is try to put some better
information in pg_stat_activity, so that when a process is not
running, users can get some better information about *why* it's not
running. The basic idea is that pg_stat_activity.waiting would be
replaced by a new column pg_stat_activity.wait_event, which would
display the reason why that backend is waiting. This wouldn't be a
free-form text field, because that would be too expensive to populate.
Instead it would contain a "reason code" which would be chosen from a
list of reason codes and translated to text for display. Internally,
pgstat_report_waiting() would be changed to take an integer argument
rather than a Boolean (possibly uint8 would be enough, certainly
uint16 would be), and called from more places. It would continue to
use an ordinary store into shared memory, with no atomic ops or
locking.
Currently, the only time we report a process as waiting is when it is
waiting for a heavyweight lock. I'd like to make that somewhat more
fine-grained, by reporting the type of heavyweight lock it's awaiting
(relation, relation extension, transaction, etc.). Also, I'd like to
report when we're waiting for a lwlock, and report either the specific
fixed lwlock for which we are waiting, or else the type of lock (lock
manager lock, buffer content lock, etc.) for locks of which there is
more than one. I'm less sure about this next part, but I think we
might also want to report ourselves as waiting when we are doing an OS
read or an OS write, because it's pretty common for people to think
that a PostgreSQL bug is to blame when in fact it's the operating
system that isn't servicing our I/O requests very quickly. We could
also invent codes for things like "I'm doing a pg_usleep because I've
exceeded max_spins_per_delay" and "I'm waiting for a cleanup lock on a
buffer" and maybe a few others.
I realize that in many cases these states will be quite transient and
you won't see them in pg_stat_activity for very long before they
vanish; whether you can catch them at all is quite uncertain. It's
not my goal here to create some kind of a performance counter system,
even though that would be valuable and could possibly be based on the
same infrastructure, but rather just to create a very simple system
that lets people know, without any developer tools, what is causing a
backend that has accepted a query and not yet returned a result to be
off-CPU rather than on-CPU. In the cases where there are many
backends, you may be able to see non-NULL results often enough to get
a sense of where the problem is; or in the case where there's one
backend that is persistently stuck, you will hopefully be able to tell
where it's stuck.
Comments?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/22/2015 10:37 AM, Robert Haas wrote:
I'm less sure about this next part, but I think we
might also want to report ourselves as waiting when we are doing an OS
read or an OS write, because it's pretty common for people to think
that a PostgreSQL bug is to blame when in fact it's the operating
system that isn't servicing our I/O requests very quickly. We could
also invent codes for things like "I'm doing a pg_usleep because I've
exceeded max_spins_per_delay" and "I'm waiting for a cleanup lock on a
buffer" and maybe a few others.
This would be a great improvement. Many, many times the problem really
has nothing to do with PostgreSQL. It is a relation falling out of
cache, swapping, a process waiting on IO to be allocated to it. If it is
possible to have a view within PostgreSQL that allows us to see that, it
would be absolutely awesome.
It would be great if we could somehow monitor what the postgresql
processes are doing within PostgreSQL. Imagine if we had pgsar ...
Sincerely,
jD
--
Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 22, 2015 at 1:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
and doesn't require a developer to
interpret the results,
[...]
We could
also invent codes for things like "I'm doing a pg_usleep because I've
exceeded max_spins_per_delay" and "I'm waiting for a cleanup lock on a
buffer" and maybe a few others.
In addition to the codes themselves I think it would aid less-experienced
operators if we would provide a meta-data categorization of the codes.
Something like, I/O Sub-System; Storage Maintenance; Concurrency, etc..
There could be a section in the documentation with these topics as section
headings and a listing and explanation of each of the possible code would
then be described within.
The meta-information is already embedded within the code/descriptions but
explicitly pulling them out would be, IMO, more user-friendly and likely
also aid in triage and speed-of-recognition when reading the corresponding
code/description.
David J.
On Mon, Jun 22, 2015 at 1:59 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:
In addition to the codes themselves I think it would aid less-experienced
operators if we would provide a meta-data categorization of the codes.
Something like, I/O Sub-System; Storage Maintenance; Concurrency, etc..There could be a section in the documentation with these topics as section
headings and a listing and explanation of each of the possible code would
then be described within.The meta-information is already embedded within the code/descriptions but
explicitly pulling them out would be, IMO, more user-friendly and likely
also aid in triage and speed-of-recognition when reading the corresponding
code/description.
I was thinking that the codes would probably be fairly straightforward
renderings of the underlying C identifiers, e.g.:
Lock (Relation)
Lock (Relation Extension)
Lock (Page)
...
...
LWLock (ShmemIndexLock)
LWLock (OidGenLock)
LWLock (XidGenLock)
LWLock (ProcArrayLock)
...
...
Spin Lock Delay
Buffer Cleanup Lock
We'd then have to figure out how to document that stuff.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 22, 2015 at 4:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jun 22, 2015 at 1:59 PM, David G. Johnston
<david.g.johnston@gmail.com> wrote:In addition to the codes themselves I think it would aid less-experienced
operators if we would provide a meta-data categorization of the codes.
Something like, I/O Sub-System; Storage Maintenance; Concurrency, etc..There could be a section in the documentation with these topics as
section
headings and a listing and explanation of each of the possible code would
then be described within.The meta-information is already embedded within the code/descriptions but
explicitly pulling them out would be, IMO, more user-friendly and likely
also aid in triage and speed-of-recognition when reading thecorresponding
code/description.
I was thinking that the codes would probably be fairly straightforward
renderings of the underlying C identifiers, e.g.:Lock (Relation)
Lock (Relation Extension)
Lock (Page)
...
...
LWLock (ShmemIndexLock)
LWLock (OidGenLock)
LWLock (XidGenLock)
LWLock (ProcArrayLock)
...
...
Spin Lock Delay
Buffer Cleanup LockWe'd then have to figure out how to document that stuff.
Just tossing stuff at the wall...
CREATE TABLE pg_stat_wait_event_info (
wait_event integer PRIMARY KEY,
category text, --possibly a FK to a description-holding table too
event_type text, --Lock, LWLock, etc...or something higher-level; was
pondering whether ltree could be brought into core and used here...
event_documentation --asciidoc or something similar
);
Add \psql commands:
\dproc [ proc_id] --runs pg_stat_activity more-or-less
\dproc+ proc_id --shows the event information w/ description; and ideally
info from pg_locks among other possibilities
That said, the same documentation should be made available online as well -
but having this allows tools to make use the info to put the documentation
closer to the user.
David J.
On Mon, Jun 22, 2015 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
When a PostgreSQL system wedges, or when it becomes dreadfully slow
for some reason, I often find myself relying on tools like strace,
gdb, or perf to figure out what is happening. This doesn't tend to
instill customers with confidence; they would like (quite
understandably) a process that doesn't require installing developer
tools on their production systems, and doesn't require a developer to
interpret the results, and perhaps even something that they could
connect up to PEM or Nagios or whatever alerting system they are
using.There are obviously many ways that we might think about improving
things here, but what I'd like to do is try to put some better
information in pg_stat_activity, so that when a process is not
running, users can get some better information about *why* it's not
running. The basic idea is that pg_stat_activity.waiting would be
replaced by a new column pg_stat_activity.wait_event, which would
display the reason why that backend is waiting. This wouldn't be a
free-form text field, because that would be too expensive to populate.
Instead it would contain a "reason code" which would be chosen from a
list of reason codes and translated to text for display.
Instead of changing the column, can't we add a new one? Adjusting
columns in PSA requires the innumerable queries written against it to
be adjusted along with all the wiki instructions to dev ops for
emergency stuck query detection etc etc. I would also prefer to
query 'waiting' in some cases, especially when in emergency
situations; it's faster to type.
merlin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 22, 2015 at 4:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
Instead of changing the column, can't we add a new one? Adjusting
columns in PSA requires the innumerable queries written against it to
be adjusted along with all the wiki instructions to dev ops for
emergency stuck query detection etc etc. I would also prefer to
query 'waiting' in some cases, especially when in emergency
situations; it's faster to type.
If people feel strongly about backward compatibility, yes, we can do
that. However, if waiting continues to mean "on a heavyweight lock"
for backward compatibility, then you could sometimes have waiting =
false but wait_state non-null. That seems confusing enough to be a
bad plan, at least to me.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Merlin Moncure <mmoncure@gmail.com> writes:
On Mon, Jun 22, 2015 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
... The basic idea is that pg_stat_activity.waiting would be
replaced by a new column pg_stat_activity.wait_event, which would
display the reason why that backend is waiting.
Instead of changing the column, can't we add a new one? Adjusting
columns in PSA requires the innumerable queries written against it to
be adjusted along with all the wiki instructions to dev ops for
emergency stuck query detection etc etc.
+1. Removing the boolean column seems like it will arbitrarily break
a whole lot of client-side code, for not-very-adequate reasons.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 6/22/15 12:37 PM, Robert Haas wrote:
It's
not my goal here to create some kind of a performance counter system,
even though that would be valuable and could possibly be based on the
same infrastructure, but rather just to create a very simple system
that lets people know, without any developer tools, what is causing a
backend that has accepted a query and not yet returned a result to be
off-CPU rather than on-CPU.
Ilya Kosmodemiansky presented such a system at pgCon[1]http://www.pgcon.org/2015/schedule/events/809.en.html -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com, and hopes to
submit an initial patch in the coming weeks. The general idea was to do
something similar to what you're describing (though, I believe even more
granular) and have a bgworker accumulating that information.
[1]: http://www.pgcon.org/2015/schedule/events/809.en.html -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 23, 2015 at 2:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jun 22, 2015 at 4:40 PM, Merlin Moncure <mmoncure@gmail.com>
wrote:
Instead of changing the column, can't we add a new one? Adjusting
columns in PSA requires the innumerable queries written against it to
be adjusted along with all the wiki instructions to dev ops for
emergency stuck query detection etc etc. I would also prefer to
query 'waiting' in some cases, especially when in emergency
situations; it's faster to type.If people feel strongly about backward compatibility, yes, we can do
that. However, if waiting continues to mean "on a heavyweight lock"
for backward compatibility, then you could sometimes have waiting =
false but wait_state non-null. That seems confusing enough to be a
bad plan, at least to me.
That's right if we leave the 'waiting' as it is for the sake of backward
compatibility, then it will be confusing after we add wait_event to
pg_stat_activity and if we change it such that for any kind of wait_event
waiting will be true (or entirely remove waiting), then it will break the
backward compatibility. So we have below alternatives here:
1. Remove/Change 'waiting' in pg_stat_activity and break the backward
compatibility. I think we should try to avoid going via this route.
2. Add 2 new columns to pg_stat_activity
waiting_resource - true for waits other heavy wait locks, false
otherwise
wait_event - description code for the wait event
3. Add new view 'pg_stat_wait_event' with following info:
pid - process id of this backend
waiting - true for any form of wait, false otherwise
wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
wait_event - Lock (Relation), Lock (Relation Extension), etc
Do you think 2nd or 3rd could be viable way to proceed for this feature?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 2015-06-25 16:07:45 +0530, Amit Kapila wrote:
On Tue, Jun 23, 2015 at 2:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
If people feel strongly about backward compatibility, yes, we can do
that. However, if waiting continues to mean "on a heavyweight lock"
for backward compatibility, then you could sometimes have waiting =
false but wait_state non-null. That seems confusing enough to be a
bad plan, at least to me.That's right if we leave the 'waiting' as it is for the sake of backward
compatibility, then it will be confusing after we add wait_event to
pg_stat_activity and if we change it such that for any kind of wait_event
waiting will be true (or entirely remove waiting), then it will break the
backward compatibility. So we have below alternatives here:
1. Remove/Change 'waiting' in pg_stat_activity and break the backward
compatibility. I think we should try to avoid going via this route.2. Add 2 new columns to pg_stat_activity
waiting_resource - true for waits other heavy wait locks, false
otherwise
wait_event - description code for the wait event3. Add new view 'pg_stat_wait_event' with following info:
pid - process id of this backend
waiting - true for any form of wait, false otherwise
wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
wait_event - Lock (Relation), Lock (Relation Extension), etcDo you think 2nd or 3rd could be viable way to proceed for this feature?
3) sounds best to me. Keeping 'waiting' even makes sense in that case,
because it'll tell whether wait_event_type is currently being blocked
on. We can leave the former contents in until the next thing is being
blocked...
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jun 25, 2015 at 4:16 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-06-25 16:07:45 +0530, Amit Kapila wrote:
On Tue, Jun 23, 2015 at 2:33 AM, Robert Haas <robertmhaas@gmail.com>
wrote:
If people feel strongly about backward compatibility, yes, we can do
that. However, if waiting continues to mean "on a heavyweight lock"
for backward compatibility, then you could sometimes have waiting =
false but wait_state non-null. That seems confusing enough to be a
bad plan, at least to me.That's right if we leave the 'waiting' as it is for the sake of backward
compatibility, then it will be confusing after we add wait_event to
pg_stat_activity and if we change it such that for any kind of
wait_event
waiting will be true (or entirely remove waiting), then it will break
the
backward compatibility. So we have below alternatives here:
1. Remove/Change 'waiting' in pg_stat_activity and break the backward
compatibility. I think we should try to avoid going via this route.2. Add 2 new columns to pg_stat_activity
waiting_resource - true for waits other heavy wait locks, false
otherwise
wait_event - description code for the wait event3. Add new view 'pg_stat_wait_event' with following info:
pid - process id of this backend
waiting - true for any form of wait, false otherwise
wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
wait_event - Lock (Relation), Lock (Relation Extension), etcDo you think 2nd or 3rd could be viable way to proceed for this feature?
3) sounds best to me. Keeping 'waiting' even makes sense in that case,
because it'll tell whether wait_event_type is currently being blocked
on. We can leave the former contents in until the next thing is being
blocked...
Won't leaving former contents as it is (until the next thing is being
blocked) could give misleading information. Currently we mark 'waiting'
as false as soon as Heavy Weight Lock is over, so following that way
sounds more appropriate, is there any reason why you want it differently
than what we are doing currently?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hi all
On Thu, Jun 25, 2015 at 12:37 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
2. Add 2 new columns to pg_stat_activity
waiting_resource - true for waits other heavy wait locks, false
otherwise
wait_event - description code for the wait event3. Add new view 'pg_stat_wait_event' with following info:
pid - process id of this backend
waiting - true for any form of wait, false otherwise
wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
wait_event - Lock (Relation), Lock (Relation Extension), etc
Personally I think, that tracking waits is a not a good idea for
pg_stat_activity (at least in that straight-forward manner). One
process can wait for lots of things between 2 sampling of
pg_stat_activity and that sampling can be pretty useless.
My approach (about which Ive had a talk mentioned by Jim and which I
hope to finalize and submit within a few days) is a bit different and
I believe is more useful:
1. Some sort of histogram of top waits within entire database by pid.
That will be an approximate one, because I hardly believe there is a
possibility to make a precise one without significant overhead.
2. Some cyclic buffer of more precise wait statistic inside each
worker. Sampling may be turned on if we see some issues in histogram
(1) and want to have some more details.
Do you think 2nd or 3rd could be viable way to proceed for this feature?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Ilya Kosmodemiansky,
PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-25 16:26:39 +0530, Amit Kapila wrote:
Won't leaving former contents as it is (until the next thing is being
blocked) could give misleading information. Currently we mark 'waiting'
as false as soon as Heavy Weight Lock is over, so following that way
sounds more appropriate, is there any reason why you want it differently
than what we are doing currently?
But we don't do the same for query, so I don't think that says much. I
think it'd be useful because it gives you a bit more chance to see what
you blocked on last, even if the time the backend was blocked was very
short.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jun 25, 2015 at 4:28 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-06-25 16:26:39 +0530, Amit Kapila wrote:
Won't leaving former contents as it is (until the next thing is being
blocked) could give misleading information. Currently we mark 'waiting'
as false as soon as Heavy Weight Lock is over, so following that way
sounds more appropriate, is there any reason why you want it differently
than what we are doing currently?But we don't do the same for query, so I don't think that says much. I
think it'd be useful because it gives you a bit more chance to see what
you blocked on last, even if the time the backend was blocked was very
short.
Sure, that's another way to look at it, if you and or others feels that is
better,
then we can follow that way.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 25, 2015 at 4:28 PM, Ilya Kosmodemiansky <
ilya.kosmodemiansky@postgresql-consulting.com> wrote:
On Thu, Jun 25, 2015 at 12:37 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
2. Add 2 new columns to pg_stat_activity
waiting_resource - true for waits other heavy wait locks, false
otherwise
wait_event - description code for the wait event3. Add new view 'pg_stat_wait_event' with following info:
pid - process id of this backend
waiting - true for any form of wait, false otherwise
wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait, etc
wait_event - Lock (Relation), Lock (Relation Extension), etcPersonally I think, that tracking waits is a not a good idea for
pg_stat_activity (at least in that straight-forward manner).
As mentioned in the initial mail by Robert, that sometimes system becomes
slow (either due to contention on various kinds of locks or due to I/O or
due
to some other such reasons) that such kind of handy information via some
view is quite useful. Recently while working on one of the
performance/scalability
projects, I need to use gdb to attach to different processes to see what
they
are doing (of course one can use perf or some other utilities as well) and I
found most of them were trying to wait on some LW locks, now having such
an information available via view could be really useful, because sometimes
at customer sites, we can't use gdb or perf to see what's going on.
One
process can wait for lots of things between 2 sampling of
pg_stat_activity and that sampling can be pretty useless.
Yeah, that's right and I am not sure if we should bother about such
scenario's
as the system is generally fine in such situations, however there are other
cases where we can find most of the backends are waiting on one or other
thing.
My approach (about which Ive had a talk mentioned by Jim and which I
hope to finalize and submit within a few days) is a bit different and
I believe is more useful:1. Some sort of histogram of top waits within entire database by pid.
That will be an approximate one, because I hardly believe there is a
possibility to make a precise one without significant overhead.2. Some cyclic buffer of more precise wait statistic inside each
worker. Sampling may be turned on if we see some issues in histogram
(1) and want to have some more details.
I think this is some what different kind of utility which can give us
aggregated information and I think this will address different kind of
usecase and will have somewhat more complex design and it doesn't
look impossible to use part of what will be developed as part of this
proposal.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 25, 2015 at 1:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Personally I think, that tracking waits is a not a good idea for
pg_stat_activity (at least in that straight-forward manner).As mentioned in the initial mail by Robert, that sometimes system becomes
slow (either due to contention on various kinds of locks or due to I/O or
due
to some other such reasons) that such kind of handy information via some
view is quite useful. Recently while working on one of the
performance/scalability
projects, I need to use gdb to attach to different processes to see what
they
are doing (of course one can use perf or some other utilities as well) and I
found most of them were trying to wait on some LW locks, now having such
an information available via view could be really useful, because sometimes
at customer sites, we can't use gdb or perf to see what's going on.
Yes, I understand such a use-case. But I hardly see if suggested
design can help for such cases.
Basically, a DBA has two reasons to take a look on waits:
1. Long response time for particular query (or some type of queries).
In that case it is good to know how much time we spend on waiting for
particular resources we need to get query results
2. Overall bad performance of a database. We know, that something goes
wrong and consumes resources, we need to identify which backend, which
query causes the most of waits.
In both cases we need a) some historical data rather than simple
snapshot b) some approach how to aggregate it because the will be
certainly a lot of events
So my point is, we need separate interface for waits, instead of
integrating in pg_stat_activity. And it should be several interfaces:
one for approximate top of waiting sessions (like
active_sessions_history in oracle), one for detailed tracing of a
session, one for waits per resource statistics etc.
One
process can wait for lots of things between 2 sampling of
pg_stat_activity and that sampling can be pretty useless.Yeah, that's right and I am not sure if we should bother about such
scenario's
as the system is generally fine in such situations, however there are other
cases where we can find most of the backends are waiting on one or other
thing.
I think approach with top of waiting sessions covers both scenarios
(well, with only one exception: if we have billions of very short
waits and high contention is the problem)
However, it maybe a good idea, to identify the resource we are waiting
for from pg_stat_activity if we are waiting for a long time.
I think this is some what different kind of utility which can give us
aggregated information and I think this will address different kind of
usecase and will have somewhat more complex design and it doesn't
look impossible to use part of what will be developed as part of this
proposal.
I think it is more than possible to mix both approaches. My proof of
concept now is only about LWLocks - yours and Robert's is more
general, and certainly some wait event classification will be needed
for both approaches and its much better to implement one rather than
two different.
And at least, I will be interesting in reviewing your approach.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Ilya Kosmodemiansky,
PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jun 25, 2015 at 6:10 PM, Ilya Kosmodemiansky <
ilya.kosmodemiansky@postgresql-consulting.com> wrote:
On Thu, Jun 25, 2015 at 1:49 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
Personally I think, that tracking waits is a not a good idea for
pg_stat_activity (at least in that straight-forward manner).As mentioned in the initial mail by Robert, that sometimes system
becomes
slow (either due to contention on various kinds of locks or due to I/O
or
due
to some other such reasons) that such kind of handy information via some
view is quite useful. Recently while working on one of the
performance/scalability
projects, I need to use gdb to attach to different processes to see what
they
are doing (of course one can use perf or some other utilities as well)
and I
found most of them were trying to wait on some LW locks, now having such
an information available via view could be really useful, because
sometimes
at customer sites, we can't use gdb or perf to see what's going on.
Yes, I understand such a use-case. But I hardly see if suggested
design can help for such cases.Basically, a DBA has two reasons to take a look on waits:
1. Long response time for particular query (or some type of queries).
In that case it is good to know how much time we spend on waiting for
particular resources we need to get query results
2. Overall bad performance of a database. We know, that something goes
wrong and consumes resources, we need to identify which backend, which
query causes the most of waits.In both cases we need a) some historical data rather than simple
snapshot b) some approach how to aggregate it because the will be
certainly a lot of events
I think this thread's proposal will help for cases, when user/DBA wants to
see where currently database is spending most time (during waits).
I understand that there is a use of historical information which can
be helpful for the kind of cases which you have explained above.
I think it is more than possible to mix both approaches. My proof of
concept now is only about LWLocks - yours and Robert's is more
general, and certainly some wait event classification will be needed
for both approaches and its much better to implement one rather than
two different.And at least, I will be interesting in reviewing your approach.
Okay, I am planning to spend time on this patch in coming few days
and when that's ready, may be we can see if that could be useful
for what you are planning to do.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 6/22/15 1:37 PM, Robert Haas wrote:
Currently, the only time we report a process as waiting is when it is
waiting for a heavyweight lock. I'd like to make that somewhat more
fine-grained, by reporting the type of heavyweight lock it's awaiting
(relation, relation extension, transaction, etc.). Also, I'd like to
report when we're waiting for a lwlock, and report either the specific
fixed lwlock for which we are waiting, or else the type of lock (lock
manager lock, buffer content lock, etc.) for locks of which there is
more than one. I'm less sure about this next part, but I think we
might also want to report ourselves as waiting when we are doing an OS
read or an OS write, because it's pretty common for people to think
that a PostgreSQL bug is to blame when in fact it's the operating
system that isn't servicing our I/O requests very quickly.
Could that also cover waiting on network?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On 2015-06-25 16:26:39 +0530, Amit Kapila wrote:
Won't leaving former contents as it is (until the next thing is being
blocked) could give misleading information. Currently we mark 'waiting'
as false as soon as Heavy Weight Lock is over, so following that way
sounds more appropriate, is there any reason why you want it differently
than what we are doing currently?
But we don't do the same for query, so I don't think that says much. I
think it'd be useful because it gives you a bit more chance to see what
you blocked on last, even if the time the backend was blocked was very
short.
The problem with the query analogy is that it's possible to tell whether
the query is active or not, by looking at the status column. We need to
avoid a situation where you can't tell if the wait status is current or
merely the last thing waited for.
At the moment I'm inclined to think we should put this on the back burner
until we see what Ilya submits. None of the proposals for changing
pg_stat_activity sound terribly clean to me.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers