Expose lock group leader pid in pg_stat_activity
Hello,
Guillaume (in Cc) recently pointed out [1]https://twitter.com/g_lelarge/status/1209486212190343168 that it's currently not
possible to retrieve the list of parallel workers for a given backend
at the SQL level. His use case was to develop a function in plpgsql
to sample a given query wait event, but it's not hard to imagine other
useful use cases for this information, for instance doing some
analysis on the average number of workers per parallel query, or ratio
of parallel queries. IIUC parallel queries is for now the only user
of lock group, so this should work just fine.
I'm attaching a trivial patch to expose the group leader pid if any
in pg_stat_activity. Quick example of usage:
=# SELECT query, leader_pid,
array_agg(pid) filter(WHERE leader_pid != pid) AS members
FROM pg_stat_activity
WHERE leader_pid IS NOT NULL
GROUP BY query, leader_pid;
query | leader_pid | members
-------------------+------------+---------------
select * from t1; | 28701 | {28728,28732}
(1 row)
[1]: https://twitter.com/g_lelarge/status/1209486212190343168
Attachments:
pgsa_leader_pid-v1.difftext/x-patch; charset=US-ASCII; name=pgsa_leader_pid-v1.diffDownload+17-4
On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
Guillaume (in Cc) recently pointed out [1] that it's currently not
possible to retrieve the list of parallel workers for a given backend
at the SQL level. His use case was to develop a function in plpgsql
to sample a given query wait event, but it's not hard to imagine other
useful use cases for this information, for instance doing some
analysis on the average number of workers per parallel query, or ratio
of parallel queries. IIUC parallel queries is for now the only user
of lock group, so this should work just fine.I'm attaching a trivial patch to expose the group leader pid if any
in pg_stat_activity. Quick example of usage:=# SELECT query, leader_pid,
array_agg(pid) filter(WHERE leader_pid != pid) AS members
FROM pg_stat_activity
WHERE leader_pid IS NOT NULL
GROUP BY query, leader_pid;
query | leader_pid | members
-------------------+------------+---------------
select * from t1; | 28701 | {28728,28732}
(1 row)[1] https://twitter.com/g_lelarge/status/1209486212190343168
And I just realized that I forgot to update rule.out, sorry about
that. v2 attached.
Attachments:
pgsa_leader_pid-v2.difftext/x-patch; charset=US-ASCII; name=pgsa_leader_pid-v2.diffDownload+22-8
Le mer. 25 déc. 2019 à 19:30, Julien Rouhaud <rjuju123@gmail.com> a écrit :
On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
Guillaume (in Cc) recently pointed out [1] that it's currently not
possible to retrieve the list of parallel workers for a given backend
at the SQL level. His use case was to develop a function in plpgsql
to sample a given query wait event, but it's not hard to imagine other
useful use cases for this information, for instance doing some
analysis on the average number of workers per parallel query, or ratio
of parallel queries. IIUC parallel queries is for now the only user
of lock group, so this should work just fine.I'm attaching a trivial patch to expose the group leader pid if any
in pg_stat_activity. Quick example of usage:=# SELECT query, leader_pid,
array_agg(pid) filter(WHERE leader_pid != pid) AS members
FROM pg_stat_activity
WHERE leader_pid IS NOT NULL
GROUP BY query, leader_pid;
query | leader_pid | members
-------------------+------------+---------------
select * from t1; | 28701 | {28728,28732}
(1 row)[1] https://twitter.com/g_lelarge/status/1209486212190343168
And I just realized that I forgot to update rule.out, sorry about
that. v2 attached.
So I tried your patch this morning, and it works really well.
On a SELECT count(*), I got this:
SELECT leader_pid, pid, wait_event_type, wait_event, state, backend_type
FROM pg_stat_activity WHERE pid=111439 or leader_pid=111439;
┌────────────┬────────┬─────────────────┬──────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │
backend_type │
├────────────┼────────┼─────────────────┼──────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ LWLock │ WALWriteLock │ active │ client
backend │
│ 111439 │ 116887 │ LWLock │ WALWriteLock │ active │ parallel
worker │
│ 111439 │ 116888 │ IO │ WALSync │ active │ parallel
worker │
└────────────┴────────┴─────────────────┴──────────────┴────────┴─────────────────┘
(3 rows)
and this from a CREATE INDEX:
┌────────────┬────────┬─────────────────┬────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │
backend_type │
├────────────┼────────┼─────────────────┼────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ │ │ active │ client
backend │
│ 111439 │ 118775 │ │ │ active │ parallel
worker │
└────────────┴────────┴─────────────────┴────────────┴────────┴─────────────────┘
(2 rows)
Anyway, it applies cleanly, it compiles, and it works. Documentation is
available. So it looks to me it's good to go :)
--
Guillaume.
On Thu, Dec 26, 2019 at 9:08 AM Guillaume Lelarge
<guillaume@lelarge.info> wrote:
Le mer. 25 déc. 2019 à 19:30, Julien Rouhaud <rjuju123@gmail.com> a écrit :
On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
Guillaume (in Cc) recently pointed out [1] that it's currently not
possible to retrieve the list of parallel workers for a given backend
at the SQL level. His use case was to develop a function in plpgsql
to sample a given query wait event, but it's not hard to imagine other
useful use cases for this information, for instance doing some
analysis on the average number of workers per parallel query, or ratio
of parallel queries. IIUC parallel queries is for now the only user
of lock group, so this should work just fine.I'm attaching a trivial patch to expose the group leader pid if any
in pg_stat_activity. Quick example of usage:=# SELECT query, leader_pid,
array_agg(pid) filter(WHERE leader_pid != pid) AS members
FROM pg_stat_activity
WHERE leader_pid IS NOT NULL
GROUP BY query, leader_pid;
query | leader_pid | members
-------------------+------------+---------------
select * from t1; | 28701 | {28728,28732}
(1 row)[1] https://twitter.com/g_lelarge/status/1209486212190343168
And I just realized that I forgot to update rule.out, sorry about
that. v2 attached.So I tried your patch this morning, and it works really well.
On a SELECT count(*), I got this:
SELECT leader_pid, pid, wait_event_type, wait_event, state, backend_type FROM pg_stat_activity WHERE pid=111439 or leader_pid=111439;
┌────────────┬────────┬─────────────────┬──────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
├────────────┼────────┼─────────────────┼──────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ LWLock │ WALWriteLock │ active │ client backend │
│ 111439 │ 116887 │ LWLock │ WALWriteLock │ active │ parallel worker │
│ 111439 │ 116888 │ IO │ WALSync │ active │ parallel worker │
└────────────┴────────┴─────────────────┴──────────────┴────────┴─────────────────┘
(3 rows)and this from a CREATE INDEX:
┌────────────┬────────┬─────────────────┬────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
├────────────┼────────┼─────────────────┼────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ │ │ active │ client backend │
│ 111439 │ 118775 │ │ │ active │ parallel worker │
└────────────┴────────┴─────────────────┴────────────┴────────┴─────────────────┘
(2 rows)Anyway, it applies cleanly, it compiles, and it works. Documentation is available. So it looks to me it's good to go :)
Thanks for the review Guillaume. Double checking the doc, I see that
I made a copy/pasto mistake in the new field name. Attached v3 should
be all good.
Attachments:
pgsa_leader_pid-v3.difftext/x-patch; charset=US-ASCII; name=pgsa_leader_pid-v3.diffDownload+22-8
Le jeu. 26 déc. 2019 à 09:49, Julien Rouhaud <rjuju123@gmail.com> a écrit :
On Thu, Dec 26, 2019 at 9:08 AM Guillaume Lelarge
<guillaume@lelarge.info> wrote:Le mer. 25 déc. 2019 à 19:30, Julien Rouhaud <rjuju123@gmail.com> a
écrit :
On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com>
wrote:
Guillaume (in Cc) recently pointed out [1] that it's currently not
possible to retrieve the list of parallel workers for a given backend
at the SQL level. His use case was to develop a function in plpgsql
to sample a given query wait event, but it's not hard to imagine other
useful use cases for this information, for instance doing some
analysis on the average number of workers per parallel query, or ratio
of parallel queries. IIUC parallel queries is for now the only user
of lock group, so this should work just fine.I'm attaching a trivial patch to expose the group leader pid if any
in pg_stat_activity. Quick example of usage:=# SELECT query, leader_pid,
array_agg(pid) filter(WHERE leader_pid != pid) AS members
FROM pg_stat_activity
WHERE leader_pid IS NOT NULL
GROUP BY query, leader_pid;
query | leader_pid | members
-------------------+------------+---------------
select * from t1; | 28701 | {28728,28732}
(1 row)[1] https://twitter.com/g_lelarge/status/1209486212190343168
And I just realized that I forgot to update rule.out, sorry about
that. v2 attached.So I tried your patch this morning, and it works really well.
On a SELECT count(*), I got this:
SELECT leader_pid, pid, wait_event_type, wait_event, state, backend_type
FROM pg_stat_activity WHERE pid=111439 or leader_pid=111439;
┌────────────┬────────┬─────────────────┬──────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │
backend_type │
├────────────┼────────┼─────────────────┼──────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ LWLock │ WALWriteLock │ active │ client
backend │
│ 111439 │ 116887 │ LWLock │ WALWriteLock │ active │
parallel worker │
│ 111439 │ 116888 │ IO │ WALSync │ active │
parallel worker │
└────────────┴────────┴─────────────────┴──────────────┴────────┴─────────────────┘
(3 rows)
and this from a CREATE INDEX:
┌────────────┬────────┬─────────────────┬────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │
backend_type │
├────────────┼────────┼─────────────────┼────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ │ │ active │ client
backend │
│ 111439 │ 118775 │ │ │ active │ parallel
worker │
└────────────┴────────┴─────────────────┴────────────┴────────┴─────────────────┘
(2 rows)
Anyway, it applies cleanly, it compiles, and it works. Documentation is
available. So it looks to me it's good to go :)
Thanks for the review Guillaume. Double checking the doc, I see that
I made a copy/pasto mistake in the new field name. Attached v3 should
be all good.
Feeling bad I missed this. But, yeah, it's much better with the right
column's name.
For me, it's looking good to be ready for commiter. Should I set it this
way in the Commit Fest app?
--
Guillaume.
On Thu, Dec 26, 2019 at 10:20 AM Guillaume Lelarge
<guillaume@lelarge.info> wrote:
Le jeu. 26 déc. 2019 à 09:49, Julien Rouhaud <rjuju123@gmail.com> a écrit :
On Thu, Dec 26, 2019 at 9:08 AM Guillaume Lelarge
<guillaume@lelarge.info> wrote:Le mer. 25 déc. 2019 à 19:30, Julien Rouhaud <rjuju123@gmail.com> a écrit :
On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
Guillaume (in Cc) recently pointed out [1] that it's currently not
possible to retrieve the list of parallel workers for a given backend
at the SQL level. His use case was to develop a function in plpgsql
to sample a given query wait event, but it's not hard to imagine other
useful use cases for this information, for instance doing some
analysis on the average number of workers per parallel query, or ratio
of parallel queries. IIUC parallel queries is for now the only user
of lock group, so this should work just fine.I'm attaching a trivial patch to expose the group leader pid if any
in pg_stat_activity. Quick example of usage:=# SELECT query, leader_pid,
array_agg(pid) filter(WHERE leader_pid != pid) AS members
FROM pg_stat_activity
WHERE leader_pid IS NOT NULL
GROUP BY query, leader_pid;
query | leader_pid | members
-------------------+------------+---------------
select * from t1; | 28701 | {28728,28732}
(1 row)[1] https://twitter.com/g_lelarge/status/1209486212190343168
And I just realized that I forgot to update rule.out, sorry about
that. v2 attached.So I tried your patch this morning, and it works really well.
On a SELECT count(*), I got this:
SELECT leader_pid, pid, wait_event_type, wait_event, state, backend_type FROM pg_stat_activity WHERE pid=111439 or leader_pid=111439;
┌────────────┬────────┬─────────────────┬──────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
├────────────┼────────┼─────────────────┼──────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ LWLock │ WALWriteLock │ active │ client backend │
│ 111439 │ 116887 │ LWLock │ WALWriteLock │ active │ parallel worker │
│ 111439 │ 116888 │ IO │ WALSync │ active │ parallel worker │
└────────────┴────────┴─────────────────┴──────────────┴────────┴─────────────────┘
(3 rows)and this from a CREATE INDEX:
┌────────────┬────────┬─────────────────┬────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │ backend_type │
├────────────┼────────┼─────────────────┼────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ │ │ active │ client backend │
│ 111439 │ 118775 │ │ │ active │ parallel worker │
└────────────┴────────┴─────────────────┴────────────┴────────┴─────────────────┘
(2 rows)Anyway, it applies cleanly, it compiles, and it works. Documentation is available. So it looks to me it's good to go :)
Thanks for the review Guillaume. Double checking the doc, I see that
I made a copy/pasto mistake in the new field name. Attached v3 should
be all good.Feeling bad I missed this. But, yeah, it's much better with the right column's name.
For me, it's looking good to be ready for commiter. Should I set it this way in the Commit Fest app?
If you don't see any other issue with the patch, I'd say yes. A
committer can still put it back to waiting on author if needed.
Le jeu. 26 déc. 2019 à 10:26, Julien Rouhaud <rjuju123@gmail.com> a écrit :
On Thu, Dec 26, 2019 at 10:20 AM Guillaume Lelarge
<guillaume@lelarge.info> wrote:Le jeu. 26 déc. 2019 à 09:49, Julien Rouhaud <rjuju123@gmail.com> a
écrit :
On Thu, Dec 26, 2019 at 9:08 AM Guillaume Lelarge
<guillaume@lelarge.info> wrote:Le mer. 25 déc. 2019 à 19:30, Julien Rouhaud <rjuju123@gmail.com> a
écrit :
On Wed, Dec 25, 2019 at 7:03 PM Julien Rouhaud <rjuju123@gmail.com>
wrote:
Guillaume (in Cc) recently pointed out [1] that it's currently not
possible to retrieve the list of parallel workers for a givenbackend
at the SQL level. His use case was to develop a function in
plpgsql
to sample a given query wait event, but it's not hard to imagine
other
useful use cases for this information, for instance doing some
analysis on the average number of workers per parallel query, orratio
of parallel queries. IIUC parallel queries is for now the only
user
of lock group, so this should work just fine.
I'm attaching a trivial patch to expose the group leader pid if any
in pg_stat_activity. Quick example of usage:=# SELECT query, leader_pid,
array_agg(pid) filter(WHERE leader_pid != pid) AS members
FROM pg_stat_activity
WHERE leader_pid IS NOT NULL
GROUP BY query, leader_pid;
query | leader_pid | members
-------------------+------------+---------------
select * from t1; | 28701 | {28728,28732}
(1 row)[1] https://twitter.com/g_lelarge/status/1209486212190343168
And I just realized that I forgot to update rule.out, sorry about
that. v2 attached.So I tried your patch this morning, and it works really well.
On a SELECT count(*), I got this:
SELECT leader_pid, pid, wait_event_type, wait_event, state,
backend_type FROM pg_stat_activity WHERE pid=111439 or leader_pid=111439;
┌────────────┬────────┬─────────────────┬──────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │
backend_type │
├────────────┼────────┼─────────────────┼──────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ LWLock │ WALWriteLock │ active │
client backend │
│ 111439 │ 116887 │ LWLock │ WALWriteLock │ active │
parallel worker │
│ 111439 │ 116888 │ IO │ WALSync │ active │
parallel worker │
└────────────┴────────┴─────────────────┴──────────────┴────────┴─────────────────┘
(3 rows)
and this from a CREATE INDEX:
┌────────────┬────────┬─────────────────┬────────────┬────────┬─────────────────┐
│ leader_pid │ pid │ wait_event_type │ wait_event │ state │
backend_type │
├────────────┼────────┼─────────────────┼────────────┼────────┼─────────────────┤
│ 111439 │ 111439 │ │ │ active │
client backend │
│ 111439 │ 118775 │ │ │ active │
parallel worker │
└────────────┴────────┴─────────────────┴────────────┴────────┴─────────────────┘
(2 rows)
Anyway, it applies cleanly, it compiles, and it works. Documentation
is available. So it looks to me it's good to go :)
Thanks for the review Guillaume. Double checking the doc, I see that
I made a copy/pasto mistake in the new field name. Attached v3 should
be all good.Feeling bad I missed this. But, yeah, it's much better with the right
column's name.
For me, it's looking good to be ready for commiter. Should I set it this
way in the Commit Fest app?
If you don't see any other issue with the patch, I'd say yes. A
committer can still put it back to waiting on author if needed.
That's also what I thought, but as I was the only one commenting on this...
Anyway, done.
--
Guillaume.
Hello
I doubt that "Process ID of the lock group leader" is enough for user documentation. I think we need note:
- this field is related to parallel query execution
- leader_pid = pid if process is parallel leader
- leader_pid would point to pid of the leader if process is parallel worker
- leader_pid will be NULL for non-parallel queries or idle sessions
Also patch has no tests. Possible this is normal, not sure how to write a reliable test for this feature.
Patch applies, compiles, pass tests
regards, Sergei
Hello,
On Thu, Dec 26, 2019 at 12:18 PM Sergei Kornilov <sk@zsrv.org> wrote:
I doubt that "Process ID of the lock group leader" is enough for user documentation. I think we need note:
- this field is related to parallel query execution
- leader_pid = pid if process is parallel leader
- leader_pid would point to pid of the leader if process is parallel worker
- leader_pid will be NULL for non-parallel queries or idle sessions
As I understand it, lock group is some infrastructure that is used by
parallel queries, but could be used for something else too. So if
more documentation is needed, we should say something like "For now,
only parallel queries can have a lock group" or something like that.
The fact that leader_pid == pid for the leader and different for the
other member should be obvious, I'm not sure that it's worth
documenting that.
Also patch has no tests. Possible this is normal, not sure how to write a reliable test for this feature.
Yes, I was unsure if some extra testing was required. We could set
force_parallel_mode to on and query "select leader_pid is not null
from pg_stat_activity where pid = pg_backend_pid(), and then the
opposite test, which should do the trick.
Hello
As I understand it, lock group is some infrastructure that is used by
parallel queries, but could be used for something else too. So if
more documentation is needed, we should say something like "For now,
only parallel queries can have a lock group" or something like that.
If lockGroupLeader will be used in some way for non-parallel query, then the name leader_pid could be confusing. No?
I treat pg_stat_activity as view for user. We have document somewhere what is "lock group leader" (excepts README in source tree)? I meant user going to read documentation, "ok, this field is process ID of the lock group leader, but what is it?". Expose a leader pid for parallel worker will be clear improvement for user. And seems lockGroupLeader->pid is exactly this stuff. Therefore, I would like to see such description and meaning of the field.
The fact that leader_pid == pid for the leader and different for the
other member should be obvious, I'm not sure that it's worth
documenting that.
It may be not obvious that leader_pid is not null in this case. But ok, no objections.
regards, Sergei
On Fri, Dec 27, 2019 at 10:01 AM Sergei Kornilov <sk@zsrv.org> wrote:
Hello
As I understand it, lock group is some infrastructure that is used by
parallel queries, but could be used for something else too. So if
more documentation is needed, we should say something like "For now,
only parallel queries can have a lock group" or something like that.If lockGroupLeader will be used in some way for non-parallel query, then the name leader_pid could be confusing. No?
I treat pg_stat_activity as view for user. We have document somewhere what is "lock group leader" (excepts README in source tree)? I meant user going to read documentation, "ok, this field is process ID of the lock group leader, but what is it?". Expose a leader pid for parallel worker will be clear improvement for user. And seems lockGroupLeader->pid is exactly this stuff. Therefore, I would like to see such description and meaning of the field.
I think that not using "parallel" to name this field will help to
avoid confusion if the lock group infrastructure is eventually used
for something else, but that's only true if indeed we explain what a
lock group is.
The fact that leader_pid == pid for the leader and different for the
other member should be obvious, I'm not sure that it's worth
documenting that.It may be not obvious that leader_pid is not null in this case. But ok, no objections.
If we adapt lmgr/README to document the group locking, it also
addresses this. What do you thing of:
The leader_pid is NULL for processes not involved in parallel query.
When a process wants to cooperate with parallel workers, it becomes a
lock group leader, which means that this field will be valued to its
own pid. When a parallel worker starts up, this field will be valued
with the leader pid.
On Fri, Dec 27, 2019 at 10:15:33AM +0100, Julien Rouhaud wrote:
I think that not using "parallel" to name this field will help to
avoid confusion if the lock group infrastructure is eventually used
for something else, but that's only true if indeed we explain what a
lock group is.
As you already pointed out, src/backend/storage/lmgr/README includes a
full description of this stuff under the section "Group Locking". So
I agree that the patch ought to document your new field in a much
better way, without mentioning the term "group locking" that's even
better to not confuse the reader because this term not present in the
docs at all.
The leader_pid is NULL for processes not involved in parallel query.
When a process wants to cooperate with parallel workers, it becomes a
lock group leader, which means that this field will be valued to its
own pid. When a parallel worker starts up, this field will be valued
with the leader pid.
The first sentence is good to have. Now instead of "lock group
leader", I think that we had better use "parallel group leader" as in
other parts of the docs (see wait events for example). Then we just
need to say that if leader_pid has the same value as
pg_stat_activity.pid, then we have a group leader. If not, then it is
a parallel worker process initially spawned by the leader whose PID is
leader_pid (when executing Gather, Gather Merge, soon-to-be parallel
vacuum or for a parallel btree build, but this does not need a mention
in the docs). There could be an argument as well to have leader_pid
set to NULL for a leader, but that would be inconsistent with what
the PGPROC entry reports for the backend.
While looking at the code, I think that we could refactor things a bit
for raw_wait_event, wait_event_type and wait_event which has some
duplicated code for backend and auxiliary processes. What about
filling in the wait event data after fetching the PGPROC entry, and
also fill in leader_pid for auxiliary processes. This does not matter
now, perhaps it will never matter (or not), but that would make the
code much more consistent.
--
Michael
On Thu, Jan 16, 2020 at 04:27:27PM +0900, Michael Paquier wrote:
While looking at the code, I think that we could refactor things a bit
for raw_wait_event, wait_event_type and wait_event which has some
duplicated code for backend and auxiliary processes. What about
filling in the wait event data after fetching the PGPROC entry, and
also fill in leader_pid for auxiliary processes. This does not matter
now, perhaps it will never matter (or not), but that would make the
code much more consistent.
And actually, the way you are looking at the leader's PID is visibly
incorrect and inconsistent because the patch takes no shared LWLock on
the leader using LockHashPartitionLockByProc() followed by
LWLockAcquire(), no? That's incorrect because it could be perfectly
possible to crash with this code between the moment you check if
lockGroupLeader is NULL and the moment you look at
lockGroupLeader->pid if a process is being stopped in-between and
removes itself from a lock group in ProcKill(). That's also
inconsistent because it could be perfectly possible to finish with an
incorrect view of the data while scanning for all the backend entries,
like a leader set to NULL with workers pointing to the leader for
example, or even workers marked incorrectly as NULL. The second one
may not be a problem, but the first one could be confusing.
--
Michael
On Thu, Jan 16, 2020 at 8:28 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Dec 27, 2019 at 10:15:33AM +0100, Julien Rouhaud wrote:
I think that not using "parallel" to name this field will help to
avoid confusion if the lock group infrastructure is eventually used
for something else, but that's only true if indeed we explain what a
lock group is.As you already pointed out, src/backend/storage/lmgr/README includes a
full description of this stuff under the section "Group Locking". So
I agree that the patch ought to document your new field in a much
better way, without mentioning the term "group locking" that's even
better to not confuse the reader because this term not present in the
docs at all.The leader_pid is NULL for processes not involved in parallel query.
When a process wants to cooperate with parallel workers, it becomes a
lock group leader, which means that this field will be valued to its
own pid. When a parallel worker starts up, this field will be valued
with the leader pid.The first sentence is good to have. Now instead of "lock group
leader", I think that we had better use "parallel group leader" as in
other parts of the docs (see wait events for example).
Ok, I'll change this way.
Then we just
need to say that if leader_pid has the same value as
pg_stat_activity.pid, then we have a group leader. If not, then it is
a parallel worker process initially spawned by the leader whose PID is
leader_pid (when executing Gather, Gather Merge, soon-to-be parallel
vacuum or for a parallel btree build, but this does not need a mention
in the docs). There could be an argument as well to have leader_pid
set to NULL for a leader, but that would be inconsistent with what
the PGPROC entry reports for the backend.
It would also slightly complicate things to get the full set of
backends involved in a parallel query, while excluding the leader is
entirely trivial.
While looking at the code, I think that we could refactor things a bit
for raw_wait_event, wait_event_type and wait_event which has some
duplicated code for backend and auxiliary processes. What about
filling in the wait event data after fetching the PGPROC entry, and
also fill in leader_pid for auxiliary processes. This does not matter
now, perhaps it will never matter (or not), but that would make the
code much more consistent.
Yeah, I didn't think that auxiliary would be involved any time soon
but I can include this refactoring.
On Thu, Jan 16, 2020 at 8:49 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jan 16, 2020 at 04:27:27PM +0900, Michael Paquier wrote:
While looking at the code, I think that we could refactor things a bit
for raw_wait_event, wait_event_type and wait_event which has some
duplicated code for backend and auxiliary processes. What about
filling in the wait event data after fetching the PGPROC entry, and
also fill in leader_pid for auxiliary processes. This does not matter
now, perhaps it will never matter (or not), but that would make the
code much more consistent.And actually, the way you are looking at the leader's PID is visibly
incorrect and inconsistent because the patch takes no shared LWLock on
the leader using LockHashPartitionLockByProc() followed by
LWLockAcquire(), no? That's incorrect because it could be perfectly
possible to crash with this code between the moment you check if
lockGroupLeader is NULL and the moment you look at
lockGroupLeader->pid if a process is being stopped in-between and
removes itself from a lock group in ProcKill(). That's also
inconsistent because it could be perfectly possible to finish with an
incorrect view of the data while scanning for all the backend entries,
like a leader set to NULL with workers pointing to the leader for
example, or even workers marked incorrectly as NULL. The second one
may not be a problem, but the first one could be confusing.
Oh indeed. But unless we hold some LWLock during the whole function
execution, we cannot guarantee a consistent view right? And isn't it
already possible to e.g. see a parallel worker in pg_stat_activity
while all other queries are shown are idle, if you're unlucky enough?
Also, LockHashPartitionLockByProc requires the leader PGPROC, and
there's no guarantee that we'll see the leader before any of the
workers, so I'm unsure how to implement what you said. Wouldn't it be
better to simply fetch the leader PGPROC after acquiring a shared
ProcArrayLock, and using that copy to display the pid, after checking
that we retrieved a non-null PGPROC?
On Fri, Jan 17, 2020 at 05:07:55PM +0100, Julien Rouhaud wrote:
Oh indeed. But unless we hold some LWLock during the whole function
execution, we cannot guarantee a consistent view right?
Yep. That's possible.
And isn't it already possible to e.g. see a parallel worker in
pg_stat_activity while all other queries are shown are idle, if
you're unlucky enough?
Yep. That's possible.
Also, LockHashPartitionLockByProc requires the leader PGPROC, and
there's no guarantee that we'll see the leader before any of the
workers, so I'm unsure how to implement what you said. Wouldn't it be
better to simply fetch the leader PGPROC after acquiring a shared
ProcArrayLock, and using that copy to display the pid, after checking
that we retrieved a non-null PGPROC?
Another idea would be to check if the current PGPROC entry is a leader
and print in an int[] the list of PIDs of all the workers while
holding a shared LWLock to avoid anything to be unregistered. Less
handy, but a bit more consistent. One problem with doing that is
that you may have in your list of PIDs some worker processes that are
already gone when going through all the other backend entries. An
advantage is that an empty array could mean "I am the leader, but
nothing has been registered yet to my group lock." (note that the
leader adds itself to lockGroupMembers).
--
Michael
On Sat, Jan 18, 2020 at 3:51 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Jan 17, 2020 at 05:07:55PM +0100, Julien Rouhaud wrote:
Oh indeed. But unless we hold some LWLock during the whole function
execution, we cannot guarantee a consistent view right?Yep. That's possible.
And isn't it already possible to e.g. see a parallel worker in
pg_stat_activity while all other queries are shown are idle, if
you're unlucky enough?Yep. That's possible.
Also, LockHashPartitionLockByProc requires the leader PGPROC, and
there's no guarantee that we'll see the leader before any of the
workers, so I'm unsure how to implement what you said. Wouldn't it be
better to simply fetch the leader PGPROC after acquiring a shared
ProcArrayLock, and using that copy to display the pid, after checking
that we retrieved a non-null PGPROC?Another idea would be to check if the current PGPROC entry is a leader
and print in an int[] the list of PIDs of all the workers while
holding a shared LWLock to avoid anything to be unregistered. Less
handy, but a bit more consistent. One problem with doing that is
that you may have in your list of PIDs some worker processes that are
already gone when going through all the other backend entries. An
advantage is that an empty array could mean "I am the leader, but
nothing has been registered yet to my group lock." (note that the
leader adds itself to lockGroupMembers).
So, AFAICT the LockHashPartitionLockByProc is required when
iterating/modifying lockGroupMembers or lockGroupLink, but just
getting the leader pid should be safe. Since we'll never be able to
get a totally consistent view of data here, I'm in favor of avoiding
taking extra locks here. I agree that outputting an array of the pid
would be more consistent for the leader, but will have its own set of
corner cases. It seems to me that a new leader_pid column is easier
to handle at SQL level, so I kept that approach in attached v4. If
you have strong objections to it. I can still change it.
Attachments:
pgsa-leader-pid-v4.difftext/x-patch; charset=US-ASCII; name=pgsa-leader-pid-v4.diffDownload+35-25
On Tue, Jan 28, 2020 at 12:36:41PM +0100, Julien Rouhaud wrote:
On Sat, Jan 18, 2020 at 3:51 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Jan 17, 2020 at 05:07:55PM +0100, Julien Rouhaud wrote:
Oh indeed. But unless we hold some LWLock during the whole function
execution, we cannot guarantee a consistent view right?Yep. That's possible.
And isn't it already possible to e.g. see a parallel worker in
pg_stat_activity while all other queries are shown are idle, if
you're unlucky enough?Yep. That's possible.
Also, LockHashPartitionLockByProc requires the leader PGPROC, and
there's no guarantee that we'll see the leader before any of the
workers, so I'm unsure how to implement what you said. Wouldn't it be
better to simply fetch the leader PGPROC after acquiring a shared
ProcArrayLock, and using that copy to display the pid, after checking
that we retrieved a non-null PGPROC?Another idea would be to check if the current PGPROC entry is a leader
and print in an int[] the list of PIDs of all the workers while
holding a shared LWLock to avoid anything to be unregistered. Less
handy, but a bit more consistent. One problem with doing that is
that you may have in your list of PIDs some worker processes that are
already gone when going through all the other backend entries. An
advantage is that an empty array could mean "I am the leader, but
nothing has been registered yet to my group lock." (note that the
leader adds itself to lockGroupMembers).So, AFAICT the LockHashPartitionLockByProc is required when
iterating/modifying lockGroupMembers or lockGroupLink, but just
getting the leader pid should be safe. Since we'll never be able to
get a totally consistent view of data here, I'm in favor of avoiding
taking extra locks here. I agree that outputting an array of the pid
would be more consistent for the leader, but will have its own set of
corner cases. It seems to me that a new leader_pid column is easier
to handle at SQL level, so I kept that approach in attached v4. If
you have strong objections to it. I can still change it.
I agree a separate "leader_id" column is easier to work with, as it does
not require unnesting and so on.
As for the consistency, I agree we probably can't make this perfect, as
we're fetching and processing the PGPROC records one by one. Fixing that
would require acquiring a much stronger lock on PGPROC, and perhaps some
other locks. That's pre-existing behavior, of course, it's just not very
obvious as we don't have any dependencies between the rows, I think.
Adding the leader_id will change, that, of course. But I think it's
still mostly OK, even with the possible inconsistency.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 28, 2020 at 2:09 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I agree a separate "leader_id" column is easier to work with, as it does
not require unnesting and so on.As for the consistency, I agree we probably can't make this perfect, as
we're fetching and processing the PGPROC records one by one. Fixing that
would require acquiring a much stronger lock on PGPROC, and perhaps some
other locks. That's pre-existing behavior, of course, it's just not very
obvious as we don't have any dependencies between the rows, I think.
Adding the leader_id will change, that, of course. But I think it's
still mostly OK, even with the possible inconsistency.
There were already some dependencies between the rows since parallel
queries were added, as you could see eg. a parallel worker while no
query is currently active. This patch will make those corner cases
more obvious. Should I document the possible inconsistencies?
On Tue, Jan 28, 2020 at 02:26:34PM +0100, Julien Rouhaud wrote:
On Tue, Jan 28, 2020 at 2:09 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:I agree a separate "leader_id" column is easier to work with, as it does
not require unnesting and so on.As for the consistency, I agree we probably can't make this perfect, as
we're fetching and processing the PGPROC records one by one. Fixing that
would require acquiring a much stronger lock on PGPROC, and perhaps some
other locks. That's pre-existing behavior, of course, it's just not very
obvious as we don't have any dependencies between the rows, I think.
Adding the leader_id will change, that, of course. But I think it's
still mostly OK, even with the possible inconsistency.There were already some dependencies between the rows since parallel
queries were added, as you could see eg. a parallel worker while no
query is currently active. This patch will make those corner cases
more obvious.
Yeah, sure. I mean explicit dependencies, e.g. a column referencing
values from another row, like leader_id does.
Should I document the possible inconsistencies?
I think it's worth mentioning that as a comment in the code, say before
the pg_stat_get_activity function. IMO we don't need to document all
possible inconsistencies, a generic explanation is enough.
Not sure about the user docs. Does it currently say anything about this
topic - consistency with stat catalogs?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services