[PATCH v2] Progress command to monitor progression of long running SQL queries
Hello,
This is version 2 of the new command name PROGRESS which I wrote in order
to monitor long running SQL queries in a Postgres backend process.
New command justification
====================
The purpose of the command is to monitor long running SQL queries in a
backend process to evaluate when they will complete.
First, the user needs to find the pid of the backend process which is
executing the long running SQL query for which we want to monitor
progression.
This search may be based on the SQL query, the session, the terminal, using
the view pg_stat_activity. This view provides the best method to identify
the query to be monitored.
Once the pid of the backend is determined, user may run:
psql> PROGRESS [VERBOSE|(FORMAT TEXT|JSON|XML|INLINE)] PID_OF_BACKEND
The output will show a table similar to the one of the EXPLAIN command but
with further details about the progression of execution of each node which
may take a long time to be executed.
Such nodes are SeqScan (Sequential Scans), IndexScan (Index Scans), Sort
(long sorts which require disk tapes mecanism), TupleStore (Store on disks
of tuple tables), Limit, HashJoinTable which use one disk files to perform
HashJoin.
Other nodes use the previous nodes. For instance, Material nodes use
TupleStore nodes.
The PROGRESS command can be used with the psql \watch command. For
instance, user may run:
psql> \watch PROGRESS PID_OF_BACKEND
This is handy to view progression of SQL queries.
Use case
=======
A use case is shown in the below example based on a table named t_10m with
at least 10 millions rows.
The table has been created with :
CREATE TABLE T_10M ( id integer, md5 text);
INSERT INTO T_10M SELECT generate_series(1,10000000) AS id,
md5(random()::text) AS md5;
ANALYSE t_10m;
ANALYSE is needed to have statistics updated. These are used to compare
rows fetched with total number of rows and report percentage of execution
of query nodes.
If ANALYSE is not run against relations used by the monitored query,
statistics may be inaccurate.
1/ Start a first psql session to run a long SQL query:
[pgadm@rco ~]$ psql -A -d test
psql (10devel)
Type "help" for help.
test=#
The option -A is used to allow rows to be output without formatting work.
Redirect output to a file in order to let the query run without terminal
interaction:
test=# \o out
Start a long running query:
test=# select * from t_10M order by md5;
2/ In a second psql session, list the backend pid(s) and their matching SQL
query
[pgadm@rco ~]$ psql -d test
psql (10devel)
Type "help" for help.
test=# select * from prg;
pid | query
-------+-----------------------------------
26306 |
26309 |
26311 | select * from t_10m order by md5;
26312 | select * from t_10m order by md5;
26313 | select * from t_10m order by md5;
26330 | select * from prg;
26304 |
26303 |
26305 |
(9 rows)
test=#
Chose the pid of the backend running the long SQL query to be monitored.
Above example is a parallel SQL query. Lowest pid is the main backend of
the query.
test=# PROGRESS 26311;
PLAN PROGRESS
------------------------------------------------------------
--------------------------------
status: <query running>
query: select * from t_10m order by md5;
time used (s): 20
Gather Merge
worker child: 26312
-> Sort => rows r/w merge 0/0 sort 0/1895721 0% tape blocks 10903
Sort Key: md5
-> Parallel Seq Scan on t_10m => rows 1903441/4166657 45% => blks
49924/83334 59%
worker child: 26313
-> Sort => rows r/w merge 0/0 sort 0/1967521 0% tape blocks 11314
Sort Key: md5
-> Parallel Seq Scan on t_10m => rows 1975561/4166657 47% => blks
49927/83334 59%
worker parent: 26311
-> Sort on tapes writing => rows r/w merge 0/0 sort 0/2075590 0% tape
blocks 11935
Sort Key: md5
-> Parallel Seq Scan on t_10m => rows 2111550/4166657 50% => blks
49928/83334 59%
total disk space used (MB) 266
(17 rows)
test=#
The query of the monitored backend is:
test=# select * from t_10M order by md5;
Because the table has 10 millions of rows, the sort is done on tapes.
The output shows the progression in terms of:
- rows with:
- already fetched rows,
- total rows to be fetched,
- blocks with:
- already fetched blocks
- total blocks to be fetched
Each node may have different output types.
Progression is reported in terms of rows, blocks, bytes, and percentages.
Sort nodes show tapes blocks being used and step phase (merge/sort).
Implementation of the command
========================
The design of the command is:
- the user issue the "PROGRESS pid" command from a psql session.
The pid is the one of the backend which runs the SQL query for which we
want to get a progression report. It can be determined from the view
pg_stat_activity.
- the monitoring backend, upon receiving the "PROGRESS pid" command from
psql utility used in step above, sends a signal to the backend whose
process pid is the one provided in the PROGRESS command, and whose reason
is PROCSIG_PROGRESS defined in src/include/storage/procsignal.h
- the monitored backend receives the signal with the reason of the signal
(progression request) and notes the request as for any interrupt. Then, it
continues its execution of its SQL query until interrupts can be serviced.
- when the monitored process can service the interrupts, it deals with the
progress request by collecting its execution tree with the execution
progress of each long running node. At this time, the SQL query must be in
a consistent state without partial nodes under alllocation or freeing. This
is enforced by flags added in the executor. The progression is only
collected if the backend is in the function of ExecutorRun(). If execution
tree is in a consistent state, it is dumped in shared memory pages
allocated at the start of the server. Then, the monitored backend set a
latch on which the monitoring backend is waiting for. It then continues
executing its SQL query.
- the monitoring backend collects the share memory data dumped, and sends
it to its psql session as a list of rows.
The command PROGRESS does not incur any slowness on the running query
because the execution progression is only computed upon receiving the
progress request which is supposed to be seldom used.
The code heavily reuses the one of the explain command. In order to share
as much code as possible with the EXPLAIN command, part of the EXPLAIN code
which deals with reporting quals for instance, has been moved to a new
report.c file in the src/backend/commands folder. This code in report.c is
shared between explain.c source code and PROGRESS command source code which
is in progress.c file.
The progression reported by PROGRESS command is given in terms of rows,
blocks, bytes and percents. The values displayed depend on the node type in
the execution plan. The current patch implements all the possible nodes
which could take a lot of time.
Parallel queries can also be monitored. The same mecanism is used to
monitor child workers with a slight difference: the main worker requests
the child progression directly in order to dump the whole progress tree in
shared memory.
Patch
=====
The patch for version 2 is attached to the current email.
The patch has been rebased on latest commits.
The patch is also available at : https://github.com/colinet/
progress/tree/progress
The diff stat of the patch is:
[root@rco pg]# git diff --stat master..
contrib/auto_explain/auto_explain.c | 5 +-
contrib/postgres_fdw/postgres_fdw.c | 13 +-
src/backend/access/heap/heapam.c | 2 +
src/backend/access/transam/parallel.c | 2 +-
src/backend/commands/Makefile | 3 +-
src/backend/commands/explain.c | 2834
++++++++++++++----------------------------------------------
-------------------------------------------------
src/backend/commands/prepare.c | 5 +-
src/backend/commands/progress.c | 2035
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++
src/backend/commands/report.c | 2252
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++
src/backend/executor/execMain.c | 8 +
src/backend/executor/execProcnode.c | 31 ++
src/backend/executor/nodeBitmapHeapscan.c | 13 +-
src/backend/executor/nodeIndexonlyscan.c | 13 +-
src/backend/executor/nodeIndexscan.c | 15 +-
src/backend/executor/nodeSamplescan.c | 12 +-
src/backend/executor/nodeSeqscan.c | 16 +-
src/backend/nodes/bitmapset.c | 19 +
src/backend/nodes/outfuncs.c | 245 ++++++++++
src/backend/parser/gram.y | 79 +++-
src/backend/postmaster/postmaster.c | 1 +
src/backend/storage/file/buffile.c | 73 +++
src/backend/storage/file/fd.c | 15 +
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/ipc/procarray.c | 57 +++
src/backend/storage/ipc/procsignal.c | 4 +
src/backend/storage/lmgr/lwlock.c | 7 +-
src/backend/storage/lmgr/lwlocknames.txt | 1 +
src/backend/tcop/postgres.c | 65 ++-
src/backend/tcop/pquery.c | 25 +
src/backend/tcop/utility.c | 10 +
src/backend/utils/init/globals.c | 12 +
src/backend/utils/sort/logtape.c | 18 +
src/backend/utils/sort/tuplesort.c | 153 +++++-
src/backend/utils/sort/tuplestore.c | 75 ++-
src/include/access/parallel.h | 1 +
src/include/commands/explain.h | 67 +--
src/include/commands/prepare.h | 2 +-
src/include/commands/report.h | 159 +++++++
src/include/executor/execdesc.h | 3 +
src/include/executor/progress.h | 52 ++
src/include/foreign/fdwapi.h | 10 +-
src/include/miscadmin.h | 2 +
src/include/nodes/bitmapset.h | 1 +
src/include/nodes/execnodes.h | 3 +
src/include/nodes/extensible.h | 6 +-
src/include/nodes/nodes.h | 8 +
src/include/nodes/parsenodes.h | 11 +
src/include/nodes/plannodes.h | 11 +
src/include/parser/kwlist.h | 2 +
src/include/pgstat.h | 3 +-
src/include/storage/buffile.h | 11 +
src/include/storage/fd.h | 2 +
src/include/storage/procarray.h | 3 +
src/include/storage/procsignal.h | 3 +
src/include/tcop/pquery.h | 1 +
src/include/utils/logtape.h | 2 +
src/include/utils/tuplesort.h | 72 ++-
src/include/utils/tuplestore.h | 35 ++
58 files changed, 5972 insertions(+), 2619 deletions(-)
[root@rco pg]#
The progress command can be used with the watch command of psql making it
more handy to monitor a long running query.
The default format of the PROGRESS command is text inline. It can be set as
json, xml, or text as for EXPLAIN command.
Use cases
========
Some further examples of use are shown below in the test_v2.txt file with
parallel queries, sorts, hashes, scans, json format output.
TODO
=====
Monitor progress of utilities commands such as CREATE INDEX which may take
a long time to complete.
Write documentation
Changelog
========
v2:
Added JSON, XML, TEXT INLINE format output
Added time consumed to run SQL queries
Added SQL query being monitored
Added support for parallel queries
Added disk used by the queries
Rebased on 9th May 2017 commits.
v1:
Initial version with only TEXT format reporting
Any suggestion, comment or feedback is welcome.
Regards
Remi
On Wed, May 10, 2017 at 06:40:31PM +0200, Remi Colinet wrote:
Hello,
This is version 2 of the new command name PROGRESS which I wrote in
order to monitor long running SQL queries in a Postgres backend
process.
Perhaps I missed something important in the discussion, but was there
a good reason that this isn't a function callable from SQL, i.e. not
restricted to the psql client?
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi
2017-05-10 18:40 GMT+02:00 Remi Colinet <remi.colinet@gmail.com>:
Hello,
This is version 2 of the new command name PROGRESS which I wrote in order
to monitor long running SQL queries in a Postgres backend process.
This patch introduce lot of reserved keyword. Why? It can breaks lot of
applications. Cannot you enhance EXPLAIN command?
Regards
Pavel
On Thu, May 11, 2017 at 1:40 AM, Remi Colinet <remi.colinet@gmail.com> wrote:
This is version 2 of the new command name PROGRESS which I wrote in order to
monitor long running SQL queries in a Postgres backend process.
It would be a good idea to add this patch to the next commit fest if
you are expecting some feedback:
https://commitfest.postgresql.org/14/
But please don't expect immediate feedback, the primary focus these
days is to stabilize the upcoming release.
New command justification
====================[root@rco pg]# git diff --stat master..
[...]
58 files changed, 5972 insertions(+), 2619 deletions(-)
For your first patch, you may want something less... Challenging.
Please note as well that it would be nice if you review other patches
to get more familiar with the community process, it is expected that
for each patch submitted, an equivalent amount of review is done.
That's hard to measure but for a patch as large as that you will need
to pick up at least one patch equivalent in difficulty.
Regarding the patch, this is way to close to the progress facility
already in place. So why don't you extend it for the executor? We can
likely come up with something that can help, though I see the part
where the plan run by a backend is shared among all processes as a
major bottleneck in performance. You have at least three concepts
different here:
- Store plans run in shared memory to allow access to any other processes.
- Allow a process to look at the plan run by another one with a SQL interface.
- Track the progress of a running plan, via pg_stat_activity.
All in all, I think that a new command is not adapted, and that you
had better split each concept before implementing something
over-engineered like the patch attached.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
That's good point.
I will probably convert the new command to a SQL function.
I was fumbling between a table or a SQL function. Oracle uses
v$session_longops to track progression of long running SQL queries which
have run for more than 6 seconds.
But a function is much better to provide parameters such as the backend pid
and a format specification for the output.
Regards
Remi
2017-05-10 21:52 GMT+02:00 David Fetter <david@fetter.org>:
Show quoted text
On Wed, May 10, 2017 at 06:40:31PM +0200, Remi Colinet wrote:
Hello,
This is version 2 of the new command name PROGRESS which I wrote in
order to monitor long running SQL queries in a Postgres backend
process.Perhaps I missed something important in the discussion, but was there
a good reason that this isn't a function callable from SQL, i.e. not
restricted to the psql client?Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)comRemember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
2017-05-11 4:05 GMT+02:00 Michael Paquier <michael.paquier@gmail.com>:
On Thu, May 11, 2017 at 1:40 AM, Remi Colinet <remi.colinet@gmail.com>
wrote:This is version 2 of the new command name PROGRESS which I wrote in
order to
monitor long running SQL queries in a Postgres backend process.
It would be a good idea to add this patch to the next commit fest if
you are expecting some feedback:
https://commitfest.postgresql.org/14/
But please don't expect immediate feedback, the primary focus these
days is to stabilize the upcoming release.
Yes, I understand. I will submit the patch to the commit fest once I will
have had a few feedbacks.
I already received some interesting suggestions such as using a SQL
function instead of a whole new command.
The purpose for the moment is to gather feedback and advise about such
feature.
New command justification
====================[root@rco pg]# git diff --stat master..
[...]
58 files changed, 5972 insertions(+), 2619 deletions(-)For your first patch, you may want something less... Challenging.
Please note as well that it would be nice if you review other patches
to get more familiar with the community process, it is expected that
for each patch submitted, an equivalent amount of review is done.
That's hard to measure but for a patch as large as that you will need
to pick up at least one patch equivalent in difficulty.
Acked
Regarding the patch, this is way to close to the progress facility
already in place. So why don't you extend it for the executor? We can
likely come up with something that can help, though I see the part
where the plan run by a backend is shared among all processes as a
major bottleneck in performance. You have at least three concepts
different here:
- Store plans run in shared memory to allow access to any other processes.
- Allow a process to look at the plan run by another one with a SQL
interface.
- Track the progress of a running plan, via pg_stat_activity.
All in all, I think that a new command is not adapted, and that you
had better split each concept before implementing something
over-engineered like the patch attached.
I initially had a look at the progress facility but this seemed to me very
far from the current goal.
Btw, I will reconsider it. Things may have changed.
The plan and its planstate are not permanently shared with other backends.
This would be useless and very costly.
As long as the PROGRESS command is not used, the patch has a very low
impact on the executor.
A few counters and flags are added/updated by the patch in the executor.
This is all what is needed to track progress of execution.
For instance in src/backend/executor/execProcnode.c, I've added:
ExecInitNode()
+ result->plan_rows = 0;
ExecProcNode()
+ node->plan_rows++;
ExecEndNode()
+ node->plan_rows = 0;
This is enough to compute the number of rows fetched by the plan at this
step of the plan.
A few counters have been added to track sorts, scans, hashes, tuple stores.
The disk space used by nodes is also traced. This shows disk resource
needed to complete a SQL query.
This was not initially a purpose of the patch, but I find this very handy.
Some debugging code will also be removed making the change in the executor
very small.
The code change in the executor is less than 50 lines and it is about 150
lines in the tuple sort/store code mainly to add functions to fetch sort
and store status.
The execution tree is dumped in shared memory only when requested by a
monitoring backend sending a signal to the monitored backend.
The monitoring backend first sends a signal which is noted by the backend
running the long SQL query.
Once the long SQL query can deal with the request (interrupt enabled), it
dumps its execution tree in shared memory and then follows up its
execution.
So the dump of the execution tree is done only when it is requested.
This does not seem to incur any performance hit. Or I missed something.
Once the concept is mature, I would split the patch in small consistent
parts to allow a smooth and clean merge.
Regards
Remi
Show quoted text
--
Michael
Thx for your comment.
Based on other comments, I will probably convert the command into a SQL
function.
This allows passing arguments such as the one which can be used in the
current command (VERBOSE, FORMAT).
This will avoid keyword collisions.
Regards
Remi
2017-05-10 22:04 GMT+02:00 Pavel Stehule <pavel.stehule@gmail.com>:
Show quoted text
Hi
2017-05-10 18:40 GMT+02:00 Remi Colinet <remi.colinet@gmail.com>:
Hello,
This is version 2 of the new command name PROGRESS which I wrote in order
to monitor long running SQL queries in a Postgres backend process.This patch introduce lot of reserved keyword. Why? It can breaks lot of
applications. Cannot you enhance EXPLAIN command?Regards
Pavel
On Thu, May 11, 2017 at 05:24:16PM +0200, Remi Colinet wrote:
2017-05-10 21:52 GMT+02:00 David Fetter <david@fetter.org>:
On Wed, May 10, 2017 at 06:40:31PM +0200, Remi Colinet wrote:
Hello,
This is version 2 of the new command name PROGRESS which I wrote in
order to monitor long running SQL queries in a Postgres backend
process.Perhaps I missed something important in the discussion, but was there
a good reason that this isn't a function callable from SQL, i.e. not
restricted to the psql client?
Please don't top post. http://www.caliburn.nl/topposting.html
That's good point.
I will probably convert the new command to a SQL function.
Great!
I was fumbling between a table or a SQL function. Oracle uses
v$session_longops to track progression of long running SQL queries
which have run for more than 6 seconds. But a function is much
better to provide parameters such as the backend pid and a format
specification for the output.
Once it's a function, it can very easily be called in a system (or
user-defined) view.
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, May 10, 2017 at 10:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Regarding the patch, this is way to close to the progress facility
already in place. So why don't you extend it for the executor?
I don't think that's a good idea. The existing progress facility
advertises a fixed number of counters with a command-type-specific
interpretation, but for *query* progress reporting, we really need a
richer model. A query plan can contain an unbounded number of
executor nodes and Remi's idea is, quite reasonably, to report
something about each one.
From a design point of view, I think a patch like this has to clear a
number of hurdles. It needs to:
1. Decide what data to expose. The sample output looks reasonable in
this case, although the amount of code churn looks really high.
2. Find a way to advertise the data that it exposes. This patch looks
like it allocates a half-MB of shared memory per backend and gives up
if that's not enough.
3. Find a way to synchronize the access to the data that it exposes.
This patch ignores that problem completely, so multiple progress
report commands running at the same time will behave unpredictably.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, May 10, 2017 at 10:10 PM, Remi Colinet <remi.colinet@gmail.com> wrote:
Parallel queries can also be monitored. The same mecanism is used to monitor
child workers with a slight difference: the main worker requests the child
progression directly in order to dump the whole progress tree in shared
memory.
What if there is any error in the worker (like "out of memory") while
gathering the statistics? It seems both for workers as well as for
the main backend it will just error out. I am not sure if it is a
good idea to error out the backend or parallel worker as it will just
end the query execution. Also, even if it is okay, there doesn't seem
to be a way by which a parallel worker can communicate the error back
to master backend, rather it will just exit silently which is not
right.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, May 13, 2017 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 10, 2017 at 10:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:Regarding the patch, this is way to close to the progress facility
already in place. So why don't you extend it for the executor?I don't think that's a good idea. The existing progress facility
advertises a fixed number of counters with a command-type-specific
interpretation, but for *query* progress reporting, we really need a
richer model. A query plan can contain an unbounded number of
executor nodes and Remi's idea is, quite reasonably, to report
something about each one.
I see what you're coming at. As st_progress_param is upper-bounded,
what should be actually used as report structure is a tree made of
nodes with one planner node attached to each node of the report tree.
Then the reporting facility should need to make persistent the data
reported.
which is in this case a set of planner nodes, being able to also make
persistent the changes on the tree reported to the user. Then when a
backend reports its execution status, what it does is just updating a
portion of the tree in shared memory. Lock contention may be a problem
though, perhaps things could be made atomic?
From a design point of view, I think a patch like this has to clear a
number of hurdles. It needs to:1. Decide what data to expose. The sample output looks reasonable in
this case, although the amount of code churn looks really high.2. Find a way to advertise the data that it exposes. This patch looks
like it allocates a half-MB of shared memory per backend and gives up
if that's not enough.
Perhaps DSM? It is not user-friendly to fail sporadically...
3. Find a way to synchronize the access to the data that it exposes.
This patch ignores that problem completely, so multiple progress
report commands running at the same time will behave unpredictably.
Yeah..
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, May 16, 2017 at 2:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
Perhaps DSM? It is not user-friendly to fail sporadically...
Yeah. I've been thinking we might want to give each backend a
backend-lifetime DSA that is created on first use. That could be
useful for some parallel query stuff and maybe for this as well.
Other backends could attach to it to read data from it, and it would
go away on last detach (which would normally be the detach by the
creating backend, but not if somebody else happens to be reading at
the moment the backend exits).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, May 16, 2017 at 7:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 16, 2017 at 2:17 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:Perhaps DSM? It is not user-friendly to fail sporadically...
Yeah. I've been thinking we might want to give each backend a
backend-lifetime DSA that is created on first use. That could be
useful for some parallel query stuff and maybe for this as well.
Other backends could attach to it to read data from it, and it would
go away on last detach (which would normally be the detach by the
creating backend, but not if somebody else happens to be reading at
the moment the backend exits).
That seems like a pretty good idea. I've been considering something simliar
for the usecase of being able to view the "collected but not sent yet"
statistics inside a backend (which tables have been accessed, number of
reads etc),and this seems like it could be used for that as well.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
2017-05-13 3:53 GMT+02:00 Robert Haas <robertmhaas@gmail.com>:
On Wed, May 10, 2017 at 10:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:Regarding the patch, this is way to close to the progress facility
already in place. So why don't you extend it for the executor?I don't think that's a good idea. The existing progress facility
advertises a fixed number of counters with a command-type-specific
interpretation, but for *query* progress reporting, we really need a
richer model. A query plan can contain an unbounded number of
executor nodes and Remi's idea is, quite reasonably, to report
something about each one.
Oracle's way to report progress of long queries is to use a
v$session_longops view which collects progress reports for any backend
which has been running a SQL query for more than 6 seconds. But it's
probably even better to use a SQL function which allows for instance to
provide a specific pid parameter in order to limit the report to a given
backend. Users are expected to be focused on one SQL query which is long to
complete.
From a design point of view, I think a patch like this has to clear a
number of hurdles. It needs to:1. Decide what data to expose. The sample output looks reasonable in
this case, although the amount of code churn looks really high.
I will probably remove:
- the verbose option (PROGRESS VERBOSE PID) which exposes very low details
for tapes used by Sort nodes and by HasJoinTable nodes.
- the JSON, XML output formats because the command will be replaced by a
SQL function returning a row set.
This should make the code much smaller and avoid any change to the EXPLAIN
code (code sharing so far).
2. Find a way to advertise the data that it exposes. This patch looks
like it allocates a half-MB of shared memory per backend and gives up
if that's not enough.
May be, it needs a dynamic allocation of shared memory instead of static
allocation at start of server. This would spare shared memory and avoid
failure when the report is larger than the shared memory buffer pre
allocated so far.
3. Find a way to synchronize the access to the data that it exposes.
This patch ignores that problem completely, so multiple progress
report commands running at the same time will behave unpredictably.So far, I've been using a single lock in shared memory to synchronize the
requests for SQL progress report.
LWLockAcquire(ProgressLock, LW_EXCLUSIVE);
...
SendProcSignal(stmt->pid, PROCSIG_PROGRESS, bid);
...
LWLockRelease(ProgressLock);
It's expected that a few dba would use such feature at the same time. The
lock would need to be unlocked if the current backend fails.
I've also realized that access control for progress report needs to be
added.
Thx & regards
Remi
Show quoted text
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
2017-05-13 14:38 GMT+02:00 Amit Kapila <amit.kapila16@gmail.com>:
On Wed, May 10, 2017 at 10:10 PM, Remi Colinet <remi.colinet@gmail.com>
wrote:Parallel queries can also be monitored. The same mecanism is used to
monitor
child workers with a slight difference: the main worker requests the
child
progression directly in order to dump the whole progress tree in shared
memory.What if there is any error in the worker (like "out of memory") while
gathering the statistics? It seems both for workers as well as for
the main backend it will just error out. I am not sure if it is a
good idea to error out the backend or parallel worker as it will just
end the query execution.
The handling of progress report starts by the creation of a MemoryContext
attached to CurrentMemoryContext. Then, few memory (few KB) is allocated.
Meanwhile, the handling of progress report could indeed exhaust memory and
fail the backend request. But, in such situation, the backend could also
have fail even without any progress request.
Also, even if it is okay, there doesn't seem
to be a way by which a parallel worker can communicate the error back
to master backend, rather it will just exit silently which is not
right.
If a child worker fails, it will not respond to the main backend request.
The main backend will follow up it execution after a 5 seconds timeout (GUC
param to be added may be). In which case, the report would be partially
filled.
If the main backend fails, the requesting backend will have a response such
as:
test=# PROGRESS 14611;
PLAN PROGRESS
----------------
<backend timeout>
(1 row)
test=#
and the child workers will log their response to the shared memory. This
response will not be collected by the main backend which has failed.
Thx & regards
Remi
Show quoted text
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
2017-05-16 8:17 GMT+02:00 Michael Paquier <michael.paquier@gmail.com>:
On Sat, May 13, 2017 at 10:53 AM, Robert Haas <robertmhaas@gmail.com>
wrote:On Wed, May 10, 2017 at 10:05 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:Regarding the patch, this is way to close to the progress facility
already in place. So why don't you extend it for the executor?I don't think that's a good idea. The existing progress facility
advertises a fixed number of counters with a command-type-specific
interpretation, but for *query* progress reporting, we really need a
richer model. A query plan can contain an unbounded number of
executor nodes and Remi's idea is, quite reasonably, to report
something about each one.I see what you're coming at. As st_progress_param is upper-bounded,
what should be actually used as report structure is a tree made of
nodes with one planner node attached to each node of the report tree.
Then the reporting facility should need to make persistent the data
reported.
which is in this case a set of planner nodes, being able to also make
persistent the changes on the tree reported to the user. Then when a
backend reports its execution status, what it does is just updating a
portion of the tree in shared memory. Lock contention may be a problem
though, perhaps things could be made atomic?
In order to report progress of a SQL query, the progress request handler
walks the execution tree of the query. No new node is ever created. The
execution tree has been slightly modified to collect the number of
rows/blocks returned during the executor run. This collection of the
progress request state is only done when it is requested by a user, and it
is dump at this moment in shared memory.
As long as nobody asks for a progress report, there is no overhead unless a
few counters updated during executor run.
Regards
Show quoted text
From a design point of view, I think a patch like this has to clear a
number of hurdles. It needs to:1. Decide what data to expose. The sample output looks reasonable in
this case, although the amount of code churn looks really high.2. Find a way to advertise the data that it exposes. This patch looks
like it allocates a half-MB of shared memory per backend and gives up
if that's not enough.Perhaps DSM? It is not user-friendly to fail sporadically...
3. Find a way to synchronize the access to the data that it exposes.
This patch ignores that problem completely, so multiple progress
report commands running at the same time will behave unpredictably.Yeah..
--
Michael
On Wed, May 17, 2017 at 9:43 PM, Remi Colinet <remi.colinet@gmail.com> wrote:
2017-05-13 14:38 GMT+02:00 Amit Kapila <amit.kapila16@gmail.com>:
On Wed, May 10, 2017 at 10:10 PM, Remi Colinet <remi.colinet@gmail.com>
wrote:Parallel queries can also be monitored. The same mecanism is used to
monitor
child workers with a slight difference: the main worker requests the
child
progression directly in order to dump the whole progress tree in shared
memory.What if there is any error in the worker (like "out of memory") while
gathering the statistics? It seems both for workers as well as for
the main backend it will just error out. I am not sure if it is a
good idea to error out the backend or parallel worker as it will just
end the query execution.The handling of progress report starts by the creation of a MemoryContext
attached to CurrentMemoryContext. Then, few memory (few KB) is allocated.
Meanwhile, the handling of progress report could indeed exhaust memory and
fail the backend request. But, in such situation, the backend could also
have fail even without any progress request.Also, even if it is okay, there doesn't seem
to be a way by which a parallel worker can communicate the error back
to master backend, rather it will just exit silently which is not
right.If a child worker fails, it will not respond to the main backend request.
The main backend will follow up it execution after a 5 seconds timeout (GUC
param to be added may be). In which case, the report would be partially
filled.If the main backend fails, the requesting backend will have a response such
as:test=# PROGRESS 14611;
PLAN PROGRESS
----------------
<backend timeout>
(1 row)test=#
and the child workers will log their response to the shared memory. This
response will not be collected by the main backend which has failed.
If the worker errors out due to any reason, we should end the main
query as well, otherwise, it can produce wrong results. See the error
handling of workers in HandleParallelMessage
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers