On-demand running query plans using auto_explain and signals

Started by Shulgin, Oleksandrover 10 years ago104 messages
#1Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
1 attachment(s)

Hello,

The other day I've rediscovered the exciting idea of using signals to dump
a backend's currently executed query plan, which, it turns out, was already
proposed by Pavel and Simon in these threads:

/messages/by-id/CAFj8pRA-DuzkmDtu52CiUgb0P7TVri_B8LtjMJfWcnr1LPts6w@mail.gmail.com

/messages/by-id/CAFj8pRDEo24joEg4UFRDYeFADFTw-jw_=t=kPwOyDW=v=g1Fhg@mail.gmail.com

Unfortunately, the latest one is missing an actual patch (or the attachment
was scrubbed) and I'm really surprised that the idea didn't take off then.

While evaluating it myself I've decided to simply patch auto_explain module
which is now in contrib, so presumably should be available to a broad
audience. Proof-of-concept patch against master is attached (a patch for
an older branch like 9.0 requires trivial manual effort to adapt).

While I can see the value of in-core support for this, realistically this
is unlikely to be included in 9.5, but an ad hoc patch could be useful long
before that happens.

What this patch does is simply dump the plan of the query being run by the
executor to the log when signaled with SIGUSR1. The auto_explain module
must be loaded to the backend beforehand of course,
session_preload_libraries seems like the place to do that cluster-wide.

Probably using SIGUSR2 would be more appropriate, but I'm not sure if there
are other extensions out there that might be already using it for some
other reason (well, I do not know that for SIGUSR1 either). Looking at the
current state of affairs in procsignal_sigusr1_handler() makes me believe
it should be pretty safe to catch the signal like I do. Or is that not the
case?

The current_query_desc probably needs to be a stack-like structure in order
to keep track of the nested queries correctly, but it works in the simplest
cases.

What would be even more useful is including stats from the running query in
the explain output, if you're a willing to pay for a (hopefully small)
overhead. Unfortunately, that doesn't work out of the box: if you enable
the auto_explain.log_analyze and friends in the .conf-file, you either get
all zero counts, or if you're really unlucky, an error from InstrEndLoop():

ERROR: InstrEndLoop called on running node

The latest patch in this area I could found is this one:
/messages/by-id/87wsn82lda.fsf@oxford.xeocode.com

From what I can see, there's no way around this problem short of hacking
InstrEndLoop... Did anything change in this area since 2008 possibly? I
would really love to have a way to make this work with existing un-patched
servers.

Cheers!
--
Alex

Attachments:

explain-on-signal.patchtext/x-patch; charset=US-ASCII; name=explain-on-signal.patchDownload
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 2a184ed..a64b3fe 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,6 +13,7 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <signal.h>
 
 #include "commands/explain.h"
 #include "executor/instrument.h"
@@ -46,6 +47,9 @@ static ExecutorStart_hook_type prev_ExecutorStart = NULL;
 static ExecutorRun_hook_type prev_ExecutorRun = NULL;
 static ExecutorFinish_hook_type prev_ExecutorFinish = NULL;
 static ExecutorEnd_hook_type prev_ExecutorEnd = NULL;
+static pqsigfunc prev_sigusr1 = NULL;
+
+static QueryDesc *current_query_desc = NULL;
 
 #define auto_explain_enabled() \
 	(auto_explain_log_min_duration >= 0 && \
@@ -54,6 +58,8 @@ static ExecutorEnd_hook_type prev_ExecutorEnd = NULL;
 void		_PG_init(void);
 void		_PG_fini(void);
 
+static void sigusr1_handler(int signum);
+
 static void explain_ExecutorStart(QueryDesc *queryDesc, int eflags);
 static void explain_ExecutorRun(QueryDesc *queryDesc,
 					ScanDirection direction,
@@ -61,6 +67,7 @@ static void explain_ExecutorRun(QueryDesc *queryDesc,
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
 
+static void explain_output_plain(QueryDesc *queryDesc, double time_msec, int format);
 
 /*
  * Module load callback
@@ -170,6 +177,8 @@ _PG_init(void)
 	ExecutorFinish_hook = explain_ExecutorFinish;
 	prev_ExecutorEnd = ExecutorEnd_hook;
 	ExecutorEnd_hook = explain_ExecutorEnd;
+
+	prev_sigusr1 = pqsignal(SIGUSR1, sigusr1_handler);
 }
 
 /*
@@ -185,12 +194,34 @@ _PG_fini(void)
 	ExecutorEnd_hook = prev_ExecutorEnd;
 }
 
+static void
+sigusr1_handler(int signum)
+{
+	int			save_errno = errno;
+
+	if (current_query_desc != NULL)
+	{
+		explain_output_plain(current_query_desc, 0.0, EXPLAIN_FORMAT_TEXT);
+	}
+
+	if (prev_sigusr1 != NULL &&
+		prev_sigusr1 != SIG_IGN &&
+		prev_sigusr1 != SIG_DFL)
+	{
+		prev_sigusr1(signum);
+	}
+
+	errno = save_errno;
+}
+
 /*
  * ExecutorStart hook: start up logging if needed
  */
 static void
 explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
 {
+	current_query_desc = queryDesc;
+
 	if (auto_explain_enabled())
 	{
 		/* Enable per-node instrumentation iff log_analyze is required. */
@@ -294,45 +325,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 		msec = queryDesc->totaltime->total * 1000.0;
 		if (msec >= auto_explain_log_min_duration)
 		{
-			ExplainState *es = NewExplainState();
-
-			es->analyze = (queryDesc->instrument_options && auto_explain_log_analyze);
-			es->verbose = auto_explain_log_verbose;
-			es->buffers = (es->analyze && auto_explain_log_buffers);
-			es->timing = (es->analyze && auto_explain_log_timing);
-			es->summary = es->analyze;
-			es->format = auto_explain_log_format;
-
-			ExplainBeginOutput(es);
-			ExplainQueryText(es, queryDesc);
-			ExplainPrintPlan(es, queryDesc);
-			if (es->analyze && auto_explain_log_triggers)
-				ExplainPrintTriggers(es, queryDesc);
-			ExplainEndOutput(es);
-
-			/* Remove last line break */
-			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
-				es->str->data[--es->str->len] = '\0';
-
-			/* Fix JSON to output an object */
-			if (auto_explain_log_format == EXPLAIN_FORMAT_JSON)
-			{
-				es->str->data[0] = '{';
-				es->str->data[es->str->len - 1] = '}';
-			}
-
-			/*
-			 * Note: we rely on the existing logging of context or
-			 * debug_query_string to identify just which statement is being
-			 * reported.  This isn't ideal but trying to do it here would
-			 * often result in duplication.
-			 */
-			ereport(LOG,
-					(errmsg("duration: %.3f ms  plan:\n%s",
-							msec, es->str->data),
-					 errhidestmt(true)));
-
-			pfree(es->str->data);
+			explain_output_plain(queryDesc, msec, auto_explain_log_format);
 		}
 	}
 
@@ -340,4 +333,50 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 		prev_ExecutorEnd(queryDesc);
 	else
 		standard_ExecutorEnd(queryDesc);
+
+	current_query_desc = NULL;
+}
+
+static void
+explain_output_plain(QueryDesc *queryDesc, double time_msec, int format)
+{
+	ExplainState *es = NewExplainState();
+
+	es->analyze = (queryDesc->instrument_options && auto_explain_log_analyze);
+	es->verbose = auto_explain_log_verbose;
+	es->buffers = (es->analyze && auto_explain_log_buffers);
+	es->timing = (es->analyze && auto_explain_log_timing);
+	es->summary = es->analyze;
+	es->format = format;
+
+	ExplainBeginOutput(es);
+	ExplainQueryText(es, queryDesc);
+	ExplainPrintPlan(es, queryDesc);
+	if (es->analyze && auto_explain_log_triggers)
+		ExplainPrintTriggers(es, queryDesc);
+	ExplainEndOutput(es);
+
+	/* Remove last line break */
+	if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
+		es->str->data[--es->str->len] = '\0';
+
+	/* Fix JSON to output an object */
+	if (format == EXPLAIN_FORMAT_JSON)
+	{
+		es->str->data[0] = '{';
+		es->str->data[es->str->len - 1] = '}';
+	}
+
+	/*
+	 * Note: we rely on the existing logging of context or
+	 * debug_query_string to identify just which statement is being
+	 * reported.  This isn't ideal but trying to do it here would
+	 * often result in duplication.
+	 */
+	ereport(LOG,
+			(errmsg("duration: %.3f ms  plan:\n%s",
+					time_msec, es->str->data),
+			 errhidestmt(true)));
+
+	pfree(es->str->data);
 }
#2Andres Freund
andres@anarazel.de
In reply to: Shulgin, Oleksandr (#1)
Re: On-demand running query plans using auto_explain and signals

On 2015-08-29 17:33:22 +0200, Shulgin, Oleksandr wrote:

Probably using SIGUSR2 would be more appropriate, but I'm not sure if there
are other extensions out there that might be already using it for some
other reason (well, I do not know that for SIGUSR1 either). Looking at the
current state of affairs in procsignal_sigusr1_handler() makes me believe
it should be pretty safe to catch the signal like I do. Or is that not the
case?

You can catch signals, but you're not allowed to do a lot from
them. Anything allocating memory, acquiring locks, etc. is out - these
functions aren't reentrant. If you can guarantee that you're not
interrupting any relevant code you can bend those rules, but that's
obviously not the case here.

Check out the list of async-signal-safe functions at http://man7.org/linux/man-pages/man7/signal.7.html

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Andres Freund (#2)
Re: On-demand running query plans using auto_explain and signals

On Sat, Aug 29, 2015 at 5:44 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-08-29 17:33:22 +0200, Shulgin, Oleksandr wrote:

Probably using SIGUSR2 would be more appropriate, but I'm not sure if

there

are other extensions out there that might be already using it for some
other reason (well, I do not know that for SIGUSR1 either). Looking at

the

current state of affairs in procsignal_sigusr1_handler() makes me believe
it should be pretty safe to catch the signal like I do. Or is that not

the

case?

You can catch signals, but you're not allowed to do a lot from
them. Anything allocating memory, acquiring locks, etc. is out - these
functions aren't reentrant. If you can guarantee that you're not
interrupting any relevant code you can bend those rules, but that's
obviously not the case here.

Check out the list of async-signal-safe functions at
http://man7.org/linux/man-pages/man7/signal.7.html

Good point. There's still hope to set a flag and process it later on.
Will have to check if it's possible to stay in the scope of a loaded module
though.

#4Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#3)
Re: On-demand running query plans using auto_explain and signals

2015-08-29 18:25 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Sat, Aug 29, 2015 at 5:44 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-08-29 17:33:22 +0200, Shulgin, Oleksandr wrote:

Probably using SIGUSR2 would be more appropriate, but I'm not sure if

there

are other extensions out there that might be already using it for some
other reason (well, I do not know that for SIGUSR1 either). Looking at

the

current state of affairs in procsignal_sigusr1_handler() makes me

believe

it should be pretty safe to catch the signal like I do. Or is that not

the

case?

You can catch signals, but you're not allowed to do a lot from
them. Anything allocating memory, acquiring locks, etc. is out - these
functions aren't reentrant. If you can guarantee that you're not
interrupting any relevant code you can bend those rules, but that's
obviously not the case here.

Check out the list of async-signal-safe functions at
http://man7.org/linux/man-pages/man7/signal.7.html

Good point. There's still hope to set a flag and process it later on.
Will have to check if it's possible to stay in the scope of a loaded module
though.

I had a workable prototype - and It was implemented very similar as
handling CANCEL

Pavel

#5Andres Freund
andres@anarazel.de
In reply to: Pavel Stehule (#4)
Re: On-demand running query plans using auto_explain and signals

On 2015-08-29 18:27:59 +0200, Pavel Stehule wrote:

2015-08-29 18:25 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>

Good point. There's still hope to set a flag and process it later on.
Will have to check if it's possible to stay in the scope of a loaded module
though.

I had a workable prototype - and It was implemented very similar as
handling CANCEL

Where did you put the handling of that kind of interrupt? Directly into
ProcessInterrupts()?

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andres Freund (#5)
Re: On-demand running query plans using auto_explain and signals

2015-08-29 18:36 GMT+02:00 Andres Freund <andres@anarazel.de>:

On 2015-08-29 18:27:59 +0200, Pavel Stehule wrote:

2015-08-29 18:25 GMT+02:00 Shulgin, Oleksandr <

oleksandr.shulgin@zalando.de>

Good point. There's still hope to set a flag and process it later on.
Will have to check if it's possible to stay in the scope of a loaded

module

though.

I had a workable prototype - and It was implemented very similar as
handling CANCEL

Where did you put the handling of that kind of interrupt? Directly into
ProcessInterrupts()?

Probably. I don't remember it well, but it need hack code - it cannot be
used from extension.

Pavel

Show quoted text

Andres

#7Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#6)
Re: On-demand running query plans using auto_explain and signals

On Aug 29, 2015 7:31 PM, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:

2015-08-29 18:36 GMT+02:00 Andres Freund <andres@anarazel.de>:

On 2015-08-29 18:27:59 +0200, Pavel Stehule wrote:

2015-08-29 18:25 GMT+02:00 Shulgin, Oleksandr <

oleksandr.shulgin@zalando.de>

Good point. There's still hope to set a flag and process it later

on.

Will have to check if it's possible to stay in the scope of a loaded

module

though.

I had a workable prototype - and It was implemented very similar as
handling CANCEL

Where did you put the handling of that kind of interrupt? Directly into
ProcessInterrupts()?

Probably. I don't remember it well, but it need hack code - it cannot be

used from extension.

Do you still have the code somewhere around? Did it see production use?

Thanks!
--
Alex

#8Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#7)
Re: On-demand running query plans using auto_explain and signals

2015-08-30 10:30 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Aug 29, 2015 7:31 PM, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:

2015-08-29 18:36 GMT+02:00 Andres Freund <andres@anarazel.de>:

On 2015-08-29 18:27:59 +0200, Pavel Stehule wrote:

2015-08-29 18:25 GMT+02:00 Shulgin, Oleksandr <

oleksandr.shulgin@zalando.de>

Good point. There's still hope to set a flag and process it later

on.

Will have to check if it's possible to stay in the scope of a

loaded module

though.

I had a workable prototype - and It was implemented very similar as
handling CANCEL

Where did you put the handling of that kind of interrupt? Directly into
ProcessInterrupts()?

Probably. I don't remember it well, but it need hack code - it cannot be

used from extension.

Do you still have the code somewhere around? Did it see production use?

I am not sure I am able to find it - I'll try. We didn't use it on
production.

Show quoted text

Thanks!
--
Alex

#9Pavel Stehule
pavel.stehule@gmail.com
In reply to: Pavel Stehule (#8)
Re: On-demand running query plans using auto_explain and signals

2015-08-30 10:34 GMT+02:00 Pavel Stehule <pavel.stehule@gmail.com>:

2015-08-30 10:30 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Aug 29, 2015 7:31 PM, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:

2015-08-29 18:36 GMT+02:00 Andres Freund <andres@anarazel.de>:

On 2015-08-29 18:27:59 +0200, Pavel Stehule wrote:

2015-08-29 18:25 GMT+02:00 Shulgin, Oleksandr <

oleksandr.shulgin@zalando.de>

Good point. There's still hope to set a flag and process it later

on.

Will have to check if it's possible to stay in the scope of a

loaded module

though.

I had a workable prototype - and It was implemented very similar as
handling CANCEL

Where did you put the handling of that kind of interrupt? Directly into
ProcessInterrupts()?

Probably. I don't remember it well, but it need hack code - it cannot

be used from extension.

Do you still have the code somewhere around? Did it see production use?

I sent it to mailing list year ago

/messages/by-id/CAFj8pRAXcS9B8ABgiM-zauVgGqDhPZOaRz5YSp1_Nhv9HP8nKw@mail.gmail.com

Regards

Pavel

Show quoted text

I am not sure I am able to find it - I'll try. We didn't use it on
production.

Thanks!
--
Alex

#10Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#9)
Re: On-demand running query plans using auto_explain and signals

Do you still have the code somewhere around? Did it see production use?

I sent it to mailing list year ago

/messages/by-id/CAFj8pRAXcS9B8ABgiM-zauVgGqDhPZOaRz5YSp1_Nhv9HP8nKw@mail.gmail.com

Ah, thanks! Somehow I've missed this mail. You didn't add the patch to a
commitfest back then I think?

--
Alex

#11Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#10)
Re: On-demand running query plans using auto_explain and signals

2015-08-31 11:30 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

Do you still have the code somewhere around? Did it see production use?

I sent it to mailing list year ago

/messages/by-id/CAFj8pRAXcS9B8ABgiM-zauVgGqDhPZOaRz5YSp1_Nhv9HP8nKw@mail.gmail.com

Ah, thanks! Somehow I've missed this mail. You didn't add the patch to a
commitfest back then I think?

I had no time to finish this patch - there is few issues in signal handling
and returning back result - but still I want it :) - and what I know -
almost all other SQL db has similar functionality.

Show quoted text

--
Alex

#12Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#11)
1 attachment(s)
Re: On-demand running query plans using auto_explain and signals

On Mon, Aug 31, 2015 at 12:35 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

/messages/by-id/CAFj8pRAXcS9B8ABgiM-zauVgGqDhPZOaRz5YSp1_Nhv9HP8nKw@mail.gmail.com

Ah, thanks! Somehow I've missed this mail. You didn't add the patch to
a commitfest back then I think?

I had no time to finish this patch - there is few issues in signal
handling and returning back result - but still I want it :) - and what I
know - almost all other SQL db has similar functionality.

I've updated the patch for the current master and also added some
unexpected parameters handling, so attached is a v2.

I'd say we should hide the so-designed pg_cmdstatus() interface behind more
friendly calls like pg_explain_backend() and pg_backend_progress() to give
some naming examples, to remove the need for magic numbers in the second
arg.

What I've found missing in this approach is the insight into nested
executor runs, so that if you're running a "SELECT my_func()", you only see
this outer query in the pg_cmdstatus() output. With the auto_explain
approach, by hooking into executor I was able to capture the nested queries
and their plans as well.

It's conceptually trivial to add some code to use the Executor hooks here,
but I don't see any precedent for this except for contrib modules
(auto_explain and pg_stat_statements), I'm just not sure if that would be
OK-ish.

And when we solve that, there is another problem of having a sane interface
to query the nested plans. For a psql user, probably the most interesting
would be the topmost (level=1) and the innermost (e.g. level=-1) plans. We
might also want to provide a full nesting of plans in a structured format
like JSON or... *cough* XML, for programs to consume and display nicely
with folding and stuff.

And the most interesting would be making instrumentation work with all of
the above.

I'm adding this to the next CF.

--
Alex

Attachments:

explain-pid-v2.patchtext/x-patch; charset=US-ASCII; name=explain-pid-v2.patchDownload
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 0abde43..40db40d 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -26,6 +26,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/cmdstatus.h"
 
 
 /*
@@ -296,6 +297,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CMDSTATUS_INFO))
+		HandleCmdStatusInfoInterrupt();
+
 	if (set_latch_on_sigusr1)
 		SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ce4bdaf..5d5df58 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/cmdstatus.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2991,6 +2992,9 @@ ProcessInterrupts(void)
 
 	if (ParallelMessagePending)
 		HandleParallelMessages();
+
+	if (CmdStatusInfoRequested)
+		ProcessCmdStatusInfoRequest();
 }
 
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 3ed0b44..2c8687c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -18,7 +18,7 @@ endif
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-	bool.o cash.o char.o date.o datetime.o datum.o dbsize.o domains.o \
+	bool.o cash.o char.o cmdstatus.o date.o datetime.o datum.o dbsize.o domains.o \
 	encode.o enum.o expandeddatum.o \
 	float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
diff --git a/src/backend/utils/adt/cmdstatus.c b/src/backend/utils/adt/cmdstatus.c
new file mode 100644
index 0000000..38c1947
--- /dev/null
+++ b/src/backend/utils/adt/cmdstatus.c
@@ -0,0 +1,508 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.c
+ *	  Definitions for pg_cmdstatus function.
+ *
+ * Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/cmdstatus.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/htup_details.h"
+#include "commands/explain.h"
+#include "lib/stringinfo.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "tcop/dest.h"
+#include "tcop/pquery.h"
+#include "utils/builtins.h"
+#include "utils/cmdstatus.h"
+
+
+#define CMDINFO_SLOTS		100
+#define BUFFER_SIZE			(8 * 1024)
+
+bool CmdStatusInfoRequested = false;
+
+typedef struct {
+	bool	is_valid;
+	bool	is_done;
+	int		target_pid;
+	int		sender_pid;
+	int		request_type;
+	int		result_code;
+} CmdStatusInfoEntry;
+
+typedef struct {
+	LWLock	   *lock;				/* protect slots - search/modification */
+	CmdStatusInfoEntry	*slots;
+	LWLock	   *buffer_lock;		/* protect buffer handling */
+	void	   *buffer;				/* result data */
+	Size		buffer_size;
+	int			target_pid;
+	int			sender_pid;
+	bool		buffer_is_free;		/* buffer is generally available */
+	bool		buffer_holds_data;	/* buffer holds a valid data */
+} CmdStatusInfo;
+
+static CmdStatusInfo *cmd_status_info = NULL;
+
+/*
+ * Prepare explain of query
+ *
+ */
+static StringInfo
+explain_query(QueryDesc *queryDesc)
+{
+	ExplainState	*es;
+
+	es = NewExplainState();
+	es->analyze = false;
+	es->verbose = false;
+	es->buffers = false;
+	es->format = EXPLAIN_FORMAT_TEXT;
+
+	ExplainBeginOutput(es);
+	ExplainQueryText(es, queryDesc);
+	ExplainPrintPlan(es, queryDesc);
+	ExplainEndOutput(es);
+
+	/* Remove last line break */
+	if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
+		es->str->data[--es->str->len] = '\0';
+
+	return es->str;
+}
+
+static CmdStatusInfo *
+attach_shmem(CmdStatusInfo *cmd_status_info)
+{
+	bool	found;
+
+	if (cmd_status_info == NULL)
+	{
+		LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+		cmd_status_info = (CmdStatusInfo *) ShmemInitStruct("cmdstatusinfo",
+									sizeof(CmdStatusInfo),
+									&found);
+		if (!found)
+		{
+			int	i;
+
+			cmd_status_info->lock = LWLockAssign();
+			cmd_status_info->slots = ShmemAlloc(CMDINFO_SLOTS * sizeof(CmdStatusInfo));
+
+			for (i = 0; i < CMDINFO_SLOTS; i++)
+			{
+				cmd_status_info->slots[i].is_valid = false;
+			}
+
+			cmd_status_info->buffer_lock = LWLockAssign();
+			cmd_status_info->buffer = ShmemAlloc(BUFFER_SIZE);
+			cmd_status_info->buffer_is_free = true;
+			cmd_status_info->buffer_holds_data = true;
+		}
+
+		LWLockRelease(AddinShmemInitLock);
+	}
+
+	return cmd_status_info;
+}
+
+/*
+ * write data to shm buffer - wait for free buffer
+ *
+ */
+static void
+write_to_shm_buffer(int target_pid, int sender_pid, void *data, Size bytes)
+{
+	int	loop = 0;
+
+	cmd_status_info = attach_shmem(cmd_status_info);
+
+	while (1)
+	{
+		LWLockAcquire(cmd_status_info->buffer_lock, LW_EXCLUSIVE);
+
+		if (cmd_status_info->buffer_is_free)
+		{
+			cmd_status_info->target_pid = target_pid;
+			cmd_status_info->sender_pid = sender_pid;
+			cmd_status_info->buffer_is_free = false;
+			cmd_status_info->buffer_holds_data = true;
+
+			cmd_status_info->buffer_size = bytes;
+			memcpy(cmd_status_info->buffer, data, bytes);
+
+			LWLockRelease(cmd_status_info->buffer_lock);
+			break;
+		}
+		else
+		{
+			LWLockRelease(cmd_status_info->buffer_lock);
+
+			if (loop++ % 100 == 0);
+				CHECK_FOR_INTERRUPTS();
+
+			if (loop > 100000)
+				elog(ERROR, "cannot to take buffer to send data");
+
+			pg_usleep(1000L);
+		}
+	}
+}
+
+/*
+ * It read data from shm buffer, waits for data
+ *
+ */
+static void *
+read_from_shm_buffer(int target_pid, int sender_pid, Size *bytes)
+{
+	void *result = NULL;
+	int	loop = 0;
+
+	cmd_status_info = attach_shmem(cmd_status_info);
+
+	while (1)
+	{
+		LWLockAcquire(cmd_status_info->buffer_lock, LW_EXCLUSIVE);
+
+		if (cmd_status_info->buffer_holds_data &&
+				cmd_status_info->target_pid == target_pid &&
+				cmd_status_info->sender_pid == sender_pid)
+		{
+			result = palloc(cmd_status_info->buffer_size);
+			memcpy(result, cmd_status_info->buffer, cmd_status_info->buffer_size);
+			*bytes = cmd_status_info->buffer_size;
+
+			cmd_status_info->buffer_is_free = true;
+			cmd_status_info->buffer_holds_data = false;
+
+			LWLockRelease(cmd_status_info->buffer_lock);
+
+			break;
+		}
+		else
+		{
+			LWLockRelease(cmd_status_info->buffer_lock);
+
+			if (loop++ % 100 == 0);
+				CHECK_FOR_INTERRUPTS();
+
+			pg_usleep(1000L);
+		}
+	}
+
+	return result;
+}
+
+/* signal handler for PROCSIG_CMDSTATUS_INFO */
+void
+HandleCmdStatusInfoInterrupt(void)
+{
+	InterruptPending = true;
+	CmdStatusInfoRequested = true;
+
+	SetLatch(MyLatch);
+}
+
+void
+ProcessCmdStatusInfoRequest(void)
+{
+	bool	found = true;
+
+	cmd_status_info = attach_shmem(cmd_status_info);
+
+	/* search any request for current process */
+	while (found)
+	{
+		int		i;
+		CmdStatusInfoEntry *csie;
+
+		found = false;
+		csie = NULL;
+
+		/* take lock for slots */
+		LWLockAcquire(cmd_status_info->lock, LW_EXCLUSIVE);
+
+		/* try to find any request in valid slots */
+		for (i = 0; i < CMDINFO_SLOTS; i++)
+		{
+			csie = &(cmd_status_info->slots[i]);
+
+			if (csie->is_valid && !csie->is_done && csie->target_pid == MyProcPid)
+			{
+				found = true;
+				break;
+			}
+		}
+
+		LWLockRelease(cmd_status_info->lock);
+
+		if (found)
+		{
+			/* process request */
+			Assert(csie != NULL);
+
+			csie->is_done = true;
+			csie->result_code = -1;
+
+			if (ActivePortal)
+			{
+				if (csie->request_type == 1)
+				{
+					if (ActivePortal->queryDesc != NULL)
+					{
+						StringInfo str;
+
+						str = explain_query(ActivePortal->queryDesc);
+						write_to_shm_buffer(csie->sender_pid, MyProcPid, (void *) str->data, str->len);
+
+						pfree(str->data);
+
+						csie->result_code = 0;
+					}
+					else
+					{
+#define NO_QUERY_DESC_MSG "(the running query description could not be found)"
+						write_to_shm_buffer(csie->sender_pid, MyProcPid,
+											NO_QUERY_DESC_MSG,
+											sizeof(NO_QUERY_DESC_MSG) - 1);
+#undef NO_QUERY_DESC_MSG
+					}
+				}
+				else if (csie->request_type == 2)
+				{
+					write_to_shm_buffer(csie->sender_pid, MyProcPid,
+											 (void *) ActivePortal->sourceText,
+											 strlen(ActivePortal->sourceText));
+					csie->result_code = 0;
+				}
+				else if (csie->request_type == 3)
+				{
+					if (ActivePortal->commandTag != NULL)
+					{
+						if (ActivePortal->queryDesc != NULL &&
+							ActivePortal->queryDesc->estate != NULL)
+						{
+							char completionTag[COMPLETION_TAG_BUFSIZE];
+
+							snprintf(completionTag, COMPLETION_TAG_BUFSIZE, "%s %u",
+									 ActivePortal->commandTag,
+									 ActivePortal->queryDesc->estate->es_processed);
+
+							write_to_shm_buffer(csie->sender_pid, MyProcPid,
+											 (void *) completionTag,
+											 strlen(completionTag));
+						}
+						else
+						{
+							write_to_shm_buffer(csie->sender_pid, MyProcPid,
+											 (void *) ActivePortal->commandTag,
+											 strlen(ActivePortal->commandTag));
+						}
+
+						csie->result_code = 0;
+					}
+					else
+					{
+#define NO_COMMAND_TAG_MSG "(a command tag is not set for the running query)"
+						write_to_shm_buffer(csie->sender_pid, MyProcPid,
+											NO_COMMAND_TAG_MSG,
+											sizeof(NO_COMMAND_TAG_MSG) - 1);
+#undef NO_COMMAND_TAG_MSG
+					}
+				}
+				else
+				{
+#define UNKNOWN_REQUEST_TYPE_MSG "(unknown request_type)"
+					write_to_shm_buffer(csie->sender_pid, MyProcPid,
+										UNKNOWN_REQUEST_TYPE_MSG,
+										sizeof(UNKNOWN_REQUEST_TYPE_MSG) - 1);
+#undef UNKNOWN_REQUEST_TYPE_MSG
+				}
+			}
+			else
+			{
+#define NO_QUERY_RUNNING_MSG "(no query is currently running)"
+					write_to_shm_buffer(csie->sender_pid, MyProcPid,
+										NO_QUERY_RUNNING_MSG,
+										sizeof(NO_QUERY_RUNNING_MSG) - 1);
+#undef NO_QUERY_RUNNING_MSG
+
+			}
+		}
+	}
+}
+
+
+static CmdStatusInfoEntry *
+NewCmdStatusInfoEntry(int target_pid, int request_type)
+{
+	bool	found = false;
+	CmdStatusInfoEntry *csie = NULL;
+	int			i;
+
+	cmd_status_info = attach_shmem(cmd_status_info);
+
+	/* find unused slot */
+	LWLockAcquire(cmd_status_info->lock, LW_EXCLUSIVE);
+
+	for (i = 0; i < CMDINFO_SLOTS; i++)
+	{
+		csie = &(cmd_status_info->slots[i]);
+
+		if (!csie->is_valid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	LWLockRelease(cmd_status_info->lock);
+
+	if (!found)
+		elog(ERROR, "there are not free slots for cmdstatusinfo now");
+
+	csie->is_valid = true;
+	csie->is_done = false;
+	csie->target_pid = target_pid;
+	csie->sender_pid = MyProcPid;
+	csie->request_type = request_type;
+	csie->result_code = 0;
+
+	return csie;
+}
+
+/*
+ * try to get status of command in another process
+ *
+ * FUNCTION get_cmdstatus(pid, request_type)
+ * RETURNS SETOF text
+ */
+Datum
+pg_cmdstatus(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	int target_pid				= PG_GETARG_INT32(0);
+	int request_type 			= PG_GETARG_INT32(1);
+	PGPROC	   *proc;
+	CmdStatusInfoEntry *csie;
+
+	Size			len;
+	void			*data = NULL;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* need to build tuplestore in query context */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupdesc = CreateTupleDescCopy(rsinfo->expectedDesc);
+	tupstore = tuplestore_begin_heap(false, false, work_mem);
+	MemoryContextSwitchTo(oldcontext);
+
+	csie = NewCmdStatusInfoEntry(target_pid, request_type);
+
+	PG_TRY();
+	{
+		/* verify access to target_pid */
+		proc = BackendPidGetProc(target_pid);
+
+		if (proc == NULL)
+			ereport(ERROR,
+					(errmsg("PID %d is not a PostgreSQL server process", target_pid)));
+
+		if (!(superuser() || proc->roleId == GetUserId()))
+			ereport(ERROR,
+					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+					 (errmsg("must be superuser or have the same role to cancel queries running in other server processes"))));
+
+		if (SendProcSignal((pid_t) target_pid, PROCSIG_CMDSTATUS_INFO, InvalidBackendId) < 0)
+			elog(ERROR, "could not signal backend with PID %d", target_pid);
+
+		while (1)
+		{
+			data = read_from_shm_buffer(MyProcPid, target_pid, &len);
+			if (len > 0)
+			{
+				Datum		value;
+				HeapTuple	tuple;
+				bool		isnull = false;
+				Size			processed = 0;
+				char			*cursor = data;
+
+				/* parse to rows */
+				while (processed < len)
+				{
+					char *eol = strchr(cursor, '\n');
+
+					if (eol != NULL)
+					{
+						int	line_size = eol - cursor;
+
+						value = PointerGetDatum(cstring_to_text_with_len(cursor, line_size));
+						cursor += line_size + 1;
+						processed += line_size + 1;
+					}
+					else
+					{
+						/* last line */
+						value = PointerGetDatum(cstring_to_text_with_len(cursor, len - processed));
+						processed = len;
+					}
+
+					tuple = heap_form_tuple(tupdesc, &value, &isnull);
+					tuplestore_puttuple(tupstore, tuple);
+				}
+
+				pfree(data);
+			}
+
+			if (csie->is_done)
+				break;
+		}
+
+		csie->is_valid = false;
+
+		/* clean up and return the tuplestore */
+		tuplestore_donestoring(tupstore);
+	}
+	PG_CATCH();
+	{
+		csie->is_valid = false;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ddf7c67..d083aaf 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3130,6 +3130,8 @@ DESCR("get OID of current session's temp schema, if any");
 DATA(insert OID = 2855 (  pg_is_other_temp_schema	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "26" _null_ _null_ _null_ _null_ _null_ pg_is_other_temp_schema _null_ _null_ _null_ ));
 DESCR("is schema another session's temp schema?");
 
+DATA(insert OID = 4099 ( pg_cmdstatus		PGNSP PGUID 12 1 100 0 0 f f f f f t s 2 0 25 "23 23" _null_ _null_  _null_  _null_ _null_ pg_cmdstatus _null_ _null_ _null_ ));
+DESCR("returns information about another process");
 DATA(insert OID = 2171 ( pg_cancel_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_cancel_backend _null_ _null_ _null_ ));
 DESCR("cancel a server process' current query");
 DATA(insert OID = 2096 ( pg_terminate_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index af1a0cd..ab1698c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,9 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
+	/* cmd status info */
+	PROCSIG_CMDSTATUS_INFO,
+
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fc1679e..605612e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1243,6 +1243,9 @@ extern Datum pg_identify_object_as_address(PG_FUNCTION_ARGS);
 /* catalog/objectaddress.c */
 extern Datum pg_get_object_address(PG_FUNCTION_ARGS);
 
+/* utils/adt/cmdstatus.c */
+extern Datum pg_cmdstatus(PG_FUNCTION_ARGS);
+
 /* commands/constraint.c */
 extern Datum unique_key_recheck(PG_FUNCTION_ARGS);
 
diff --git a/src/include/utils/cmdstatus.h b/src/include/utils/cmdstatus.h
new file mode 100644
index 0000000..ff88ead
--- /dev/null
+++ b/src/include/utils/cmdstatus.h
@@ -0,0 +1,21 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.h
+ *	  Declarations for command status interrupt handling.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * src/include/utils/cmdstatus.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CMDSTATUS_H
+#define CMDSTATUS_H
+
+extern bool CmdStatusInfoRequested;
+
+extern void HandleCmdStatusInfoInterrupt(void);
+extern void ProcessCmdStatusInfoRequest(void);
+
+#endif   /* CMDSTATUS_H */
#13Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#12)
Re: On-demand running query plans using auto_explain and signals

Hi

2015-08-31 19:09 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Mon, Aug 31, 2015 at 12:35 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

/messages/by-id/CAFj8pRAXcS9B8ABgiM-zauVgGqDhPZOaRz5YSp1_Nhv9HP8nKw@mail.gmail.com

Ah, thanks! Somehow I've missed this mail. You didn't add the patch to
a commitfest back then I think?

I had no time to finish this patch - there is few issues in signal
handling and returning back result - but still I want it :) - and what I
know - almost all other SQL db has similar functionality.

I've updated the patch for the current master and also added some
unexpected parameters handling, so attached is a v2.

Thank you very much

I'd say we should hide the so-designed pg_cmdstatus() interface behind
more friendly calls like pg_explain_backend() and pg_backend_progress() to
give some naming examples, to remove the need for magic numbers in the
second arg.

I had similar idea - this is good enough for start, but target interface
iis based on integration with EXPLAIN statement

some like EXPLAIN PROCESS or EXPLAIN PID or EXPLAIN VERBOSE PID ..

What I've found missing in this approach is the insight into nested
executor runs, so that if you're running a "SELECT my_func()", you only see
this outer query in the pg_cmdstatus() output. With the auto_explain
approach, by hooking into executor I was able to capture the nested queries
and their plans as well.

I understand - originally I didn't think about nested queries, but it is
good idea and probably not a problem:

Not for XML and JSON where we can describe nesting simply

It is little bit harder for plain text - but we can use similar format that
is used for subplans or some like

top query:
SELECT fx()

nested (1. level) query:
SELECT ....

It's conceptually trivial to add some code to use the Executor hooks here,
but I don't see any precedent for this except for contrib modules
(auto_explain and pg_stat_statements), I'm just not sure if that would be
OK-ish.

And when we solve that, there is another problem of having a sane
interface to query the nested plans. For a psql user, probably the most
interesting would be the topmost (level=1) and the innermost (e.g.
level=-1) plans. We might also want to provide a full nesting of plans in
a structured format like JSON or... *cough* XML, for programs to consume
and display nicely with folding and stuff.

And the most interesting would be making instrumentation work with all of
the above.

the important functionality is drawing complete text of query - it was my
original motivation, because I had not way how to get complete query before
its finishing

Probably the communication between processes should be more complex :( -
the SHM queue should be used there, because some plans can be terrible long.

The using shared write buffer (one for all) is too simply solution probably
- good for prototype, but not good for core.

I have a idea about communication:

1. caller prepare buffer, shm queue and signalize target process -
parameter is pid od caller
2. target process fills a write buffer and close queue
3. caller show data and close buffer, close queue

Now almost all code for communication is in upstream - the missing part is
injection one end of queue to any process dynamicaly.

Regards

Pavel

Show quoted text

I'm adding this to the next CF.

--
Alex

#14Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#13)
Re: On-demand running query plans using auto_explain and signals

I'd say we should hide the so-designed pg_cmdstatus() interface behind

more friendly calls like pg_explain_backend() and pg_backend_progress() to
give some naming examples, to remove the need for magic numbers in the
second arg.

I had similar idea - this is good enough for start, but target interface
iis based on integration with EXPLAIN statement

some like EXPLAIN PROCESS or EXPLAIN PID or EXPLAIN VERBOSE PID ..

Yes, that's another way to do it.

the important functionality is drawing complete text of query - it was my

original motivation, because I had not way how to get complete query before
its finishing

Probably the communication between processes should be more complex :( -
the SHM queue should be used there, because some plans can be terrible long.

The using shared write buffer (one for all) is too simply solution
probably - good for prototype, but not good for core.

I have a idea about communication:

1. caller prepare buffer, shm queue and signalize target process -
parameter is pid od caller
2. target process fills a write buffer and close queue
3. caller show data and close buffer, close queue

Now almost all code for communication is in upstream - the missing part is
injection one end of queue to any process dynamicaly.

I'm not familiar with the shared memory handling, but could we not allocate
just enough shared memory to fit the data we're going to write instead of
the fixed 8k? It's not that we cannot know the length of the resulting
plan text in advance.

I think we can remove buffer_is_free/buffer_holds_data and just use the
buffer != NULL to check if there's some data to be read (and buffer == NULL
to check if we can write).

--
Alex

#15Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#14)
Re: On-demand running query plans using auto_explain and signals

2015-09-01 15:00 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

I'd say we should hide the so-designed pg_cmdstatus() interface behind

more friendly calls like pg_explain_backend() and pg_backend_progress() to
give some naming examples, to remove the need for magic numbers in the
second arg.

I had similar idea - this is good enough for start, but target interface
iis based on integration with EXPLAIN statement

some like EXPLAIN PROCESS or EXPLAIN PID or EXPLAIN VERBOSE PID ..

Yes, that's another way to do it.

the important functionality is drawing complete text of query - it was my

original motivation, because I had not way how to get complete query before
its finishing

Probably the communication between processes should be more complex :( -
the SHM queue should be used there, because some plans can be terrible long.

The using shared write buffer (one for all) is too simply solution
probably - good for prototype, but not good for core.

I have a idea about communication:

1. caller prepare buffer, shm queue and signalize target process -
parameter is pid od caller
2. target process fills a write buffer and close queue
3. caller show data and close buffer, close queue

Now almost all code for communication is in upstream - the missing part
is injection one end of queue to any process dynamicaly.

I'm not familiar with the shared memory handling, but could we not
allocate just enough shared memory to fit the data we're going to write
instead of the fixed 8k? It's not that we cannot know the length of the
resulting plan text in advance.

the shared memory cannot be reused - (released) :(, so allocating enough
memory is not effective. More - in this moment it has not sense. Shared
memory queue can do almost all work.

Show quoted text

I think we can remove buffer_is_free/buffer_holds_data and just use the
buffer != NULL to check if there's some data to be read (and buffer == NULL
to check if we can write).

--
Alex

#16Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#15)
Re: On-demand running query plans using auto_explain and signals

I'm not familiar with the shared memory handling, but could we not

allocate just enough shared memory to fit the data we're going to write
instead of the fixed 8k? It's not that we cannot know the length of the
resulting plan text in advance.

the shared memory cannot be reused - (released) :(, so allocating enough
memory is not effective. More - in this moment it has not sense. Shared
memory queue can do almost all work.

A-ha, I've discovered the shared memory message queue facility and I see
how we can use it.

But do we really need the slots mechanism? Would it not be OK to just let
the LWLock do the sequencing of concurrent requests? Given that we only
going to use one message queue per cluster, there's not much concurrency
you can gain by introducing slots I believe.

--
Alex

#17Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#16)
Re: On-demand running query plans using auto_explain and signals

2015-09-01 17:20 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

I'm not familiar with the shared memory handling, but could we not

allocate just enough shared memory to fit the data we're going to write
instead of the fixed 8k? It's not that we cannot know the length of the
resulting plan text in advance.

the shared memory cannot be reused - (released) :(, so allocating enough
memory is not effective. More - in this moment it has not sense. Shared
memory queue can do almost all work.

A-ha, I've discovered the shared memory message queue facility and I see
how we can use it.

But do we really need the slots mechanism? Would it not be OK to just let
the LWLock do the sequencing of concurrent requests? Given that we only
going to use one message queue per cluster, there's not much concurrency
you can gain by introducing slots I believe.

I afraid of problems on production. When you have a queue related to any
process, then all problems should be off after end of processes. One
message queue per cluster needs restart cluster when some pathological
problems are - and you cannot restart cluster in production week, sometimes
weeks. The slots are more robust.

Pavel

Show quoted text

--
Alex

#18Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#17)
Re: On-demand running query plans using auto_explain and signals

On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

But do we really need the slots mechanism? Would it not be OK to just
let the LWLock do the sequencing of concurrent requests? Given that we
only going to use one message queue per cluster, there's not much
concurrency you can gain by introducing slots I believe.

I afraid of problems on production. When you have a queue related to any
process, then all problems should be off after end of processes. One
message queue per cluster needs restart cluster when some pathological
problems are - and you cannot restart cluster in production week, sometimes
weeks. The slots are more robust.

Yes, but in your implementation the slots themselves don't have a
queue/buffer. Did you intend to have a message queue per slot?

What sort of pathological problems are you concerned of? The communicating
backends should just detach from the message queue properly and have some
timeout configured to prevent deadlocks. Other than that, I don't see how
having N slots really help the problem: in case of pathological problems
you will just deplete them all sooner or later.

--
Alex

#19Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#18)
Re: On-demand running query plans using auto_explain and signals

2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

But do we really need the slots mechanism? Would it not be OK to just
let the LWLock do the sequencing of concurrent requests? Given that we
only going to use one message queue per cluster, there's not much
concurrency you can gain by introducing slots I believe.

I afraid of problems on production. When you have a queue related to any
process, then all problems should be off after end of processes. One
message queue per cluster needs restart cluster when some pathological
problems are - and you cannot restart cluster in production week, sometimes
weeks. The slots are more robust.

Yes, but in your implementation the slots themselves don't have a
queue/buffer. Did you intend to have a message queue per slot?

The message queue cannot be reused, so I expect one slot per caller to be
used passing parameters, - message queue will be created/released by demand
by caller.

What sort of pathological problems are you concerned of? The
communicating backends should just detach from the message queue properly
and have some timeout configured to prevent deadlocks. Other than that, I
don't see how having N slots really help the problem: in case of
pathological problems you will just deplete them all sooner or later.

I afraid of unexpected problems :) - any part of signal handling or
multiprocess communication is fragile. Slots are simple and simply attached
to any process without necessity to alloc/free some memory.

Pavel

Show quoted text

--
Alex

#20Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#19)
Re: On-demand running query plans using auto_explain and signals

On Wed, Sep 2, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

But do we really need the slots mechanism? Would it not be OK to just
let the LWLock do the sequencing of concurrent requests? Given that we
only going to use one message queue per cluster, there's not much
concurrency you can gain by introducing slots I believe.

I afraid of problems on production. When you have a queue related to any
process, then all problems should be off after end of processes. One
message queue per cluster needs restart cluster when some pathological
problems are - and you cannot restart cluster in production week, sometimes
weeks. The slots are more robust.

Yes, but in your implementation the slots themselves don't have a
queue/buffer. Did you intend to have a message queue per slot?

The message queue cannot be reused, so I expect one slot per caller to be
used passing parameters, - message queue will be created/released by demand
by caller.

I don't believe a message queue cannot really be reused. What would stop
us from calling shm_mq_create() on the queue struct again?

To give you an idea, in my current prototype I have only the following
struct:

typedef struct {
LWLock *lock;
/*CmdStatusInfoSlot slots[CMDINFO_SLOTS];*/
pid_t target_pid;
pid_t sender_pid;
int request_type;
int result_code;
shm_mq buffer;
} CmdStatusInfo;

An instance of this is allocated on shared memory once, using BUFFER_SIZE
of 8k.

In pg_cmdstatus() I lock on the LWLock to check if target_pid is 0, then it
means nobody else is using this communication channel at the moment. If
that's the case, I set the pids and request_type and initialize the mq
buffer. Otherwise I just sleep and retry acquiring the lock (a timeout
should be added here probably).

What sort of pathological problems are you concerned of? The communicating

backends should just detach from the message queue properly and have some
timeout configured to prevent deadlocks. Other than that, I don't see how
having N slots really help the problem: in case of pathological problems
you will just deplete them all sooner or later.

I afraid of unexpected problems :) - any part of signal handling or
multiprocess communication is fragile. Slots are simple and simply attached
to any process without necessity to alloc/free some memory.

Yes, but do slots solve the actual problem? If there is only one message
queue, you still have the same problem regardless of the number of slots
you decide to have.

--
Alex

#21Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#20)
Re: On-demand running query plans using auto_explain and signals

2015-09-02 12:36 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Wed, Sep 2, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

But do we really need the slots mechanism? Would it not be OK to just
let the LWLock do the sequencing of concurrent requests? Given that we
only going to use one message queue per cluster, there's not much
concurrency you can gain by introducing slots I believe.

I afraid of problems on production. When you have a queue related to
any process, then all problems should be off after end of processes. One
message queue per cluster needs restart cluster when some pathological
problems are - and you cannot restart cluster in production week, sometimes
weeks. The slots are more robust.

Yes, but in your implementation the slots themselves don't have a
queue/buffer. Did you intend to have a message queue per slot?

The message queue cannot be reused, so I expect one slot per caller to be
used passing parameters, - message queue will be created/released by demand
by caller.

I don't believe a message queue cannot really be reused. What would stop
us from calling shm_mq_create() on the queue struct again?

you cannot to change recipient later

Show quoted text

To give you an idea, in my current prototype I have only the following
struct:

typedef struct {
LWLock *lock;
/*CmdStatusInfoSlot slots[CMDINFO_SLOTS];*/
pid_t target_pid;
pid_t sender_pid;
int request_type;
int result_code;
shm_mq buffer;
} CmdStatusInfo;

An instance of this is allocated on shared memory once, using BUFFER_SIZE
of 8k.

In pg_cmdstatus() I lock on the LWLock to check if target_pid is 0, then
it means nobody else is using this communication channel at the moment. If
that's the case, I set the pids and request_type and initialize the mq
buffer. Otherwise I just sleep and retry acquiring the lock (a timeout
should be added here probably).

What sort of pathological problems are you concerned of? The

communicating backends should just detach from the message queue properly
and have some timeout configured to prevent deadlocks. Other than that, I
don't see how having N slots really help the problem: in case of
pathological problems you will just deplete them all sooner or later.

I afraid of unexpected problems :) - any part of signal handling or
multiprocess communication is fragile. Slots are simple and simply attached
to any process without necessity to alloc/free some memory.

Yes, but do slots solve the actual problem? If there is only one message
queue, you still have the same problem regardless of the number of slots
you decide to have.

--
Alex

#22Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#21)
Re: On-demand running query plans using auto_explain and signals

On Wed, Sep 2, 2015 at 2:56 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-02 12:36 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Wed, Sep 2, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

But do we really need the slots mechanism? Would it not be OK to
just let the LWLock do the sequencing of concurrent requests? Given that
we only going to use one message queue per cluster, there's not much
concurrency you can gain by introducing slots I believe.

I afraid of problems on production. When you have a queue related to
any process, then all problems should be off after end of processes. One
message queue per cluster needs restart cluster when some pathological
problems are - and you cannot restart cluster in production week, sometimes
weeks. The slots are more robust.

Yes, but in your implementation the slots themselves don't have a
queue/buffer. Did you intend to have a message queue per slot?

The message queue cannot be reused, so I expect one slot per caller to
be used passing parameters, - message queue will be created/released by
demand by caller.

I don't believe a message queue cannot really be reused. What would stop
us from calling shm_mq_create() on the queue struct again?

you cannot to change recipient later

Well, maybe I'm missing something, but sh_mq_create() will just overwrite
the contents of the struct, so it doesn't care about sender/receiver: only
sh_mq_set_sender/receiver() do.

#23Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#22)
Re: On-demand running query plans using auto_explain and signals

2015-09-02 15:00 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Wed, Sep 2, 2015 at 2:56 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-02 12:36 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Wed, Sep 2, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-02 11:01 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Tue, Sep 1, 2015 at 7:02 PM, Pavel Stehule <pavel.stehule@gmail.com

wrote:

But do we really need the slots mechanism? Would it not be OK to
just let the LWLock do the sequencing of concurrent requests? Given that
we only going to use one message queue per cluster, there's not much
concurrency you can gain by introducing slots I believe.

I afraid of problems on production. When you have a queue related to
any process, then all problems should be off after end of processes. One
message queue per cluster needs restart cluster when some pathological
problems are - and you cannot restart cluster in production week, sometimes
weeks. The slots are more robust.

Yes, but in your implementation the slots themselves don't have a
queue/buffer. Did you intend to have a message queue per slot?

The message queue cannot be reused, so I expect one slot per caller to
be used passing parameters, - message queue will be created/released by
demand by caller.

I don't believe a message queue cannot really be reused. What would
stop us from calling shm_mq_create() on the queue struct again?

you cannot to change recipient later

Well, maybe I'm missing something, but sh_mq_create() will just overwrite
the contents of the struct, so it doesn't care about sender/receiver: only
sh_mq_set_sender/receiver() do.

if you create sh_mq from scratch, then you can reuse structure.

Pavel

#24Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#23)
Re: On-demand running query plans using auto_explain and signals

On Wed, Sep 2, 2015 at 3:04 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Well, maybe I'm missing something, but sh_mq_create() will just overwrite
the contents of the struct, so it doesn't care about sender/receiver: only
sh_mq_set_sender/receiver() do.

if you create sh_mq from scratch, then you can reuse structure.

That was my plan. I'm now close to actually compiling this stuff, we'll
see if it works in the simplest case at least. :-)

#25Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Shulgin, Oleksandr (#24)
1 attachment(s)
Re: On-demand running query plans using auto_explain and signals

On Wed, Sep 2, 2015 at 3:07 PM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

On Wed, Sep 2, 2015 at 3:04 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Well, maybe I'm missing something, but sh_mq_create() will just
overwrite the contents of the struct, so it doesn't care about
sender/receiver: only sh_mq_set_sender/receiver() do.

if you create sh_mq from scratch, then you can reuse structure.

Please find attached a v3.

It uses a shared memory queue and also has the ability to capture plans
nested deeply in the call stack. Not sure about using the executor hook,
since this is not an extension...

The LWLock is used around initializing/cleaning the shared struct and the
message queue, the IO synchronization is handled by the message queue
itself. After some testing with concurrent pgbench and intentionally deep
recursive plpgsql functions (up to 700 plpgsql stack frames) I think this
approach can work. Unless there's some theoretical problem I'm just not
aware of. :-)

Comments welcome!
--
Alex

Attachments:

explain-pid-v3.patchtext/x-patch; charset=US-ASCII; name=explain-pid-v3.patchDownload
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 0abde43..40db40d 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -26,6 +26,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/cmdstatus.h"
 
 
 /*
@@ -296,6 +297,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CMDSTATUS_INFO))
+		HandleCmdStatusInfoInterrupt();
+
 	if (set_latch_on_sigusr1)
 		SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ce4bdaf..5d5df58 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/cmdstatus.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2991,6 +2992,9 @@ ProcessInterrupts(void)
 
 	if (ParallelMessagePending)
 		HandleParallelMessages();
+
+	if (CmdStatusInfoRequested)
+		ProcessCmdStatusInfoRequest();
 }
 
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 3ed0b44..2c8687c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -18,7 +18,7 @@ endif
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-	bool.o cash.o char.o date.o datetime.o datum.o dbsize.o domains.o \
+	bool.o cash.o char.o cmdstatus.o date.o datetime.o datum.o dbsize.o domains.o \
 	encode.o enum.o expandeddatum.o \
 	float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
diff --git a/src/backend/utils/adt/cmdstatus.c b/src/backend/utils/adt/cmdstatus.c
new file mode 100644
index 0000000..5f31a2d
--- /dev/null
+++ b/src/backend/utils/adt/cmdstatus.c
@@ -0,0 +1,567 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.c
+ *	  Definitions for pg_cmdstatus function.
+ *
+ * Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/cmdstatus.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/htup_details.h"
+#include "commands/explain.h"
+#include "lib/stringinfo.h"
+#include "storage/latch.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shm_mq.h"
+#include "tcop/dest.h"
+#include "tcop/pquery.h"
+#include "utils/builtins.h"
+#include "utils/cmdstatus.h"
+
+
+typedef enum {
+	CMD_STATUS_REQUEST_EXPLAIN = 1,
+	CMD_STATUS_REQUEST_QUERY_TEXT = 2,
+	CMD_STATUS_REQUEST_PROGRESS_TAG = 3,
+	CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE = 4
+} CmdStatusInfoRequestType;
+
+#define CMD_STATUS_MAX_REQUEST CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE
+
+typedef enum {
+	CMD_STATUS_RESULT_FAILURE = -1,
+	CMD_STATUS_RESULT_SUCCESS = 0,
+	CMD_STATUS_RESULT_BACKEND_IDLE,
+	CMD_STATUS_RESULT_NO_COMMAND_TAG
+} CmdStatusInfoResultCode;
+
+typedef struct {
+	LWLock	   *lock;
+	pid_t		target_pid;
+	pid_t		sender_pid;
+	CmdStatusInfoRequestType	request_type;
+	CmdStatusInfoResultCode		result_code;
+	char		buffer[FLEXIBLE_ARRAY_MEMBER];
+} CmdStatusInfo;
+
+#define BUFFER_SIZE			8192
+
+/*
+ * These structs are allocated on the program stack as local variables in the
+ * ExecutorRun hook.  The top of stack is current_query_stack, see below.
+ */
+typedef struct CmdInfoStack {
+	QueryDesc			   *query_desc;
+	struct CmdInfoStack	   *parent;
+} CmdInfoStack;
+
+
+bool CmdStatusInfoEnabled = true;
+bool CmdStatusInfoRequested = false;
+
+static CmdStatusInfo   *cmd_status_info = NULL;
+static CmdInfoStack	   *current_query_stack = NULL;
+static int				query_stack_size = 0;
+
+static ExecutorRun_hook_type	prev_ExecutorRun = NULL;
+
+static void attach_shmem(void);
+static void cmdstatus_ExecutorRun(QueryDesc *queryDesc,
+								  ScanDirection direction, long count);
+
+
+void
+InstallCmdStatusInfoHooks(void)
+{
+	/*elog(LOG, "InstallCmdStatusInfoHooks");*/
+
+	prev_ExecutorRun = ExecutorRun_hook;
+	ExecutorRun_hook = cmdstatus_ExecutorRun;
+}
+
+
+static void
+attach_shmem(void)
+{
+	bool	found;
+
+	Assert(sizeof(CmdStatusInfo) < BUFFER_SIZE);
+
+	if (cmd_status_info == NULL)
+	{
+		LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+		cmd_status_info = (CmdStatusInfo *) ShmemInitStruct("cmdstatusinfo",
+															BUFFER_SIZE, &found);
+		if (!found)
+		{
+			cmd_status_info->lock = LWLockAssign();
+
+			cmd_status_info->target_pid = 0;
+			cmd_status_info->sender_pid = 0;
+			cmd_status_info->request_type = 0;
+			cmd_status_info->result_code = 0;
+		}
+
+		LWLockRelease(AddinShmemInitLock);
+	}
+}
+
+
+static void
+cmdstatus_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
+{
+	CmdInfoStack	current;
+
+	/*ereport(LOG,
+			(errmsg("ExecutorRun   : %p\t%d", queryDesc, query_stack_size),
+			 errhidecontext(true),
+			 errhidestmt(true)));*/
+
+	current.query_desc = queryDesc;
+	current.parent = current_query_stack;
+
+	current_query_stack = &current;
+	query_stack_size++;
+
+	PG_TRY();
+	{
+		if (prev_ExecutorRun)
+			prev_ExecutorRun(queryDesc, direction, count);
+		else
+			standard_ExecutorRun(queryDesc, direction, count);
+
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+	}
+	PG_CATCH();
+	{
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+
+static StringInfo
+explain_query(QueryDesc *queryDesc)
+{
+	ExplainState	*es;
+
+	es = NewExplainState();
+	es->analyze = false;
+	es->verbose = false;
+	es->buffers = false;
+	es->format = EXPLAIN_FORMAT_TEXT;
+
+	ExplainBeginOutput(es);
+	/* XXX: appendStringInfo(es->str, "#%d ", depth); ? */
+	ExplainQueryText(es, queryDesc);
+	ExplainPrintPlan(es, queryDesc);
+	ExplainEndOutput(es);
+
+	/* Remove last line break. */
+	if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
+		es->str->data[--es->str->len] = '\0';
+
+	return es->str;
+}
+
+static void
+send_explain_plan(shm_mq_handle *shmq, const char *str)
+{
+	/* Break plan into lines and send each as a separate message. */
+	while (*str)
+	{
+		const char	   *q   = strchr(str, '\n');
+		Size			len = q ? q - str : strlen(str);
+
+		shm_mq_send(shmq, len, str, false);
+
+		if (!q)
+			break;
+
+		str += len + 1;
+	}
+}
+
+static void
+explain_one_query_desc(shm_mq_handle *shmq, QueryDesc *query_desc)
+{
+	StringInfo		str;
+
+	str = explain_query(query_desc);
+
+	send_explain_plan(shmq, str->data);
+
+	pfree(str->data);
+	pfree(str);
+}
+
+
+/* signal handler for PROCSIG_CMDSTATUS_INFO */
+void
+HandleCmdStatusInfoInterrupt(void)
+{
+	if (CmdStatusInfoEnabled && !CmdStatusInfoRequested)
+	{
+		InterruptPending = true;
+		CmdStatusInfoRequested = true;
+
+		SetLatch(MyLatch);
+	}
+}
+
+void
+ProcessCmdStatusInfoRequest(void)
+{
+	shm_mq		   *buffer = NULL;
+	shm_mq_handle  *buffer_handle = NULL;
+
+	if (!(CmdStatusInfoEnabled && CmdStatusInfoRequested))
+	  return;
+
+	CmdStatusInfoEnabled = false;
+	CmdStatusInfoRequested = false;
+
+	PG_TRY();
+	{
+		attach_shmem();
+
+		LWLockAcquire(cmd_status_info->lock, LW_EXCLUSIVE);
+
+		if (cmd_status_info->target_pid == 0)
+		{
+			LWLockRelease(cmd_status_info->lock);
+
+			elog(LOG, "target backend PID is not set");
+			return;
+		}
+		else if (cmd_status_info->target_pid != MyProcPid)
+		{
+			pid_t	target_pid = cmd_status_info->target_pid;
+
+			LWLockRelease(cmd_status_info->lock);
+
+			elog(LOG, "target backend PID doesn't match: expected %d, but seen %d",
+				 MyProcPid, target_pid);
+			return;
+		}
+
+		buffer = (shm_mq *) &cmd_status_info->buffer;
+		shm_mq_set_sender(buffer, MyProc);
+
+		buffer_handle = shm_mq_attach(buffer, NULL, NULL);
+
+		/* Show some optimism. */
+		cmd_status_info->result_code = CMD_STATUS_RESULT_SUCCESS;
+
+		/*
+		 * It should be safe to release the lock now, since the IO
+		 * synchronization is performed on the shared memory queue itself.
+		 */
+		LWLockRelease(cmd_status_info->lock);
+
+		if (ActivePortal)
+		{
+			switch (cmd_status_info->request_type)
+			{
+				case CMD_STATUS_REQUEST_EXPLAIN:
+					explain_one_query_desc(buffer_handle, ActivePortal->queryDesc);
+					break;
+
+				case CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE:
+				{
+					CmdInfoStack	   *query;
+
+					if (current_query_stack != NULL)
+					{
+						for (query = current_query_stack;
+							 query != NULL;
+							 query = query->parent)
+						{
+							explain_one_query_desc(buffer_handle, query->query_desc);
+						}
+					}
+					else
+					{
+						/* XXX */
+						cmd_status_info->result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+					}
+					break;
+				}
+
+				case CMD_STATUS_REQUEST_QUERY_TEXT:
+					/* XXX: how useful is this one?  we already have pg_stat_activity */
+					shm_mq_send(buffer_handle, strlen(ActivePortal->sourceText),
+								ActivePortal->sourceText, false);
+					break;
+
+				case CMD_STATUS_REQUEST_PROGRESS_TAG:
+					if (ActivePortal->commandTag != NULL)
+					{
+						if (ActivePortal->queryDesc != NULL &&
+							ActivePortal->queryDesc->estate != NULL)
+						{
+							char completionTag[COMPLETION_TAG_BUFSIZE];
+
+							snprintf(completionTag, COMPLETION_TAG_BUFSIZE, "%s %u",
+									 ActivePortal->commandTag,
+									 ActivePortal->queryDesc->estate->es_processed);
+
+							shm_mq_send(buffer_handle, strlen(completionTag),
+										completionTag, false);
+						}
+						else
+						{
+							/* no progress available, at least show the command tag */
+							shm_mq_send(buffer_handle, strlen(ActivePortal->commandTag),
+										ActivePortal->commandTag, false);
+						}
+					}
+					else
+					{
+						cmd_status_info->result_code = CMD_STATUS_RESULT_NO_COMMAND_TAG;
+					}
+					break;
+			}
+		}
+		else
+		{
+			cmd_status_info->result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+		}
+
+		shm_mq_detach(buffer);
+
+		CmdStatusInfoEnabled = true;
+	}
+	PG_CATCH();
+	{
+		if (buffer_handle)
+			shm_mq_detach(buffer);
+
+		CmdStatusInfoEnabled = true;
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+static void
+report_result_code(CmdStatusInfoResultCode result, pid_t target_pid)
+{
+	switch (result)
+	{
+		case CMD_STATUS_RESULT_BACKEND_IDLE:
+			elog(INFO, "no command is currently running in backend with PID %d",
+				 target_pid);
+			break;
+
+		case CMD_STATUS_RESULT_NO_COMMAND_TAG:
+			elog(WARNING, "no command tag found for the query in backend with PID %d",
+				 target_pid);
+			break;
+
+		default:
+			elog(ERROR, "general command status request failure");
+			break;
+	}
+}
+
+/*
+ * try to get status of command in another process
+ *
+ * FUNCTION get_cmdstatus(pid, request_type)
+ * RETURNS SETOF text
+ */
+Datum
+pg_cmdstatus(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo	= (ReturnSetInfo *) fcinfo->resultinfo;
+	Tuplestorestate *tupstore;
+	TupleDesc	tupdesc;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	pid_t target_pid		= (pid_t) PG_GETARG_INT32(0);
+	int request_type		= PG_GETARG_INT32(1);
+	PGPROC	   *proc;
+
+	shm_mq		   *buffer = NULL;
+	shm_mq_handle  *buffer_handle = NULL;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	if (request_type < 1 || request_type > CMD_STATUS_MAX_REQUEST)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("unknown command status request")));
+
+	/* need to build tuplestore in query context */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupdesc = CreateTupleDescCopy(rsinfo->expectedDesc);
+	tupstore = tuplestore_begin_heap(false, false, work_mem);
+	MemoryContextSwitchTo(oldcontext);
+
+	PG_TRY();
+	{
+		bool			done;
+
+		if (target_pid == MyProcPid)
+			ereport(ERROR,
+					(errmsg("backend cannot query command status of itself")));
+
+		/* verify access to target_pid */
+		proc = BackendPidGetProc(target_pid);
+
+		if (proc == NULL)
+			ereport(ERROR,
+					(errmsg("PID %d is not a PostgreSQL server process", target_pid)));
+
+		if (!(superuser() || proc->roleId == GetUserId()))
+			ereport(ERROR,
+					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+					 (errmsg("must be superuser or have the same role to cancel queries running in other server processes"))));
+
+		attach_shmem();
+
+		/* wait before the shared status struct is free to be used */
+		for (;;)
+		{
+			LWLockAcquire(cmd_status_info->lock, LW_EXCLUSIVE);
+
+			if (cmd_status_info->target_pid == 0)
+			{
+				cmd_status_info->target_pid = target_pid;
+				cmd_status_info->sender_pid = MyProcPid;
+				cmd_status_info->request_type = request_type;
+				cmd_status_info->result_code = CMD_STATUS_RESULT_FAILURE;
+
+				buffer = shm_mq_create(&cmd_status_info->buffer,
+									   BUFFER_SIZE - offsetof(CmdStatusInfo, buffer));
+
+				shm_mq_set_receiver(buffer, MyProc);
+
+				buffer_handle = shm_mq_attach(buffer, NULL, NULL);
+
+				LWLockRelease(cmd_status_info->lock);
+				break;
+			}
+			else
+			{
+				LWLockRelease(cmd_status_info->lock);
+				pg_usleep(1000L);
+			}	
+		}
+
+		if (SendProcSignal(target_pid, PROCSIG_CMDSTATUS_INFO, InvalidBackendId) < 0)
+			elog(ERROR, "could not signal backend with PID %d", target_pid);
+
+		done = false;
+		while (!done)
+		{
+			char   *data;
+			Size	len;
+			shm_mq_result	res;
+
+			res = shm_mq_receive(buffer_handle, &len, (void **) &data,
+								 /* nowait = */ false);
+			switch (res)
+			{
+				case SHM_MQ_SUCCESS:
+				{
+					Datum		value;
+					HeapTuple	tuple;
+					bool		isnull = false;
+
+					value = PointerGetDatum(cstring_to_text_with_len(data, len));
+
+					tuple = heap_form_tuple(tupdesc, &value, &isnull);
+					tuplestore_puttuple(tupstore, tuple);
+					break;
+				}
+
+				case SHM_MQ_DETACHED:
+					if (cmd_status_info->result_code != CMD_STATUS_RESULT_SUCCESS)
+					{
+						report_result_code(cmd_status_info->result_code, target_pid);
+					}
+					done = true;
+					break;
+
+				default:
+					Assert(!"unknown shm_mq_result code");
+			}
+		}
+
+		/* clean up and return the tuplestore */
+		tuplestore_donestoring(tupstore);
+
+		shm_mq_detach(buffer);
+
+		/* clean the status struct for the next user */
+		LWLockAcquire(cmd_status_info->lock, LW_EXCLUSIVE);
+
+		cmd_status_info->target_pid = 0;
+		cmd_status_info->sender_pid = 0;
+		cmd_status_info->request_type = 0;
+		cmd_status_info->result_code = 0;
+
+		LWLockRelease(cmd_status_info->lock);
+	}
+	PG_CATCH();
+	{
+		if (buffer_handle)
+			shm_mq_detach(buffer);
+
+		if (cmd_status_info)
+		{
+			LWLockAcquire(cmd_status_info->lock, LW_EXCLUSIVE);
+
+			cmd_status_info->target_pid = 0;
+			cmd_status_info->sender_pid = 0;
+			cmd_status_info->request_type = 0;
+			cmd_status_info->result_code = 0;
+
+			LWLockRelease(cmd_status_info->lock);
+		}
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	return (Datum) 0;
+}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7b19714..19e7009 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -50,6 +50,7 @@
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
+#include "utils/cmdstatus.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -1017,6 +1018,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	/* initialize client encoding */
 	InitializeClientEncoding();
 
+	InstallCmdStatusInfoHooks();
+
 	/* report this backend in the PgBackendStatus array */
 	if (!bootstrap)
 		pgstat_bestart();
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ddf7c67..d083aaf 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3130,6 +3130,8 @@ DESCR("get OID of current session's temp schema, if any");
 DATA(insert OID = 2855 (  pg_is_other_temp_schema	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "26" _null_ _null_ _null_ _null_ _null_ pg_is_other_temp_schema _null_ _null_ _null_ ));
 DESCR("is schema another session's temp schema?");
 
+DATA(insert OID = 4099 ( pg_cmdstatus		PGNSP PGUID 12 1 100 0 0 f f f f f t s 2 0 25 "23 23" _null_ _null_  _null_  _null_ _null_ pg_cmdstatus _null_ _null_ _null_ ));
+DESCR("returns information about another process");
 DATA(insert OID = 2171 ( pg_cancel_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_cancel_backend _null_ _null_ _null_ ));
 DESCR("cancel a server process' current query");
 DATA(insert OID = 2096 ( pg_terminate_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index af1a0cd..ab1698c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,9 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
+	/* cmd status info */
+	PROCSIG_CMDSTATUS_INFO,
+
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fc1679e..605612e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1243,6 +1243,9 @@ extern Datum pg_identify_object_as_address(PG_FUNCTION_ARGS);
 /* catalog/objectaddress.c */
 extern Datum pg_get_object_address(PG_FUNCTION_ARGS);
 
+/* utils/adt/cmdstatus.c */
+extern Datum pg_cmdstatus(PG_FUNCTION_ARGS);
+
 /* commands/constraint.c */
 extern Datum unique_key_recheck(PG_FUNCTION_ARGS);
 
diff --git a/src/include/utils/cmdstatus.h b/src/include/utils/cmdstatus.h
new file mode 100644
index 0000000..52ddf6a
--- /dev/null
+++ b/src/include/utils/cmdstatus.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.h
+ *	  Declarations for command status interrupt handling.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * src/include/utils/cmdstatus.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CMDSTATUS_H
+#define CMDSTATUS_H
+
+extern bool CmdStatusInfoRequested;
+
+extern void InstallCmdStatusInfoHooks(void);
+
+extern void HandleCmdStatusInfoInterrupt(void);
+extern void ProcessCmdStatusInfoRequest(void);
+
+#endif   /* CMDSTATUS_H */
#26Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#25)
Re: On-demand running query plans using auto_explain and signals

Hi

2015-09-03 18:30 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Wed, Sep 2, 2015 at 3:07 PM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

On Wed, Sep 2, 2015 at 3:04 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Well, maybe I'm missing something, but sh_mq_create() will just
overwrite the contents of the struct, so it doesn't care about
sender/receiver: only sh_mq_set_sender/receiver() do.

if you create sh_mq from scratch, then you can reuse structure.

Please find attached a v3.

It uses a shared memory queue and also has the ability to capture plans
nested deeply in the call stack. Not sure about using the executor hook,
since this is not an extension...

The LWLock is used around initializing/cleaning the shared struct and the
message queue, the IO synchronization is handled by the message queue
itself. After some testing with concurrent pgbench and intentionally deep
recursive plpgsql functions (up to 700 plpgsql stack frames) I think this
approach can work. Unless there's some theoretical problem I'm just not
aware of. :-)

Comments welcome!

I am not pretty happy from this design. Only one EXPLAIN PID/GET STATUS in
one time can be executed per server - I remember lot of queries that
doesn't handle CANCEL well ~ doesn't handle interrupt well, and this can be
unfriendly. Cannot to say if it is good enough for first iteration. This is
functionality that can be used for diagnostic when you have overloaded
server and this risk looks too high (for me). The idea of receive slot can
to solve this risk well. The difference from this code should not be too
big - although it is not trivial - needs work with PGPROC.

Other smaller issues:

* probably sending line by line is useless - shm_mq_send can pass bigger
data when nowait = false
* pg_usleep(1000L); - it is related to single point resource

Some ideas:

* this code share some important parts with auto_explain (query stack) -
and because it should be in core (due handling signal if I remember well),
it can be first step of integration auto_explain to core.

Show quoted text

--
Alex

#27Pavel Stehule
pavel.stehule@gmail.com
In reply to: Pavel Stehule (#26)
Re: On-demand running query plans using auto_explain and signals

2015-09-03 22:06 GMT+02:00 Pavel Stehule <pavel.stehule@gmail.com>:

Show quoted text

Hi

2015-09-03 18:30 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Wed, Sep 2, 2015 at 3:07 PM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

On Wed, Sep 2, 2015 at 3:04 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Well, maybe I'm missing something, but sh_mq_create() will just
overwrite the contents of the struct, so it doesn't care about
sender/receiver: only sh_mq_set_sender/receiver() do.

if you create sh_mq from scratch, then you can reuse structure.

Please find attached a v3.

It uses a shared memory queue and also has the ability to capture plans
nested deeply in the call stack. Not sure about using the executor hook,
since this is not an extension...

The LWLock is used around initializing/cleaning the shared struct and the
message queue, the IO synchronization is handled by the message queue
itself. After some testing with concurrent pgbench and intentionally deep
recursive plpgsql functions (up to 700 plpgsql stack frames) I think this
approach can work. Unless there's some theoretical problem I'm just not
aware of. :-)

Comments welcome!

I am not pretty happy from this design. Only one EXPLAIN PID/GET STATUS in
one time can be executed per server - I remember lot of queries that
doesn't handle CANCEL well ~ doesn't handle interrupt well, and this can be
unfriendly. Cannot to say if it is good enough for first iteration. This is
functionality that can be used for diagnostic when you have overloaded
server and this risk looks too high (for me). The idea of receive slot can
to solve this risk well (and can be used elsewhere). The difference from
this code should not be too big - although it is not trivial - needs work
with PGPROC. The opinion of our multiprocess experts can be interesting.
Maybe I am too careful.

Other smaller issues:

* probably sending line by line is useless - shm_mq_send can pass bigger
data when nowait = false
* pg_usleep(1000L); - it is related to single point resource

Some ideas:

* this code share some important parts with auto_explain (query stack) -
and because it should be in core (due handling signal if I remember well),
it can be first step of integration auto_explain to core.

--
Alex

#28Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#27)
Re: On-demand running query plans using auto_explain and signals

On Sep 3, 2015 10:14 PM, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:

Please find attached a v3.

It uses a shared memory queue and also has the ability to capture plans

nested deeply in the call stack. Not sure about using the executor hook,
since this is not an extension...

The LWLock is used around initializing/cleaning the shared struct and

the message queue, the IO synchronization is handled by the message queue
itself.

I am not pretty happy from this design. Only one EXPLAIN PID/GET STATUS

in one time can be executed per server - I remember lot of queries that
doesn't handle CANCEL well ~ doesn't handle interrupt well, and this can be
unfriendly. Cannot to say if it is good enough for first iteration. This is
functionality that can be used for diagnostic when you have overloaded
server and this risk looks too high (for me). The idea of receive slot can
to solve this risk well (and can be used elsewhere). The difference from
this code should not be too big - although it is not trivial - needs work
with PGPROC. The opinion of our multiprocess experts can be interesting.
Maybe I am too careful.

Sorry, but I still don't see how the slots help this issue - could you
please elaborate?

Other smaller issues:

* probably sending line by line is useless - shm_mq_send can pass bigger

data when nowait = false

I'm not sending it like that because of the message size - I just find it
more convenient. If you think it can be problematic, its easy to do this as
before, by splitting lines on the receiving side.

* pg_usleep(1000L); - it is related to single point resource

But not a highly concurrent one.

-
Alex

#29Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#28)
Re: On-demand running query plans using auto_explain and signals

2015-09-04 5:50 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>:

On Sep 3, 2015 10:14 PM, "Pavel Stehule" <pavel.stehule@gmail.com> wrote:

Please find attached a v3.

It uses a shared memory queue and also has the ability to capture

plans nested deeply in the call stack. Not sure about using the executor
hook, since this is not an extension...

The LWLock is used around initializing/cleaning the shared struct and

the message queue, the IO synchronization is handled by the message queue
itself.

I am not pretty happy from this design. Only one EXPLAIN PID/GET STATUS

in one time can be executed per server - I remember lot of queries that
doesn't handle CANCEL well ~ doesn't handle interrupt well, and this can be
unfriendly. Cannot to say if it is good enough for first iteration. This is
functionality that can be used for diagnostic when you have overloaded
server and this risk looks too high (for me). The idea of receive slot can
to solve this risk well (and can be used elsewhere). The difference from
this code should not be too big - although it is not trivial - needs work
with PGPROC. The opinion of our multiprocess experts can be interesting.
Maybe I am too careful.

Sorry, but I still don't see how the slots help this issue - could you
please elaborate?

with slot (or some similiar) there is not global locked resource. If I'll
have a time at weekend I'll try to write some prototype.

Other smaller issues:

* probably sending line by line is useless - shm_mq_send can pass

bigger data when nowait = false

I'm not sending it like that because of the message size - I just find it
more convenient. If you think it can be problematic, its easy to do this as
before, by splitting lines on the receiving side.

Yes, shm queue sending data immediately - so slicing on sender generates
more interprocess communication

* pg_usleep(1000L); - it is related to single point resource

But not a highly concurrent one.

I believe so it is not becessary - waiting (sleeping) can be deeper in
reading from queue - the code will be cleaner

Regard

Pavel

Show quoted text

-
Alex

#30Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#29)
Re: On-demand running query plans using auto_explain and signals

On Fri, Sep 4, 2015 at 6:11 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Sorry, but I still don't see how the slots help this issue - could you

please elaborate?

with slot (or some similiar) there is not global locked resource. If I'll
have a time at weekend I'll try to write some prototype.

But you will still lock on the slots list to find an unused one. How is
that substantially different from what I'm doing?

Other smaller issues:

* probably sending line by line is useless - shm_mq_send can pass

bigger data when nowait = false

I'm not sending it like that because of the message size - I just find it
more convenient. If you think it can be problematic, its easy to do this as
before, by splitting lines on the receiving side.

Yes, shm queue sending data immediately - so slicing on sender generates
more interprocess communication

Well, we are talking about hundreds to thousands bytes per plan in total.
And if my reading of shm_mq implementation is correct, if the message fits
into the shared memory buffer, the receiver gets the direct pointer to the
shared memory, no extra allocation/copy to process-local memory. So this
can be actually a win.

* pg_usleep(1000L); - it is related to single point resource

But not a highly concurrent one.

I believe so it is not becessary - waiting (sleeping) can be deeper in
reading from queue - the code will be cleaner

The only way I expect this line to be reached is when a concurrent
pg_cmdstatus() call is in progress: the receiving backend has set the
target_pid and has created the queue, released the lock and now waits to
read something from shm_mq. So the backend that's trying to also use this
communication channel can obtain the lwlock, checks if the channel is not
used at the time, fails and then it needs to check again, but that's going
to put a load on the CPU, so there's a small sleep.

The real problem could be if the process that was signaled to connect to
the message queue never handles the interrupt, and we keep waiting forever
in shm_mq_receive(). We could add a timeout parameter or just let the user
cancel the call: send a cancellation request, use pg_cancel_backend() or
set statement_timeout before running this.

--
Alex

#31Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#30)
1 attachment(s)
Re: On-demand running query plans using auto_explain and signals

2015-09-07 11:55 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Fri, Sep 4, 2015 at 6:11 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Sorry, but I still don't see how the slots help this issue - could you

please elaborate?

with slot (or some similiar) there is not global locked resource. If I'll
have a time at weekend I'll try to write some prototype.

But you will still lock on the slots list to find an unused one. How is
that substantially different from what I'm doing?

It is not necessary - you can use similar technique to what it does PGPROC.
I am sending "lock free" demo.

I don't afraid about locks - short locks, when the range and time are
limited. But there are lot of bugs, and fixes with the name "do
interruptible some", and it is reason, why I prefer typical design for work
with shared memory.

Other smaller issues:

* probably sending line by line is useless - shm_mq_send can pass

bigger data when nowait = false

I'm not sending it like that because of the message size - I just find
it more convenient. If you think it can be problematic, its easy to do this
as before, by splitting lines on the receiving side.

Yes, shm queue sending data immediately - so slicing on sender generates
more interprocess communication

Well, we are talking about hundreds to thousands bytes per plan in total.
And if my reading of shm_mq implementation is correct, if the message fits
into the shared memory buffer, the receiver gets the direct pointer to the
shared memory, no extra allocation/copy to process-local memory. So this
can be actually a win.

you have to calculate with signals and interprocess communication. the cost
of memory allocation is not all.

* pg_usleep(1000L); - it is related to single point resource

But not a highly concurrent one.

I believe so it is not becessary - waiting (sleeping) can be deeper in
reading from queue - the code will be cleaner

The only way I expect this line to be reached is when a concurrent
pg_cmdstatus() call is in progress: the receiving backend has set the
target_pid and has created the queue, released the lock and now waits to
read something from shm_mq. So the backend that's trying to also use this
communication channel can obtain the lwlock, checks if the channel is not
used at the time, fails and then it needs to check again, but that's going
to put a load on the CPU, so there's a small sleep.

The real problem could be if the process that was signaled to connect to
the message queue never handles the interrupt, and we keep waiting forever
in shm_mq_receive(). We could add a timeout parameter or just let the user
cancel the call: send a cancellation request, use pg_cancel_backend() or
set statement_timeout before running this.

This is valid question - for begin we can use a statement_timeout and we
don't need to design some special (if you don't hold some important lock).

My example (the code has prototype quality) is little bit longer, but it
work without global lock - the requester doesn't block any other

Pavel

Show quoted text

--
Alex

Attachments:

signal_demo.patchtext/x-patch; charset=US-ASCII; name=signal_demo.patchDownload
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
new file mode 100644
index 32ac58f..f5db55a
*** a/src/backend/storage/ipc/ipci.c
--- b/src/backend/storage/ipc/ipci.c
***************
*** 43,48 ****
--- 43,49 ----
  #include "storage/procsignal.h"
  #include "storage/sinvaladt.h"
  #include "storage/spin.h"
+ #include "utils/signal_demo.h"
  
  
  shmem_startup_hook_type shmem_startup_hook = NULL;
*************** CreateSharedMemoryAndSemaphores(bool mak
*** 139,144 ****
--- 140,146 ----
  		size = add_size(size, BTreeShmemSize());
  		size = add_size(size, SyncScanShmemSize());
  		size = add_size(size, AsyncShmemSize());
+ 		size = add_size(size, ProcPipeShmemSize());
  #ifdef EXEC_BACKEND
  		size = add_size(size, ShmemBackendArraySize());
  #endif
*************** CreateSharedMemoryAndSemaphores(bool mak
*** 243,248 ****
--- 245,251 ----
  	ReplicationOriginShmemInit();
  	WalSndShmemInit();
  	WalRcvShmemInit();
+ 	ProcPipeShmemInit();
  
  	/*
  	 * Set up other modules that need some shared memory space
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
new file mode 100644
index 0abde43..c28b44f
*** a/src/backend/storage/ipc/procsignal.c
--- b/src/backend/storage/ipc/procsignal.c
***************
*** 27,32 ****
--- 27,33 ----
  #include "storage/sinval.h"
  #include "tcop/tcopprot.h"
  
+ #include "utils/signal_demo.h"
  
  /*
   * The SIGUSR1 signal is multiplexed to support signalling multiple event
*************** bool		set_latch_on_sigusr1;
*** 69,75 ****
  
  static ProcSignalSlot *ProcSignalSlots = NULL;
  static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
! 
  static bool CheckProcSignal(ProcSignalReason reason);
  static void CleanupProcSignalState(int status, Datum arg);
  
--- 70,76 ----
  
  static ProcSignalSlot *ProcSignalSlots = NULL;
  static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
!  
  static bool CheckProcSignal(ProcSignalReason reason);
  static void CleanupProcSignalState(int status, Datum arg);
  
*************** procsignal_sigusr1_handler(SIGNAL_ARGS)
*** 296,301 ****
--- 297,305 ----
  	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
  		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
  
+ 	if (CheckProcSignal(PROCSIG_SEND_INFO))
+ 		SendProcessInfoInterrupt();
+ 
  	if (set_latch_on_sigusr1)
  		SetLatch(MyLatch);
  
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
new file mode 100644
index d917af3..51701b8
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
*************** ProcessInterrupts(void)
*** 2991,2996 ****
--- 2991,2999 ----
  
  	if (ParallelMessagePending)
  		HandleParallelMessages();
+ 
+ 	if (SendProcessInfoPending)
+ 		HandleSendProcessInfo();
  }
  
  
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
new file mode 100644
index 3ed0b44..864970c
*** a/src/backend/utils/adt/Makefile
--- b/src/backend/utils/adt/Makefile
*************** OBJS = acl.o arrayfuncs.o array_expanded
*** 36,42 ****
  	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
  	tsvector.o tsvector_op.o tsvector_parser.o \
  	txid.o uuid.o varbit.o varchar.o varlena.o version.o \
! 	windowfuncs.o xid.o xml.o
  
  like.o: like.c like_match.c
  
--- 36,42 ----
  	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
  	tsvector.o tsvector_op.o tsvector_parser.o \
  	txid.o uuid.o varbit.o varchar.o varlena.o version.o \
! 	windowfuncs.o xid.o xml.o signal_demo.o
  
  like.o: like.c like_match.c
  
diff --git a/src/backend/utils/adt/signal_demo.c b/src/backend/utils/adt/signal_demo.c
new file mode 100644
index ...5de50b5
*** a/src/backend/utils/adt/signal_demo.c
--- b/src/backend/utils/adt/signal_demo.c
***************
*** 0 ****
--- 1,411 ----
+ #include "postgres.h"
+ 
+ #include "miscadmin.h"
+ 
+ #include "storage/backendid.h"
+ #include "storage/ipc.h"
+ #include "storage/procarray.h"
+ #include "storage/shm_mq.h"
+ #include "storage/shm_toc.h"
+ #include "utils/builtins.h"
+ #include "utils/resowner.h"
+ #include "utils/signal_demo.h"
+ 
+ #include "storage/dsm.h"
+ 
+ 
+ typedef struct
+ {
+ 	pid_t			pss_pid;
+ 	pid_t			sender_pss_pid;
+ 	BackendId		sender_backendId;
+ 	dsm_handle		dsm_seg_handle;
+ 	int			data_tag;
+ 	bool		is_valid;
+ 	bool		is_in_process;
+ 	bool		is_processed;
+ } ProcPipeSlot;
+ 
+ 
+ #define NumProcPipeSlots	(MaxBackends + NUM_AUXPROCTYPES)
+ 
+ #define PG_SEND_PROCESS_INFO_MAGIC 0x79fb2449
+ #define SEND_INFO_DEMO_TAG		1
+ 
+ static ProcPipeSlot *ProcPipeSlots = NULL;
+ static volatile ProcPipeSlot *MyProcPipeSlot = NULL;
+ 
+ volatile bool	SendProcessInfoPending = false;
+ 
+ static char *prepare_demo_data(int length, bool is_active_stmt);
+ 
+ 
+ Size
+ ProcPipeShmemSize(void)
+ {
+ 	return NumProcPipeSlots * sizeof(ProcPipeSlot);
+ }
+ 
+ void
+ ProcPipeShmemInit(void)
+ {
+ 	Size		size = ProcPipeShmemSize();
+ 	bool		found;
+ 
+ 	ProcPipeSlots = (ProcPipeSlot *)
+ 		ShmemInitStruct("ProcPipeSlots", size, &found);
+ 
+ 	if (!found)
+ 		MemSet(ProcPipeSlots, 0, size);
+ }
+ 
+ static void
+ CleanupProcPipeSlot(int status, Datum arg)
+ {
+ 	int		pss_idx = DatumGetInt32(arg);
+ 	volatile ProcPipeSlot *slot;
+ 
+ 	slot = &ProcPipeSlots[pss_idx - 1];
+ 	Assert(slot == MyProcPipeSlot);
+ 
+ 	MyProcPipeSlot = NULL;
+ 
+ 	if (slot->pss_pid != MyProcPid)
+ 	{
+ 		elog(LOG, "process %d releasing ProcSignal slot %d, but it contains %d",
+ 			 MyProcPid, pss_idx, (int) slot->pss_pid);
+ 		return;					/* XXX better to zero the slot anyway? */
+ 	}
+ 
+ 	slot->pss_pid = 0;
+ 	slot->is_valid = false;
+ }
+ 
+ void
+ ProcPipeInit(int pss_idx)
+ {
+ 	volatile ProcPipeSlot *slot;
+ 
+ 	slot = &ProcPipeSlots[pss_idx - 1];
+ 
+ 	if (slot->pss_pid != 0)
+ 		elog(LOG, "process %d taking over ProcPipe slot %d, but it's not empty",
+ 				 MyProcPid, pss_idx);
+ 
+ 	slot->pss_pid = MyProcPid;
+ 	slot->is_valid = false;
+ 
+ 	MyProcPipeSlot = slot;
+ 
+ 	on_shmem_exit(CleanupProcPipeSlot, Int32GetDatum(pss_idx));
+ }
+ 
+ /*
+  * Send request to data to targetted process. Result is taken from SendProcSignal
+  */
+ int
+ SendProcessInfoSignal(pid_t data_sender_pid, BackendId data_sender_backendId,
+ 		    dsm_handle dsm_seg_handle, int data_tag)
+ {
+ 	MyProcPipeSlot->sender_pss_pid = data_sender_pid;
+ 	MyProcPipeSlot->sender_backendId = data_sender_backendId;
+ 	MyProcPipeSlot->dsm_seg_handle = dsm_seg_handle;
+ 	MyProcPipeSlot->data_tag = data_tag;
+ 	MyProcPipeSlot->is_valid = true;
+ 	MyProcPipeSlot->is_in_process = false;
+ 	MyProcPipeSlot->is_processed = false;
+ 
+ 	return SendProcSignal(data_sender_pid, PROCSIG_SEND_INFO, data_sender_backendId);
+ }
+ 
+ void
+ InvalidateProcPipeSlot(void)
+ {
+ 	MyProcPipeSlot->is_valid = false;
+ }
+ 
+ bool
+ is_valid_result(void)
+ {
+ 	return MyProcPipeSlot->is_valid;
+ }
+ 
+ 
+ /*
+  * signal is handled quickly - don't do heavy work here
+  */
+ void
+ SendProcessInfoInterrupt(void)
+ {
+ 	volatile ProcPipeSlot *slot;
+ 	int		i;
+ 
+ 	for (i = NumProcPipeSlots - 1; i >= 0; i--)
+ 	{
+ 		slot = &ProcPipeSlots[i];
+ 		if (slot->sender_pss_pid == MyProcPid && slot->is_valid
+ 				 && !(slot->is_processed || slot->is_in_process))
+ 		{
+ 			SendProcessInfoPending = true;
+ 			InterruptPending = true;
+ 
+ 			SetLatch(MyLatch);
+ 
+ 			return;
+ 		}
+ 	}
+ }
+ 
+ static void
+ RestoreResourceOwner(ResourceOwner prevResourceOwner)
+ {
+ 	if (prevResourceOwner != CurrentResourceOwner)
+ 	{
+ 		ResourceOwner deleted_ro = CurrentResourceOwner;
+ 
+ 		CurrentResourceOwner = prevResourceOwner;
+ 		ResourceOwnerDelete(deleted_ro);
+ 	}
+ }
+ 
+ void
+ HandleSendProcessInfo(void)
+ {
+ 	volatile ProcPipeSlot *slot;
+ 
+ 	while (SendProcessInfoPending)
+ 	{
+ 		int		i;
+ 
+ 		SendProcessInfoPending = false;
+ 
+ 		for (i = NumProcPipeSlots - 1; i >= 0; i--)
+ 		{
+ 			slot = &ProcPipeSlots[i];
+ 
+ 			if (slot->sender_pss_pid == MyProcPid && slot->is_valid && !slot->is_processed)
+ 			{
+ 				ResourceOwner prevRO = CurrentResourceOwner;
+ 
+ 				dsm_segment	*seg = NULL;
+ 				shm_toc		*toc = NULL;
+ 				shm_mq_handle		*output;
+ 				shm_mq			*mq;
+ 				int		length, *length_ptr;
+ 
+ 				/* fast leave when tag is unsupported */
+ 				if (slot->data_tag != SEND_INFO_DEMO_TAG)
+ 				{
+ 					slot->is_processed = true;
+ 					slot->is_valid = false;
+ 					continue;
+ 				}
+ 
+ 				/* Ensure valid resource owner for access to dsm */
+ 				if (CurrentResourceOwner == NULL)
+ 					CurrentResourceOwner = ResourceOwnerCreate(NULL, "HandlerSendProcessInfo");
+ 
+ 				seg = dsm_attach(slot->dsm_seg_handle);
+ 				if (seg == NULL)
+ 				{
+ 					elog(LOG, "unable map dynamic memory segment");
+ 					slot->is_processed = true;
+ 					slot->is_valid = false;
+ 
+ 					RestoreResourceOwner(prevRO);
+ 
+ 					continue;
+ 				}
+ 
+ 				toc = shm_toc_attach(PG_SEND_PROCESS_INFO_MAGIC, dsm_segment_address(seg));
+ 				if (toc == NULL)
+ 				{
+ 					elog(LOG, "bad magic number in passed dynamic memory segment");
+ 
+ 					dsm_detach(seg);
+ 					slot->is_processed = true;
+ 					slot->is_valid = false;
+ 
+ 					RestoreResourceOwner(prevRO);
+ 
+ 					continue;
+ 				}
+ 
+ 				mq = shm_toc_lookup(toc, 0);
+ 				output = shm_mq_attach(mq, seg, NULL);
+ 
+ 				length_ptr = shm_toc_lookup(toc, 1);
+ 				length = *length_ptr;
+ 
+ 				/* do potentialy heavy and slow work */
+ 				PG_TRY();
+ 				{
+ 					char *data;
+ 					shm_mq_result		res;
+ 
+ 					data = prepare_demo_data(length, prevRO == CurrentResourceOwner);
+ 					res = shm_mq_send(output, length + 1, data, false);
+ 					if (res != SHM_MQ_SUCCESS)
+ 					{
+ 						elog(LOG, "unable to send data to recipient");
+ 
+ 						dsm_detach(seg);
+ 						slot->is_processed = true;
+ 						slot->is_valid = false;
+ 					}
+ 
+ 					slot->is_processed = true;
+ 					shm_mq_detach(mq);
+ 					dsm_detach(seg);
+ 
+ 					RestoreResourceOwner(prevRO);
+ 				}
+ 				PG_CATCH();
+ 				{
+ 					slot->is_processed = true;
+ 					slot->is_valid = false;
+ 
+ 					if (seg != NULL)
+ 						dsm_detach(seg);
+ 
+ 					RestoreResourceOwner(prevRO);
+ 
+ 					PG_RE_THROW();
+ 				}
+ 				PG_END_TRY();
+ 
+ 			}
+ 		}
+ 	}
+ }
+ 
+ /*
+  * generate some content that will be sended to recipient
+  */
+ static char *
+ prepare_demo_data(int length, bool is_active_stmt)
+ {
+ 	char *buffer, *ptr;
+ 	char *chars = "0123456789";
+ 	int padding = 0;
+ 
+ 	length = length < 100 ? 100 : length;
+ 
+ 	buffer = palloc(length + 1);
+ 
+ 	if (!is_active_stmt)
+ 	{
+ 		sprintf(buffer, "idle sender pid: %d", MyProcPid);
+ 		return buffer;
+ 	}
+ 
+ 	sprintf(buffer, "sender pid: %d\n", MyProcPid);
+ 
+ 	ptr = buffer + strlen(buffer);
+ 	length -= strlen(buffer);
+ 
+ 	while (length > 1)
+ 	{
+ 		*ptr++ = chars[padding++ % 10];
+ 
+ 		length--;
+ 		if (padding % 30 == 0)
+ 		{
+ 			*ptr++ = '\n';
+ 			length--;
+ 		}
+ 	}
+ 
+ 	*ptr++ = '#';
+ 	*ptr = '\0';
+ 
+ 	return buffer;
+ }
+ 
+ /*
+  * Send custom signal and returns string generated by signaled process.
+  */
+ Datum
+ signal_process(PG_FUNCTION_ARGS)
+ {
+ 	int		pid = PG_GETARG_INT32(0);
+ 	int		length = PG_GETARG_INT32(1);
+ 	dsm_segment		*seg;
+ 	shm_toc_estimator	e;
+ 	shm_toc			*toc;
+ 	shm_mq			*mq;
+ 	shm_mq_handle		*input;
+ 	shm_mq_result		res;
+ 	Size		segsize;
+ 	int		*param_ptr;
+ 	text		*result;
+ 	PGPROC	*proc;
+ 	Size		received_data_length;
+ 	char		*data;
+ 
+ 	if (pid == MyProcPid)
+ 		elog(ERROR, "cannot to send signal to self");
+ 
+ 	proc = BackendPidGetProc(pid);
+ 	if (proc == NULL)
+ 		elog(ERROR, "invalid pid");
+ 
+ 	if (length < 30)
+ 		elog(ERROR, "second parameter (length: %d) should be >= 30", length);
+ 
+ 
+ 	/* prepare shared dsm segment */
+ 	shm_toc_initialize_estimator(&e);
+ 	shm_toc_estimate_chunk(&e, 1024);
+ 	shm_toc_estimate_chunk(&e, sizeof(int));
+ 	shm_toc_estimate_keys(&e, 2);
+ 	segsize = shm_toc_estimate(&e);
+ 
+ 	seg = dsm_create(segsize, DSM_CREATE_NULL_IF_MAXSEGMENTS);
+ 	if (seg == NULL)
+ 		elog(ERROR, "cannot to create dsm segment");
+ 
+ 	PG_TRY();
+ 	{
+ 		toc = shm_toc_create(PG_SEND_PROCESS_INFO_MAGIC, dsm_segment_address(seg), segsize);
+ 		if (toc == NULL)
+ 			elog(ERROR, "cannot to create toc");
+ 
+ 		/* prepare basic structures */
+ 		mq = shm_mq_create(shm_toc_allocate(toc, 1024), 1024);
+ 		shm_mq_set_receiver(mq, MyProc);
+ 		shm_mq_set_sender(mq, proc);
+ 		shm_toc_insert(toc, 0, mq);
+ 
+ 		param_ptr = (int *) shm_toc_allocate(toc, sizeof(int));
+ 		*param_ptr = length;
+ 		shm_toc_insert(toc, 1, param_ptr);
+ 
+ 		if (SendProcessInfoSignal(pid, proc->backendId , dsm_segment_handle(seg), SEND_INFO_DEMO_TAG))
+ 			elog(ERROR, "could not send signal to process");
+ 
+ 		/* try to read from queue */
+ 		input = shm_mq_attach(mq, seg, NULL);
+ 		res = shm_mq_receive(input, &received_data_length, (void **) &data, false);
+ 		if (res != SHM_MQ_SUCCESS)
+ 			elog(ERROR, "could not receive message");
+ 
+ 		if (!is_valid_result())
+ 			elog(ERROR, "sender doesn't send valid data");
+ 
+ 		result = cstring_to_text_with_len(data, received_data_length - 1);
+ 
+ 		InvalidateProcPipeSlot();
+ 		shm_mq_detach(mq);
+ 		dsm_detach(seg);
+ 	}
+ 	PG_CATCH();
+ 	{
+ 		InvalidateProcPipeSlot();
+ 		dsm_detach(seg);
+ 		PG_RE_THROW();
+ 	}
+ 	PG_END_TRY();
+ 
+ 	PG_RETURN_TEXT_P(result);
+ }
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
new file mode 100644
index 7b19714..5a36936
*** a/src/backend/utils/init/postinit.c
--- b/src/backend/utils/init/postinit.c
***************
*** 60,66 ****
  #include "utils/syscache.h"
  #include "utils/timeout.h"
  #include "utils/tqual.h"
! 
  
  static HeapTuple GetDatabaseTuple(const char *dbname);
  static HeapTuple GetDatabaseTupleByOid(Oid dboid);
--- 60,66 ----
  #include "utils/syscache.h"
  #include "utils/timeout.h"
  #include "utils/tqual.h"
! #include "utils/signal_demo.h"
  
  static HeapTuple GetDatabaseTuple(const char *dbname);
  static HeapTuple GetDatabaseTupleByOid(Oid dboid);
*************** InitPostgres(const char *in_dbname, Oid
*** 588,593 ****
--- 588,596 ----
  	/* Now that we have a BackendId, we can participate in ProcSignal */
  	ProcSignalInit(MyBackendId);
  
+ 	/* And same for ProcPipe */
+ 	ProcPipeInit(MyBackendId);
+ 
  	/*
  	 * Also set up timeout handlers needed for backend operation.  We need
  	 * these in every case except bootstrap.
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
new file mode 100644
index ddf7c67..ae25559
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 2171 ( pg_cancel_backe
*** 3134,3139 ****
--- 3134,3141 ----
  DESCR("cancel a server process' current query");
  DATA(insert OID = 2096 ( pg_terminate_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
  DESCR("terminate a server process");
+ DATA(insert OID = 4066 ( signal_process		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 25 "23 23" _null_ _null_ _null_ _null_ _null_ signal_process _null_ _null_ _null_ ));
+ DESCR("cancel a server process' current query");
  DATA(insert OID = 2172 ( pg_start_backup		PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 3220 "25 16" _null_ _null_ _null_ _null_ _null_ pg_start_backup _null_ _null_ _null_ ));
  DESCR("prepare for taking an online backup");
  DATA(insert OID = 2173 ( pg_stop_backup			PGNSP PGUID 12 1 0 0 0 f f f f t f v 0 0 3220 "" _null_ _null_ _null_ _null_ _null_ pg_stop_backup _null_ _null_ _null_ ));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
new file mode 100644
index e0cc69f..3b9a7bf
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
***************
*** 80,85 ****
--- 80,86 ----
  extern PGDLLIMPORT volatile bool InterruptPending;
  extern PGDLLIMPORT volatile bool QueryCancelPending;
  extern PGDLLIMPORT volatile bool ProcDiePending;
+ extern PGDLLIMPORT volatile bool SendProcessInfoPending;
  
  extern volatile bool ClientConnectionLost;
  
*************** extern void SwitchBackToLocalLatch(void)
*** 324,329 ****
--- 325,333 ----
  extern bool superuser(void);	/* current user is superuser */
  extern bool superuser_arg(Oid roleid);	/* given user is superuser */
  
+ /* utils/misc/signal_demo.c */
+ extern void HandleSendProcessInfo(void);
+ 
  
  /*****************************************************************************
   *	  pmod.h --																 *
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
new file mode 100644
index af1a0cd..0d692a7
*** a/src/include/storage/procsignal.h
--- b/src/include/storage/procsignal.h
*************** typedef enum
*** 32,37 ****
--- 32,38 ----
  	PROCSIG_CATCHUP_INTERRUPT,	/* sinval catchup interrupt */
  	PROCSIG_NOTIFY_INTERRUPT,	/* listen/notify interrupt */
  	PROCSIG_PARALLEL_MESSAGE,	/* message from cooperating parallel backend */
+ 	PROCSIG_SEND_INFO,		/* internal info request */
  
  	/* Recovery conflict reasons */
  	PROCSIG_RECOVERY_CONFLICT_DATABASE,
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
new file mode 100644
index fc1679e..7310dc0
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
*************** extern Datum pg_prepared_statement(PG_FU
*** 1264,1267 ****
--- 1264,1270 ----
  /* utils/mmgr/portalmem.c */
  extern Datum pg_cursor(PG_FUNCTION_ARGS);
  
+ /* utils/signal_demo.c */
+ extern Datum signal_process(PG_FUNCTION_ARGS);
+ 
  #endif   /* BUILTINS_H */
diff --git a/src/include/utils/signal_demo.h b/src/include/utils/signal_demo.h
new file mode 100644
index ...0c951c4
*** a/src/include/utils/signal_demo.h
--- b/src/include/utils/signal_demo.h
***************
*** 0 ****
--- 1,14 ----
+ #include "storage/dsm.h"
+ 
+ extern Size ProcPipeShmemSize(void);
+ extern void ProcPipeShmemInit(void);
+ extern void ProcPipeInit(int pss_idx);
+ 
+ extern int SendProcessInfoSignal(pid_t data_sender_pid, BackendId data_sender_backendId,
+ 		    dsm_handle dsm_seg_handle, int data_tag);
+ 
+ extern void SendProcessInfoInterrupt(void);
+ extern void HandleSendProcessInfo(void);
+ 
+ extern void InvalidateProcPipeSlot(void);
+ extern bool is_valid_result(void);
#32Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#31)
Re: On-demand running query plans using auto_explain and signals

On Tue, Sep 8, 2015 at 6:58 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

But you will still lock on the slots list to find an unused one. How is

that substantially different from what I'm doing?

It is not necessary - you can use similar technique to what it does

PGPROC. I am sending "lock free" demo.

I don't afraid about locks - short locks, when the range and time are

limited. But there are lot of bugs, and fixes with the name "do
interruptible some", and it is reason, why I prefer typical design for work
with shared memory.

Thanks, this is really helpful! The key difference is that every backend
has a dedicated slot, so there's no need to search for a free one, which
would again incur locking.

Well, we are talking about hundreds to thousands bytes per plan in

total. And if my reading of shm_mq implementation is correct, if the
message fits into the shared memory buffer, the receiver gets the direct
pointer to the shared memory, no extra allocation/copy to process-local
memory. So this can be actually a win.

you have to calculate with signals and interprocess communication. the

cost of memory allocation is not all.

Sure. Anyway, we're talking about only kilobytes being sent in this case,
so the whole performance discussion is rather moot.

The real problem could be if the process that was signaled to connect to

the message queue never handles the interrupt, and we keep waiting forever
in shm_mq_receive(). We could add a timeout parameter or just let the user
cancel the call: send a cancellation request, use pg_cancel_backend() or
set statement_timeout before running this.

This is valid question - for begin we can use a statement_timeout and we

don't need to design some special (if you don't hold some important lock).

My example (the code has prototype quality) is little bit longer, but it

work without global lock - the requester doesn't block any other

I'll update the commitfest patch to use this technique.

--
Alex

#33Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Shulgin, Oleksandr (#32)
1 attachment(s)
Re: On-demand running query plans using auto_explain and signals

On Tue, Sep 8, 2015 at 11:49 AM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

The real problem could be if the process that was signaled to connect

to the message queue never handles the interrupt, and we keep waiting
forever in shm_mq_receive(). We could add a timeout parameter or just let
the user cancel the call: send a cancellation request, use
pg_cancel_backend() or set statement_timeout before running this.

This is valid question - for begin we can use a statement_timeout and we

don't need to design some special (if you don't hold some important lock).

My example (the code has prototype quality) is little bit longer, but it

work without global lock - the requester doesn't block any other

I'll update the commitfest patch to use this technique.

Please find attached v4.

--
Alex

Attachments:

explain-pid-v4.patchtext/x-patch; charset=US-ASCII; name=explain-pid-v4.patchDownload
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..a2b757c 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -139,6 +139,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, CmdStatusShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -243,6 +244,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
+	CmdStatusShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 0abde43..e637be1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -26,6 +26,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/cmdstatus.h"
 
 
 /*
@@ -296,6 +297,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CMD_STATUS_INFO))
+		HandleCmdStatusInfoInterrupt();
+
 	if (set_latch_on_sigusr1)
 		SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d917af3..1a5b03c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/cmdstatus.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2991,6 +2992,9 @@ ProcessInterrupts(void)
 
 	if (ParallelMessagePending)
 		HandleParallelMessages();
+
+	if (CmdStatusInfoPending)
+		ProcessCmdStatusInfoRequest();
 }
 
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 3ed0b44..2c8687c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -18,7 +18,7 @@ endif
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-	bool.o cash.o char.o date.o datetime.o datum.o dbsize.o domains.o \
+	bool.o cash.o char.o cmdstatus.o date.o datetime.o datum.o dbsize.o domains.o \
 	encode.o enum.o expandeddatum.o \
 	float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
diff --git a/src/backend/utils/adt/cmdstatus.c b/src/backend/utils/adt/cmdstatus.c
new file mode 100644
index 0000000..c17de9d
--- /dev/null
+++ b/src/backend/utils/adt/cmdstatus.c
@@ -0,0 +1,647 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.c
+ *	  Definitions for pg_cmdstatus function.
+ *
+ * Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/cmdstatus.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/htup_details.h"
+#include "commands/explain.h"
+#include "lib/stringinfo.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shm_mq.h"
+#include "storage/shm_toc.h"
+#include "tcop/dest.h"
+#include "tcop/pquery.h"
+#include "utils/builtins.h"
+#include "utils/cmdstatus.h"
+
+
+typedef enum {
+	CMD_STATUS_REQUEST_EXPLAIN = 1,
+	CMD_STATUS_REQUEST_QUERY_TEXT = 2,
+	CMD_STATUS_REQUEST_PROGRESS_TAG = 3,
+	CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE = 4
+} CmdStatusInfoRequestType;
+
+#define CMD_STATUS_MAX_REQUEST CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE
+
+typedef enum {
+	CMD_STATUS_RESULT_FAILURE = -1,
+	CMD_STATUS_RESULT_SUCCESS = 0,
+	CMD_STATUS_RESULT_BACKEND_IDLE,
+	CMD_STATUS_RESULT_NO_COMMAND_TAG
+} CmdStatusInfoResultCode;
+
+typedef struct
+{
+	pid_t			css_pid;
+	pid_t			sender_css_pid;
+	/*BackendId		sender_backend_id;*/
+	dsm_handle		dsm_seg_handle;
+	CmdStatusInfoRequestType	request_type;
+	CmdStatusInfoResultCode		result_code;
+	bool			is_valid;
+	bool			is_processed;
+} CmdStatusSlot;
+
+#define NumCmdStatusSlots	(MaxBackends + NUM_AUXPROCTYPES)
+
+#define PG_CMD_STATUS_INFO_MAGIC	0x79fb2449
+#define CMD_STATUS_BUFFER_SIZE		1024
+
+/*
+ * These structs are allocated on the program stack as local variables in the
+ * ExecutorRun hook.  The top of stack is current_query_stack, see below.
+ */
+typedef struct CmdInfoStack {
+	QueryDesc			   *query_desc;
+	struct CmdInfoStack	   *parent;
+} CmdInfoStack;
+
+
+static CmdStatusSlot *CmdStatusSlots = NULL;
+static volatile CmdStatusSlot *MyCmdStatusSlot = NULL;
+
+static CmdInfoStack	   *current_query_stack = NULL;
+static int				query_stack_size = 0;
+
+static ExecutorRun_hook_type	prev_ExecutorRun = NULL;
+
+static void cmdstatus_ExecutorRun(QueryDesc *queryDesc,
+								  ScanDirection direction, long count);
+
+
+Size
+CmdStatusShmemSize(void)
+{
+	return NumCmdStatusSlots * sizeof(CmdStatusSlot);
+}
+
+void
+CmdStatusShmemInit(void)
+{
+	Size		size = CmdStatusShmemSize();
+	bool		found;
+
+	CmdStatusSlots = (CmdStatusSlot *)
+		ShmemInitStruct("CmdStatusSlots", size, &found);
+
+	if (!found)
+		MemSet(CmdStatusSlots, 0, size);
+}
+
+static void
+CleanupCmdStatusSlot(int status, Datum arg)
+{
+	int		css_idx = DatumGetInt32(arg);
+	volatile CmdStatusSlot *slot;
+
+	slot = &CmdStatusSlots[css_idx - 1];
+	Assert(slot == MyCmdStatusSlot);
+
+	MyCmdStatusSlot = NULL;
+
+	if (slot->css_pid != MyProcPid)
+	{
+		elog(LOG, "process %d releasing ProcSignal slot %d, but it contains %d",
+			 MyProcPid, css_idx, (int) slot->css_pid);
+		return;					/* XXX better to zero the slot anyway? */
+	}
+
+	slot->css_pid = 0;
+	slot->is_valid = false;
+}
+
+void
+CmdStatusInit(int css_idx)
+{
+	volatile CmdStatusSlot *slot;
+
+	slot = &CmdStatusSlots[css_idx - 1];
+
+	if (slot->css_pid != 0)
+		elog(LOG, "process %d taking over CmdStatus slot %d, but it's not empty",
+			 MyProcPid, css_idx);
+
+	slot->css_pid = MyProcPid;
+	slot->is_valid = false;
+
+	MyCmdStatusSlot = slot;
+
+	on_shmem_exit(CleanupCmdStatusSlot, Int32GetDatum(css_idx));
+
+	/* also install executor hooks */
+	prev_ExecutorRun = ExecutorRun_hook;
+	ExecutorRun_hook = cmdstatus_ExecutorRun;
+}
+
+static int
+SendCmdStatusInfoSignal(pid_t target_pid, BackendId target_backend_id,
+						dsm_handle dsm_seg_handle,
+						CmdStatusInfoRequestType request_type)
+{
+	MyCmdStatusSlot->sender_css_pid = target_pid;
+	/*MyCmdStatusSlot->sender_backend_id = target_backend_id;*/
+	MyCmdStatusSlot->dsm_seg_handle = dsm_seg_handle;
+	MyCmdStatusSlot->request_type = request_type;
+	MyCmdStatusSlot->is_processed = false;
+	MyCmdStatusSlot->is_valid = true;		/* this should be set the latest */
+
+	return SendProcSignal(target_pid, PROCSIG_CMD_STATUS_INFO, target_backend_id);
+}
+
+static void
+InvalidateCmdStatusSlot(void)
+{
+	MyCmdStatusSlot->is_valid = false;
+	MyCmdStatusSlot->is_processed = false;
+	MyCmdStatusSlot->sender_css_pid = 0;
+	/*MyCmdStatusSlot->sender_backend_id = 0; */
+	MyCmdStatusSlot->dsm_seg_handle = NULL;
+	MyCmdStatusSlot->request_type = 0;
+}
+
+
+
+static void
+cmdstatus_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
+{
+	CmdInfoStack	current;
+
+	current.query_desc = queryDesc;
+	current.parent = current_query_stack;
+
+	current_query_stack = &current;
+	query_stack_size++;
+
+	PG_TRY();
+	{
+		if (prev_ExecutorRun)
+			prev_ExecutorRun(queryDesc, direction, count);
+		else
+			standard_ExecutorRun(queryDesc, direction, count);
+
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+	}
+	PG_CATCH();
+	{
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+
+static StringInfo
+explain_query(QueryDesc *queryDesc)
+{
+	StringInfo		str;
+	ExplainState   *es;
+
+	es = NewExplainState();
+	es->analyze = false;
+	es->verbose = false;
+	es->buffers = false;
+	es->format = EXPLAIN_FORMAT_TEXT;
+
+	ExplainBeginOutput(es);
+	/* XXX: appendStringInfo(es->str, "#%d ", depth); ? */
+	ExplainQueryText(es, queryDesc);
+	ExplainPrintPlan(es, queryDesc);
+	ExplainEndOutput(es);
+
+	str = es->str;
+
+	pfree(es);
+	return str;
+}
+
+
+/* signal handler for PROCSIG_CMD_STATUS_INFO */
+void
+HandleCmdStatusInfoInterrupt(void)
+{
+	CmdStatusInfoPending = true;
+	InterruptPending = true;
+
+	SetLatch(MyLatch);
+}
+
+
+static StringInfo
+ProcessCmdStatusSlot(volatile CmdStatusSlot *slot)
+{
+	StringInfo		result = NULL;
+
+	/* Show some optimism. */
+	slot->result_code = CMD_STATUS_RESULT_SUCCESS;
+
+	if (ActivePortal)
+	{
+		switch (slot->request_type)
+		{
+			case CMD_STATUS_REQUEST_EXPLAIN:
+				result = explain_query(ActivePortal->queryDesc);
+				break;
+
+			case CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE:
+				{
+					StringInfo		str;
+					CmdInfoStack   *query;
+
+					result = makeStringInfo();
+
+					if (current_query_stack != NULL)
+					{
+						for (query = current_query_stack;
+							 query != NULL;
+							 query = query->parent)
+						{
+							str = explain_query(query->query_desc);
+							appendBinaryStringInfo(result, str->data, str->len);
+
+							pfree(str->data);
+							pfree(str);
+						}
+					}
+					else
+					{
+						/* XXX */
+						slot->result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+					}
+					break;
+				}
+
+			case CMD_STATUS_REQUEST_QUERY_TEXT:
+				{
+					result = makeStringInfo();
+					appendStringInfoString(result, ActivePortal->sourceText);
+				}
+				break;
+
+			case CMD_STATUS_REQUEST_PROGRESS_TAG:
+				if (ActivePortal->commandTag != NULL)
+				{
+					result = makeStringInfo();
+
+					if (ActivePortal->queryDesc != NULL &&
+						ActivePortal->queryDesc->estate != NULL)
+					{
+						appendStringInfo(result, "%s %u",
+										 ActivePortal->commandTag,
+										 ActivePortal->queryDesc->estate->es_processed);
+					}
+					else
+					{
+						/* no progress available, at least show the command tag */
+						appendStringInfoString(result, ActivePortal->commandTag);
+					}
+				}
+				else
+				{
+					slot->result_code = CMD_STATUS_RESULT_NO_COMMAND_TAG;
+				}
+				break;
+		}
+	}
+	else
+	{
+		slot->result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+	}
+
+	return result;
+}
+
+static void
+RestoreResourceOwner(ResourceOwner prevResourceOwner)
+{
+	if (prevResourceOwner != CurrentResourceOwner)
+	{
+		ResourceOwner	res_owner = CurrentResourceOwner;
+
+		CurrentResourceOwner = prevResourceOwner;
+		ResourceOwnerDelete(res_owner);
+	}
+}
+
+void
+ProcessCmdStatusInfoRequest(void)
+{
+	elog(LOG, "+ProcessCmdStatusInfoRequest: %d", CmdStatusInfoPending);
+
+	while (CmdStatusInfoPending)
+	{
+		int		i;
+
+		CmdStatusInfoPending = false;
+
+		for (i = 0; i < NumCmdStatusSlots; i++)
+		{
+			volatile CmdStatusSlot *slot = &CmdStatusSlots[i];
+
+			if (slot->sender_css_pid == MyProcPid)
+			{
+				elog(LOG, "found receiver slot: PID %d, valid: %d, processed: %d",
+					 slot->css_pid, slot->is_valid, slot->is_processed);
+
+			if (slot->is_valid && !slot->is_processed)
+			{
+				ResourceOwner	prevResourceOwner = CurrentResourceOwner;
+				dsm_segment	   *seg = NULL;
+				shm_toc		   *toc = NULL;
+				shm_mq_handle  *output;
+				shm_mq		   *mq;
+
+				/* Ensure valid resource owner for access to dsm. */
+				if (CurrentResourceOwner == NULL)
+					CurrentResourceOwner = ResourceOwnerCreate(NULL, "ProcessCmdStatusInfoRequest");
+
+				seg = dsm_attach(slot->dsm_seg_handle);
+				if (seg == NULL)
+				{
+					elog(LOG, "unable to map dynamic memory segment for command status");
+
+					slot->is_processed = true;
+					slot->is_valid = false;
+
+					RestoreResourceOwner(prevResourceOwner);
+					continue;
+				}
+
+				toc = shm_toc_attach(PG_CMD_STATUS_INFO_MAGIC, dsm_segment_address(seg));
+				if (toc == NULL)
+				{
+					elog(LOG, "bad magic in dynamic memory segment for command status");
+
+					slot->is_processed = true;
+					slot->is_valid = false;
+
+					dsm_detach(seg);
+
+					RestoreResourceOwner(prevResourceOwner);
+					continue;
+				}
+
+				mq = shm_toc_lookup(toc, 0);
+
+				elog(LOG, "sender: attaching mq");
+				output = shm_mq_attach(mq, seg, NULL);
+
+				PG_TRY();
+				{
+					StringInfo		payload;
+					int				res;
+
+					payload = ProcessCmdStatusSlot(slot);
+					if (payload != NULL)
+					{
+						elog(LOG, "sender: sending payload");
+
+						slot->is_processed = true;
+
+						res = shm_mq_send(output, payload->len, payload->data, false);
+						if (res != SHM_MQ_SUCCESS)
+						{
+							elog(LOG, "cannot send command status to backend with PID %d",
+								 slot->css_pid);
+							slot->is_valid = false;
+						}
+						pfree(payload->data);
+						pfree(payload);
+					}
+					else
+					{
+						slot->is_valid = false;
+					}
+
+					elog(LOG, "sender: detaching mq");
+					shm_mq_detach(mq);
+					dsm_detach(seg);
+
+					RestoreResourceOwner(prevResourceOwner);
+				}
+				PG_CATCH();
+				{
+					slot->is_processed = true;
+					slot->is_valid = false;
+
+					elog(LOG, "sender: detaching mq (ON ERROR)");
+					shm_mq_detach(mq);
+					dsm_detach(seg);
+
+					RestoreResourceOwner(prevResourceOwner);
+
+					PG_RE_THROW();
+				}
+				PG_END_TRY();
+			}
+			}
+		}
+	}
+
+	elog(LOG, "-ProcessCmdStatusInfoRequest");
+}
+
+static void
+report_result_code(CmdStatusInfoResultCode result, pid_t target_pid)
+{
+	switch (result)
+	{
+		case CMD_STATUS_RESULT_BACKEND_IDLE:
+			elog(INFO, "no command is currently running in backend with PID %d",
+				 target_pid);
+			break;
+
+		case CMD_STATUS_RESULT_NO_COMMAND_TAG:
+			elog(WARNING, "no command tag found for the query in backend with PID %d",
+				 target_pid);
+			break;
+
+		default:
+			elog(ERROR, "general command status request failure");
+			break;
+	}
+}
+
+/*
+ * try to get status of command in another process
+ *
+ * FUNCTION get_cmdstatus(pid, request_type)
+ * RETURNS SETOF text
+ */
+Datum
+pg_cmdstatus(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo	   *rsinfo	= (ReturnSetInfo *) fcinfo->resultinfo;
+	Tuplestorestate	   *tupstore;
+	TupleDesc			tupdesc;
+	MemoryContext		per_query_ctx;
+	MemoryContext		oldcontext;
+	pid_t				target_pid = (pid_t) PG_GETARG_INT32(0);
+	int					request_type = PG_GETARG_INT32(1);
+	PGPROC			   *proc;
+	shm_toc_estimator	estimator;
+	Size				segsize;
+	dsm_segment		   *seg = NULL;
+	shm_toc			   *toc = NULL;
+	shm_mq			   *mq = NULL;
+	shm_mq_handle	   *input = NULL;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	if (request_type < 1 || request_type > CMD_STATUS_MAX_REQUEST)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("unknown command status request")));
+
+	/* need to build tuplestore in query context */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupdesc = CreateTupleDescCopy(rsinfo->expectedDesc);
+	tupstore = tuplestore_begin_heap(false, false, work_mem);
+	MemoryContextSwitchTo(oldcontext);
+
+	if (target_pid == MyProcPid)
+		ereport(ERROR,
+				(errmsg("backend cannot query command status of itself")));
+
+	/* verify access to target_pid */
+	proc = BackendPidGetProc(target_pid);
+
+	if (proc == NULL)
+		ereport(ERROR,
+				(errmsg("PID %d is not a PostgreSQL server process", target_pid)));
+
+	if (!(superuser() || proc->roleId == GetUserId()))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser or have the same role to cancel queries running in other server processes"))));
+
+	/* prepare shared dsm segment */
+	shm_toc_initialize_estimator(&estimator);
+	shm_toc_estimate_keys(&estimator, 1);
+	shm_toc_estimate_chunk(&estimator, CMD_STATUS_BUFFER_SIZE);
+	segsize = shm_toc_estimate(&estimator);
+
+	seg = dsm_create(segsize, 0);
+
+	toc = shm_toc_create(PG_CMD_STATUS_INFO_MAGIC, dsm_segment_address(seg),
+						 segsize);
+
+	/* prepare basic structures */
+	mq = shm_mq_create(shm_toc_allocate(toc, CMD_STATUS_BUFFER_SIZE),
+					   CMD_STATUS_BUFFER_SIZE);
+
+	shm_mq_set_receiver(mq, MyProc);
+	shm_mq_set_sender(mq, proc);
+	shm_toc_insert(toc, 0, mq);
+
+	PG_TRY();
+	{
+		shm_mq_result	res;
+		char		   *data = NULL;
+		Size			length;
+
+		elog(LOG, "receiver: attaching mq");
+		input = shm_mq_attach(mq, seg, NULL);
+
+		elog(LOG, "receiver: signalling PID %d", target_pid);
+		if (SendCmdStatusInfoSignal(target_pid, proc->backendId,
+									dsm_segment_handle(seg), request_type))
+		{
+			elog(ERROR, "could not signal backend with PID %d", target_pid);
+		}
+
+		/* try to read from queue */
+		res = shm_mq_receive(input, &length, (void **) &data, false);
+		if (res == SHM_MQ_SUCCESS)
+		{
+			const char	   *p = data;
+
+			/* break into tuples on newline */
+			while (*p)
+			{
+				const char *q = strchr(p, '\n');
+				Size		len = q ? q - p : strlen(p);
+				Datum		value;
+				HeapTuple	tuple;
+				bool		isnull = false;
+
+				value = PointerGetDatum(cstring_to_text_with_len(p, len));
+
+				tuple = heap_form_tuple(tupdesc, &value, &isnull);
+				tuplestore_puttuple(tupstore, tuple);
+
+				if (!q)
+					break;
+
+				p += len + 1;
+			}
+		}
+		else
+		{
+			report_result_code(MyCmdStatusSlot->result_code, target_pid);
+		}
+
+		/* clean up and return the tuplestore */
+		tuplestore_donestoring(tupstore);
+
+		InvalidateCmdStatusSlot();
+
+		elog(LOG, "receiver: detaching mq");
+		shm_mq_detach(mq);
+		dsm_detach(seg);
+	}
+	PG_CATCH();
+	{
+		InvalidateCmdStatusSlot();
+
+		elog(LOG, "receiver: detaching mq (ON ERROR)");
+		shm_mq_detach(mq);
+		dsm_detach(seg);
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	return (Datum) 0;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 23e594e..2269853 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -29,6 +29,7 @@ ProtocolVersion FrontendProtocol;
 volatile bool InterruptPending = false;
 volatile bool QueryCancelPending = false;
 volatile bool ProcDiePending = false;
+volatile bool CmdStatusInfoPending = false;
 volatile bool ClientConnectionLost = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7b19714..7e7ba77 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -50,6 +50,7 @@
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
+#include "utils/cmdstatus.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -588,6 +589,9 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	/* Now that we have a BackendId, we can participate in ProcSignal */
 	ProcSignalInit(MyBackendId);
 
+	/* Init CmdStatus slot */
+	CmdStatusInit(MyBackendId);
+
 	/*
 	 * Also set up timeout handlers needed for backend operation.  We need
 	 * these in every case except bootstrap.
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ddf7c67..d083aaf 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3130,6 +3130,8 @@ DESCR("get OID of current session's temp schema, if any");
 DATA(insert OID = 2855 (  pg_is_other_temp_schema	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "26" _null_ _null_ _null_ _null_ _null_ pg_is_other_temp_schema _null_ _null_ _null_ ));
 DESCR("is schema another session's temp schema?");
 
+DATA(insert OID = 4099 ( pg_cmdstatus		PGNSP PGUID 12 1 100 0 0 f f f f f t s 2 0 25 "23 23" _null_ _null_  _null_  _null_ _null_ pg_cmdstatus _null_ _null_ _null_ ));
+DESCR("returns information about another process");
 DATA(insert OID = 2171 ( pg_cancel_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_cancel_backend _null_ _null_ _null_ ));
 DESCR("cancel a server process' current query");
 DATA(insert OID = 2096 ( pg_terminate_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e0cc69f..5f03f63 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -80,6 +80,7 @@
 extern PGDLLIMPORT volatile bool InterruptPending;
 extern PGDLLIMPORT volatile bool QueryCancelPending;
 extern PGDLLIMPORT volatile bool ProcDiePending;
+extern PGDLLIMPORT volatile bool CmdStatusInfoPending;
 
 extern volatile bool ClientConnectionLost;
 
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index af1a0cd..cd856e5 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,8 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
+	PROCSIG_CMD_STATUS_INFO,
+
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fc1679e..605612e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1243,6 +1243,9 @@ extern Datum pg_identify_object_as_address(PG_FUNCTION_ARGS);
 /* catalog/objectaddress.c */
 extern Datum pg_get_object_address(PG_FUNCTION_ARGS);
 
+/* utils/adt/cmdstatus.c */
+extern Datum pg_cmdstatus(PG_FUNCTION_ARGS);
+
 /* commands/constraint.c */
 extern Datum unique_key_recheck(PG_FUNCTION_ARGS);
 
diff --git a/src/include/utils/cmdstatus.h b/src/include/utils/cmdstatus.h
new file mode 100644
index 0000000..85627a57
--- /dev/null
+++ b/src/include/utils/cmdstatus.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.h
+ *	  Declarations for command status interrupt handling.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * src/include/utils/cmdstatus.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CMDSTATUS_H
+#define CMDSTATUS_H
+
+extern Size CmdStatusShmemSize(void);
+extern void CmdStatusShmemInit(void);
+extern void CmdStatusInit(int css_idx);
+
+extern void HandleCmdStatusInfoInterrupt(void);
+extern void ProcessCmdStatusInfoRequest(void);
+
+#endif   /* CMDSTATUS_H */
#34Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#33)
Re: On-demand running query plans using auto_explain and signals

2015-09-08 18:53 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Tue, Sep 8, 2015 at 11:49 AM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

The real problem could be if the process that was signaled to connect

to the message queue never handles the interrupt, and we keep waiting
forever in shm_mq_receive(). We could add a timeout parameter or just let
the user cancel the call: send a cancellation request, use
pg_cancel_backend() or set statement_timeout before running this.

This is valid question - for begin we can use a statement_timeout and

we don't need to design some special (if you don't hold some important
lock).

My example (the code has prototype quality) is little bit longer, but

it work without global lock - the requester doesn't block any other

I'll update the commitfest patch to use this technique.

Please find attached v4.

It is better

Two notices:

1. The communication mechanism can be used more wide, than only for this
purpose. We can introduce a SendInfoHook - and it can be used for any
customer probes - memory, cpu, ...

2. With your support for explain of nested queries we have all what we need
for integration auto_explain to core.

Regards

Pavel

Show quoted text

--
Alex

#35Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#34)
Re: On-demand running query plans using auto_explain and signals

On Wed, Sep 9, 2015 at 8:36 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Please find attached v4.

It is better

One important thing to notice, and probably deserves a code comment, that
any modification of the slot fields done by the info sender backend must be
done before actually sending the message via the shared memory queue (or
detaching one, if there's nothing to be sent). Otherwise it is possible
that is_processed flag will be set on the slot that has been already reused
by the receiving backend.

I could hit that problem rather easily while doing

select pg_cmdstatus(pid,$1) from pg_stat_activity where
pid<>pg_backend_pid() \watch 1

and running pgbench with >= 8 connections in the background. The observed
effect is that after a few successful runs, the pg_cmdstatus() never
returns because the next signaled backend loops through the slots and
doesn't find an unprocessed one (because it was marked as processed by the
backend that was responding just before it).

Two notices:

1. The communication mechanism can be used more wide, than only for this
purpose. We can introduce a SendInfoHook - and it can be used for any
customer probes - memory, cpu, ...

Not sure if for CPU you can get any more insight than an external tool like
top(1) will provide. For the memory, it might be indeed useful in some way
(memory context inspection?)

So there's certainly a space for generalization. Rename it as
pg_backend_info()?

2. With your support for explain of nested queries we have all what we need

for integration auto_explain to core.

Well, I don't quite see how that follows. What I'm really after is
bringing instrumentation support to this, so that not only plan could be
extracted, but also the rows/loops/times accumulated thus far during the
query run.

--
Alex

#36Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#35)
Re: On-demand running query plans using auto_explain and signals

Two notices:

1. The communication mechanism can be used more wide, than only for this
purpose. We can introduce a SendInfoHook - and it can be used for any
customer probes - memory, cpu, ...

Not sure if for CPU you can get any more insight than an external tool like
top(1) will provide. For the memory, it might be indeed useful in some way
(memory context inspection?)

CPU is probably nonsense - sorry - but there are more other possibilities,
how to use it - simple checking some internal states of extensions without
log processing or without gdb, etc

So there's certainly a space for generalization. Rename it as
pg_backend_info()?

It is good name, so why not?

2. With your support for explain of nested queries we have all what we

need for integration auto_explain to core.

Well, I don't quite see how that follows. What I'm really after is
bringing instrumentation support to this, so that not only plan could be
extracted, but also the rows/loops/times accumulated thus far during the
query run.

It is possible, but not necessary step - but it is premature question in
this moment

Regards

Pavel

Show quoted text

--
Alex

#37Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Shulgin, Oleksandr (#35)
1 attachment(s)
Re: On-demand running query plans using auto_explain and signals

On Wed, Sep 9, 2015 at 10:22 AM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

On Wed, Sep 9, 2015 at 8:36 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Please find attached v4.

It is better

One important thing to notice, and probably deserves a code comment, that
any modification of the slot fields done by the info sender backend must be
done before actually sending the message via the shared memory queue (or
detaching one, if there's nothing to be sent). Otherwise it is possible
that is_processed flag will be set on the slot that has been already reused
by the receiving backend.

Sorry, the latest version did contain some debugging messages I didn't
intend to send. I've removed them and also added a comment about the above
synchronization problem.

--
Alex

Attachments:

explain-pid-v4.1.patchtext/x-patch; charset=US-ASCII; name=explain-pid-v4.1.patchDownload
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..a2b757c 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -139,6 +139,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, CmdStatusShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -243,6 +244,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
+	CmdStatusShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 0abde43..e637be1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -26,6 +26,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/cmdstatus.h"
 
 
 /*
@@ -296,6 +297,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CMD_STATUS_INFO))
+		HandleCmdStatusInfoInterrupt();
+
 	if (set_latch_on_sigusr1)
 		SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d917af3..1a5b03c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/cmdstatus.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2991,6 +2992,9 @@ ProcessInterrupts(void)
 
 	if (ParallelMessagePending)
 		HandleParallelMessages();
+
+	if (CmdStatusInfoPending)
+		ProcessCmdStatusInfoRequest();
 }
 
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 3ed0b44..2c8687c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -18,7 +18,7 @@ endif
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-	bool.o cash.o char.o date.o datetime.o datum.o dbsize.o domains.o \
+	bool.o cash.o char.o cmdstatus.o date.o datetime.o datum.o dbsize.o domains.o \
 	encode.o enum.o expandeddatum.o \
 	float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
diff --git a/src/backend/utils/adt/cmdstatus.c b/src/backend/utils/adt/cmdstatus.c
new file mode 100644
index 0000000..40df6e1
--- /dev/null
+++ b/src/backend/utils/adt/cmdstatus.c
@@ -0,0 +1,640 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.c
+ *	  Definitions for pg_cmdstatus function.
+ *
+ * Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/cmdstatus.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/htup_details.h"
+#include "commands/explain.h"
+#include "lib/stringinfo.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shm_mq.h"
+#include "storage/shm_toc.h"
+#include "tcop/dest.h"
+#include "tcop/pquery.h"
+#include "utils/builtins.h"
+#include "utils/cmdstatus.h"
+
+
+typedef enum {
+	CMD_STATUS_REQUEST_EXPLAIN = 1,
+	CMD_STATUS_REQUEST_QUERY_TEXT = 2,
+	CMD_STATUS_REQUEST_PROGRESS_TAG = 3,
+	CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE = 4
+} CmdStatusInfoRequestType;
+
+#define CMD_STATUS_MAX_REQUEST CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE
+
+typedef enum {
+	CMD_STATUS_RESULT_FAILURE = -1,
+	CMD_STATUS_RESULT_SUCCESS = 0,
+	CMD_STATUS_RESULT_BACKEND_IDLE,
+	CMD_STATUS_RESULT_NO_COMMAND_TAG
+} CmdStatusInfoResultCode;
+
+typedef struct
+{
+	pid_t			css_pid;
+	pid_t			sender_css_pid;
+	/*BackendId		sender_backend_id;*/
+	dsm_handle		dsm_seg_handle;
+	CmdStatusInfoRequestType	request_type;
+	CmdStatusInfoResultCode		result_code;
+	bool			is_valid;
+	bool			is_processed;
+} CmdStatusSlot;
+
+#define NumCmdStatusSlots	(MaxBackends + NUM_AUXPROCTYPES)
+
+#define PG_CMD_STATUS_INFO_MAGIC	0x79fb2449
+#define CMD_STATUS_BUFFER_SIZE		1024
+
+/*
+ * These structs are allocated on the program stack as local variables in the
+ * ExecutorRun hook.  The top of stack is current_query_stack, see below.
+ */
+typedef struct CmdInfoStack {
+	QueryDesc			   *query_desc;
+	struct CmdInfoStack	   *parent;
+} CmdInfoStack;
+
+
+static CmdStatusSlot *CmdStatusSlots = NULL;
+static volatile CmdStatusSlot *MyCmdStatusSlot = NULL;
+
+static CmdInfoStack	   *current_query_stack = NULL;
+static int				query_stack_size = 0;
+
+static ExecutorRun_hook_type	prev_ExecutorRun = NULL;
+
+static void cmdstatus_ExecutorRun(QueryDesc *queryDesc,
+								  ScanDirection direction, long count);
+
+
+Size
+CmdStatusShmemSize(void)
+{
+	return NumCmdStatusSlots * sizeof(CmdStatusSlot);
+}
+
+void
+CmdStatusShmemInit(void)
+{
+	Size		size = CmdStatusShmemSize();
+	bool		found;
+
+	CmdStatusSlots = (CmdStatusSlot *)
+		ShmemInitStruct("CmdStatusSlots", size, &found);
+
+	if (!found)
+		MemSet(CmdStatusSlots, 0, size);
+}
+
+static void
+CleanupCmdStatusSlot(int status, Datum arg)
+{
+	int		css_idx = DatumGetInt32(arg);
+	volatile CmdStatusSlot *slot;
+
+	slot = &CmdStatusSlots[css_idx - 1];
+	Assert(slot == MyCmdStatusSlot);
+
+	MyCmdStatusSlot = NULL;
+
+	if (slot->css_pid != MyProcPid)
+	{
+		elog(LOG, "process %d releasing ProcSignal slot %d, but it contains %d",
+			 MyProcPid, css_idx, (int) slot->css_pid);
+		return;					/* XXX better to zero the slot anyway? */
+	}
+
+	slot->css_pid = 0;
+	slot->is_valid = false;
+}
+
+void
+CmdStatusInit(int css_idx)
+{
+	volatile CmdStatusSlot *slot;
+
+	slot = &CmdStatusSlots[css_idx - 1];
+
+	if (slot->css_pid != 0)
+		elog(LOG, "process %d taking over CmdStatus slot %d, but it's not empty",
+			 MyProcPid, css_idx);
+
+	slot->css_pid = MyProcPid;
+	slot->is_valid = false;
+
+	MyCmdStatusSlot = slot;
+
+	on_shmem_exit(CleanupCmdStatusSlot, Int32GetDatum(css_idx));
+
+	/* also install executor hooks */
+	prev_ExecutorRun = ExecutorRun_hook;
+	ExecutorRun_hook = cmdstatus_ExecutorRun;
+}
+
+static int
+SendCmdStatusInfoSignal(pid_t target_pid, BackendId target_backend_id,
+						dsm_handle dsm_seg_handle,
+						CmdStatusInfoRequestType request_type)
+{
+	MyCmdStatusSlot->sender_css_pid = target_pid;
+	/*MyCmdStatusSlot->sender_backend_id = target_backend_id;*/
+	MyCmdStatusSlot->dsm_seg_handle = dsm_seg_handle;
+	MyCmdStatusSlot->request_type = request_type;
+	MyCmdStatusSlot->is_processed = false;
+	MyCmdStatusSlot->is_valid = true;		/* this should be set the latest */
+
+	return SendProcSignal(target_pid, PROCSIG_CMD_STATUS_INFO, target_backend_id);
+}
+
+static void
+InvalidateCmdStatusSlot(void)
+{
+	MyCmdStatusSlot->is_valid = false;
+	MyCmdStatusSlot->is_processed = false;
+	MyCmdStatusSlot->sender_css_pid = 0;
+	/*MyCmdStatusSlot->sender_backend_id = 0; */
+	MyCmdStatusSlot->dsm_seg_handle = 0;
+	MyCmdStatusSlot->request_type = 0;
+}
+
+
+
+static void
+cmdstatus_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
+{
+	CmdInfoStack	current;
+
+	current.query_desc = queryDesc;
+	current.parent = current_query_stack;
+
+	current_query_stack = &current;
+	query_stack_size++;
+
+	PG_TRY();
+	{
+		if (prev_ExecutorRun)
+			prev_ExecutorRun(queryDesc, direction, count);
+		else
+			standard_ExecutorRun(queryDesc, direction, count);
+
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+	}
+	PG_CATCH();
+	{
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+
+static StringInfo
+explain_query(QueryDesc *queryDesc)
+{
+	StringInfo		str;
+	ExplainState   *es;
+
+	es = NewExplainState();
+	es->analyze = false;
+	es->verbose = false;
+	es->buffers = false;
+	es->format = EXPLAIN_FORMAT_TEXT;
+
+	ExplainBeginOutput(es);
+	/* XXX: appendStringInfo(es->str, "#%d ", depth); ? */
+	ExplainQueryText(es, queryDesc);
+	ExplainPrintPlan(es, queryDesc);
+	ExplainEndOutput(es);
+
+	str = es->str;
+
+	pfree(es);
+	return str;
+}
+
+
+/* signal handler for PROCSIG_CMD_STATUS_INFO */
+void
+HandleCmdStatusInfoInterrupt(void)
+{
+	CmdStatusInfoPending = true;
+	InterruptPending = true;
+
+	SetLatch(MyLatch);
+}
+
+
+static StringInfo
+ProcessCmdStatusSlot(volatile CmdStatusSlot *slot)
+{
+	StringInfo		result = NULL;
+
+	/* Show some optimism. */
+	slot->result_code = CMD_STATUS_RESULT_SUCCESS;
+
+	if (ActivePortal)
+	{
+		switch (slot->request_type)
+		{
+			case CMD_STATUS_REQUEST_EXPLAIN:
+				result = explain_query(ActivePortal->queryDesc);
+				break;
+
+			case CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE:
+				{
+					StringInfo		str;
+					CmdInfoStack   *query;
+
+					result = makeStringInfo();
+
+					if (current_query_stack != NULL)
+					{
+						for (query = current_query_stack;
+							 query != NULL;
+							 query = query->parent)
+						{
+							str = explain_query(query->query_desc);
+							appendBinaryStringInfo(result, str->data, str->len);
+
+							pfree(str->data);
+							pfree(str);
+						}
+					}
+					else
+					{
+						/* XXX */
+						slot->result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+					}
+					break;
+				}
+
+			case CMD_STATUS_REQUEST_QUERY_TEXT:
+				{
+					result = makeStringInfo();
+					appendStringInfoString(result, ActivePortal->sourceText);
+				}
+				break;
+
+			case CMD_STATUS_REQUEST_PROGRESS_TAG:
+				if (ActivePortal->commandTag != NULL)
+				{
+					result = makeStringInfo();
+
+					if (ActivePortal->queryDesc != NULL &&
+						ActivePortal->queryDesc->estate != NULL)
+					{
+						appendStringInfo(result, "%s %u",
+										 ActivePortal->commandTag,
+										 ActivePortal->queryDesc->estate->es_processed);
+					}
+					else
+					{
+						/* no progress available, at least show the command tag */
+						appendStringInfoString(result, ActivePortal->commandTag);
+					}
+				}
+				else
+				{
+					slot->result_code = CMD_STATUS_RESULT_NO_COMMAND_TAG;
+				}
+				break;
+		}
+	}
+	else
+	{
+		slot->result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+	}
+
+	return result;
+}
+
+static void
+RestoreResourceOwner(ResourceOwner prevResourceOwner)
+{
+	if (prevResourceOwner != CurrentResourceOwner)
+	{
+		ResourceOwner	res_owner = CurrentResourceOwner;
+
+		CurrentResourceOwner = prevResourceOwner;
+		ResourceOwnerDelete(res_owner);
+	}
+}
+
+void
+ProcessCmdStatusInfoRequest(void)
+{
+	while (CmdStatusInfoPending)
+	{
+		int		i;
+
+		CmdStatusInfoPending = false;
+
+		for (i = 0; i < NumCmdStatusSlots; i++)
+		{
+			volatile CmdStatusSlot *slot = &CmdStatusSlots[i];
+
+			if (slot->sender_css_pid == MyProcPid &&
+				slot->is_valid && !slot->is_processed)
+			{
+				ResourceOwner	prevResourceOwner = CurrentResourceOwner;
+				dsm_segment	   *seg = NULL;
+				shm_toc		   *toc = NULL;
+				shm_mq_handle  *output;
+				shm_mq		   *mq;
+
+				/* Ensure valid resource owner for access to dsm. */
+				if (CurrentResourceOwner == NULL)
+					CurrentResourceOwner = ResourceOwnerCreate(NULL, "ProcessCmdStatusInfoRequest");
+
+				seg = dsm_attach(slot->dsm_seg_handle);
+				if (seg == NULL)
+				{
+					elog(LOG, "unable to map dynamic memory segment for command status");
+
+					slot->is_processed = true;
+					slot->is_valid = false;
+
+					RestoreResourceOwner(prevResourceOwner);
+					continue;
+				}
+
+				toc = shm_toc_attach(PG_CMD_STATUS_INFO_MAGIC, dsm_segment_address(seg));
+				if (toc == NULL)
+				{
+					elog(LOG, "bad magic in dynamic memory segment for command status");
+
+					slot->is_processed = true;
+					slot->is_valid = false;
+
+					dsm_detach(seg);
+
+					RestoreResourceOwner(prevResourceOwner);
+					continue;
+				}
+
+				mq = shm_toc_lookup(toc, 0);
+				output = shm_mq_attach(mq, seg, NULL);
+
+				PG_TRY();
+				{
+					StringInfo		payload;
+					int				res;
+
+					payload = ProcessCmdStatusSlot(slot);
+					if (payload != NULL)
+					{
+						/*
+						 * It's important to set the processed flag and result
+						 * code *before* sending the message through the
+						 * queue.  Otherwise it may result that the receiving
+						 * backend will wake up the queue and re-use its slot
+						 * to send another status request (it doesn't matter
+						 * if that would be a request to the same backend or a
+						 * different one).  The message queue serves as a
+						 * barrier here: it should be generally safe to assume
+						 * that before we send anything to the queue (or
+						 * detach from it), the receiving backend is waiting.
+						 */
+						slot->is_processed = true;
+
+						res = shm_mq_send(output, payload->len, payload->data, false);
+						if (res != SHM_MQ_SUCCESS)
+						{
+							elog(LOG, "cannot send command status to backend with PID %d",
+								 slot->css_pid);
+							slot->is_valid = false;
+						}
+						pfree(payload->data);
+						pfree(payload);
+					}
+					else
+					{
+						slot->is_valid = false;
+					}
+
+					shm_mq_detach(mq);
+					dsm_detach(seg);
+
+					RestoreResourceOwner(prevResourceOwner);
+				}
+				PG_CATCH();
+				{
+					slot->is_processed = true;
+					slot->is_valid = false;
+
+					shm_mq_detach(mq);
+					dsm_detach(seg);
+
+					RestoreResourceOwner(prevResourceOwner);
+
+					PG_RE_THROW();
+				}
+				PG_END_TRY();
+			}
+		}
+	}
+}
+
+static void
+report_result_code(CmdStatusInfoResultCode result, pid_t target_pid)
+{
+	switch (result)
+	{
+		case CMD_STATUS_RESULT_BACKEND_IDLE:
+			elog(INFO, "no command is currently running in backend with PID %d",
+				 target_pid);
+			break;
+
+		case CMD_STATUS_RESULT_NO_COMMAND_TAG:
+			elog(WARNING, "no command tag found for the query in backend with PID %d",
+				 target_pid);
+			break;
+
+		default:
+			elog(ERROR, "general command status request failure");
+			break;
+	}
+}
+
+/*
+ * try to get status of command in another process
+ *
+ * FUNCTION get_cmdstatus(pid, request_type)
+ * RETURNS SETOF text
+ */
+Datum
+pg_cmdstatus(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo	   *rsinfo	= (ReturnSetInfo *) fcinfo->resultinfo;
+	Tuplestorestate	   *tupstore;
+	TupleDesc			tupdesc;
+	MemoryContext		per_query_ctx;
+	MemoryContext		oldcontext;
+	pid_t				target_pid = (pid_t) PG_GETARG_INT32(0);
+	int					request_type = PG_GETARG_INT32(1);
+	PGPROC			   *proc;
+	shm_toc_estimator	estimator;
+	Size				segsize;
+	dsm_segment		   *seg = NULL;
+	shm_toc			   *toc = NULL;
+	shm_mq			   *mq = NULL;
+	shm_mq_handle	   *input = NULL;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	if (request_type < 1 || request_type > CMD_STATUS_MAX_REQUEST)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("unknown command status request")));
+
+	/* need to build tuplestore in query context */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupdesc = CreateTupleDescCopy(rsinfo->expectedDesc);
+	tupstore = tuplestore_begin_heap(false, false, work_mem);
+	MemoryContextSwitchTo(oldcontext);
+
+	if (target_pid == MyProcPid)
+		ereport(ERROR,
+				(errmsg("backend cannot query command status of itself")));
+
+	/* verify access to target_pid */
+	proc = BackendPidGetProc(target_pid);
+
+	if (proc == NULL)
+		ereport(ERROR,
+				(errmsg("PID %d is not a PostgreSQL server process", target_pid)));
+
+	if (!(superuser() || proc->roleId == GetUserId()))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser or have the same role to explain queries running in other server processes"))));
+
+	/* prepare shared dsm segment */
+	shm_toc_initialize_estimator(&estimator);
+	shm_toc_estimate_keys(&estimator, 1);
+	shm_toc_estimate_chunk(&estimator, CMD_STATUS_BUFFER_SIZE);
+	segsize = shm_toc_estimate(&estimator);
+
+	seg = dsm_create(segsize, 0);
+
+	toc = shm_toc_create(PG_CMD_STATUS_INFO_MAGIC, dsm_segment_address(seg),
+						 segsize);
+
+	/* prepare basic structures */
+	mq = shm_mq_create(shm_toc_allocate(toc, CMD_STATUS_BUFFER_SIZE),
+					   CMD_STATUS_BUFFER_SIZE);
+
+	shm_mq_set_receiver(mq, MyProc);
+	shm_mq_set_sender(mq, proc);
+	shm_toc_insert(toc, 0, mq);
+
+	PG_TRY();
+	{
+		shm_mq_result	res;
+		char		   *data = NULL;
+		Size			length;
+
+		input = shm_mq_attach(mq, seg, NULL);
+
+		if (SendCmdStatusInfoSignal(target_pid, proc->backendId,
+									dsm_segment_handle(seg), request_type))
+		{
+			elog(ERROR, "could not signal backend with PID %d", target_pid);
+		}
+
+		/* try to read from queue */
+		res = shm_mq_receive(input, &length, (void **) &data, false);
+		if (res == SHM_MQ_SUCCESS)
+		{
+			const char	   *p = data;
+
+			/* break into tuples on newline */
+			while (*p)
+			{
+				const char *q = strchr(p, '\n');
+				Size		len = q ? q - p : strlen(p);
+				Datum		value;
+				HeapTuple	tuple;
+				bool		isnull = false;
+
+				value = PointerGetDatum(cstring_to_text_with_len(p, len));
+
+				tuple = heap_form_tuple(tupdesc, &value, &isnull);
+				tuplestore_puttuple(tupstore, tuple);
+
+				if (!q)
+					break;
+
+				p += len + 1;
+			}
+		}
+		else
+		{
+			report_result_code(MyCmdStatusSlot->result_code, target_pid);
+		}
+
+		/* clean up and return the tuplestore */
+		tuplestore_donestoring(tupstore);
+
+		InvalidateCmdStatusSlot();
+
+		shm_mq_detach(mq);
+		dsm_detach(seg);
+	}
+	PG_CATCH();
+	{
+		InvalidateCmdStatusSlot();
+
+		shm_mq_detach(mq);
+		dsm_detach(seg);
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	return (Datum) 0;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 23e594e..2269853 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -29,6 +29,7 @@ ProtocolVersion FrontendProtocol;
 volatile bool InterruptPending = false;
 volatile bool QueryCancelPending = false;
 volatile bool ProcDiePending = false;
+volatile bool CmdStatusInfoPending = false;
 volatile bool ClientConnectionLost = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7b19714..7e7ba77 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -50,6 +50,7 @@
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
+#include "utils/cmdstatus.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -588,6 +589,9 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	/* Now that we have a BackendId, we can participate in ProcSignal */
 	ProcSignalInit(MyBackendId);
 
+	/* Init CmdStatus slot */
+	CmdStatusInit(MyBackendId);
+
 	/*
 	 * Also set up timeout handlers needed for backend operation.  We need
 	 * these in every case except bootstrap.
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ddf7c67..d083aaf 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3130,6 +3130,8 @@ DESCR("get OID of current session's temp schema, if any");
 DATA(insert OID = 2855 (  pg_is_other_temp_schema	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "26" _null_ _null_ _null_ _null_ _null_ pg_is_other_temp_schema _null_ _null_ _null_ ));
 DESCR("is schema another session's temp schema?");
 
+DATA(insert OID = 4099 ( pg_cmdstatus		PGNSP PGUID 12 1 100 0 0 f f f f f t s 2 0 25 "23 23" _null_ _null_  _null_  _null_ _null_ pg_cmdstatus _null_ _null_ _null_ ));
+DESCR("returns information about another process");
 DATA(insert OID = 2171 ( pg_cancel_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_cancel_backend _null_ _null_ _null_ ));
 DESCR("cancel a server process' current query");
 DATA(insert OID = 2096 ( pg_terminate_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e0cc69f..5f03f63 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -80,6 +80,7 @@
 extern PGDLLIMPORT volatile bool InterruptPending;
 extern PGDLLIMPORT volatile bool QueryCancelPending;
 extern PGDLLIMPORT volatile bool ProcDiePending;
+extern PGDLLIMPORT volatile bool CmdStatusInfoPending;
 
 extern volatile bool ClientConnectionLost;
 
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index af1a0cd..cd856e5 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,8 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
+	PROCSIG_CMD_STATUS_INFO,
+
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fc1679e..605612e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1243,6 +1243,9 @@ extern Datum pg_identify_object_as_address(PG_FUNCTION_ARGS);
 /* catalog/objectaddress.c */
 extern Datum pg_get_object_address(PG_FUNCTION_ARGS);
 
+/* utils/adt/cmdstatus.c */
+extern Datum pg_cmdstatus(PG_FUNCTION_ARGS);
+
 /* commands/constraint.c */
 extern Datum unique_key_recheck(PG_FUNCTION_ARGS);
 
diff --git a/src/include/utils/cmdstatus.h b/src/include/utils/cmdstatus.h
new file mode 100644
index 0000000..85627a57
--- /dev/null
+++ b/src/include/utils/cmdstatus.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.h
+ *	  Declarations for command status interrupt handling.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * src/include/utils/cmdstatus.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CMDSTATUS_H
+#define CMDSTATUS_H
+
+extern Size CmdStatusShmemSize(void);
+extern void CmdStatusShmemInit(void);
+extern void CmdStatusInit(int css_idx);
+
+extern void HandleCmdStatusInfoInterrupt(void);
+extern void ProcessCmdStatusInfoRequest(void);
+
+#endif   /* CMDSTATUS_H */
#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Shulgin, Oleksandr (#37)
Re: On-demand running query plans using auto_explain and signals

Hi,

I did a quick initial review of this patch today, so here are my
comments so far:

- ipcs.c should include utils/cmdstatus.h (the compiler complains
about implicit declaration of two functions)

- Attempts to get plan for simple insert queries like this

INSERT INTO x SELECT * FROM x;

end with a segfault, because ActivePortal->queryDesc is 0x0 for this
query. Needs investigation.

- The lockless approach seems fine to me, although I think the fear
of performance issues is a bit moot (I don't think we expect large
number of processes calling pg_cmdstatus at the same time). But
it's not significantly more complex, so why not.

- The patch contains pretty much no documentation, both comments
at the code level and user docs. The lack of user docs is not that
a big deal at this point (although the patch seems to be mature
enough, although the user-level API will likely change).

The lack of code comments is more serious, as it makes the review
somewhat more difficult. For example it'd be very nice to document
the contract for the lock-less interface.

- I agree that pg_cmdstatus() is not the best API. Having something
like EXPLAIN PID would be nice, but it does not really work for
all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe
there's not a single API for all cases, i.e. we should use EXPLAIN
PID for one case and invent something different for the other?

- Is there a particular reason why we allocate slots for auxiliary
processes and not just for backends (NumBackends)? Do we expect those
auxiliary processes to ever use this API?

- CleanupCmdStatusSlot seems needlessly complicated. I don't quite see
the need for the second argument, or the internal slot variable. Why
not to simply use the MyCmdStatusSlot directly?

- I also don't quite understand why we need to track css_pid for the
slot? In what scenario will this actually matter?

- While being able to get EXPLAIN from the running process is nice,
I'm especially interested in getting EXPLAIN ANALYZE to get insight
into the progress of the execution. The are other ways to get the
EXPLAIN, e.g. by opening a different connection and actually running
it (sure, the plan might have changed since then), but currently
there's no way to get insight into the progress.

From the thread I get the impression that Oleksandr also finds this
useful - correct? What are the plans in this direction?

ISTM we need at least two things for that to work:

(a) Ability to enable instrumentation on all queries (effectively
what auto_explain allows), otherwise we can't get EXPLAIN ANALYZE
on the queries later. But auto_explain is an extension, so that
does not seem as a good match if this is supposed to be in core.
In that case a separate GUC seems appropriate.

(b) Being able to run the InstrEnd* methods repeatedly - the initial
message in this thread mentions issues with InstrEndLoop for
example. So perhaps this is non-trivial.

- And finally, I think we should really support all existing EXPLAIN
formats, not just text. We need to support the other formats (yaml,
json, xml) if we want to use the EXPLAIN PID approach, and it also
makes the plans easier to process by additional tools.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Pavel Stehule
pavel.stehule@gmail.com
In reply to: Tomas Vondra (#38)
Re: On-demand running query plans using auto_explain and signals

Hi

Thank you for precious check.

2015-09-12 11:50 GMT+02:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:

Hi,

I did a quick initial review of this patch today, so here are my comments
so far:

- ipcs.c should include utils/cmdstatus.h (the compiler complains
about implicit declaration of two functions)

- Attempts to get plan for simple insert queries like this

INSERT INTO x SELECT * FROM x;

end with a segfault, because ActivePortal->queryDesc is 0x0 for this
query. Needs investigation.

- The lockless approach seems fine to me, although I think the fear
of performance issues is a bit moot (I don't think we expect large
number of processes calling pg_cmdstatus at the same time). But
it's not significantly more complex, so why not.

- The patch contains pretty much no documentation, both comments
at the code level and user docs. The lack of user docs is not that
a big deal at this point (although the patch seems to be mature
enough, although the user-level API will likely change).

The lack of code comments is more serious, as it makes the review
somewhat more difficult. For example it'd be very nice to document
the contract for the lock-less interface.

- I agree that pg_cmdstatus() is not the best API. Having something
like EXPLAIN PID would be nice, but it does not really work for
all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe
there's not a single API for all cases, i.e. we should use EXPLAIN
PID for one case and invent something different for the other?

I used this simple interface because it is faster way how to get a
prototype. In this patch the most complexity is in interprocess
communication and in executor hooking. The UI can be designed later very
quickly.

There is consensus about EXPLAIN PID, for other (status, query) I don't
want to introduce new keyword, so there will be accessed via functions. But
we can introduce psql commands

\qpid - query pid and \spid - status pid

- Is there a particular reason why we allocate slots for auxiliary
processes and not just for backends (NumBackends)? Do we expect those
auxiliary processes to ever use this API?

The introduced interprocess communication (we talked about this possibility
week ago) can be used by any diagnostic extension - so aux processes can be
interesting too.

- CleanupCmdStatusSlot seems needlessly complicated. I don't quite see
the need for the second argument, or the internal slot variable. Why
not to simply use the MyCmdStatusSlot directly?

- I also don't quite understand why we need to track css_pid for the
slot? In what scenario will this actually matter?

I have to recheck it - but this is safeguard against to change process with
same backendId.

- While being able to get EXPLAIN from the running process is nice,
I'm especially interested in getting EXPLAIN ANALYZE to get insight
into the progress of the execution. The are other ways to get the
EXPLAIN, e.g. by opening a different connection and actually running
it (sure, the plan might have changed since then), but currently
there's no way to get insight into the progress.

It can be pretty nice feature. I though about it - as next step of this
feature. But I have not idea how it is complex task - maybe not too much.

From the thread I get the impression that Oleksandr also finds this
useful - correct? What are the plans in this direction?

ISTM we need at least two things for that to work:

(a) Ability to enable instrumentation on all queries (effectively
what auto_explain allows), otherwise we can't get EXPLAIN ANALYZE
on the queries later. But auto_explain is an extension, so that
does not seem as a good match if this is supposed to be in core.
In that case a separate GUC seems appropriate.

If described functionality will be implemented, then auto_explain extension
will be very thin envelop - I don't see any reason, why it should not be
integrated to core.

(b) Being able to run the InstrEnd* methods repeatedly - the initial
message in this thread mentions issues with InstrEndLoop for
example. So perhaps this is non-trivial.

- And finally, I think we should really support all existing EXPLAIN
formats, not just text. We need to support the other formats (yaml,
json, xml) if we want to use the EXPLAIN PID approach, and it also
makes the plans easier to process by additional tools.

surely - there is not any reason why not.

Regards

Pavel

Show quoted text

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tomas Vondra (#38)
Re: On-demand running query plans using auto_explain and signals

On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com

wrote:

Hi,

I did a quick initial review of this patch today, so here are my comments
so far:

Hi Tomas,

First of all, thanks for the review!

- ipcs.c should include utils/cmdstatus.h (the compiler complains

about implicit declaration of two functions)

Correct, though the file name is ipci.c.

- Attempts to get plan for simple insert queries like this

INSERT INTO x SELECT * FROM x;

end with a segfault, because ActivePortal->queryDesc is 0x0 for this
query. Needs investigation.

Yes, I've hit the same problem after submitting the latest version of the
patch. For now I've just added a check for queryDesc being not NULL, but I
guess the top of the current_query_stack might contains something useful.
Something I need to try.

- The lockless approach seems fine to me, although I think the fear

of performance issues is a bit moot (I don't think we expect large
number of processes calling pg_cmdstatus at the same time). But
it's not significantly more complex, so why not.

I believe the main benefit of the less-locking approach is that if
something goes wrong when two backends tried to communicate it doesn't
prevent the rest of them from operating, because there is no shared (and
thus locked) communication channel.

- The patch contains pretty much no documentation, both comments

at the code level and user docs. The lack of user docs is not that
a big deal at this point (although the patch seems to be mature
enough, although the user-level API will likely change).

The lack of code comments is more serious, as it makes the review
somewhat more difficult. For example it'd be very nice to document
the contract for the lock-less interface.

I will add the code comments. The user docs could wait before we decide on
the interface, I think.

- I agree that pg_cmdstatus() is not the best API. Having something

like EXPLAIN PID would be nice, but it does not really work for
all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe
there's not a single API for all cases, i.e. we should use EXPLAIN
PID for one case and invent something different for the other?

I can think of something like:

EXPLAIN [ ( option [, ...] ) ] PROCESS <PID>;

where option is extended with:

QUERY
PROGRESS
BACKTRACE

in addition to the usual ANALYZE, VERBOSE, FORMAT, etc.

- Is there a particular reason why we allocate slots for auxiliary

processes and not just for backends (NumBackends)? Do we expect those
auxiliary processes to ever use this API?

If we extend the interface to a more general one, there still might be some
space for querying status of checkpointer of bgwriter.

- CleanupCmdStatusSlot seems needlessly complicated. I don't quite see

the need for the second argument, or the internal slot variable. Why
not to simply use the MyCmdStatusSlot directly?

Good point.

- I also don't quite understand why we need to track css_pid for the

slot? In what scenario will this actually matter?

I think it's being only used for error reporting and could help in
debugging, but for now that's it.

- While being able to get EXPLAIN from the running process is nice,

I'm especially interested in getting EXPLAIN ANALYZE to get insight
into the progress of the execution. The are other ways to get the
EXPLAIN, e.g. by opening a different connection and actually running
it (sure, the plan might have changed since then), but currently
there's no way to get insight into the progress.

From the thread I get the impression that Oleksandr also finds this
useful - correct? What are the plans in this direction?

ISTM we need at least two things for that to work:

(a) Ability to enable instrumentation on all queries (effectively
what auto_explain allows), otherwise we can't get EXPLAIN ANALYZE
on the queries later. But auto_explain is an extension, so that
does not seem as a good match if this is supposed to be in core.
In that case a separate GUC seems appropriate.

(b) Being able to run the InstrEnd* methods repeatedly - the initial
message in this thread mentions issues with InstrEndLoop for
example. So perhaps this is non-trivial.

I was able to make this work with a simple change to InstrEndLoop and the
callers. Basically, adding a bool parameter in_explain and passing an
appropriate value. I guess that's not the best approach, but it appears to
work.

Adding a GUC to enable instrumentation sounds reasonable.

Do you believe it makes sense to add instrumentation support in this same
patch or better focus on making the simplest thing work first?

- And finally, I think we should really support all existing EXPLAIN

formats, not just text. We need to support the other formats (yaml,
json, xml) if we want to use the EXPLAIN PID approach, and it also
makes the plans easier to process by additional tools.

Sure, that was in my plans (and see above for possible syntax). What would
be really neat is retrieving the complete backtrace. Not sure what the
good interface would look like, but using JSON format for the output sounds
promising.

--
Alex

#41Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#40)
Re: On-demand running query plans using auto_explain and signals

2015-09-14 10:23 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

I can think of something like:

EXPLAIN [ ( option [, ...] ) ] PROCESS <PID>;

where option is extended with:

QUERY
PROGRESS
BACKTRACE

in addition to the usual ANALYZE, VERBOSE, FORMAT, etc.

It can work

Regards

Pavel

#42Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Shulgin, Oleksandr (#40)
Re: On-demand running query plans using auto_explain and signals

On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote:

On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

...

- Attempts to get plan for simple insert queries like this

INSERT INTO x SELECT * FROM x;

end with a segfault, because ActivePortal->queryDesc is 0x0 for this
query. Needs investigation.

Yes, I've hit the same problem after submitting the latest version of
the patch. For now I've just added a check for queryDesc being not
NULL, but I guess the top of the current_query_stack might contains
something useful. Something I need to try.

Well, the thing is we're able to do EXPLAIN on those queries, and IIRC
auto_explain can log them too. So perhaps look into the hooks where they
take the queryDesc in those cases - it has to be available somewhere.

- The lockless approach seems fine to me, although I think the fear
of performance issues is a bit moot (I don't think we expect large
number of processes calling pg_cmdstatus at the same time). But
it's not significantly more complex, so why not.

I believe the main benefit of the less-locking approach is that if
something goes wrong when two backends tried to communicate it doesn't
prevent the rest of them from operating, because there is no shared (and
thus locked) communication channel.

Sure, but I think it really deserves a bit more detailed explanation of
the protocol, and discussion of the expected behavior for some basic
failure types.

For example - what happens when a backend requests info, but dies before
receiving it, and the backed is reused for another connection? Doesn't
this introduce a race condition? Perhaps not, I'm just asking.

- The patch contains pretty much no documentation, both comments
at the code level and user docs. The lack of user docs is not that
a big deal at this point (although the patch seems to be mature
enough, although the user-level API will likely change).

The lack of code comments is more serious, as it makes the review
somewhat more difficult. For example it'd be very nice to document
the contract for the lock-less interface.

I will add the code comments. The user docs could wait before we decide
on the interface, I think.

Agreed, although I think having rudimentary user documentation would be
useful for the reviewers - a summary of the goals that are a bit
scattered over the e-mail thread.

- I agree that pg_cmdstatus() is not the best API. Having something
like EXPLAIN PID would be nice, but it does not really work for
all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe
there's not a single API for all cases, i.e. we should use EXPLAIN
PID for one case and invent something different for the other?

I can think of something like:

EXPLAIN [ ( option [, ...] ) ] PROCESS <PID>;

where option is extended with:

QUERY
PROGRESS
BACKTRACE

in addition to the usual ANALYZE, VERBOSE, FORMAT, etc.

Seems OK. I'm not quite sure what PROGRESS is - I assume it's the same
thing as ANALYZE? Why not to use the keyword, then?

- Is there a particular reason why we allocate slots for auxiliary
processes and not just for backends (NumBackends)? Do we expect those
auxiliary processes to ever use this API?

If we extend the interface to a more general one, there still might be
some space for querying status of checkpointer of bgwriter.

I don't think we should mix this with monitoring of auxiliary processes.
This interface is designed at monitoring SQL queries running in other
backends, effectively "remote" EXPLAIN. But those auxiliary processes
are not processing SQL queries at all, they're not even using regular
executor ...

OTOH the ability to request this info (e.g. auxiliary process looking at
plans running in backends) seems useful, so I'm ok with tuple slots for
auxiliary processes.

- CleanupCmdStatusSlot seems needlessly complicated. I don't quite see
the need for the second argument, or the internal slot variable. Why
not to simply use the MyCmdStatusSlot directly?

Good point.

- I also don't quite understand why we need to track css_pid for the
slot? In what scenario will this actually matter?

I think it's being only used for error reporting and could help in
debugging, but for now that's it.

I recommend getting rid of it, unless we have a clear idea of how to use
it. Otherwise I envision we won't really keep it updated properly
(because it's "just for debugging"), and then one day someone actually
starts using it and get bitten by it.

- While being able to get EXPLAIN from the running process is nice,
I'm especially interested in getting EXPLAIN ANALYZE to get insight
into the progress of the execution. The are other ways to get the
EXPLAIN, e.g. by opening a different connection and actually running
it (sure, the plan might have changed since then), but currently
there's no way to get insight into the progress.

From the thread I get the impression that Oleksandr also finds this
useful - correct? What are the plans in this direction?

ISTM we need at least two things for that to work:

(a) Ability to enable instrumentation on all queries (effectively
what auto_explain allows), otherwise we can't get EXPLAIN ANALYZE
on the queries later. But auto_explain is an extension, so that
does not seem as a good match if this is supposed to be in core.
In that case a separate GUC seems appropriate.

(b) Being able to run the InstrEnd* methods repeatedly - the initial
message in this thread mentions issues with InstrEndLoop for
example. So perhaps this is non-trivial.

I was able to make this work with a simple change to InstrEndLoop and
the callers. Basically, adding a bool parameter in_explain and passing
an appropriate value. I guess that's not the best approach, but it
appears to work.

Adding a GUC to enable instrumentation sounds reasonable.

Nice!

Do you believe it makes sense to add instrumentation support in this
same patch or better focus on making the simplest thing work first?

Let's make it a "patch series" with two patches.

- And finally, I think we should really support all existing EXPLAIN
formats, not just text. We need to support the other formats (yaml,
json, xml) if we want to use the EXPLAIN PID approach, and it also
makes the plans easier to process by additional tools.

Sure, that was in my plans (and see above for possible syntax). What
would be really neat is retrieving the complete backtrace. Not sure
what the good interface would look like, but using JSON format for the
output sounds promising.

I don't quite see the reason to encode everything as JSON, because that
makes it a bit difficult for clients that would prefer YAML, for
example. Why not to just use the requested format? For example in YAML,
we can do something like this:

QUERY PLAN
----------------------------------
- Plan[0]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +
+
- Plan[1]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +
+
- Plan[2]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +

and similarly for other formats. We don't really use nesting.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tomas Vondra (#42)
Re: On-demand running query plans using auto_explain and signals

On Mon, Sep 14, 2015 at 11:53 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com

wrote:

On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote:

On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>
wrote:

...

- Attempts to get plan for simple insert queries like this

INSERT INTO x SELECT * FROM x;

end with a segfault, because ActivePortal->queryDesc is 0x0 for
this
query. Needs investigation.

Yes, I've hit the same problem after submitting the latest version of
the patch. For now I've just added a check for queryDesc being not
NULL, but I guess the top of the current_query_stack might contains
something useful. Something I need to try.

Well, the thing is we're able to do EXPLAIN on those queries, and IIRC
auto_explain can log them too. So perhaps look into the hooks where they
take the queryDesc in those cases - it has to be available somewhere.

Yes, this seems to work. I was able to successfully query "create table
as" and even "explain analyze" for the plans. In both cases ActivePortal
doesn't have a valid queryDesc, but the executor hook captures one.

- The lockless approach seems fine to me, although I think the fear

of performance issues is a bit moot (I don't think we expect large
number of processes calling pg_cmdstatus at the same time). But
it's not significantly more complex, so why not.

I believe the main benefit of the less-locking approach is that if
something goes wrong when two backends tried to communicate it doesn't
prevent the rest of them from operating, because there is no shared (and
thus locked) communication channel.

Sure, but I think it really deserves a bit more detailed explanation of
the protocol, and discussion of the expected behavior for some basic
failure types.

For example - what happens when a backend requests info, but dies before
receiving it, and the backed is reused for another connection? Doesn't this
introduce a race condition? Perhaps not, I'm just asking.

Now explained better in code comments.

The worst thing that could happen in this case I believe is that the
requesting backend will not receive any response from the second use of its
slot due to the slot being marked as processed by the backend that was
asked first:

A:
slot->is_processed = false;
slot->is_valid = true;
/* signals backend B */;
shm_mq_read(); /* blocks */

B: /* finds the slot and prepares the response */

A: /* decides to bail out */
InvalidateCmdStatusSlot();
shm_mq_detach();
/* exits pg_cmdstatus() */

/* pg_cmdstatus() is called again */
/* initializes the slot and signals some backend */
shm_mq_read(); /* blocks */

B: /* completes preparing the response */
slot->is_processed = true;
shm_mq_send() => SHM_MQ_DETACHED
slot->is_valid = false;
/* gives up with this slot */

Now the backend that has been signaled on the second call to pg_cmdstatus
(it can be either some other backend, or the backend B again) will not find
an unprocessed slot, thus it will not try to attach/detach the queue and
the backend A will block forever.

This requires a really bad timing and the user should be able to interrupt
the querying backend A still.

In any case, the backends that are being asked to send the info will be
able to notice the problem (receiver detached early) and handle it
gracefully.

- The patch contains pretty much no documentation, both comments

at the code level and user docs. The lack of user docs is not that
a big deal at this point (although the patch seems to be mature
enough, although the user-level API will likely change).

The lack of code comments is more serious, as it makes the review
somewhat more difficult. For example it'd be very nice to document
the contract for the lock-less interface.

I will add the code comments. The user docs could wait before we decide
on the interface, I think.

Agreed, although I think having rudimentary user documentation would be
useful for the reviewers - a summary of the goals that are a bit scattered
over the e-mail thread.

OK

- I agree that pg_cmdstatus() is not the best API. Having something

like EXPLAIN PID would be nice, but it does not really work for
all the request types (just CMD_STATUS_REQUEST_EXPLAIN). Maybe
there's not a single API for all cases, i.e. we should use EXPLAIN
PID for one case and invent something different for the other?

I can think of something like:

EXPLAIN [ ( option [, ...] ) ] PROCESS <PID>;

where option is extended with:

QUERY
PROGRESS
BACKTRACE

in addition to the usual ANALYZE, VERBOSE, FORMAT, etc.

Seems OK. I'm not quite sure what PROGRESS is - I assume it's the same
thing as ANALYZE? Why not to use the keyword, then?

This is what CMD_INFO_PROGRESS_TAG does: return the "partial completion"
command tag, like "INSERT 12345", reflecting the number of tuples processed
up to this point in time.

- Is there a particular reason why we allocate slots for auxiliary

processes and not just for backends (NumBackends)? Do we expect
those
auxiliary processes to ever use this API?

If we extend the interface to a more general one, there still might be
some space for querying status of checkpointer of bgwriter.

I don't think we should mix this with monitoring of auxiliary processes.
This interface is designed at monitoring SQL queries running in other
backends, effectively "remote" EXPLAIN. But those auxiliary processes are
not processing SQL queries at all, they're not even using regular executor
...

OTOH the ability to request this info (e.g. auxiliary process looking at
plans running in backends) seems useful, so I'm ok with tuple slots for
auxiliary processes.

Now that I think about it, reserving the slots for aux process doesn't let
us query their status, it's the other way round. If we don't reserve them,
then aux process would not be able to query any other process for the
status. Likely this is not a problem at all, so we can remove these extra
slots.

- CleanupCmdStatusSlot seems needlessly complicated. I don't quite see

the need for the second argument, or the internal slot variable.
Why
not to simply use the MyCmdStatusSlot directly?

Good point.

- I also don't quite understand why we need to track css_pid for the
slot? In what scenario will this actually matter?

I think it's being only used for error reporting and could help in
debugging, but for now that's it.

I recommend getting rid of it, unless we have a clear idea of how to use
it. Otherwise I envision we won't really keep it updated properly (because
it's "just for debugging"), and then one day someone actually starts using
it and get bitten by it.

The design was copied over from procsignal.c, where the pss_pid is used to
"claim" the procsignal slot. There it's used for searching the target
process slot when backend id is not known. We don't use that in cmd status
slots because of the inverted slots logic, so we could really remove this
field.

- While being able to get EXPLAIN from the running process is nice,

I'm especially interested in getting EXPLAIN ANALYZE to get insight
into the progress of the execution. The are other ways to get the
EXPLAIN, e.g. by opening a different connection and actually
running
it (sure, the plan might have changed since then), but currently
there's no way to get insight into the progress.

From the thread I get the impression that Oleksandr also finds this
useful - correct? What are the plans in this direction?

ISTM we need at least two things for that to work:

(a) Ability to enable instrumentation on all queries (effectively
what auto_explain allows), otherwise we can't get EXPLAIN
ANALYZE
on the queries later. But auto_explain is an extension, so that
does not seem as a good match if this is supposed to be in
core.
In that case a separate GUC seems appropriate.

(b) Being able to run the InstrEnd* methods repeatedly - the
initial
message in this thread mentions issues with InstrEndLoop for
example. So perhaps this is non-trivial.

I was able to make this work with a simple change to InstrEndLoop and
the callers. Basically, adding a bool parameter in_explain and passing
an appropriate value. I guess that's not the best approach, but it
appears to work.

Adding a GUC to enable instrumentation sounds reasonable.

Nice!

Do you believe it makes sense to add instrumentation support in this
same patch or better focus on making the simplest thing work first?

Let's make it a "patch series" with two patches.

OK, I'll submit them later today after addressing all the comments.

- And finally, I think we should really support all existing EXPLAIN

formats, not just text. We need to support the other formats (yaml,
json, xml) if we want to use the EXPLAIN PID approach, and it also
makes the plans easier to process by additional tools.

Sure, that was in my plans (and see above for possible syntax). What
would be really neat is retrieving the complete backtrace. Not sure
what the good interface would look like, but using JSON format for the
output sounds promising.

I don't quite see the reason to encode everything as JSON, because that
makes it a bit difficult for clients that would prefer YAML, for example.
Why not to just use the requested format? For example in YAML, we can do
something like this:

QUERY PLAN
----------------------------------
- Plan[0]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +
+
- Plan[1]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +
+
- Plan[2]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +

and similarly for other formats. We don't really use nesting.

You're right, the nesting structure is too simple to require a certain
format. Do you have ideas about representing stack frames in TEXT format?
Something like what gdb does with #0, #1, etc. markers for frame depth?

--
Alex

#44Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Shulgin, Oleksandr (#43)
Re: On-demand running query plans using auto_explain and signals

On 09/14/2015 01:15 PM, Shulgin, Oleksandr wrote:

On Mon, Sep 14, 2015 at 11:53 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:

On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote:

On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>
<mailto:tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>>> wrote:

...

- Attempts to get plan for simple insert queries like this

INSERT INTO x SELECT * FROM x;

end with a segfault, because ActivePortal->queryDesc is
0x0 for this
query. Needs investigation.

Yes, I've hit the same problem after submitting the latest
version of
the patch. For now I've just added a check for queryDesc being not
NULL, but I guess the top of the current_query_stack might contains
something useful. Something I need to try.

Well, the thing is we're able to do EXPLAIN on those queries, and
IIRC auto_explain can log them too. So perhaps look into the hooks
where they take the queryDesc in those cases - it has to be
available somewhere.

Yes, this seems to work. I was able to successfully query "create table
as" and even "explain analyze" for the plans. In both cases
ActivePortal doesn't have a valid queryDesc, but the executor hook
captures one.

- The lockless approach seems fine to me, although I think
the fear
of performance issues is a bit moot (I don't think we
expect large
number of processes calling pg_cmdstatus at the same
time). But
it's not significantly more complex, so why not.

I believe the main benefit of the less-locking approach is that if
something goes wrong when two backends tried to communicate it
doesn't
prevent the rest of them from operating, because there is no
shared (and
thus locked) communication channel.

Sure, but I think it really deserves a bit more detailed explanation
of the protocol, and discussion of the expected behavior for some
basic failure types.

For example - what happens when a backend requests info, but dies
before receiving it, and the backed is reused for another
connection? Doesn't this introduce a race condition? Perhaps not,
I'm just asking.

Now explained better in code comments.

The worst thing that could happen in this case I believe is that the
requesting backend will not receive any response from the second use of
its slot due to the slot being marked as processed by the backend that
was asked first:

A:
slot->is_processed = false;
slot->is_valid = true;
/* signals backend B */;
shm_mq_read(); /* blocks */

B: /* finds the slot and prepares the response */

A: /* decides to bail out */
InvalidateCmdStatusSlot();
shm_mq_detach();
/* exits pg_cmdstatus() */

/* pg_cmdstatus() is called again */
/* initializes the slot and signals some backend */
shm_mq_read(); /* blocks */

B: /* completes preparing the response */
slot->is_processed = true;
shm_mq_send() => SHM_MQ_DETACHED
slot->is_valid = false;
/* gives up with this slot */

Now the backend that has been signaled on the second call to
pg_cmdstatus (it can be either some other backend, or the backend B
again) will not find an unprocessed slot, thus it will not try to
attach/detach the queue and the backend A will block forever.

This requires a really bad timing and the user should be able to
interrupt the querying backend A still.

I think we can't rely on the low probability that this won't happen, and
we should not rely on people interrupting the backend. Being able to
detect the situation and fail gracefully should be possible.

It may be possible to introduce some lock-less protocol preventing such
situations, but it's not there at the moment. If you believe it's
possible, you need to explain and "prove" that it's actually safe.

Otherwise we may need to introduce some basic locking - for example we
may introduce a LWLock for each slot, and lock it with dontWait=true
(and skip it if we couldn't lock it). This should prevent most scenarios
where one corrupted slot blocks many processes.

In any case, the backends that are being asked to send the info will be
able to notice the problem (receiver detached early) and handle it
gracefully.

Ummm, how? Maybe I missed something?

I don't think we should mix this with monitoring of auxiliary
processes. This interface is designed at monitoring SQL queries
running in other backends, effectively "remote" EXPLAIN. But those
auxiliary processes are not processing SQL queries at all, they're
not even using regular executor ...

OTOH the ability to request this info (e.g. auxiliary process
looking at plans running in backends) seems useful, so I'm ok with
tuple slots for auxiliary processes.

Now that I think about it, reserving the slots for aux process doesn't
let us query their status, it's the other way round. If we don't
reserve them, then aux process would not be able to query any other
process for the status. Likely this is not a problem at all, so we can
remove these extra slots.

I don't know. I can imagine using this from background workers, but I
think those are counted as regular backends (not sure though).

- I also don't quite understand why we need to track
css_pid for the
slot? In what scenario will this actually matter?

I think it's being only used for error reporting and could help in
debugging, but for now that's it.

I recommend getting rid of it, unless we have a clear idea of how to
use it. Otherwise I envision we won't really keep it updated
properly (because it's "just for debugging"), and then one day
someone actually starts using it and get bitten by it.

The design was copied over from procsignal.c, where the pss_pid is used
to "claim" the procsignal slot. There it's used for searching the
target process slot when backend id is not known. We don't use that in
cmd status slots because of the inverted slots logic, so we could really
remove this field.

OK.

I don't quite see the reason to encode everything as JSON, because
that makes it a bit difficult for clients that would prefer YAML,
for example. Why not to just use the requested format? For example
in YAML, we can do something like this:

QUERY PLAN
----------------------------------
- Plan[0]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +
+
- Plan[1]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +
+
- Plan[2]: +
Node Type: "Index Only Scan"+
Scan Direction: "Forward" +
... +

and similarly for other formats. We don't really use nesting.

You're right, the nesting structure is too simple to require a certain
format. Do you have ideas about representing stack frames in TEXT
format? Something like what gdb does with #0, #1, etc. markers for
frame depth?

Yes. The simpler the format, the better.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Pavel Stehule
pavel.stehule@gmail.com
In reply to: Tomas Vondra (#44)
Re: On-demand running query plans using auto_explain and signals

2015-09-14 14:11 GMT+02:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:

On 09/14/2015 01:15 PM, Shulgin, Oleksandr wrote:

On Mon, Sep 14, 2015 at 11:53 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>
wrote:

On 09/14/2015 10:23 AM, Shulgin, Oleksandr wrote:

On Sat, Sep 12, 2015 at 11:50 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>
<mailto:tomas.vondra@2ndquadrant.com

<mailto:tomas.vondra@2ndquadrant.com>>> wrote:

...

- Attempts to get plan for simple insert queries like this

INSERT INTO x SELECT * FROM x;

end with a segfault, because ActivePortal->queryDesc is
0x0 for this
query. Needs investigation.

Yes, I've hit the same problem after submitting the latest
version of
the patch. For now I've just added a check for queryDesc being
not
NULL, but I guess the top of the current_query_stack might
contains
something useful. Something I need to try.

Well, the thing is we're able to do EXPLAIN on those queries, and
IIRC auto_explain can log them too. So perhaps look into the hooks
where they take the queryDesc in those cases - it has to be
available somewhere.

Yes, this seems to work. I was able to successfully query "create table
as" and even "explain analyze" for the plans. In both cases
ActivePortal doesn't have a valid queryDesc, but the executor hook
captures one.

- The lockless approach seems fine to me, although I think
the fear
of performance issues is a bit moot (I don't think we
expect large
number of processes calling pg_cmdstatus at the same
time). But
it's not significantly more complex, so why not.

I believe the main benefit of the less-locking approach is that if
something goes wrong when two backends tried to communicate it
doesn't
prevent the rest of them from operating, because there is no
shared (and
thus locked) communication channel.

Sure, but I think it really deserves a bit more detailed explanation
of the protocol, and discussion of the expected behavior for some
basic failure types.

For example - what happens when a backend requests info, but dies
before receiving it, and the backed is reused for another
connection? Doesn't this introduce a race condition? Perhaps not,
I'm just asking.

Now explained better in code comments.

The worst thing that could happen in this case I believe is that the
requesting backend will not receive any response from the second use of
its slot due to the slot being marked as processed by the backend that
was asked first:

A:
slot->is_processed = false;
slot->is_valid = true;
/* signals backend B */;
shm_mq_read(); /* blocks */

B: /* finds the slot and prepares the response */

A: /* decides to bail out */
InvalidateCmdStatusSlot();
shm_mq_detach();
/* exits pg_cmdstatus() */

/* pg_cmdstatus() is called again */
/* initializes the slot and signals some backend */
shm_mq_read(); /* blocks */

B: /* completes preparing the response */
slot->is_processed = true;
shm_mq_send() => SHM_MQ_DETACHED
slot->is_valid = false;
/* gives up with this slot */

hmm .. yes - there have to be check if slot is related still to queue
before to change is_valid attribute.

Regards

Pavel

#46Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tomas Vondra (#44)
Re: On-demand running query plans using auto_explain and signals

On Mon, Sep 14, 2015 at 2:11 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Now the backend that has been signaled on the second call to
pg_cmdstatus (it can be either some other backend, or the backend B
again) will not find an unprocessed slot, thus it will not try to
attach/detach the queue and the backend A will block forever.

This requires a really bad timing and the user should be able to
interrupt the querying backend A still.

I think we can't rely on the low probability that this won't happen, and
we should not rely on people interrupting the backend. Being able to detect
the situation and fail gracefully should be possible.

It may be possible to introduce some lock-less protocol preventing such
situations, but it's not there at the moment. If you believe it's possible,
you need to explain and "prove" that it's actually safe.

Otherwise we may need to introduce some basic locking - for example we may
introduce a LWLock for each slot, and lock it with dontWait=true (and skip
it if we couldn't lock it). This should prevent most scenarios where one
corrupted slot blocks many processes.

OK, I will revisit this part then.

In any case, the backends that are being asked to send the info will be

able to notice the problem (receiver detached early) and handle it
gracefully.

Ummm, how? Maybe I missed something?

Well, I didn't attach the updated patch (doing that now). The basic idea
is that when the backend that has requested information bails out
prematurely it still detaches from the shared memory queue. This makes it
possible for the backend being asked to detect the situation either before
attaching to the queue or when trying to send the data, so it won't be
blocked forever if the other backend failed to wait.

I don't think we should mix this with monitoring of auxiliary

processes. This interface is designed at monitoring SQL queries
running in other backends, effectively "remote" EXPLAIN. But those
auxiliary processes are not processing SQL queries at all, they're
not even using regular executor ...

OTOH the ability to request this info (e.g. auxiliary process
looking at plans running in backends) seems useful, so I'm ok with
tuple slots for auxiliary processes.

Now that I think about it, reserving the slots for aux process doesn't
let us query their status, it's the other way round. If we don't
reserve them, then aux process would not be able to query any other
process for the status. Likely this is not a problem at all, so we can
remove these extra slots.

I don't know. I can imagine using this from background workers, but I
think those are counted as regular backends (not sure though).

MaxBackends includes the background workers, yes.

--
Alex

#47Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Shulgin, Oleksandr (#46)
1 attachment(s)
Re: On-demand running query plans using auto_explain and signals

Well, I didn't attach the updated patch (doing that now).

This time for real. Sorry, it's Monday :-p

Attachments:

explain-pid-v4.2.patchtext/x-patch; charset=US-ASCII; name=explain-pid-v4.2.patchDownload
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..2e3beaf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -43,6 +43,7 @@
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
+#include "utils/cmdstatus.h"
 
 
 shmem_startup_hook_type shmem_startup_hook = NULL;
@@ -139,6 +140,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, CmdStatusShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -243,6 +245,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
+	CmdStatusShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 0abde43..e637be1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -26,6 +26,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/cmdstatus.h"
 
 
 /*
@@ -296,6 +297,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CMD_STATUS_INFO))
+		HandleCmdStatusInfoInterrupt();
+
 	if (set_latch_on_sigusr1)
 		SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d917af3..1a5b03c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/cmdstatus.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2991,6 +2992,9 @@ ProcessInterrupts(void)
 
 	if (ParallelMessagePending)
 		HandleParallelMessages();
+
+	if (CmdStatusInfoPending)
+		ProcessCmdStatusInfoRequest();
 }
 
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 3ed0b44..2c8687c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -18,7 +18,7 @@ endif
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-	bool.o cash.o char.o date.o datetime.o datum.o dbsize.o domains.o \
+	bool.o cash.o char.o cmdstatus.o date.o datetime.o datum.o dbsize.o domains.o \
 	encode.o enum.o expandeddatum.o \
 	float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
diff --git a/src/backend/utils/adt/cmdstatus.c b/src/backend/utils/adt/cmdstatus.c
new file mode 100644
index 0000000..5cc6d51
--- /dev/null
+++ b/src/backend/utils/adt/cmdstatus.c
@@ -0,0 +1,691 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.c
+ *	  Definitions for pg_cmdstatus function.
+ *
+ * Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/cmdstatus.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/htup_details.h"
+#include "commands/explain.h"
+#include "lib/stringinfo.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/procsignal.h"
+#include "storage/shm_mq.h"
+#include "storage/shm_toc.h"
+#include "tcop/dest.h"
+#include "tcop/pquery.h"
+#include "utils/builtins.h"
+#include "utils/cmdstatus.h"
+
+
+typedef enum {
+	CMD_STATUS_REQUEST_EXPLAIN = 1,
+	CMD_STATUS_REQUEST_QUERY_TEXT = 2,
+	CMD_STATUS_REQUEST_PROGRESS_TAG = 3,
+	CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE = 4
+} CmdStatusInfoRequestType;
+
+#define CMD_STATUS_MAX_REQUEST CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE
+
+typedef enum {
+	CMD_STATUS_RESULT_FAILURE = -1,
+	CMD_STATUS_RESULT_SUCCESS = 0,
+	CMD_STATUS_RESULT_BACKEND_IDLE,
+	CMD_STATUS_RESULT_NO_DATA
+} CmdStatusInfoResultCode;
+
+/*
+ * Each process that wants to be able to query other processes for their
+ * status registers itself in a slot in the shared memory by setting the
+ * slot's process ID.  The slots array is indexed by backend ID, so any slot
+ * can be assigned at max to one backend at any given time.
+ *
+ * In order to actually query some backend for its status, the interested
+ * process will initialize its *own* slot, create a DSM segment and initialize
+ * a shared memory queue in it, then send the SIGUSR1 with the reason
+ * PROCSIG_CMD_STATUS_INFO to the process being queried.  Then the interested
+ * process will wait for the response to be delivered via the shared memory
+ * queue.
+ */
+typedef struct
+{
+	pid_t			css_pid;			/* the ID of process asking for the status */
+	pid_t			sender_css_pid;		/* the ID of process that sends back the result */
+	dsm_handle		dsm_seg_handle;
+	CmdStatusInfoRequestType	request_type;
+	CmdStatusInfoResultCode		result_code;
+	bool			is_valid;
+	bool			is_processed;
+} CmdStatusSlot;
+
+/* TODO: do we really want the aux processes to be able to query other backends? */
+#define NumCmdStatusSlots	(MaxBackends + NUM_AUXPROCTYPES)
+
+#define PG_CMD_STATUS_INFO_MAGIC	0x79fb2449
+#define CMD_STATUS_BUFFER_SIZE		1024
+
+/*
+ * These structs are allocated on the program stack as local variables in the
+ * ExecutorRun hook.  The top of stack is current_query_stack, see below.
+ *
+ * Capturing the execution stack allows us to inspect the inner-most running
+ * query as well as showing the complete backtrace.
+ */
+typedef struct CmdInfoStack {
+	QueryDesc			   *query_desc;
+	struct CmdInfoStack	   *parent;
+} CmdInfoStack;
+
+/* The array of slots pre-allocated on shared memory. */
+static CmdStatusSlot *CmdStatusSlots = NULL;
+
+static volatile CmdStatusSlot *MyCmdStatusSlot = NULL;
+
+static CmdInfoStack	   *current_query_stack = NULL;
+static int				query_stack_size = 0;  /* XXX not really used */
+
+static ExecutorRun_hook_type	prev_ExecutorRun = NULL;
+
+static void cmdstatus_ExecutorRun(QueryDesc *queryDesc,
+								  ScanDirection direction, long count);
+
+static void CleanupCmdStatusSlot(int status, Datum arg);
+static int SendCmdStatusInfoSignal(pid_t target_pid, BackendId target_backend_id,
+						dsm_handle dsm_seg_handle,
+						CmdStatusInfoRequestType request_type);
+static void InvalidateCmdStatusSlot(void);
+static StringInfo ProcessCmdStatusSlot(volatile CmdStatusSlot *slot);
+
+static StringInfo explain_query(QueryDesc *queryDesc);
+static void report_result_code(CmdStatusInfoResultCode result, pid_t target_pid);
+static void RestoreResourceOwner(ResourceOwner prevResourceOwner);
+
+
+Size
+CmdStatusShmemSize(void)
+{
+	return NumCmdStatusSlots * sizeof(CmdStatusSlot);
+}
+
+void
+CmdStatusShmemInit(void)
+{
+	Size		size = CmdStatusShmemSize();
+	bool		found;
+
+	CmdStatusSlots = (CmdStatusSlot *)
+		ShmemInitStruct("CmdStatusSlots", size, &found);
+
+	if (!found)
+		MemSet(CmdStatusSlots, 0, size);
+}
+
+static void
+CleanupCmdStatusSlot(int status, Datum arg)
+{
+	MemSet(MyCmdStatusSlot, 0, sizeof(CmdStatusSlot));
+	MyCmdStatusSlot = NULL;
+}
+
+/*
+ * CmdStatusInit
+ *		Register the current process in the CmdStatusSlots array
+ *
+ * The passed index should be MyBackendId.
+ */
+void
+CmdStatusInit(int css_idx)
+{
+	volatile CmdStatusSlot *slot;
+
+	slot = &CmdStatusSlots[css_idx - 1];
+
+	if (slot->css_pid != 0)
+		elog(LOG, "process %d taking over CmdStatus slot %d, but it's not empty",
+			 MyProcPid, css_idx);
+
+	slot->css_pid = MyProcPid;
+	slot->is_valid = false;
+
+	MyCmdStatusSlot = slot;
+
+	on_shmem_exit(CleanupCmdStatusSlot, (Datum) 0);
+
+	/* also install executor hooks */
+	prev_ExecutorRun = ExecutorRun_hook;
+	ExecutorRun_hook = cmdstatus_ExecutorRun;
+}
+
+/* Prepare the slot on receiving side and signal the target process. */
+static int
+SendCmdStatusInfoSignal(pid_t target_pid, BackendId target_backend_id,
+						dsm_handle dsm_seg_handle,
+						CmdStatusInfoRequestType request_type)
+{
+	MyCmdStatusSlot->sender_css_pid = target_pid;
+	MyCmdStatusSlot->dsm_seg_handle = dsm_seg_handle;
+	MyCmdStatusSlot->request_type = request_type;
+	MyCmdStatusSlot->is_processed = false;
+	MyCmdStatusSlot->is_valid = true;		/* this should be set the latest */
+
+	return SendProcSignal(target_pid, PROCSIG_CMD_STATUS_INFO, target_backend_id);
+}
+
+/* Clean up the slot for next use by this process. */
+static void
+InvalidateCmdStatusSlot(void)
+{
+	MyCmdStatusSlot->is_valid = false;
+	MyCmdStatusSlot->is_processed = false;
+	MyCmdStatusSlot->sender_css_pid = 0;
+	MyCmdStatusSlot->dsm_seg_handle = 0;
+	MyCmdStatusSlot->request_type = 0;
+}
+
+/*
+ * The executor hook.
+ *
+ * Accumulates the query descriptors on the program stack and takes care of
+ * popping the current frame when leaving the function ab-/normally.
+ */
+static void
+cmdstatus_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
+{
+	CmdInfoStack	current;
+
+	current.query_desc = queryDesc;
+	current.parent = current_query_stack;
+
+	current_query_stack = &current;
+	query_stack_size++;
+
+	PG_TRY();
+	{
+		if (prev_ExecutorRun)
+			prev_ExecutorRun(queryDesc, direction, count);
+		else
+			standard_ExecutorRun(queryDesc, direction, count);
+
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+	}
+	PG_CATCH();
+	{
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+/* Produce an explain plan of a single query descriptor. */
+static StringInfo
+explain_query(QueryDesc *queryDesc)
+{
+	StringInfo		str;
+	ExplainState   *es;
+
+	es = NewExplainState();
+	es->analyze = false;
+	es->verbose = false;
+	es->buffers = false;
+	es->format = EXPLAIN_FORMAT_TEXT;
+
+	ExplainBeginOutput(es);
+	/* XXX: appendStringInfo(es->str, "#%d ", depth); ? */
+	ExplainQueryText(es, queryDesc);
+	ExplainPrintPlan(es, queryDesc);
+	ExplainEndOutput(es);
+
+	str = es->str;
+
+	pfree(es);
+	return str;
+}
+
+
+/* signal handler for PROCSIG_CMD_STATUS_INFO */
+void
+HandleCmdStatusInfoInterrupt(void)
+{
+	CmdStatusInfoPending = true;
+	InterruptPending = true;
+
+	SetLatch(MyLatch);
+}
+
+/* Produce a response and set the result code for a single slot. */
+static StringInfo
+ProcessCmdStatusSlot(volatile CmdStatusSlot *slot)
+{
+	StringInfo		result = NULL;
+
+	/* Show some optimism, overwrite with error code later if needed. */
+	slot->result_code = CMD_STATUS_RESULT_SUCCESS;
+
+	if (ActivePortal)
+	{
+		switch (slot->request_type)
+		{
+			case CMD_STATUS_REQUEST_EXPLAIN:
+				if (ActivePortal->queryDesc != NULL)
+					result = explain_query(ActivePortal->queryDesc);
+				else if (current_query_stack != NULL)
+					result = explain_query(current_query_stack->query_desc);
+				else
+					slot->result_code = CMD_STATUS_RESULT_NO_DATA;
+				break;
+
+			case CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE:
+				{
+					StringInfo		str;
+					CmdInfoStack   *query;
+
+					result = makeStringInfo();
+
+					if (current_query_stack != NULL)
+					{
+						for (query = current_query_stack;
+							 query != NULL;
+							 query = query->parent)
+						{
+							str = explain_query(query->query_desc);
+							appendBinaryStringInfo(result, str->data, str->len);
+
+							pfree(str->data);
+							pfree(str);
+						}
+					}
+					else
+						slot->result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+					break;
+				}
+
+			case CMD_STATUS_REQUEST_QUERY_TEXT:
+				result = makeStringInfo();
+				appendStringInfoString(result, ActivePortal->sourceText);
+				break;
+
+			case CMD_STATUS_REQUEST_PROGRESS_TAG:
+				if (ActivePortal->commandTag != NULL)
+				{
+					result = makeStringInfo();
+
+					if (ActivePortal->queryDesc != NULL &&
+						ActivePortal->queryDesc->estate != NULL)
+					{
+						appendStringInfo(result, "%s %u",
+										 ActivePortal->commandTag,
+										 ActivePortal->queryDesc->estate->es_processed);
+					}
+					else
+					{
+						/* no progress available, at least show the command tag */
+						appendStringInfoString(result, ActivePortal->commandTag);
+					}
+				}
+				else
+					slot->result_code = CMD_STATUS_RESULT_NO_DATA;
+				break;
+		}
+	}
+	else
+		slot->result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+
+	return result;
+}
+
+static void
+RestoreResourceOwner(ResourceOwner prevResourceOwner)
+{
+	if (prevResourceOwner != CurrentResourceOwner)
+	{
+		ResourceOwner	res_owner = CurrentResourceOwner;
+
+		CurrentResourceOwner = prevResourceOwner;
+		ResourceOwnerDelete(res_owner);
+	}
+}
+
+/*
+ * Process all command status requests from other backends by scanning the
+ * whole array of slots looking for the requests targeted at us, which were
+ * not invalidated or already processed by us during earlier rounds.
+ */
+void
+ProcessCmdStatusInfoRequest(void)
+{
+	/*
+	 * We might have been signaled again by another process while handling a
+	 * request, so re-check after going through all the slots.
+	 */
+	while (CmdStatusInfoPending)
+	{
+		int		i;
+
+		CmdStatusInfoPending = false;
+
+		for (i = 0; i < NumCmdStatusSlots; i++)
+		{
+			volatile CmdStatusSlot *slot = &CmdStatusSlots[i];
+
+			if (slot->sender_css_pid == MyProcPid &&
+				slot->is_valid && !slot->is_processed)
+			{
+				ResourceOwner	prevResourceOwner = CurrentResourceOwner;
+				dsm_segment	   *seg = NULL;
+				shm_toc		   *toc = NULL;
+				shm_mq_handle  *output;
+				shm_mq		   *mq;
+
+				/* Ensure valid resource owner for access to dsm. */
+				if (CurrentResourceOwner == NULL)
+					CurrentResourceOwner = ResourceOwnerCreate(NULL, "ProcessCmdStatusInfoRequest");
+
+				seg = dsm_attach(slot->dsm_seg_handle);
+				if (seg == NULL)
+				{
+					elog(LOG, "unable to map dynamic memory segment for command status");
+
+					slot->is_processed = true;
+					slot->is_valid = false;
+
+					RestoreResourceOwner(prevResourceOwner);
+					continue;
+				}
+
+				toc = shm_toc_attach(PG_CMD_STATUS_INFO_MAGIC, dsm_segment_address(seg));
+				if (toc == NULL)
+				{
+					elog(LOG, "bad magic in dynamic memory segment for command status");
+
+					slot->is_processed = true;
+					slot->is_valid = false;
+
+					dsm_detach(seg);
+
+					RestoreResourceOwner(prevResourceOwner);
+					continue;
+				}
+
+				mq = shm_toc_lookup(toc, 0);
+				output = shm_mq_attach(mq, seg, NULL);
+
+				PG_TRY();
+				{
+					StringInfo		payload;
+					int				res;
+
+					payload = ProcessCmdStatusSlot(slot);
+					if (payload != NULL)
+					{
+						/*
+						 * It's important to set the processed flag and result
+						 * code *before* sending the message through the
+						 * queue.  Otherwise it may result that the receiving
+						 * backend will wake up the queue and re-use its slot
+						 * to send another status request (it doesn't matter
+						 * if that would be a request to the same backend or a
+						 * different one).
+						 *
+						 * The message queue serves as a barrier here: it
+						 * should be generally safe to assume that before we
+						 * send anything to the queue or detach from it, the
+						 * receiving backend is waiting.
+						 */
+						slot->is_processed = true;
+
+						/*
+						 * Send the data, including null terminator.
+						 *
+						 * This blocks until the receiving side completes
+						 * reading or detaches from the queue.
+						 */
+						res = shm_mq_send(output, payload->len + 1, payload->data, false);
+						if (res != SHM_MQ_SUCCESS)
+						{
+							elog(LOG, "cannot send command status to backend with PID %d",
+								 slot->css_pid);
+							slot->is_valid = false;
+						}
+						pfree(payload->data);
+						pfree(payload);
+					}
+					else
+					{
+						slot->is_valid = false;
+
+						/*
+						 * If there was no payload to send to the receiver, we
+						 * just detach from the queue and the other side
+						 * should wake up.
+						 */
+					}
+
+					shm_mq_detach(mq);
+					dsm_detach(seg);
+
+					RestoreResourceOwner(prevResourceOwner);
+				}
+				PG_CATCH();
+				{
+					slot->is_processed = true;
+					slot->is_valid = false;
+
+					/*
+					 * Detaching from the queue also wakes up the receiver, be
+					 * sure to update the slot status before detaching.
+					 */
+					shm_mq_detach(mq);
+					dsm_detach(seg);
+
+					RestoreResourceOwner(prevResourceOwner);
+
+					PG_RE_THROW();	/* XXX is it better to log/suppress the error here? */
+				}
+				PG_END_TRY();
+			}
+		}
+	}
+}
+
+static void
+report_result_code(CmdStatusInfoResultCode result, pid_t target_pid)
+{
+	switch (result)
+	{
+		case CMD_STATUS_RESULT_BACKEND_IDLE:
+			elog(INFO, "no command is currently running in backend with PID %d",
+				 target_pid);
+			break;
+
+	case CMD_STATUS_RESULT_NO_DATA:
+			elog(WARNING, "no suitable data found for the query in backend with PID %d",
+				 target_pid);
+			break;
+
+		default:
+			elog(ERROR, "general command status request failure");
+			break;
+	}
+}
+
+/*
+ * Try to get status of a command running in another process.
+ *
+ * FUNCTION pg_cmdstatus(pid, request_type)
+ * RETURNS SETOF text
+ */
+Datum
+pg_cmdstatus(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo	   *rsinfo	= (ReturnSetInfo *) fcinfo->resultinfo;
+	Tuplestorestate	   *tupstore;
+	TupleDesc			tupdesc;
+	MemoryContext		per_query_ctx;
+	MemoryContext		oldcontext;
+	pid_t				target_pid = (pid_t) PG_GETARG_INT32(0);
+	int					request_type = PG_GETARG_INT32(1);
+	PGPROC			   *proc;
+	shm_toc_estimator	estimator;
+	Size				segsize;
+	dsm_segment		   *seg = NULL;
+	shm_toc			   *toc = NULL;
+	shm_mq			   *mq = NULL;
+	shm_mq_handle	   *input = NULL;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	if (request_type < 1 || request_type > CMD_STATUS_MAX_REQUEST)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("unknown command status request")));
+
+	/* need to build tuplestore in query context */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupdesc = CreateTupleDescCopy(rsinfo->expectedDesc);
+	tupstore = tuplestore_begin_heap(false, false, work_mem);
+	MemoryContextSwitchTo(oldcontext);
+
+	if (target_pid == MyProcPid)
+		ereport(ERROR,
+				(errmsg("backend cannot query command status of itself")));
+
+	/* verify access to target_pid */
+	proc = BackendPidGetProc(target_pid);
+
+	if (proc == NULL)
+		ereport(ERROR,
+				(errmsg("PID %d is not a PostgreSQL server process", target_pid)));
+
+	if (!(superuser() || proc->roleId == GetUserId()))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser or have the same role to explain queries running in other server processes"))));
+
+	/* prepare shared dsm segment */
+	shm_toc_initialize_estimator(&estimator);
+	shm_toc_estimate_keys(&estimator, 1);
+	shm_toc_estimate_chunk(&estimator, CMD_STATUS_BUFFER_SIZE);
+	segsize = shm_toc_estimate(&estimator);
+
+	seg = dsm_create(segsize, 0);
+
+	toc = shm_toc_create(PG_CMD_STATUS_INFO_MAGIC, dsm_segment_address(seg),
+						 segsize);
+
+	/* prepare basic structures */
+	mq = shm_mq_create(shm_toc_allocate(toc, CMD_STATUS_BUFFER_SIZE),
+					   CMD_STATUS_BUFFER_SIZE);
+
+	shm_mq_set_receiver(mq, MyProc);
+	shm_mq_set_sender(mq, proc);
+	shm_toc_insert(toc, 0, mq);
+
+	PG_TRY();
+	{
+		shm_mq_result	res;
+		char		   *data = NULL;
+		Size			length;
+
+		input = shm_mq_attach(mq, seg, NULL);
+
+		if (SendCmdStatusInfoSignal(target_pid, proc->backendId,
+									dsm_segment_handle(seg), request_type))
+		{
+			elog(ERROR, "could not signal backend with PID %d", target_pid);
+		}
+
+		/*
+		 * This blocks until the sender process writes a message to the queue
+		 * or detaches from it.  In either case we wake up and process the
+		 * result.
+		 */
+		res = shm_mq_receive(input, &length, (void **) &data, false);
+		if (res == SHM_MQ_SUCCESS)
+		{
+			const char	   *p = data;
+
+			/* break into tuples on newline */
+			while (*p)
+			{
+				const char *q = strchr(p, '\n');
+				Size		len = q ? q - p : strlen(p);
+				Datum		value;
+				HeapTuple	tuple;
+				bool		isnull = false;
+
+				value = PointerGetDatum(cstring_to_text_with_len(p, len));
+
+				tuple = heap_form_tuple(tupdesc, &value, &isnull);
+				tuplestore_puttuple(tupstore, tuple);
+
+				if (!q)
+					break;
+
+				p += len + 1;
+			}
+		}
+		else
+		{
+			report_result_code(MyCmdStatusSlot->result_code, target_pid);
+		}
+
+		/* clean up and return the tuplestore */
+		tuplestore_donestoring(tupstore);
+
+		InvalidateCmdStatusSlot();
+
+		shm_mq_detach(mq);
+		dsm_detach(seg);
+	}
+	PG_CATCH();
+	{
+		InvalidateCmdStatusSlot();
+
+		shm_mq_detach(mq);
+		dsm_detach(seg);
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	return (Datum) 0;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 23e594e..2269853 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -29,6 +29,7 @@ ProtocolVersion FrontendProtocol;
 volatile bool InterruptPending = false;
 volatile bool QueryCancelPending = false;
 volatile bool ProcDiePending = false;
+volatile bool CmdStatusInfoPending = false;
 volatile bool ClientConnectionLost = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7b19714..7e7ba77 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -50,6 +50,7 @@
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
+#include "utils/cmdstatus.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -588,6 +589,9 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	/* Now that we have a BackendId, we can participate in ProcSignal */
 	ProcSignalInit(MyBackendId);
 
+	/* Init CmdStatus slot */
+	CmdStatusInit(MyBackendId);
+
 	/*
 	 * Also set up timeout handlers needed for backend operation.  We need
 	 * these in every case except bootstrap.
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ddf7c67..d083aaf 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3130,6 +3130,8 @@ DESCR("get OID of current session's temp schema, if any");
 DATA(insert OID = 2855 (  pg_is_other_temp_schema	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "26" _null_ _null_ _null_ _null_ _null_ pg_is_other_temp_schema _null_ _null_ _null_ ));
 DESCR("is schema another session's temp schema?");
 
+DATA(insert OID = 4099 ( pg_cmdstatus		PGNSP PGUID 12 1 100 0 0 f f f f f t s 2 0 25 "23 23" _null_ _null_  _null_  _null_ _null_ pg_cmdstatus _null_ _null_ _null_ ));
+DESCR("returns information about another process");
 DATA(insert OID = 2171 ( pg_cancel_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_cancel_backend _null_ _null_ _null_ ));
 DESCR("cancel a server process' current query");
 DATA(insert OID = 2096 ( pg_terminate_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e0cc69f..5f03f63 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -80,6 +80,7 @@
 extern PGDLLIMPORT volatile bool InterruptPending;
 extern PGDLLIMPORT volatile bool QueryCancelPending;
 extern PGDLLIMPORT volatile bool ProcDiePending;
+extern PGDLLIMPORT volatile bool CmdStatusInfoPending;
 
 extern volatile bool ClientConnectionLost;
 
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index af1a0cd..cd856e5 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,8 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
+	PROCSIG_CMD_STATUS_INFO,
+
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fc1679e..605612e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1243,6 +1243,9 @@ extern Datum pg_identify_object_as_address(PG_FUNCTION_ARGS);
 /* catalog/objectaddress.c */
 extern Datum pg_get_object_address(PG_FUNCTION_ARGS);
 
+/* utils/adt/cmdstatus.c */
+extern Datum pg_cmdstatus(PG_FUNCTION_ARGS);
+
 /* commands/constraint.c */
 extern Datum unique_key_recheck(PG_FUNCTION_ARGS);
 
diff --git a/src/include/utils/cmdstatus.h b/src/include/utils/cmdstatus.h
new file mode 100644
index 0000000..85627a57
--- /dev/null
+++ b/src/include/utils/cmdstatus.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.h
+ *	  Declarations for command status interrupt handling.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * src/include/utils/cmdstatus.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CMDSTATUS_H
+#define CMDSTATUS_H
+
+extern Size CmdStatusShmemSize(void);
+extern void CmdStatusShmemInit(void);
+extern void CmdStatusInit(int css_idx);
+
+extern void HandleCmdStatusInfoInterrupt(void);
+extern void ProcessCmdStatusInfoRequest(void);
+
+#endif   /* CMDSTATUS_H */
#48Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Shulgin, Oleksandr (#46)
Re: On-demand running query plans using auto_explain and signals

On Mon, Sep 14, 2015 at 3:09 PM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

On Mon, Sep 14, 2015 at 2:11 PM, Tomas Vondra <
tomas.vondra@2ndquadrant.com> wrote:

Now the backend that has been signaled on the second call to
pg_cmdstatus (it can be either some other backend, or the backend B
again) will not find an unprocessed slot, thus it will not try to
attach/detach the queue and the backend A will block forever.

This requires a really bad timing and the user should be able to
interrupt the querying backend A still.

I think we can't rely on the low probability that this won't happen, and
we should not rely on people interrupting the backend. Being able to detect
the situation and fail gracefully should be possible.

It may be possible to introduce some lock-less protocol preventing such
situations, but it's not there at the moment. If you believe it's possible,
you need to explain and "prove" that it's actually safe.

Otherwise we may need to introduce some basic locking - for example we
may introduce a LWLock for each slot, and lock it with dontWait=true (and
skip it if we couldn't lock it). This should prevent most scenarios where
one corrupted slot blocks many processes.

OK, I will revisit this part then.

I have a radical proposal to remove the need for locking: make the
CmdStatusSlot struct consist of a mere dsm_handle and move all the required
metadata like sender_pid, request_type, etc. into the shared memory segment
itself.

If we allow the only the requesting process to update the slot (that is the
handle value itself) this removes the need for locking between sender and
receiver.

The sender will walk through the slots looking for a non-zero dsm handle
(according to dsm_create() implementation 0 is considered an invalid
handle), and if it finds a valid one, it will attach and look inside, to
check if it's destined for this process ID. At first that might sound
strange, but I would expect 99% of the time that the only valid slot would
be for the process that has been just signaled.

The sender process will then calculate the response message, update the
result_code in the shared memory segment and finally send the message
through the queue. If the receiver has since detached we get a detached
result code and bail out.

Clearing the slot after receiving the message should be the requesting
process' responsibility. This way the receiver only writes to the slot and
the sender only reads from it.

By the way, is it safe to assume atomic read/writes of dsm_handle
(uint32)? I would be surprised if not.

--
Alex

#49Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#48)
Re: On-demand running query plans using auto_explain and signals

2015-09-14 18:46 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Mon, Sep 14, 2015 at 3:09 PM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

On Mon, Sep 14, 2015 at 2:11 PM, Tomas Vondra <
tomas.vondra@2ndquadrant.com> wrote:

Now the backend that has been signaled on the second call to
pg_cmdstatus (it can be either some other backend, or the backend B
again) will not find an unprocessed slot, thus it will not try to
attach/detach the queue and the backend A will block forever.

This requires a really bad timing and the user should be able to
interrupt the querying backend A still.

I think we can't rely on the low probability that this won't happen, and
we should not rely on people interrupting the backend. Being able to detect
the situation and fail gracefully should be possible.

It may be possible to introduce some lock-less protocol preventing such
situations, but it's not there at the moment. If you believe it's possible,
you need to explain and "prove" that it's actually safe.

Otherwise we may need to introduce some basic locking - for example we
may introduce a LWLock for each slot, and lock it with dontWait=true (and
skip it if we couldn't lock it). This should prevent most scenarios where
one corrupted slot blocks many processes.

OK, I will revisit this part then.

I have a radical proposal to remove the need for locking: make the
CmdStatusSlot struct consist of a mere dsm_handle and move all the required
metadata like sender_pid, request_type, etc. into the shared memory segment
itself.

If we allow the only the requesting process to update the slot (that is
the handle value itself) this removes the need for locking between sender
and receiver.

The sender will walk through the slots looking for a non-zero dsm handle
(according to dsm_create() implementation 0 is considered an invalid
handle), and if it finds a valid one, it will attach and look inside, to
check if it's destined for this process ID. At first that might sound
strange, but I would expect 99% of the time that the only valid slot would
be for the process that has been just signaled.

The sender process will then calculate the response message, update the
result_code in the shared memory segment and finally send the message
through the queue. If the receiver has since detached we get a detached
result code and bail out.

Clearing the slot after receiving the message should be the requesting
process' responsibility. This way the receiver only writes to the slot and
the sender only reads from it.

By the way, is it safe to assume atomic read/writes of dsm_handle
(uint32)? I would be surprised if not.

I don't see any reason why it should not to work - only few processes will
wait for data - so lost attach/detach shm operations will not be too much.

Pavel

Show quoted text

--
Alex

#50Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#49)
1 attachment(s)
Re: On-demand running query plans using auto_explain and signals

On Mon, Sep 14, 2015 at 7:27 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-14 18:46 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

I have a radical proposal to remove the need for locking: make the
CmdStatusSlot struct consist of a mere dsm_handle and move all the required
metadata like sender_pid, request_type, etc. into the shared memory segment
itself.

If we allow the only the requesting process to update the slot (that is
the handle value itself) this removes the need for locking between sender
and receiver.

The sender will walk through the slots looking for a non-zero dsm handle
(according to dsm_create() implementation 0 is considered an invalid
handle), and if it finds a valid one, it will attach and look inside, to
check if it's destined for this process ID. At first that might sound
strange, but I would expect 99% of the time that the only valid slot would
be for the process that has been just signaled.

The sender process will then calculate the response message, update the
result_code in the shared memory segment and finally send the message
through the queue. If the receiver has since detached we get a detached
result code and bail out.

Clearing the slot after receiving the message should be the requesting
process' responsibility. This way the receiver only writes to the slot and
the sender only reads from it.

By the way, is it safe to assume atomic read/writes of dsm_handle
(uint32)? I would be surprised if not.

I don't see any reason why it should not to work - only few processes will
wait for data - so lost attach/detach shm operations will not be too much.

Please see attached for implementation of this approach. The most
surprising thing is that it actually works :)

One problem still remains with the process requesting information when the
target process exits before it can have a chance to handle the signal. The
requesting process then waits forever, because nobody attaches/detaches the
queue. We've discussed this before and looks like we need to introduce a
timeout parameter to pg_cmdstatus()...

--
Alex

Attachments:

explain-pid-v5.patchtext/x-patch; charset=US-ASCII; name=explain-pid-v5.patchDownload
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..2e3beaf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -43,6 +43,7 @@
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
+#include "utils/cmdstatus.h"
 
 
 shmem_startup_hook_type shmem_startup_hook = NULL;
@@ -139,6 +140,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, CmdStatusShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -243,6 +245,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
+	CmdStatusShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 0abde43..e637be1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -26,6 +26,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/cmdstatus.h"
 
 
 /*
@@ -296,6 +297,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CMD_STATUS_INFO))
+		HandleCmdStatusInfoInterrupt();
+
 	if (set_latch_on_sigusr1)
 		SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d917af3..1a5b03c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/cmdstatus.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2991,6 +2992,9 @@ ProcessInterrupts(void)
 
 	if (ParallelMessagePending)
 		HandleParallelMessages();
+
+	if (CmdStatusInfoPending)
+		ProcessCmdStatusInfoRequest();
 }
 
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 3ed0b44..2c8687c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -18,7 +18,7 @@ endif
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-	bool.o cash.o char.o date.o datetime.o datum.o dbsize.o domains.o \
+	bool.o cash.o char.o cmdstatus.o date.o datetime.o datum.o dbsize.o domains.o \
 	encode.o enum.o expandeddatum.o \
 	float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
diff --git a/src/backend/utils/adt/cmdstatus.c b/src/backend/utils/adt/cmdstatus.c
new file mode 100644
index 0000000..6cb04f4
--- /dev/null
+++ b/src/backend/utils/adt/cmdstatus.c
@@ -0,0 +1,675 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.c
+ *	  Definitions for pg_cmdstatus function.
+ *
+ * Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/cmdstatus.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/htup_details.h"
+#include "commands/explain.h"
+#include "lib/stringinfo.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/procsignal.h"
+#include "storage/shm_mq.h"
+#include "storage/shm_toc.h"
+#include "tcop/dest.h"
+#include "tcop/pquery.h"
+#include "utils/builtins.h"
+#include "utils/cmdstatus.h"
+
+
+typedef enum {
+	CMD_STATUS_REQUEST_EXPLAIN = 1,
+	CMD_STATUS_REQUEST_QUERY_TEXT = 2,
+	CMD_STATUS_REQUEST_PROGRESS_TAG = 3,
+	CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE = 4
+} CmdStatusInfoRequestType;
+
+#define CMD_STATUS_MAX_REQUEST CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE
+
+typedef enum {
+	CMD_STATUS_RESULT_FAILURE = -1,
+	CMD_STATUS_RESULT_SUCCESS = 0,
+	CMD_STATUS_RESULT_BACKEND_IDLE,
+	CMD_STATUS_RESULT_NO_DATA
+} CmdStatusInfoResultCode;
+
+/*
+ * We allocate an array the size of NumCmdStatusSlots of DSM handles in the
+ * shared memory.  This slots array is indexed by backend ID, so each slot can
+ * corresponds to at max. one backend at any given time.
+ *
+ * In order to actually query some backend for its status, the interested
+ * process will create a DSM segment, initialize a table of contents in it,
+ * fill the target process id and request type and create a shared memory
+ * queue in the segment.
+ *
+ * The interested process will then put the such initialized DSM segment's
+ * handle in its own slot and send the SIGUSR1 with the reason
+ * PROCSIG_CMD_STATUS_INFO to the process being queried.  The interested
+ * process will wait for the response to be delivered via the shared memory
+ * queue and the result code via a field in the DSM segment.
+ *
+ * The layout of a DSM segment is given by the following TOC keys enumeration:
+ */
+typedef enum {
+	CmdStatusSlotKeyTargetProcessID,
+	CmdStatusSlotKeyIsProcessedFlag,
+	CmdStatusSlotKeyRequestType,
+	CmdStatusSlotKeyResultCode,
+	CmdStatusSlotKeyMemoryQueue,
+
+	CmdStatusTotalSlotKeys
+} CmdStatusSlotKey;
+
+#define NumCmdStatusSlots	MaxBackends
+
+#define PG_CMD_STATUS_INFO_MAGIC	0x79fb2449
+#define CMD_STATUS_BUFFER_SIZE		1024
+
+/*
+ * These structs are allocated on the program stack as local variables in the
+ * ExecutorRun hook.  The top of stack is current_query_stack, see below.
+ *
+ * Capturing the execution stack allows us to inspect the inner-most running
+ * query as well as showing the complete backtrace.
+ */
+typedef struct CmdInfoStack {
+	QueryDesc			   *query_desc;
+	struct CmdInfoStack	   *parent;
+} CmdInfoStack;
+
+/* The array of slots pre-allocated on shared memory. */
+static volatile dsm_handle *CmdStatusSlots = NULL;
+
+/* Pointer into the slots array, to a slot assigned for this backend. */
+static volatile dsm_handle *MyCmdStatusSlot = NULL;
+
+static CmdInfoStack	   *current_query_stack = NULL;
+static int				query_stack_size = 0;  /* XXX not really used */
+
+static ExecutorRun_hook_type	prev_ExecutorRun = NULL;
+
+static void cmdstatus_ExecutorRun(QueryDesc *queryDesc,
+								  ScanDirection direction, long count);
+
+static void CleanupCmdStatusSlot(int status, Datum arg);
+static StringInfo ProcessCmdStatusSlot(CmdStatusInfoRequestType request_type,
+					 CmdStatusInfoResultCode *result_code);
+
+static StringInfo explain_query(QueryDesc *queryDesc);
+static void report_result_code(CmdStatusInfoResultCode result, pid_t target_pid);
+static void RestoreResourceOwner(ResourceOwner prevResourceOwner);
+
+
+Size
+CmdStatusShmemSize(void)
+{
+	return NumCmdStatusSlots * sizeof(dsm_handle);
+}
+
+void
+CmdStatusShmemInit(void)
+{
+	Size		size = CmdStatusShmemSize();
+	bool		found;
+
+	CmdStatusSlots = (dsm_handle *) ShmemInitStruct("CmdStatusSlots", size, &found);
+
+	if (!found)
+		MemSet(CmdStatusSlots, 0, size);
+}
+
+static void
+CleanupCmdStatusSlot(int status, Datum arg)
+{
+	Assert(MyCmdStatusSlot != NULL);
+
+	*MyCmdStatusSlot = 0;	/* just in case */
+
+	MyCmdStatusSlot = NULL;
+}
+
+/*
+ * CmdStatusInit
+ *		Initialize MyCmdStatusSlot
+ *
+ * The passed index should be MyBackendId.
+ */
+void
+CmdStatusInit(int css_idx)
+{
+	MyCmdStatusSlot = &CmdStatusSlots[css_idx - 1];
+
+	on_shmem_exit(CleanupCmdStatusSlot, (Datum) 0);
+
+	/* also install executor hooks */
+	prev_ExecutorRun = ExecutorRun_hook;
+	ExecutorRun_hook = cmdstatus_ExecutorRun;
+}
+
+/*
+ * The executor hook.
+ *
+ * Accumulates the query descriptors on the program stack and takes care of
+ * popping the current frame when leaving the function ab-/normally.
+ */
+static void
+cmdstatus_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
+{
+	CmdInfoStack	current;
+
+	current.query_desc = queryDesc;
+	current.parent = current_query_stack;
+
+	current_query_stack = &current;
+	query_stack_size++;
+
+	PG_TRY();
+	{
+		if (prev_ExecutorRun)
+			prev_ExecutorRun(queryDesc, direction, count);
+		else
+			standard_ExecutorRun(queryDesc, direction, count);
+
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+	}
+	PG_CATCH();
+	{
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+/* Produce an explain plan of a single query descriptor. */
+static StringInfo
+explain_query(QueryDesc *queryDesc)
+{
+	StringInfo		str;
+	ExplainState   *es;
+
+	es = NewExplainState();
+	es->analyze = false;
+	es->verbose = false;
+	es->buffers = false;
+	es->format = EXPLAIN_FORMAT_TEXT;
+
+	ExplainBeginOutput(es);
+	/* XXX: appendStringInfo(es->str, "#%d ", depth); ? */
+	ExplainQueryText(es, queryDesc);
+	ExplainPrintPlan(es, queryDesc);
+	ExplainEndOutput(es);
+
+	str = es->str;
+
+	pfree(es);
+	return str;
+}
+
+
+/* signal handler for PROCSIG_CMD_STATUS_INFO */
+void
+HandleCmdStatusInfoInterrupt(void)
+{
+	CmdStatusInfoPending = true;
+	InterruptPending = true;
+
+	SetLatch(MyLatch);
+}
+
+/* Produce a response and set the result code for a single slot. */
+static StringInfo
+ProcessCmdStatusSlot(CmdStatusInfoRequestType request_type,
+					 CmdStatusInfoResultCode *result_code)
+{
+	StringInfo		result = NULL;
+
+	/* Show some optimism, overwrite with error code later if needed. */
+	*result_code = CMD_STATUS_RESULT_SUCCESS;
+
+	if (ActivePortal)
+	{
+		switch (request_type)
+		{
+			case CMD_STATUS_REQUEST_EXPLAIN:
+				if (ActivePortal->queryDesc != NULL)
+					result = explain_query(ActivePortal->queryDesc);
+				else if (current_query_stack != NULL)
+					result = explain_query(current_query_stack->query_desc);
+				else
+					*result_code = CMD_STATUS_RESULT_NO_DATA;
+				break;
+
+			case CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE:
+				{
+					StringInfo		str;
+					CmdInfoStack   *query;
+
+					result = makeStringInfo();
+
+					if (current_query_stack != NULL)
+					{
+						for (query = current_query_stack;
+							 query != NULL;
+							 query = query->parent)
+						{
+							str = explain_query(query->query_desc);
+							appendBinaryStringInfo(result, str->data, str->len);
+
+							pfree(str->data);
+							pfree(str);
+						}
+					}
+					else
+						*result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+					break;
+				}
+
+			case CMD_STATUS_REQUEST_QUERY_TEXT:
+				result = makeStringInfo();
+				appendStringInfoString(result, ActivePortal->sourceText);
+				break;
+
+			case CMD_STATUS_REQUEST_PROGRESS_TAG:
+				if (ActivePortal->commandTag != NULL)
+				{
+					result = makeStringInfo();
+
+					if (ActivePortal->queryDesc != NULL &&
+						ActivePortal->queryDesc->estate != NULL)
+					{
+						appendStringInfo(result, "%s %u",
+										 ActivePortal->commandTag,
+										 ActivePortal->queryDesc->estate->es_processed);
+					}
+					else
+					{
+						/* no progress available, at least show the command tag */
+						appendStringInfoString(result, ActivePortal->commandTag);
+					}
+				}
+				else
+					*result_code = CMD_STATUS_RESULT_NO_DATA;
+				break;
+		}
+	}
+	else
+		*result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+
+	return result;
+}
+
+static void
+RestoreResourceOwner(ResourceOwner prevResourceOwner)
+{
+	if (prevResourceOwner != CurrentResourceOwner)
+	{
+		ResourceOwner	res_owner = CurrentResourceOwner;
+
+		CurrentResourceOwner = prevResourceOwner;
+		ResourceOwnerDelete(res_owner);
+	}
+}
+
+/*
+ * Process all command status requests from other backends by scanning the
+ * whole array of slots looking for the requests targeted at us, which were
+ * not already processed by us during earlier rounds.
+ *
+ * Most of the times it is expected that the first valid slot we find is for
+ * us, so there's no penalty for actually attaching it and looking inside.
+ */
+void
+ProcessCmdStatusInfoRequest(void)
+{
+	ResourceOwner	prevResourceOwner = CurrentResourceOwner;
+
+	/* Ensure valid resource owner for access to dsm. */
+	if (CurrentResourceOwner == NULL)
+		CurrentResourceOwner = ResourceOwnerCreate(NULL, "ProcessCmdStatusInfoRequest");
+
+	/*
+	 * We might have been signaled again by another process while handling a
+	 * request, so re-check after going through all the slots.
+	 */
+	while (CmdStatusInfoPending)
+	{
+		int		i;
+
+		CmdStatusInfoPending = false;
+
+		for (i = 0; i < NumCmdStatusSlots; i++)
+		{
+			dsm_handle slot = CmdStatusSlots[i];
+
+			if (slot != 0)
+			{
+				dsm_segment	   *seg = NULL;
+				shm_toc		   *toc = NULL;
+				pid_t		   *target_pid_ptr;
+				bool		   *is_processed_flag_ptr;
+				CmdStatusInfoRequestType   *request_type_ptr;
+				CmdStatusInfoResultCode	   *result_code_ptr;
+				shm_mq		   *mq;
+				shm_mq_handle  *output;
+
+				seg = dsm_attach(slot);
+				if (seg == NULL)
+				{
+					elog(LOG, "unable to map dynamic memory segment for command status");
+					continue;
+				}
+
+				toc = shm_toc_attach(PG_CMD_STATUS_INFO_MAGIC, dsm_segment_address(seg));
+				if (toc == NULL)
+				{
+					elog(LOG, "bad magic in dynamic memory segment for command status");
+
+					dsm_detach(seg);
+					continue;
+				}
+
+				target_pid_ptr = (pid_t *)
+					shm_toc_lookup(toc, CmdStatusSlotKeyTargetProcessID);
+				is_processed_flag_ptr = (bool *)
+					shm_toc_lookup(toc, CmdStatusSlotKeyIsProcessedFlag);
+
+				if (*target_pid_ptr != MyProcPid || *is_processed_flag_ptr)
+				{
+					dsm_detach(seg);
+					continue;
+				}
+
+				/* Set the flag early to avoid re-processing during next round. */
+				*is_processed_flag_ptr = true;
+
+				request_type_ptr = (CmdStatusInfoRequestType *)
+					shm_toc_lookup(toc, CmdStatusSlotKeyRequestType);
+
+				result_code_ptr = (CmdStatusInfoResultCode *)
+					shm_toc_lookup(toc, CmdStatusSlotKeyResultCode);
+
+				mq = shm_toc_lookup(toc, CmdStatusSlotKeyMemoryQueue);
+				output = shm_mq_attach(mq, seg, NULL);
+
+				PG_TRY();
+				{
+					StringInfo		payload;
+					int				res;
+
+					payload = ProcessCmdStatusSlot(*request_type_ptr, result_code_ptr);
+					if (payload != NULL)
+					{
+						/*
+						 * Send the data, including null terminator.
+						 *
+						 * This blocks until the receiving side completes
+						 * reading or detaches from the queue.
+						 */
+						res = shm_mq_send(output, payload->len + 1, payload->data, false);
+						if (res != SHM_MQ_SUCCESS)
+							elog(LOG, "cannot send command status via shared memory queue");
+
+						pfree(payload->data);
+						pfree(payload);
+					}
+					/*
+					 * If there was no payload to send, we just detach from
+					 * the queue and the receiver should wake up.
+					 */
+
+					shm_mq_detach(mq);
+					dsm_detach(seg);
+				}
+				PG_CATCH();
+				{
+					/*
+					 * Detaching from the queue also wakes up the receiver, be
+					 * sure to update the slot status before detaching.
+					 */
+					shm_mq_detach(mq);
+					dsm_detach(seg);
+
+					RestoreResourceOwner(prevResourceOwner);	/* XXX remove if not re-throwing */
+
+					PG_RE_THROW();	/* XXX is it better to log/suppress the error here? */
+				}
+				PG_END_TRY();
+			}
+		}
+	}
+
+	RestoreResourceOwner(prevResourceOwner);
+}
+
+static void
+report_result_code(CmdStatusInfoResultCode result, pid_t target_pid)
+{
+	switch (result)
+	{
+		case CMD_STATUS_RESULT_BACKEND_IDLE:
+			elog(INFO, "no command is currently running in backend with PID %d",
+				 target_pid);
+			break;
+
+		case CMD_STATUS_RESULT_NO_DATA:
+			elog(WARNING, "no suitable data found for the query in backend with PID %d",
+				 target_pid);
+			break;
+
+		default:
+			elog(ERROR, "general command status request failure when querying backend with PID %d",
+				 target_pid);
+			break;
+	}
+}
+
+/*
+ * Try to get status of a command running in another process.
+ *
+ * FUNCTION pg_cmdstatus(pid, request_type)
+ * RETURNS SETOF text
+ */
+Datum
+pg_cmdstatus(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo	   *rsinfo	= (ReturnSetInfo *) fcinfo->resultinfo;
+	Tuplestorestate	   *tupstore;
+	TupleDesc			tupdesc;
+	MemoryContext		per_query_ctx;
+	MemoryContext		oldcontext;
+	pid_t				target_pid = (pid_t) PG_GETARG_INT32(0);
+	int					request_type = PG_GETARG_INT32(1);
+	PGPROC			   *target_proc;
+	shm_toc_estimator	estimator;
+	Size				segsize;
+	dsm_segment		   *seg;
+	shm_toc			   *toc;
+	pid_t			   *target_pid_ptr;
+	bool			   *is_processed_flag_ptr;
+	CmdStatusInfoRequestType   *request_type_ptr;
+	CmdStatusInfoResultCode	   *result_code_ptr;
+	shm_mq			   *mq;
+	shm_mq_handle	   *input;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	if (request_type < 1 || request_type > CMD_STATUS_MAX_REQUEST)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("unknown command status request")));
+
+	if (target_pid == MyProcPid)
+		ereport(ERROR,
+				(errmsg("backend cannot query command status of itself")));
+
+	/* verify access to target_pid */
+	target_proc = BackendPidGetProc(target_pid);
+
+	if (target_proc == NULL)
+		ereport(ERROR,
+				(errmsg("PID %d is not a PostgreSQL server process", target_pid)));
+
+	if (!(superuser() || target_proc->roleId == GetUserId()))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser or have the same role to explain queries running in other server processes"))));
+
+	if (*MyCmdStatusSlot != 0)
+		ereport(ERROR,
+				(errmsg("pg_cmdstatus() call is already in progress")));
+
+	/* need to build tuplestore in query context */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupdesc = CreateTupleDescCopy(rsinfo->expectedDesc);
+	tupstore = tuplestore_begin_heap(false, false, work_mem);
+	MemoryContextSwitchTo(oldcontext);
+
+	/* prepare shared dsm segment */
+	shm_toc_initialize_estimator(&estimator);
+	shm_toc_estimate_keys(&estimator, CmdStatusTotalSlotKeys);
+	shm_toc_estimate_chunk(&estimator, sizeof(pid_t));
+	shm_toc_estimate_chunk(&estimator, sizeof(bool));
+	shm_toc_estimate_chunk(&estimator, sizeof(CmdStatusInfoRequestType));
+	shm_toc_estimate_chunk(&estimator, sizeof(CmdStatusInfoResultCode));
+	shm_toc_estimate_chunk(&estimator, CMD_STATUS_BUFFER_SIZE);
+	segsize = shm_toc_estimate(&estimator);
+
+	seg = dsm_create(segsize, 0);
+
+	toc = shm_toc_create(PG_CMD_STATUS_INFO_MAGIC, dsm_segment_address(seg), segsize);
+
+	target_pid_ptr = (pid_t *) shm_toc_allocate(toc, sizeof(pid_t));
+	*target_pid_ptr = target_pid;
+	shm_toc_insert(toc, CmdStatusSlotKeyTargetProcessID, target_pid_ptr);
+
+	is_processed_flag_ptr = (bool *) shm_toc_allocate(toc, sizeof(bool));
+	*is_processed_flag_ptr = false;
+	shm_toc_insert(toc, CmdStatusSlotKeyIsProcessedFlag, is_processed_flag_ptr);
+
+	request_type_ptr = (CmdStatusInfoRequestType *)
+		shm_toc_allocate(toc, sizeof(CmdStatusInfoRequestType));
+	*request_type_ptr = request_type;
+	shm_toc_insert(toc, CmdStatusSlotKeyRequestType, request_type_ptr);
+
+	result_code_ptr = (CmdStatusInfoResultCode *)
+		shm_toc_allocate(toc, sizeof(CmdStatusInfoResultCode));
+	*result_code_ptr = CMD_STATUS_RESULT_FAILURE;
+	shm_toc_insert(toc, CmdStatusSlotKeyResultCode, result_code_ptr);
+
+	mq = shm_mq_create(shm_toc_allocate(toc, CMD_STATUS_BUFFER_SIZE),
+					   CMD_STATUS_BUFFER_SIZE);
+	shm_toc_insert(toc, CmdStatusSlotKeyMemoryQueue, mq);
+
+	shm_mq_set_receiver(mq, MyProc);
+	shm_mq_set_sender(mq, target_proc);
+
+	PG_TRY();
+	{
+		shm_mq_result	res;
+		char		   *data = NULL;
+		Size			length;
+
+		input = shm_mq_attach(mq, seg, NULL);
+
+		*MyCmdStatusSlot = dsm_segment_handle(seg);
+
+		if (SendProcSignal(target_pid, PROCSIG_CMD_STATUS_INFO, target_proc->backendId))
+			elog(ERROR, "could not signal backend with PID %d", target_pid);
+
+		/*
+		 * This blocks until the sender process writes a message to the queue
+		 * or detaches from it.  In either case we wake up and process the
+		 * result.
+		 */
+		res = shm_mq_receive(input, &length, (void **) &data, false);
+		if (res == SHM_MQ_SUCCESS)
+		{
+			const char	   *p = data;
+
+			/* break into tuples on newline */
+			while (*p)
+			{
+				const char *q = strchr(p, '\n');
+				Size		len = q ? q - p : strlen(p);
+				Datum		value;
+				HeapTuple	tuple;
+				bool		isnull = false;
+
+				value = PointerGetDatum(cstring_to_text_with_len(p, len));
+
+				tuple = heap_form_tuple(tupdesc, &value, &isnull);
+				tuplestore_puttuple(tupstore, tuple);
+
+				if (!q)
+					break;
+
+				p += len + 1;
+			}
+		}
+		else
+		{
+			report_result_code(*result_code_ptr, target_pid);
+		}
+
+		/* clean up and return the tuplestore */
+		tuplestore_donestoring(tupstore);
+
+		*MyCmdStatusSlot = 0;
+
+		shm_mq_detach(mq);
+		dsm_detach(seg);
+	}
+	PG_CATCH();
+	{
+		*MyCmdStatusSlot = 0;
+
+		shm_mq_detach(mq);
+		dsm_detach(seg);
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	return (Datum) 0;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 23e594e..2269853 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -29,6 +29,7 @@ ProtocolVersion FrontendProtocol;
 volatile bool InterruptPending = false;
 volatile bool QueryCancelPending = false;
 volatile bool ProcDiePending = false;
+volatile bool CmdStatusInfoPending = false;
 volatile bool ClientConnectionLost = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7b19714..7e7ba77 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -50,6 +50,7 @@
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
+#include "utils/cmdstatus.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -588,6 +589,9 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	/* Now that we have a BackendId, we can participate in ProcSignal */
 	ProcSignalInit(MyBackendId);
 
+	/* Init CmdStatus slot */
+	CmdStatusInit(MyBackendId);
+
 	/*
 	 * Also set up timeout handlers needed for backend operation.  We need
 	 * these in every case except bootstrap.
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ddf7c67..d083aaf 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3130,6 +3130,8 @@ DESCR("get OID of current session's temp schema, if any");
 DATA(insert OID = 2855 (  pg_is_other_temp_schema	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "26" _null_ _null_ _null_ _null_ _null_ pg_is_other_temp_schema _null_ _null_ _null_ ));
 DESCR("is schema another session's temp schema?");
 
+DATA(insert OID = 4099 ( pg_cmdstatus		PGNSP PGUID 12 1 100 0 0 f f f f f t s 2 0 25 "23 23" _null_ _null_  _null_  _null_ _null_ pg_cmdstatus _null_ _null_ _null_ ));
+DESCR("returns information about another process");
 DATA(insert OID = 2171 ( pg_cancel_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_cancel_backend _null_ _null_ _null_ ));
 DESCR("cancel a server process' current query");
 DATA(insert OID = 2096 ( pg_terminate_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e0cc69f..5f03f63 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -80,6 +80,7 @@
 extern PGDLLIMPORT volatile bool InterruptPending;
 extern PGDLLIMPORT volatile bool QueryCancelPending;
 extern PGDLLIMPORT volatile bool ProcDiePending;
+extern PGDLLIMPORT volatile bool CmdStatusInfoPending;
 
 extern volatile bool ClientConnectionLost;
 
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index af1a0cd..cd856e5 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,8 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
+	PROCSIG_CMD_STATUS_INFO,
+
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fc1679e..605612e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1243,6 +1243,9 @@ extern Datum pg_identify_object_as_address(PG_FUNCTION_ARGS);
 /* catalog/objectaddress.c */
 extern Datum pg_get_object_address(PG_FUNCTION_ARGS);
 
+/* utils/adt/cmdstatus.c */
+extern Datum pg_cmdstatus(PG_FUNCTION_ARGS);
+
 /* commands/constraint.c */
 extern Datum unique_key_recheck(PG_FUNCTION_ARGS);
 
diff --git a/src/include/utils/cmdstatus.h b/src/include/utils/cmdstatus.h
new file mode 100644
index 0000000..85627a57
--- /dev/null
+++ b/src/include/utils/cmdstatus.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.h
+ *	  Declarations for command status interrupt handling.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * src/include/utils/cmdstatus.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CMDSTATUS_H
+#define CMDSTATUS_H
+
+extern Size CmdStatusShmemSize(void);
+extern void CmdStatusShmemInit(void);
+extern void CmdStatusInit(int css_idx);
+
+extern void HandleCmdStatusInfoInterrupt(void);
+extern void ProcessCmdStatusInfoRequest(void);
+
+#endif   /* CMDSTATUS_H */
#51Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Shulgin, Oleksandr (#50)
2 attachment(s)
Re: On-demand running query plans using auto_explain and signals

On Tue, Sep 15, 2015 at 11:00 AM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

On Mon, Sep 14, 2015 at 7:27 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-14 18:46 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

... This way the receiver only writes to the slot and the sender only
reads from it.

By the way, is it safe to assume atomic read/writes of dsm_handle
(uint32)? I would be surprised if not.

I don't see any reason why it should not to work - only few processes
will wait for data - so lost attach/detach shm operations will not be too
much.

Please see attached for implementation of this approach. The most
surprising thing is that it actually works :)

One problem still remains with the process requesting information when the
target process exits before it can have a chance to handle the signal. The
requesting process then waits forever, because nobody attaches/detaches the
queue. We've discussed this before and looks like we need to introduce a
timeout parameter to pg_cmdstatus()...

I've added the timeout parameter to the pg_cmdstatus call, and more
importantly to the sending side of the queue, so that one can limit the
potential effect of handling the interrupt in case something goes really
wrong.

I've tested a number of possible scenarios with artificial delays in reply
and sending cancel request or SIGTERM to the receiving side and this is all
seems to work nicely due to proper message queue detach event
notification. Still I think it helps to know that there is a timeout in
case the receiving side is really slow to read the message.

I've also decided we really ought to suppress any possible ERROR level
messages generated during the course of processing the status request in
order not to prevent the originally running transaction to complete. The
errors so caught are just logged using LOG level and ignored in this new
version of the patch.

I'm also submitting the instrumentation support as a separate patch on top
of this. I'm not really fond of adding bool parameter to InstrEndLoop, but
for now I didn't find any better way.

What I'm now thinking about is probably we can extend this backend
communication mechanism in order to query GUC values effective in a
different backend or even setting the values. The obvious candidate to
check when you see some query misbehaving would be work_mem, for example.
Or we could provide a report of all settings that were overridden in the
backend's session, to the effect of running something like this:

select * from pg_settings where context = 'user' and setting != reset_val;

The obvious candidates to be set externally are the
cmdstatus_analyze/instrument_*: when you decided you want to turn them on,
you'd rather do that carefully for a single backend than globally
per-cluster. One can still modify the postgresql.conf and then send SIGHUP
to just a single backend, but I think a more direct way to alter the
settings would be better.

In this light should we rename the API to something like "backend control"
instead of "command status"? The SHOW/SET syntax could be extended to
support the remote setting retrieval/update.

--
Alex

Attachments:

0002-explain-instrument-pid-v6.patchtext/x-patch; charset=US-ASCII; name=0002-explain-instrument-pid-v6.patchDownload
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 2a184ed..d2f46bb 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -288,7 +288,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
 		 * levels of hook all do this.)
 		 */
-		InstrEndLoop(queryDesc->totaltime);
+		InstrEndLoop(queryDesc->totaltime, false);
 
 		/* Log plan if duration is exceeded. */
 		msec = queryDesc->totaltime->total * 1000.0;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 59b8a2e..d90fc6d 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -926,7 +926,7 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
 		 * levels of hook all do this.)
 		 */
-		InstrEndLoop(queryDesc->totaltime);
+		InstrEndLoop(queryDesc->totaltime, false);
 
 		pgss_store(queryDesc->sourceText,
 				   queryId,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5d06fa4..e5a28a9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -654,7 +654,7 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
 		char	   *conname = NULL;
 
 		/* Must clean up instrumentation state */
-		InstrEndLoop(instr);
+		InstrEndLoop(instr, false);
 
 		/*
 		 * We ignore triggers that were never invoked; they likely aren't
@@ -1233,62 +1233,75 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		}
 	}
 
-	/*
-	 * We have to forcibly clean up the instrumentation state because we
-	 * haven't done ExecutorEnd yet.  This is pretty grotty ...
-	 *
-	 * Note: contrib/auto_explain could cause instrumentation to be set up
-	 * even though we didn't ask for it here.  Be careful not to print any
-	 * instrumentation results the user didn't ask for.  But we do the
-	 * InstrEndLoop call anyway, if possible, to reduce the number of cases
-	 * auto_explain has to contend with.
-	 */
 	if (planstate->instrument)
-		InstrEndLoop(planstate->instrument);
-
-	if (es->analyze &&
-		planstate->instrument && planstate->instrument->nloops > 0)
 	{
-		double		nloops = planstate->instrument->nloops;
-		double		startup_sec = 1000.0 * planstate->instrument->startup / nloops;
-		double		total_sec = 1000.0 * planstate->instrument->total / nloops;
-		double		rows = planstate->instrument->ntuples / nloops;
+		/*
+		 * We have to forcibly clean up the instrumentation state because we
+		 * haven't done ExecutorEnd yet.  This is pretty grotty ...
+		 *
+		 * Note: contrib/auto_explain could cause instrumentation to be set up
+		 * even though we didn't ask for it here.  Be careful not to print any
+		 * instrumentation results the user didn't ask for.  But we do the
+		 * InstrEndLoop call anyway, if possible, to reduce the number of cases
+		 * auto_explain has to contend with.
+		 */
+		InstrEndLoop(planstate->instrument, es->running);
 
-		if (es->format == EXPLAIN_FORMAT_TEXT)
-		{
-			if (es->timing)
-				appendStringInfo(es->str,
-							" (actual time=%.3f..%.3f rows=%.0f loops=%.0f)",
-								 startup_sec, total_sec, rows, nloops);
-			else
-				appendStringInfo(es->str,
-								 " (actual rows=%.0f loops=%.0f)",
-								 rows, nloops);
-		}
-		else
+		if (es->analyze)
 		{
-			if (es->timing)
+			double		nloops = planstate->instrument->nloops;
+
+			if (es->running && planstate->instrument->running)
+				nloops += 1; /* so we can display something */
+
+			if (nloops > 0)
 			{
-				ExplainPropertyFloat("Actual Startup Time", startup_sec, 3, es);
-				ExplainPropertyFloat("Actual Total Time", total_sec, 3, es);
+				double		startup_sec = 1000.0 * planstate->instrument->startup / nloops;
+				double		total_sec = 1000.0 * planstate->instrument->total / nloops;
+				double		rows = planstate->instrument->ntuples / nloops;
+				double		percent_done = 100.0 * rows / plan->plan_rows;
+
+				if (es->format == EXPLAIN_FORMAT_TEXT)
+				{
+					if (es->timing)
+						appendStringInfo(es->str,
+										 " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)",
+										 startup_sec, total_sec, rows,
+										 planstate->instrument->nloops);
+					else
+						appendStringInfo(es->str,
+										 " (actual rows=%.0f loops=%.0f)",
+										 rows, planstate->instrument->nloops);
+
+					if (es->running)
+						appendStringInfo(es->str, " %.1f%%", percent_done);
+				}
+				else
+				{
+					if (es->timing)
+					{
+						ExplainPropertyFloat("Actual Startup Time", startup_sec, 3, es);
+						ExplainPropertyFloat("Actual Total Time", total_sec, 3, es);
+					}
+					ExplainPropertyFloat("Actual Rows", rows, 0, es);
+					ExplainPropertyFloat("Actual Loops", planstate->instrument->nloops, 0, es);
+				}
 			}
-			ExplainPropertyFloat("Actual Rows", rows, 0, es);
-			ExplainPropertyFloat("Actual Loops", nloops, 0, es);
-		}
-	}
-	else if (es->analyze)
-	{
-		if (es->format == EXPLAIN_FORMAT_TEXT)
-			appendStringInfoString(es->str, " (never executed)");
-		else
-		{
-			if (es->timing)
+			else
 			{
-				ExplainPropertyFloat("Actual Startup Time", 0.0, 3, es);
-				ExplainPropertyFloat("Actual Total Time", 0.0, 3, es);
+				if (es->format == EXPLAIN_FORMAT_TEXT)
+					appendStringInfoString(es->str, " (never executed)");
+				else
+				{
+					if (es->timing)
+					{
+						ExplainPropertyFloat("Actual Startup Time", 0.0, 3, es);
+						ExplainPropertyFloat("Actual Total Time", 0.0, 3, es);
+					}
+					ExplainPropertyFloat("Actual Rows", 0.0, 0, es);
+					ExplainPropertyFloat("Actual Loops", 0.0, 0, es);
+				}
 			}
-			ExplainPropertyFloat("Actual Rows", 0.0, 0, es);
-			ExplainPropertyFloat("Actual Loops", 0.0, 0, es);
 		}
 	}
 
@@ -2650,7 +2663,7 @@ show_modifytable_info(ModifyTableState *mtstate, List *ancestors,
 			double		insert_path;
 			double		other_path;
 
-			InstrEndLoop(mtstate->mt_plans[0]->instrument);
+			InstrEndLoop(mtstate->mt_plans[0]->instrument, es->running);
 
 			/* count the number of source rows */
 			total = mtstate->mt_plans[0]->instrument->ntuples;
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 93e1e9a..33162f2 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -72,7 +72,7 @@ ExecReScan(PlanState *node)
 {
 	/* If collecting timing stats, update them */
 	if (node->instrument)
-		InstrEndLoop(node->instrument);
+		InstrEndLoop(node->instrument, false);
 
 	/*
 	 * If we have changed parameters, propagate that info.
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f5351eb..132153e 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -100,7 +100,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 
 /* Finish a run cycle for a plan node */
 void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(Instrumentation *instr, bool partial_run)
 {
 	double		totaltime;
 
@@ -108,7 +108,7 @@ InstrEndLoop(Instrumentation *instr)
 	if (!instr->running)
 		return;
 
-	if (!INSTR_TIME_IS_ZERO(instr->starttime))
+	if (!INSTR_TIME_IS_ZERO(instr->starttime) && !partial_run)
 		elog(ERROR, "InstrEndLoop called on running node");
 
 	/* Accumulate per-cycle statistics into totals */
@@ -117,11 +117,15 @@ InstrEndLoop(Instrumentation *instr)
 	instr->startup += instr->firsttuple;
 	instr->total += totaltime;
 	instr->ntuples += instr->tuplecount;
-	instr->nloops += 1;
 
 	/* Reset for next cycle (if any) */
-	instr->running = false;
-	INSTR_TIME_SET_ZERO(instr->starttime);
+	if (!partial_run)
+	{
+		instr->nloops += 1;
+
+		instr->running = false;
+		INSTR_TIME_SET_ZERO(instr->starttime);
+	}
 	INSTR_TIME_SET_ZERO(instr->counter);
 	instr->firsttuple = 0;
 	instr->tuplecount = 0;
diff --git a/src/backend/utils/adt/cmdstatus.c b/src/backend/utils/adt/cmdstatus.c
index 5078100..b969907 100644
--- a/src/backend/utils/adt/cmdstatus.c
+++ b/src/backend/utils/adt/cmdstatus.c
@@ -95,7 +95,12 @@ typedef struct CmdInfoStack {
 } CmdInfoStack;
 
 
+/* GUCs */
 int cmdstatus_interrupt_timeout = CMD_STATUS_DEFAULT_INTERRUPT_TIMEOUT;
+bool cmdstatus_analyze = false;
+bool cmdstatus_instrument_timer = false;
+bool cmdstatus_instrument_rows = false;
+bool cmdstatus_instrument_buffers = false;
 
 /* The array of slots pre-allocated on shared memory. */
 static volatile dsm_handle *CmdStatusSlots = NULL;
@@ -106,8 +111,10 @@ static volatile dsm_handle *MyCmdStatusSlot = NULL;
 static CmdInfoStack	   *current_query_stack = NULL;
 static int				query_stack_size = 0;  /* XXX not really used */
 
+static ExecutorStart_hook_type	prev_ExecutorStart = NULL;
 static ExecutorRun_hook_type	prev_ExecutorRun = NULL;
 
+static void cmdstatus_ExecutorStart(QueryDesc *queryDesc, int eflags);
 static void cmdstatus_ExecutorRun(QueryDesc *queryDesc,
 								  ScanDirection direction, long count);
 
@@ -162,12 +169,42 @@ CmdStatusInit(int css_idx)
 	on_shmem_exit(CleanupCmdStatusSlot, (Datum) 0);
 
 	/* also install executor hooks */
+	prev_ExecutorStart = ExecutorStart_hook;
+	ExecutorStart_hook = cmdstatus_ExecutorStart;
+
 	prev_ExecutorRun = ExecutorRun_hook;
 	ExecutorRun_hook = cmdstatus_ExecutorRun;
 }
 
+
+/*
+ * The ExecutorStart hook.
+ *
+ * Turns on query instrumentation options based on configuration setting.
+ */
+static void
+cmdstatus_ExecutorStart(QueryDesc *queryDesc, int eflags)
+{
+	if (cmdstatus_analyze)
+	{
+		if (cmdstatus_instrument_timer)
+			queryDesc->instrument_options |= INSTRUMENT_TIMER;
+
+		if (cmdstatus_instrument_rows)
+			queryDesc->instrument_options |= INSTRUMENT_ROWS;
+
+		if (cmdstatus_instrument_buffers)
+			queryDesc->instrument_options |= INSTRUMENT_BUFFERS;
+	}
+
+	if (prev_ExecutorStart)
+		prev_ExecutorStart(queryDesc, eflags);
+	else
+		standard_ExecutorStart(queryDesc, eflags);
+}
+
 /*
- * The executor hook.
+ * The ExecutorRun hook.
  *
  * Accumulates the query descriptors on the program stack and takes care of
  * popping the current frame when leaving the function ab-/normally.
@@ -217,9 +254,11 @@ explain_query(QueryDesc *queryDesc)
 	ExplainState   *es;
 
 	es = NewExplainState();
-	es->analyze = false;
+	es->running = true;
+	es->analyze = cmdstatus_analyze;
+	es->timing = cmdstatus_instrument_timer;
 	es->verbose = false;
-	es->buffers = false;
+	es->buffers = cmdstatus_instrument_buffers;
 	es->format = EXPLAIN_FORMAT_TEXT;
 
 	ExplainBeginOutput(es);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7cac9d3..21054de 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1619,6 +1619,46 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"cmdstatus_analyze", PGC_USERSET, STATS,
+			gettext_noop("Automatically instrument all queries for extraction of EXPLAIN ANALYZE plans with pg_cmdstatus() function."),
+			NULL,
+		},
+		&cmdstatus_analyze,
+		false,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"cmdstatus_instrument_timer", PGC_USERSET, STATS,
+			gettext_noop("Automatically instrument all queries for timing, requires cmdstatus_analyze=on."),
+			NULL,
+		},
+		&cmdstatus_instrument_timer,
+		false,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"cmdstatus_instrument_rows", PGC_USERSET, STATS,
+			gettext_noop("Automatically instrument all queries for tracking row counts, requires cmdstatus_analyze=on."),
+			NULL,
+		},
+		&cmdstatus_instrument_rows,
+		false,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"cmdstatus_instrument_buffers", PGC_USERSET, STATS,
+			gettext_noop("Automatically instrument all queries for tracking buffers usage, requires cmdstatus_analyze=on."),
+			NULL,
+		},
+		&cmdstatus_instrument_buffers,
+		false,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c33e585..a275f51 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -471,6 +471,13 @@
 #log_statement_stats = off
 
 
+# - Automatic Query Instrumentation -
+
+#cmdstatus_analyze = off
+#cmdstatus_instrument_timer = off
+#cmdstatus_instrument_rows = off
+#cmdstatus_instrument_buffers = off
+
 #------------------------------------------------------------------------------
 # AUTOVACUUM PARAMETERS
 #------------------------------------------------------------------------------
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 26fcc5b..3334f41 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -35,6 +35,7 @@ typedef struct ExplainState
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
 	ExplainFormat format;		/* output format */
+	bool		running;		/* explain on a running statement */
 	/* other states */
 	PlannedStmt *pstmt;			/* top of plan */
 	List	   *rtable;			/* range table */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index c9a2129..a18b089 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -68,6 +68,6 @@ extern PGDLLIMPORT BufferUsage pgBufferUsage;
 extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
+extern void InstrEndLoop(Instrumentation *instr, bool partial_run);
 
 #endif   /* INSTRUMENT_H */
diff --git a/src/include/utils/cmdstatus.h b/src/include/utils/cmdstatus.h
index d6379f0..a9a508a 100644
--- a/src/include/utils/cmdstatus.h
+++ b/src/include/utils/cmdstatus.h
@@ -16,6 +16,10 @@
 #define CMD_STATUS_DEFAULT_INTERRUPT_TIMEOUT		10	/* milliseconds */
 
 extern int cmdstatus_interrupt_timeout;
+extern bool cmdstatus_analyze;
+extern bool cmdstatus_instrument_timer;
+extern bool cmdstatus_instrument_rows;
+extern bool cmdstatus_instrument_buffers;
 
 extern Size CmdStatusShmemSize(void);
 extern void CmdStatusShmemInit(void);
0001-explain-pid-v6.patchtext/x-patch; charset=US-ASCII; name=0001-explain-pid-v6.patchDownload
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..2e3beaf 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -43,6 +43,7 @@
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
+#include "utils/cmdstatus.h"
 
 
 shmem_startup_hook_type shmem_startup_hook = NULL;
@@ -139,6 +140,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, CmdStatusShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -243,6 +245,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
+	CmdStatusShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 0abde43..e637be1 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -26,6 +26,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/cmdstatus.h"
 
 
 /*
@@ -296,6 +297,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	if (CheckProcSignal(PROCSIG_CMD_STATUS_INFO))
+		HandleCmdStatusInfoInterrupt();
+
 	if (set_latch_on_sigusr1)
 		SetLatch(MyLatch);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d917af3..1a5b03c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/cmdstatus.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2991,6 +2992,9 @@ ProcessInterrupts(void)
 
 	if (ParallelMessagePending)
 		HandleParallelMessages();
+
+	if (CmdStatusInfoPending)
+		ProcessCmdStatusInfoRequest();
 }
 
 
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 3ed0b44..2c8687c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -18,7 +18,7 @@ endif
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-	bool.o cash.o char.o date.o datetime.o datum.o dbsize.o domains.o \
+	bool.o cash.o char.o cmdstatus.o date.o datetime.o datum.o dbsize.o domains.o \
 	encode.o enum.o expandeddatum.o \
 	float.o format_type.o formatting.o genfile.o \
 	geo_ops.o geo_selfuncs.o inet_cidr_ntop.o inet_net_pton.o int.o \
diff --git a/src/backend/utils/adt/cmdstatus.c b/src/backend/utils/adt/cmdstatus.c
new file mode 100644
index 0000000..5078100
--- /dev/null
+++ b/src/backend/utils/adt/cmdstatus.c
@@ -0,0 +1,791 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.c
+ *	  Definitions for pg_cmdstatus function.
+ *
+ * Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/cmdstatus.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+
+#include "access/htup_details.h"
+#include "commands/explain.h"
+#include "lib/stringinfo.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/procsignal.h"
+#include "storage/shm_mq.h"
+#include "storage/shm_toc.h"
+#include "tcop/dest.h"
+#include "tcop/pquery.h"
+#include "utils/builtins.h"
+#include "utils/cmdstatus.h"
+
+
+typedef enum {
+	CMD_STATUS_REQUEST_EXPLAIN = 1,
+	CMD_STATUS_REQUEST_QUERY_TEXT = 2,
+	CMD_STATUS_REQUEST_PROGRESS_TAG = 3,
+	CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE = 4
+} CmdStatusInfoRequestType;
+
+#define CMD_STATUS_MAX_REQUEST CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE
+
+typedef enum {
+	CMD_STATUS_RESULT_FAILURE = -1,
+	CMD_STATUS_RESULT_SUCCESS = 0,
+	CMD_STATUS_RESULT_BACKEND_IDLE,
+	CMD_STATUS_RESULT_NO_DATA
+} CmdStatusInfoResultCode;
+
+/*
+ * We allocate an array the size of NumCmdStatusSlots of DSM handles in the
+ * shared memory.  This slots array is indexed by backend ID, so each slot can
+ * corresponds to at max. one backend at any given time.
+ *
+ * In order to actually query some backend for its status, the interested
+ * process will create a DSM segment, initialize a table of contents in it,
+ * fill the target process id and request type and create a shared memory
+ * queue in the segment.
+ *
+ * The interested process will then put the such initialized DSM segment's
+ * handle in its own slot and send the SIGUSR1 with the reason
+ * PROCSIG_CMD_STATUS_INFO to the process being queried.  The interested
+ * process will wait for the response to be delivered via the shared memory
+ * queue and the result code via a field in the DSM segment.
+ *
+ * The layout of a DSM segment is given by the following TOC keys enumeration:
+ */
+typedef enum {
+	CmdStatusSlotKeyTargetProcessID,
+	CmdStatusSlotKeyIsProcessedFlag,
+	CmdStatusSlotKeyRequestType,
+	CmdStatusSlotKeyResultCode,
+	CmdStatusSlotKeyMemoryQueue,
+
+	CmdStatusTotalSlotKeys
+} CmdStatusSlotKey;
+
+#define NumCmdStatusSlots	MaxBackends
+
+#define PG_CMD_STATUS_INFO_MAGIC	0x79fb2449
+#define CMD_STATUS_BUFFER_SIZE		1024
+
+/*
+ * These structs are allocated on the program stack as local variables in the
+ * ExecutorRun hook.  The top of stack is current_query_stack, see below.
+ *
+ * Capturing the execution stack allows us to inspect the inner-most running
+ * query as well as showing the complete backtrace.
+ */
+typedef struct CmdInfoStack {
+	QueryDesc			   *query_desc;
+	struct CmdInfoStack	   *parent;
+} CmdInfoStack;
+
+
+int cmdstatus_interrupt_timeout = CMD_STATUS_DEFAULT_INTERRUPT_TIMEOUT;
+
+/* The array of slots pre-allocated on shared memory. */
+static volatile dsm_handle *CmdStatusSlots = NULL;
+
+/* Pointer into the slots array, to a slot assigned for this backend. */
+static volatile dsm_handle *MyCmdStatusSlot = NULL;
+
+static CmdInfoStack	   *current_query_stack = NULL;
+static int				query_stack_size = 0;  /* XXX not really used */
+
+static ExecutorRun_hook_type	prev_ExecutorRun = NULL;
+
+static void cmdstatus_ExecutorRun(QueryDesc *queryDesc,
+								  ScanDirection direction, long count);
+
+static void CleanupCmdStatusSlot(int status, Datum arg);
+static StringInfo ProcessCmdStatusSlot(CmdStatusInfoRequestType request_type,
+					 CmdStatusInfoResultCode *result_code);
+
+static StringInfo explain_query(QueryDesc *queryDesc);
+static void report_result_code(CmdStatusInfoResultCode result, pid_t target_pid);
+static void RestoreResourceOwner(ResourceOwner prevResourceOwner);
+
+
+Size
+CmdStatusShmemSize(void)
+{
+	return NumCmdStatusSlots * sizeof(dsm_handle);
+}
+
+void
+CmdStatusShmemInit(void)
+{
+	Size		size = CmdStatusShmemSize();
+	bool		found;
+
+	CmdStatusSlots = (dsm_handle *) ShmemInitStruct("CmdStatusSlots", size, &found);
+
+	if (!found)
+		MemSet(CmdStatusSlots, 0, size);
+}
+
+static void
+CleanupCmdStatusSlot(int status, Datum arg)
+{
+	Assert(MyCmdStatusSlot != NULL);
+
+	*MyCmdStatusSlot = 0;	/* just in case */
+
+	MyCmdStatusSlot = NULL;
+}
+
+/*
+ * CmdStatusInit
+ *		Initialize MyCmdStatusSlot
+ *
+ * The passed index should be MyBackendId.
+ */
+void
+CmdStatusInit(int css_idx)
+{
+	MyCmdStatusSlot = &CmdStatusSlots[css_idx - 1];
+
+	on_shmem_exit(CleanupCmdStatusSlot, (Datum) 0);
+
+	/* also install executor hooks */
+	prev_ExecutorRun = ExecutorRun_hook;
+	ExecutorRun_hook = cmdstatus_ExecutorRun;
+}
+
+/*
+ * The executor hook.
+ *
+ * Accumulates the query descriptors on the program stack and takes care of
+ * popping the current frame when leaving the function ab-/normally.
+ */
+static void
+cmdstatus_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
+{
+	CmdInfoStack	current;
+
+	current.query_desc = queryDesc;
+	current.parent = current_query_stack;
+
+	current_query_stack = &current;
+	query_stack_size++;
+
+	PG_TRY();
+	{
+		if (prev_ExecutorRun)
+			prev_ExecutorRun(queryDesc, direction, count);
+		else
+			standard_ExecutorRun(queryDesc, direction, count);
+
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+	}
+	PG_CATCH();
+	{
+		Assert(current_query_stack == &current);
+		Assert(query_stack_size > 0);
+
+		query_stack_size--;
+		current_query_stack = current.parent;
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+/* Produce an explain plan of a single query descriptor. */
+static StringInfo
+explain_query(QueryDesc *queryDesc)
+{
+	StringInfo		str;
+	ExplainState   *es;
+
+	es = NewExplainState();
+	es->analyze = false;
+	es->verbose = false;
+	es->buffers = false;
+	es->format = EXPLAIN_FORMAT_TEXT;
+
+	ExplainBeginOutput(es);
+	/* XXX: appendStringInfo(es->str, "#%d ", depth); ? */
+	ExplainQueryText(es, queryDesc);
+	ExplainPrintPlan(es, queryDesc);
+	ExplainEndOutput(es);
+
+	str = es->str;
+
+	pfree(es);
+	return str;
+}
+
+
+/* signal handler for PROCSIG_CMD_STATUS_INFO */
+void
+HandleCmdStatusInfoInterrupt(void)
+{
+	CmdStatusInfoPending = true;
+	InterruptPending = true;
+
+	SetLatch(MyLatch);
+}
+
+/* Produce a response and set the result code for a single slot. */
+static StringInfo
+ProcessCmdStatusSlot(CmdStatusInfoRequestType request_type,
+					 CmdStatusInfoResultCode *result_code)
+{
+	StringInfo		result = NULL;
+
+	/* Show some optimism, overwrite with error code later if needed. */
+	*result_code = CMD_STATUS_RESULT_SUCCESS;
+
+	if (ActivePortal)
+	{
+		switch (request_type)
+		{
+			case CMD_STATUS_REQUEST_EXPLAIN:
+				if (ActivePortal->queryDesc != NULL)
+					result = explain_query(ActivePortal->queryDesc);
+				else if (current_query_stack != NULL)
+					result = explain_query(current_query_stack->query_desc);
+				else
+					*result_code = CMD_STATUS_RESULT_NO_DATA;
+				break;
+
+			case CMD_STATUS_REQUEST_EXPLAIN_BACKTRACE:
+				{
+					StringInfo		str;
+					CmdInfoStack   *query;
+
+					result = makeStringInfo();
+
+					if (current_query_stack != NULL)
+					{
+						for (query = current_query_stack;
+							 query != NULL;
+							 query = query->parent)
+						{
+							str = explain_query(query->query_desc);
+							appendBinaryStringInfo(result, str->data, str->len);
+
+							pfree(str->data);
+							pfree(str);
+						}
+					}
+					else
+						*result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+					break;
+				}
+
+			case CMD_STATUS_REQUEST_QUERY_TEXT:
+				result = makeStringInfo();
+				appendStringInfoString(result, ActivePortal->sourceText);
+				break;
+
+			case CMD_STATUS_REQUEST_PROGRESS_TAG:
+				if (ActivePortal->commandTag != NULL)
+				{
+					result = makeStringInfo();
+
+					if (ActivePortal->queryDesc != NULL &&
+						ActivePortal->queryDesc->estate != NULL)
+					{
+						appendStringInfo(result, "%s %u",
+										 ActivePortal->commandTag,
+										 ActivePortal->queryDesc->estate->es_processed);
+					}
+					else
+					{
+						/* no progress available, at least show the command tag */
+						appendStringInfoString(result, ActivePortal->commandTag);
+					}
+				}
+				else
+					*result_code = CMD_STATUS_RESULT_NO_DATA;
+				break;
+		}
+	}
+	else
+		*result_code = CMD_STATUS_RESULT_BACKEND_IDLE;
+
+	return result;
+}
+
+static void
+RestoreResourceOwner(ResourceOwner prevResourceOwner)
+{
+	if (prevResourceOwner != CurrentResourceOwner)
+	{
+		ResourceOwner	res_owner = CurrentResourceOwner;
+
+		CurrentResourceOwner = prevResourceOwner;
+		ResourceOwnerDelete(res_owner);
+	}
+}
+
+/*
+ * Process all command status requests from other backends by scanning the
+ * whole array of slots looking for the requests targeted at us, which were
+ * not already processed by us during earlier rounds.
+ *
+ * Most of the times it is expected that the first valid slot we find is for
+ * us, so there's no penalty for actually attaching it and looking inside.
+ */
+void
+ProcessCmdStatusInfoRequest(void)
+{
+	MemoryContext	context = CurrentMemoryContext;
+	ResourceOwner	prevResourceOwner = CurrentResourceOwner;
+	struct timeval	tmleft;
+
+	tmleft.tv_sec = 0;
+	tmleft.tv_usec = cmdstatus_interrupt_timeout * 1000L;
+
+	/* Ensure valid resource owner for access to dsm. */
+	if (CurrentResourceOwner == NULL)
+		CurrentResourceOwner = ResourceOwnerCreate(NULL, "ProcessCmdStatusInfoRequest");
+
+	/*
+	 * We might have been signaled again by another process while handling a
+	 * request, so re-check after going through all the slots.
+	 */
+	while (CmdStatusInfoPending)
+	{
+		int		i;
+
+		CmdStatusInfoPending = false;
+
+		for (i = 0; i < NumCmdStatusSlots; i++)
+		{
+			dsm_handle slot = CmdStatusSlots[i];
+
+			if (slot != 0)
+			{
+				dsm_segment	   *seg = NULL;
+				shm_toc		   *toc = NULL;
+				pid_t		   *target_pid_ptr;
+				bool		   *is_processed_flag_ptr;
+				CmdStatusInfoRequestType   *request_type_ptr;
+				CmdStatusInfoResultCode	   *result_code_ptr;
+				shm_mq		   *mq;
+				shm_mq_handle  *output;
+				struct timeval		tmnow;
+				struct timeval		tmthen;
+				struct timeval		tmtemp;
+
+				seg = dsm_attach(slot);
+				if (seg == NULL)
+				{
+					elog(LOG, "unable to map dynamic memory segment for command status");
+					continue;
+				}
+
+				toc = shm_toc_attach(PG_CMD_STATUS_INFO_MAGIC, dsm_segment_address(seg));
+				if (toc == NULL)
+				{
+					elog(LOG, "bad magic in dynamic memory segment for command status");
+
+					dsm_detach(seg);
+					continue;
+				}
+
+				target_pid_ptr = (pid_t *)
+					shm_toc_lookup(toc, CmdStatusSlotKeyTargetProcessID);
+				is_processed_flag_ptr = (bool *)
+					shm_toc_lookup(toc, CmdStatusSlotKeyIsProcessedFlag);
+
+				if (*target_pid_ptr != MyProcPid || *is_processed_flag_ptr)
+				{
+					dsm_detach(seg);
+					continue;
+				}
+
+				/* Set the flag early to avoid re-processing during next round. */
+				*is_processed_flag_ptr = true;
+
+				request_type_ptr = (CmdStatusInfoRequestType *)
+					shm_toc_lookup(toc, CmdStatusSlotKeyRequestType);
+
+				result_code_ptr = (CmdStatusInfoResultCode *)
+					shm_toc_lookup(toc, CmdStatusSlotKeyResultCode);
+
+				mq = shm_toc_lookup(toc, CmdStatusSlotKeyMemoryQueue);
+				output = shm_mq_attach(mq, seg, NULL);
+
+				PG_TRY();
+				{
+					StringInfo		payload;
+					int				res;
+
+					payload = ProcessCmdStatusSlot(*request_type_ptr, result_code_ptr);
+					if (payload != NULL)
+					{
+						for (;;)
+						{
+							/* Send the data, including null terminator. */
+							res = shm_mq_send(output, payload->len + 1, payload->data, true);
+							if (res == SHM_MQ_SUCCESS)
+							{
+								break;
+							}
+							else if (res == SHM_MQ_DETACHED)
+							{
+								elog(LOG, "receiving backend has detached from the shared memory queue prematurely");
+								break;
+							}
+							else if (res == SHM_MQ_WOULD_BLOCK)
+							{
+								int		waitres;
+
+								gettimeofday(&tmthen, NULL);
+
+								waitres = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT,
+													tmleft.tv_usec / 1000L);
+								if (waitres & WL_TIMEOUT)
+								{
+									elog(LOG, "timeout while sending command status response");
+									break;
+								}
+
+								if (waitres & WL_LATCH_SET)
+									ResetLatch(MyLatch);
+
+								gettimeofday(&tmnow, NULL);
+
+								/* left = left + then - now */
+								timeradd(&tmleft, &tmthen, &tmtemp);
+								timersub(&tmtemp, &tmnow, &tmleft);
+
+								if (tmleft.tv_sec < 0)
+								{
+									tmleft.tv_sec = 0;
+									tmleft.tv_usec = 0;
+								}
+							}
+						}
+
+						pfree(payload->data);
+						pfree(payload);
+					}
+					/*
+					 * If there was no payload to be sent, we just detach from
+					 * the queue and the receiver should wake up.
+					 */
+
+					shm_mq_detach(mq);
+					dsm_detach(seg);
+				}
+				PG_CATCH();
+				{
+					/*
+					 * Detaching from the queue also wakes up the receiver, be
+					 * sure to update the slot status before detaching.
+					 */
+					shm_mq_detach(mq);
+					dsm_detach(seg);
+
+					MemoryContextSwitchTo(context);
+					PG_TRY();
+					{
+						ErrorData	   *edata = CopyErrorData();
+
+						/* XXX is there a way to just report the current ERROR with LOG instead, without making a copy first? */
+						elog(LOG, "an error occurred while processing command status request: %s",
+							 edata->message);
+
+						FreeErrorData(edata);
+						FlushErrorState();
+					}
+					PG_CATCH();
+					{
+						FlushErrorState();
+						MemoryContextSwitchTo(context);
+
+						elog(LOG, "out of memory while processing command status request");
+					}
+					PG_END_TRY();
+					continue;
+				}
+				PG_END_TRY();
+			}
+		}
+	}
+
+	RestoreResourceOwner(prevResourceOwner);
+}
+
+static void
+report_result_code(CmdStatusInfoResultCode result, pid_t target_pid)
+{
+	switch (result)
+	{
+		case CMD_STATUS_RESULT_SUCCESS:
+			/* nothing to report */
+			break;
+
+		case CMD_STATUS_RESULT_BACKEND_IDLE:
+			elog(INFO, "no command is currently running in backend with PID %d",
+				 target_pid);
+			break;
+
+		case CMD_STATUS_RESULT_NO_DATA:
+			elog(WARNING, "no suitable data found for the query in backend with PID %d",
+				 target_pid);
+			break;
+
+		default:
+			elog(ERROR, "general command status request failure when querying backend with PID %d",
+				 target_pid);
+			break;
+	}
+}
+
+/*
+ * Try to get status of a command running in another process.
+ *
+ * FUNCTION pg_cmdstatus(pid, request_type)
+ * RETURNS SETOF text
+ */
+Datum
+pg_cmdstatus(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo	   *rsinfo	= (ReturnSetInfo *) fcinfo->resultinfo;
+	Tuplestorestate	   *tupstore;
+	TupleDesc			tupdesc;
+	MemoryContext		per_query_ctx;
+	MemoryContext		oldcontext;
+	pid_t				target_pid = (pid_t) PG_GETARG_INT32(0);
+	int					request_type = PG_GETARG_INT32(1);
+	int					timeout = PG_GETARG_INT32(2);
+	PGPROC			   *target_proc;
+	shm_toc_estimator	estimator;
+	Size				segsize;
+	dsm_segment		   *seg;
+	shm_toc			   *toc;
+	pid_t			   *target_pid_ptr;
+	bool			   *is_processed_flag_ptr;
+	CmdStatusInfoRequestType   *request_type_ptr;
+	CmdStatusInfoResultCode	   *result_code_ptr;
+	shm_mq			   *mq;
+	shm_mq_handle	   *input;
+	struct timeval		tmnow;
+	struct timeval		tmthen;
+	struct timeval		tmtemp;
+	struct timeval		tmleft;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	if (request_type < 1 || request_type > CMD_STATUS_MAX_REQUEST)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("unknown command status request")));
+
+	if (target_pid == MyProcPid)
+		ereport(ERROR,
+				(errmsg("backend cannot query command status of itself")));
+
+	/* verify access to target_pid */
+	target_proc = BackendPidGetProc(target_pid);
+
+	if (target_proc == NULL)
+		ereport(ERROR,
+				(errmsg("PID %d is not a PostgreSQL server process", target_pid)));
+
+	if (!(superuser() || target_proc->roleId == GetUserId()))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser or have the same role to explain queries running in other server processes"))));
+
+	if (*MyCmdStatusSlot != 0)
+		ereport(ERROR,
+				(errmsg("pg_cmdstatus() call is already in progress")));
+
+	/* need to build tuplestore in query context */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupdesc = CreateTupleDescCopy(rsinfo->expectedDesc);
+	tupstore = tuplestore_begin_heap(false, false, work_mem);
+	MemoryContextSwitchTo(oldcontext);
+
+	/* prepare shared dsm segment */
+	shm_toc_initialize_estimator(&estimator);
+	shm_toc_estimate_keys(&estimator, CmdStatusTotalSlotKeys);
+	shm_toc_estimate_chunk(&estimator, sizeof(pid_t));
+	shm_toc_estimate_chunk(&estimator, sizeof(bool));
+	shm_toc_estimate_chunk(&estimator, sizeof(CmdStatusInfoRequestType));
+	shm_toc_estimate_chunk(&estimator, sizeof(CmdStatusInfoResultCode));
+	shm_toc_estimate_chunk(&estimator, CMD_STATUS_BUFFER_SIZE);
+	segsize = shm_toc_estimate(&estimator);
+
+	seg = dsm_create(segsize, 0);
+
+	toc = shm_toc_create(PG_CMD_STATUS_INFO_MAGIC, dsm_segment_address(seg), segsize);
+
+	target_pid_ptr = (pid_t *) shm_toc_allocate(toc, sizeof(pid_t));
+	*target_pid_ptr = target_pid;
+	shm_toc_insert(toc, CmdStatusSlotKeyTargetProcessID, target_pid_ptr);
+
+	is_processed_flag_ptr = (bool *) shm_toc_allocate(toc, sizeof(bool));
+	*is_processed_flag_ptr = false;
+	shm_toc_insert(toc, CmdStatusSlotKeyIsProcessedFlag, is_processed_flag_ptr);
+
+	request_type_ptr = (CmdStatusInfoRequestType *)
+		shm_toc_allocate(toc, sizeof(CmdStatusInfoRequestType));
+	*request_type_ptr = request_type;
+	shm_toc_insert(toc, CmdStatusSlotKeyRequestType, request_type_ptr);
+
+	result_code_ptr = (CmdStatusInfoResultCode *)
+		shm_toc_allocate(toc, sizeof(CmdStatusInfoResultCode));
+	*result_code_ptr = CMD_STATUS_RESULT_FAILURE;
+	shm_toc_insert(toc, CmdStatusSlotKeyResultCode, result_code_ptr);
+
+	mq = shm_mq_create(shm_toc_allocate(toc, CMD_STATUS_BUFFER_SIZE),
+					   CMD_STATUS_BUFFER_SIZE);
+	shm_toc_insert(toc, CmdStatusSlotKeyMemoryQueue, mq);
+
+	shm_mq_set_receiver(mq, MyProc);
+	shm_mq_set_sender(mq, target_proc);
+
+	PG_TRY();
+	{
+		shm_mq_result	res;
+		char		   *data = NULL;
+		Size			length;
+
+		input = shm_mq_attach(mq, seg, NULL);
+
+		*MyCmdStatusSlot = dsm_segment_handle(seg);
+
+		if (SendProcSignal(target_pid, PROCSIG_CMD_STATUS_INFO, target_proc->backendId))
+			elog(ERROR, "could not signal backend with PID %d", target_pid);
+
+		tmleft.tv_sec = timeout / 1000L;
+		tmleft.tv_usec = (timeout % 1000L) * 1000L;
+
+		for (;;)
+		{
+			/*
+			 * Read without blocking, otherwise cancel request or timeout
+			 * cannot be handled promptly.
+			 */
+			res = shm_mq_receive(input, &length, (void **) &data, true);
+			if (res == SHM_MQ_SUCCESS)
+			{
+				const char	   *p = data;
+
+				/* break into tuples on newline */
+				while (*p)
+				{
+					const char *q = strchr(p, '\n');
+					Size		len = q ? q - p : strlen(p);
+					Datum		value;
+					HeapTuple	tuple;
+					bool		isnull = false;
+
+					value = PointerGetDatum(cstring_to_text_with_len(p, len));
+
+					tuple = heap_form_tuple(tupdesc, &value, &isnull);
+					tuplestore_puttuple(tupstore, tuple);
+
+					if (!q)
+						break;
+
+					p += len + 1;
+				}
+			}
+			else if (res == SHM_MQ_DETACHED)
+			{
+				report_result_code(*result_code_ptr, target_pid);
+				break;
+			}
+			else if (res == SHM_MQ_WOULD_BLOCK)
+			{
+				int		waitres;
+				long	left_ms;
+
+				if (timeout)
+					gettimeofday(&tmthen, NULL);
+
+				/* Wait up to tmleft in ms (or 1 sec if timeout is not set). */
+				left_ms = timeout ? tmleft.tv_sec * 1000L + tmleft.tv_usec / 1000L : 1000L;
+
+				waitres = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT, left_ms);
+				if (timeout && (waitres & WL_TIMEOUT) != 0)
+				{
+					elog(INFO, "timeout waiting for reply from backend with PID %d",
+						 target_pid);
+					break;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+				if (waitres & WL_LATCH_SET)
+					ResetLatch(MyLatch);
+
+				if (timeout)
+				{
+					gettimeofday(&tmnow, NULL);
+
+					/* left = left + then - now */
+					timeradd(&tmleft, &tmthen, &tmtemp);
+					timersub(&tmtemp, &tmnow, &tmleft);
+
+					if (tmleft.tv_sec < 0)
+					{
+						tmleft.tv_sec = 0;
+						tmleft.tv_usec = 0;
+					}
+				}
+			}
+		}
+
+		/* clean up and return the tuplestore */
+		tuplestore_donestoring(tupstore);
+
+		*MyCmdStatusSlot = 0;
+
+		shm_mq_detach(mq);
+		dsm_detach(seg);
+	}
+	PG_CATCH();
+	{
+		*MyCmdStatusSlot = 0;
+
+		shm_mq_detach(mq);
+		dsm_detach(seg);
+
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	return (Datum) 0;
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 23e594e..2269853 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -29,6 +29,7 @@ ProtocolVersion FrontendProtocol;
 volatile bool InterruptPending = false;
 volatile bool QueryCancelPending = false;
 volatile bool ProcDiePending = false;
+volatile bool CmdStatusInfoPending = false;
 volatile bool ClientConnectionLost = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7b19714..7e7ba77 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -50,6 +50,7 @@
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
+#include "utils/cmdstatus.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -588,6 +589,9 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 	/* Now that we have a BackendId, we can participate in ProcSignal */
 	ProcSignalInit(MyBackendId);
 
+	/* Init CmdStatus slot */
+	CmdStatusInit(MyBackendId);
+
 	/*
 	 * Also set up timeout handlers needed for backend operation.  We need
 	 * these in every case except bootstrap.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..7cac9d3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -73,6 +73,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/cmdstatus.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
@@ -2662,6 +2663,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"cmdstatus_interrupt_timeout", PGC_USERSET, STATS,
+			gettext_noop("Timeout for processing all pending command status requests."),
+			NULL,
+		 GUC_UNIT_MS /* min max */
+		},
+		&cmdstatus_interrupt_timeout,
+		CMD_STATUS_DEFAULT_INTERRUPT_TIMEOUT, 1, 1000,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ddf7c67..5edbfd0 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -3130,6 +3130,8 @@ DESCR("get OID of current session's temp schema, if any");
 DATA(insert OID = 2855 (  pg_is_other_temp_schema	PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "26" _null_ _null_ _null_ _null_ _null_ pg_is_other_temp_schema _null_ _null_ _null_ ));
 DESCR("is schema another session's temp schema?");
 
+DATA(insert OID = 4099 ( pg_cmdstatus		PGNSP PGUID 12 1 100 0 0 f f f f f t s 3 0 25 "23 23 23" _null_ _null_  _null_  _null_ _null_ pg_cmdstatus _null_ _null_ _null_ ));
+DESCR("returns information about another process");
 DATA(insert OID = 2171 ( pg_cancel_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_cancel_backend _null_ _null_ _null_ ));
 DESCR("cancel a server process' current query");
 DATA(insert OID = 2096 ( pg_terminate_backend		PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 16 "23" _null_ _null_ _null_ _null_ _null_ pg_terminate_backend _null_ _null_ _null_ ));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e0cc69f..5f03f63 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -80,6 +80,7 @@
 extern PGDLLIMPORT volatile bool InterruptPending;
 extern PGDLLIMPORT volatile bool QueryCancelPending;
 extern PGDLLIMPORT volatile bool ProcDiePending;
+extern PGDLLIMPORT volatile bool CmdStatusInfoPending;
 
 extern volatile bool ClientConnectionLost;
 
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index af1a0cd..cd856e5 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -41,6 +41,8 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
+	PROCSIG_CMD_STATUS_INFO,
+
 	NUM_PROCSIGNALS				/* Must be last! */
 } ProcSignalReason;
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fc1679e..605612e 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1243,6 +1243,9 @@ extern Datum pg_identify_object_as_address(PG_FUNCTION_ARGS);
 /* catalog/objectaddress.c */
 extern Datum pg_get_object_address(PG_FUNCTION_ARGS);
 
+/* utils/adt/cmdstatus.c */
+extern Datum pg_cmdstatus(PG_FUNCTION_ARGS);
+
 /* commands/constraint.c */
 extern Datum unique_key_recheck(PG_FUNCTION_ARGS);
 
diff --git a/src/include/utils/cmdstatus.h b/src/include/utils/cmdstatus.h
new file mode 100644
index 0000000..d6379f0
--- /dev/null
+++ b/src/include/utils/cmdstatus.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * cmdstatus.h
+ *	  Declarations for command status interrupt handling.
+ *
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ *
+ * src/include/utils/cmdstatus.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CMDSTATUS_H
+#define CMDSTATUS_H
+
+#define CMD_STATUS_DEFAULT_INTERRUPT_TIMEOUT		10	/* milliseconds */
+
+extern int cmdstatus_interrupt_timeout;
+
+extern Size CmdStatusShmemSize(void);
+extern void CmdStatusShmemInit(void);
+extern void CmdStatusInit(int css_idx);
+
+extern void HandleCmdStatusInfoInterrupt(void);
+extern void ProcessCmdStatusInfoRequest(void);
+
+#endif   /* CMDSTATUS_H */
#52Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#51)
Re: On-demand running query plans using auto_explain and signals

2015-09-16 16:31 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Tue, Sep 15, 2015 at 11:00 AM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

On Mon, Sep 14, 2015 at 7:27 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-14 18:46 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

... This way the receiver only writes to the slot and the sender only
reads from it.

By the way, is it safe to assume atomic read/writes of dsm_handle
(uint32)? I would be surprised if not.

I don't see any reason why it should not to work - only few processes
will wait for data - so lost attach/detach shm operations will not be too
much.

Please see attached for implementation of this approach. The most
surprising thing is that it actually works :)

One problem still remains with the process requesting information when
the target process exits before it can have a chance to handle the signal.
The requesting process then waits forever, because nobody attaches/detaches
the queue. We've discussed this before and looks like we need to introduce
a timeout parameter to pg_cmdstatus()...

I've added the timeout parameter to the pg_cmdstatus call, and more
importantly to the sending side of the queue, so that one can limit the
potential effect of handling the interrupt in case something goes really
wrong.

I don't think so introduction new user visible timer is good idea. This
timer should be invisible

My idea - send a signal and wait 1sec, then check if target process is
living still. Stop if not. Wait next 5 sec, then check target process. Stop
if this process doesn't live or raise warning "requested process doesn't
communicate, waiting.." The final cancel should be done by
statement_timeout.

what do you think about it?

I've tested a number of possible scenarios with artificial delays in reply
and sending cancel request or SIGTERM to the receiving side and this is all
seems to work nicely due to proper message queue detach event
notification. Still I think it helps to know that there is a timeout in
case the receiving side is really slow to read the message.

I've also decided we really ought to suppress any possible ERROR level
messages generated during the course of processing the status request in
order not to prevent the originally running transaction to complete. The
errors so caught are just logged using LOG level and ignored in this new
version of the patch.

I'm also submitting the instrumentation support as a separate patch on top
of this. I'm not really fond of adding bool parameter to InstrEndLoop, but
for now I didn't find any better way.

What I'm now thinking about is probably we can extend this backend
communication mechanism in order to query GUC values effective in a
different backend or even setting the values. The obvious candidate to
check when you see some query misbehaving would be work_mem, for example.
Or we could provide a report of all settings that were overridden in the
backend's session, to the effect of running something like this:

select * from pg_settings where context = 'user' and setting != reset_val;

this is a good idea. More times I interested what is current setting of
query. We cannot to use simple query - because it is not possible to push
PID to function simply, but you can to write function pg_settings_pid() so
usage can look like

select * from pg_settings_pid(xxxx, possible other params) where ...

The obvious candidates to be set externally are the
cmdstatus_analyze/instrument_*: when you decided you want to turn them on,
you'd rather do that carefully for a single backend than globally
per-cluster. One can still modify the postgresql.conf and then send SIGHUP
to just a single backend, but I think a more direct way to alter the
settings would be better.

I am 100% for possibility to read the setting. But reconfiguration from
other process is too hot coffee - it can be available via extension, but
not via usually used tools.

In this light should we rename the API to something like "backend control"
instead of "command status"? The SHOW/SET syntax could be extended to
support the remote setting retrieval/update.

prepare API, and this functionality can be part of referential
implementation in contrib.

This patch should not to have too range be finished in this release cycle.

Regards

Pavel

Show quoted text

--
Alex

#53Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#52)
Re: On-demand running query plans using auto_explain and signals

On Wed, Sep 16, 2015 at 8:07 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-16 16:31 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

I've added the timeout parameter to the pg_cmdstatus call, and more
importantly to the sending side of the queue, so that one can limit the
potential effect of handling the interrupt in case something goes really
wrong.

I don't think so introduction new user visible timer is good idea. This
timer should be invisible

My idea - send a signal and wait 1sec, then check if target process is
living still. Stop if not. Wait next 5 sec, then check target process. Stop
if this process doesn't live or raise warning "requested process doesn't
communicate, waiting.." The final cancel should be done by
statement_timeout.

what do you think about it?

That won't work really well with something like I use to do when testing
this patch, namely:

postgres=# select pid, array(select pg_cmdstatus(pid, 1, 10)) from
pg_stat_activity where pid<>pg_backend_pid() \watch 1

while also running pgbench with -C option (to create new connection for
every transaction). When a targeted backend exits before it can handle the
signal, the receiving process keeps waiting forever.

The statement_timeout in this case will stop the whole select, not just
individual function call. Unless you wrap it to set statement_timeout and
catch QUERY_CANCELED in plpgsql, but then you won't be able to ^C the whole
select. The ability to set a (short) timeout for the function itself
proves to be a really useful feature to me.

We can still communicate some warnings to the client when no timeout is
specified (and make 0 the default value actually).

What I'm now thinking about is probably we can extend this backend

communication mechanism in order to query GUC values effective in a
different backend or even setting the values. The obvious candidate to
check when you see some query misbehaving would be work_mem, for example.
Or we could provide a report of all settings that were overridden in the
backend's session, to the effect of running something like this:

select * from pg_settings where context = 'user' and setting != reset_val;

this is a good idea. More times I interested what is current setting of
query. We cannot to use simple query - because it is not possible to push
PID to function simply, but you can to write function pg_settings_pid() so
usage can look like

select * from pg_settings_pid(xxxx, possible other params) where ...

I would rather have a more general way to run *readonly* queries in the
other backend than invent a new function for every occurrence.

The obvious candidates to be set externally are the

cmdstatus_analyze/instrument_*: when you decided you want to turn them on,
you'd rather do that carefully for a single backend than globally
per-cluster. One can still modify the postgresql.conf and then send SIGHUP
to just a single backend, but I think a more direct way to alter the
settings would be better.

I am 100% for possibility to read the setting. But reconfiguration from
other process is too hot coffee - it can be available via extension, but
not via usually used tools.

It can be reserved to superuser, and nobody is forcing one to use it
anyway. :-)

In this light should we rename the API to something like "backend control"

instead of "command status"? The SHOW/SET syntax could be extended to
support the remote setting retrieval/update.

prepare API, and this functionality can be part of referential
implementation in contrib.

This patch should not to have too range be finished in this release cycle.

These are just the thoughts on what could be achieved using this
cross-backend communication mechanism and ideas for generalization of the
API.

--
Alex

#54Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#53)
Re: On-demand running query plans using auto_explain and signals

2015-09-17 11:55 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Wed, Sep 16, 2015 at 8:07 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-16 16:31 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

I've added the timeout parameter to the pg_cmdstatus call, and more
importantly to the sending side of the queue, so that one can limit the
potential effect of handling the interrupt in case something goes really
wrong.

I don't think so introduction new user visible timer is good idea. This
timer should be invisible

My idea - send a signal and wait 1sec, then check if target process is
living still. Stop if not. Wait next 5 sec, then check target process. Stop
if this process doesn't live or raise warning "requested process doesn't
communicate, waiting.." The final cancel should be done by
statement_timeout.

what do you think about it?

That won't work really well with something like I use to do when testing
this patch, namely:

postgres=# select pid, array(select pg_cmdstatus(pid, 1, 10)) from
pg_stat_activity where pid<>pg_backend_pid() \watch 1

while also running pgbench with -C option (to create new connection for
every transaction). When a targeted backend exits before it can handle the
signal, the receiving process keeps waiting forever.

no - every timeout you have to check, if targeted backend is living still,
if not you will do cancel. It is 100% safe.

The statement_timeout in this case will stop the whole select, not just
individual function call. Unless you wrap it to set statement_timeout and
catch QUERY_CANCELED in plpgsql, but then you won't be able to ^C the whole
select. The ability to set a (short) timeout for the function itself
proves to be a really useful feature to me.

you cannot to handle QUERY_CANCELED in plpgsql. There is need some internal
timeout - but this timeout should not be visible - any new GUC increase
PostgreSQL complexity - and there is not a reason why do it.

We can still communicate some warnings to the client when no timeout is
specified (and make 0 the default value actually).

What I'm now thinking about is probably we can extend this backend

communication mechanism in order to query GUC values effective in a
different backend or even setting the values. The obvious candidate to
check when you see some query misbehaving would be work_mem, for example.
Or we could provide a report of all settings that were overridden in the
backend's session, to the effect of running something like this:

select * from pg_settings where context = 'user' and setting !=
reset_val;

this is a good idea. More times I interested what is current setting of
query. We cannot to use simple query - because it is not possible to push
PID to function simply, but you can to write function pg_settings_pid() so
usage can look like

select * from pg_settings_pid(xxxx, possible other params) where ...

I would rather have a more general way to run *readonly* queries in the
other backend than invent a new function for every occurrence.

The obvious candidates to be set externally are the

cmdstatus_analyze/instrument_*: when you decided you want to turn them on,
you'd rather do that carefully for a single backend than globally
per-cluster. One can still modify the postgresql.conf and then send SIGHUP
to just a single backend, but I think a more direct way to alter the
settings would be better.

I am 100% for possibility to read the setting. But reconfiguration from
other process is too hot coffee - it can be available via extension, but
not via usually used tools.

It can be reserved to superuser, and nobody is forcing one to use it
anyway. :-)

In this light should we rename the API to something like "backend control"

instead of "command status"? The SHOW/SET syntax could be extended to
support the remote setting retrieval/update.

prepare API, and this functionality can be part of referential
implementation in contrib.

This patch should not to have too range be finished in this release cycle.

These are just the thoughts on what could be achieved using this
cross-backend communication mechanism and ideas for generalization of the
API.

ok

Show quoted text

--
Alex

#55Robert Haas
robertmhaas@gmail.com
In reply to: Shulgin, Oleksandr (#50)
Re: On-demand running query plans using auto_explain and signals

On Tue, Sep 15, 2015 at 5:00 AM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

Please see attached for implementation of this approach. The most
surprising thing is that it actually works :)

It's cool to see these interesting ideas for using some of the code
I've been working on for the last couple of years. However, it seems
to me that a better design would avoid the use of shm_mq. Instead of
having the guy who asks the question create a DSM, have the guy who
sends back the answer create it. Then, he can just make the DSM big
enough to store the entire string. I think that's going to be a lot
more robust than what you have here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Robert Haas
robertmhaas@gmail.com
In reply to: Shulgin, Oleksandr (#51)
Re: On-demand running query plans using auto_explain and signals

On Wed, Sep 16, 2015 at 10:31 AM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

I've also decided we really ought to suppress any possible ERROR level
messages generated during the course of processing the status request in
order not to prevent the originally running transaction to complete. The
errors so caught are just logged using LOG level and ignored in this new
version of the patch.

This plan has almost zero chance of surviving committer-level
scrutiny, and if it somehow does, some other committer will scream
bloody murder as soon as they notice it.

It's not safe to just catch an error and continue on executing. You
have to abort the (sub)transaction once an error happens.

Of course, this gets at a pretty crucial design question for this
patch, which is whether it's OK for one process to interrogate another
process's state, and what the consequences of that might be. What
permissions are needed? Can you DOS the guy running a query by
pinging him for status at a very high rate?

The other question here is whether it's really safe to do this.
ProcessInterrupts() gets called at lots of places in the code, and we
may freely add more in the future. How are you going to prove that
ProcessCmdStatusInfoRequest() is safe to call in all of those places?
How do you know that some data structure you find by chasing down from
ActivePortal or current_query_stack doesn't have a NULL pointer
someplace that you don't expect, because the code that initializes
that pointer hasn't run yet?

I'd like to see this made to work and be safe, but I think proving
that it's truly robust in all cases is going to be hard.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#54)
Re: On-demand running query plans using auto_explain and signals

On Thu, Sep 17, 2015 at 12:06 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

That won't work really well with something like I use to do when testing
this patch, namely:

postgres=# select pid, array(select pg_cmdstatus(pid, 1, 10)) from
pg_stat_activity where pid<>pg_backend_pid() \watch 1

while also running pgbench with -C option (to create new connection for
every transaction). When a targeted backend exits before it can handle the
signal, the receiving process keeps waiting forever.

no - every timeout you have to check, if targeted backend is living still,
if not you will do cancel. It is 100% safe.

But then you need to make this internal timeout rather short, not 1s as
originally suggested.

The statement_timeout in this case will stop the whole select, not just

individual function call. Unless you wrap it to set statement_timeout and
catch QUERY_CANCELED in plpgsql, but then you won't be able to ^C the whole
select. The ability to set a (short) timeout for the function itself
proves to be a really useful feature to me.

you cannot to handle QUERY_CANCELED in plpgsql.

Well, you can but its not that useful of course:

=# create or replace function test_query_cancel() returns void language
plpgsql as $$ begin
perform pg_sleep(1);
exception when query_canceled then raise notice 'cancel';
end; $$;
CREATE FUNCTION
=# set statement_timeout to '100ms';
SET
=# select test_query_cancel();
NOTICE: cancel
test_query_cancel
-------------------

(1 row)
=# select test_query_cancel() from generate_series(1, 100) x;
NOTICE: cancel
^CCancel request sent
NOTICE: cancel
^CCancel request sent

Now you cannot cancel this query unless you do pg_terminate_backend() or
equivalent.

There is need some internal timeout - but this timeout should not be

visible - any new GUC increase PostgreSQL complexity - and there is not a
reason why do it.

But the GUC was added for the timeout on the sending side, not the
receiving one. There is no "one value fits all" for this, but you would
still want to make the effects of this as limited as possible.

--
Alex

#58Pavel Stehule
pavel.stehule@gmail.com
In reply to: Robert Haas (#55)
Re: On-demand running query plans using auto_explain and signals

2015-09-17 14:06 GMT+02:00 Robert Haas <robertmhaas@gmail.com>:

On Tue, Sep 15, 2015 at 5:00 AM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

Please see attached for implementation of this approach. The most
surprising thing is that it actually works :)

It's cool to see these interesting ideas for using some of the code
I've been working on for the last couple of years. However, it seems
to me that a better design would avoid the use of shm_mq. Instead of
having the guy who asks the question create a DSM, have the guy who
sends back the answer create it. Then, he can just make the DSM big
enough to store the entire string. I think that's going to be a lot
more robust than what you have here.

Please, can you explain what is wrong on using of shm_mq? It works pretty
well now.

There can be a contra argument, why don't use shm_mq, because it does data
exchange between processes and this patch introduce new pattern for same
thing.

Regards

Pavel

Show quoted text

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#59Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#57)
Re: On-demand running query plans using auto_explain and signals

2015-09-17 16:06 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Thu, Sep 17, 2015 at 12:06 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

That won't work really well with something like I use to do when testing
this patch, namely:

postgres=# select pid, array(select pg_cmdstatus(pid, 1, 10)) from
pg_stat_activity where pid<>pg_backend_pid() \watch 1

while also running pgbench with -C option (to create new connection for
every transaction). When a targeted backend exits before it can handle the
signal, the receiving process keeps waiting forever.

no - every timeout you have to check, if targeted backend is living
still, if not you will do cancel. It is 100% safe.

But then you need to make this internal timeout rather short, not 1s as
originally suggested.

can be - 1 sec is max, maybe 100ms is optimum.

The statement_timeout in this case will stop the whole select, not just

individual function call. Unless you wrap it to set statement_timeout and
catch QUERY_CANCELED in plpgsql, but then you won't be able to ^C the whole
select. The ability to set a (short) timeout for the function itself
proves to be a really useful feature to me.

you cannot to handle QUERY_CANCELED in plpgsql.

Well, you can but its not that useful of course:

hmm, some is wrong - I remember from some older plpgsql, so CANCEL message
is not catchable. Maybe I have bad memory. I have to recheck it.

=# create or replace function test_query_cancel() returns void language
plpgsql as $$ begin
perform pg_sleep(1);
exception when query_canceled then raise notice 'cancel';
end; $$;
CREATE FUNCTION
=# set statement_timeout to '100ms';
SET
=# select test_query_cancel();
NOTICE: cancel
test_query_cancel
-------------------

(1 row)
=# select test_query_cancel() from generate_series(1, 100) x;
NOTICE: cancel
^CCancel request sent
NOTICE: cancel
^CCancel request sent

Now you cannot cancel this query unless you do pg_terminate_backend() or
equivalent.

There is need some internal timeout - but this timeout should not be

visible - any new GUC increase PostgreSQL complexity - and there is not a
reason why do it.

But the GUC was added for the timeout on the sending side, not the
receiving one. There is no "one value fits all" for this, but you would
still want to make the effects of this as limited as possible.

I still believe so any new GUC is not necessary. If you don't like
statement_timeout, we can copy the behave of CREATE DATABASE - there are
few 5sec cycles (when template1 is occupated) and break.

Regards

Pavel

Show quoted text

--
Alex

#60Robert Haas
robertmhaas@gmail.com
In reply to: Pavel Stehule (#58)
Re: On-demand running query plans using auto_explain and signals

On Thu, Sep 17, 2015 at 10:28 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Please, can you explain what is wrong on using of shm_mq? It works pretty
well now.

There can be a contra argument, why don't use shm_mq, because it does data
exchange between processes and this patch introduce new pattern for same
thing.

Sure, I can explain that.

First, when you communication through a shm_mq, the writer has to wait
when the queue is full, and the reader has to wait when the queue is
empty. If the message is short enough to fit in the queue, then you
can send it and be done without blocking. But if it is long, then you
will have each process repeatedly waiting for the other one until the
whole message has been sent. That is going to make sending the
message potentially a lot slower than writing it all in one go,
especially if the system is under heavy load.

Also, if there are any bugs in the way the shm_mq is being used,
they're likely to be quite rare and hard to find, because the vast
majority of messages will probably be short enough to be sent in a
single chunk, so whatever bugs may exist when the processes play
ping-ping are unlikely to occur in practice except in unusual cases
where the message being returned is very long.

Second, using a shm_mq manipulates the state of the process latch. I
don't think you can make the assumption that it's safe to reset the
process latch at any and every place where we check for interrupts.
For example, suppose the process is already using a shm_mq and the
CHECK_FOR_INTERRUPTS() call inside that code then discovers that
somebody has activated this mechanism and you now go try to send and
receive from a new shm_mq. But even if that and every other
CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
today, it's a new coding rule that could easily trip people up in the
future.

Using a shm_mq is appropriate when the amount of data that needs to be
transmitted might be very large - too large to just allocate a buffer
for the whole thing - or when the amount of data can't be predicted
before memory is allocated. But there is obviously no rule that a
shm_mq should be used any time we have "data exchange between
processes"; we have lots of shared-memory based IPC already and have
for many years, and shm_mq is newer than the vast majority of that
code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Pavel Stehule
pavel.stehule@gmail.com
In reply to: Robert Haas (#60)
Re: On-demand running query plans using auto_explain and signals

2015-09-17 16:46 GMT+02:00 Robert Haas <robertmhaas@gmail.com>:

On Thu, Sep 17, 2015 at 10:28 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Please, can you explain what is wrong on using of shm_mq? It works pretty
well now.

There can be a contra argument, why don't use shm_mq, because it does

data

exchange between processes and this patch introduce new pattern for same
thing.

Sure, I can explain that.

First, when you communication through a shm_mq, the writer has to wait
when the queue is full, and the reader has to wait when the queue is
empty. If the message is short enough to fit in the queue, then you
can send it and be done without blocking. But if it is long, then you
will have each process repeatedly waiting for the other one until the
whole message has been sent. That is going to make sending the
message potentially a lot slower than writing it all in one go,
especially if the system is under heavy load.

Is there some risk if we take too much large DSM segment? And what is max
size of DSM segment? When we use shm_mq, we don't need to solve limits.

Also, if there are any bugs in the way the shm_mq is being used,
they're likely to be quite rare and hard to find, because the vast
majority of messages will probably be short enough to be sent in a
single chunk, so whatever bugs may exist when the processes play
ping-ping are unlikely to occur in practice except in unusual cases
where the message being returned is very long.

This is true for any functionality based on shm_mq - parallel seq scan,

Second, using a shm_mq manipulates the state of the process latch. I
don't think you can make the assumption that it's safe to reset the
process latch at any and every place where we check for interrupts.
For example, suppose the process is already using a shm_mq and the
CHECK_FOR_INTERRUPTS() call inside that code then discovers that
somebody has activated this mechanism and you now go try to send and
receive from a new shm_mq. But even if that and every other
CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
today, it's a new coding rule that could easily trip people up in the
future.

It is valid, and probably most important. But if we introduce own
mechanism, we will play with process latch too (although we can use LWlocks)

Using a shm_mq is appropriate when the amount of data that needs to be
transmitted might be very large - too large to just allocate a buffer
for the whole thing - or when the amount of data can't be predicted
before memory is allocated. But there is obviously no rule that a
shm_mq should be used any time we have "data exchange between
processes"; we have lots of shared-memory based IPC already and have
for many years, and shm_mq is newer than the vast majority of that
code.

I am little bit disappointed - I hoped so shm_mq can be used as generic
interprocess mechanism - that will ensure all corner cases for work with
shared memory. I understand to shm_mq is new, and nobody used it, but this
risk is better than invent wheels again and again.

Regards

Pavel

Show quoted text

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#62Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Robert Haas (#56)
Re: On-demand running query plans using auto_explain and signals

On Thu, Sep 17, 2015 at 2:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 16, 2015 at 10:31 AM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

I've also decided we really ought to suppress any possible ERROR level
messages generated during the course of processing the status request in
order not to prevent the originally running transaction to complete. The
errors so caught are just logged using LOG level and ignored in this new
version of the patch.

This plan has almost zero chance of surviving committer-level
scrutiny, and if it somehow does, some other committer will scream
bloody murder as soon as they notice it.

It's not safe to just catch an error and continue on executing. You
have to abort the (sub)transaction once an error happens.

Hm... is this really different from WHEN EXCEPTION clause in plpgsql?

Of course, this gets at a pretty crucial design question for this

patch, which is whether it's OK for one process to interrogate another
process's state, and what the consequences of that might be. What
permissions are needed?

I think superuser or same user is pretty reasonable requirement. Can be
limited to superuser only if there are some specific concerns.

Can you DOS the guy running a query by

pinging him for status at a very high rate?

Probably, but if you've got enough permissions to do that
pg_cancel/terminate_backend() would be easier.

The other question here is whether it's really safe to do this.

ProcessInterrupts() gets called at lots of places in the code, and we
may freely add more in the future. How are you going to prove that
ProcessCmdStatusInfoRequest() is safe to call in all of those places?
How do you know that some data structure you find by chasing down from
ActivePortal or current_query_stack doesn't have a NULL pointer
someplace that you don't expect, because the code that initializes
that pointer hasn't run yet?

I'd like to see this made to work and be safe, but I think proving
that it's truly robust in all cases is going to be hard.

Yes, this is a perfectly good point. For the purpose of obtaining the
explain plans, I believe anything we capture in ExecutorRun ought to be
safe to run Explain on: the plannedstmt is there and isn't going to change.

In a more general case, if we seek to extend this approach, then yes one
should be very careful in what is allowed to run during ProcessInterrupts().

What we could do instead to be extra-safe, is make the running backend
prepare the information for other backends to obtain from the shared
memory. Explain plans can be easily captured at the ExecutorRun hook. The
obvious downside is that you can no longer choose the explain format, or
obtain partially collected instrumentation data.

--
Alex

#63Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#61)
Re: On-demand running query plans using auto_explain and signals

On Thu, Sep 17, 2015 at 5:16 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-17 16:46 GMT+02:00 Robert Haas <robertmhaas@gmail.com>:

Second, using a shm_mq manipulates the state of the process latch. I
don't think you can make the assumption that it's safe to reset the
process latch at any and every place where we check for interrupts.
For example, suppose the process is already using a shm_mq and the
CHECK_FOR_INTERRUPTS() call inside that code then discovers that
somebody has activated this mechanism and you now go try to send and
receive from a new shm_mq. But even if that and every other
CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
today, it's a new coding rule that could easily trip people up in the
future.

It is valid, and probably most important. But if we introduce own
mechanism, we will play with process latch too (although we can use LWlocks)

Careful manipulation of the process latch is a valid point. Probably we
were way too optimistic about the limits we can hit with this approach. :-(

But if we make the sender backend create the DSM with the response payload,
it suddenly becomes really unclear at which point and who should make the
final detach of that DSM. We're getting back to the problem of
synchronization between these processes all over again.

--
Alex

#64Robert Haas
robertmhaas@gmail.com
In reply to: Pavel Stehule (#61)
Re: On-demand running query plans using auto_explain and signals

On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Is there some risk if we take too much large DSM segment? And what is max
size of DSM segment? When we use shm_mq, we don't need to solve limits.

I can't really think you are going to have a problem. How much data
do you really intend to send back? Surely it's going to be <100kB.
If you think it's not a problem to have a running query stop and send
a gigabyte of data someplace anytime somebody asks, well, I don't
think I agree.

Also, if there are any bugs in the way the shm_mq is being used,
they're likely to be quite rare and hard to find, because the vast
majority of messages will probably be short enough to be sent in a
single chunk, so whatever bugs may exist when the processes play
ping-ping are unlikely to occur in practice except in unusual cases
where the message being returned is very long.

This is true for any functionality based on shm_mq - parallel seq scan,

Parallel sequential scan is likely to put a lot more data through a
shm_mq than you would for this.

Second, using a shm_mq manipulates the state of the process latch. I
don't think you can make the assumption that it's safe to reset the
process latch at any and every place where we check for interrupts.
For example, suppose the process is already using a shm_mq and the
CHECK_FOR_INTERRUPTS() call inside that code then discovers that
somebody has activated this mechanism and you now go try to send and
receive from a new shm_mq. But even if that and every other
CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
today, it's a new coding rule that could easily trip people up in the
future.

It is valid, and probably most important. But if we introduce own mechanism,
we will play with process latch too (although we can use LWlocks)

With the design I proposed, there is zero need to touch the process
latch, which is good, because I'm pretty sure that is going to be a
problem. I don't think there is any need to use LWLocks here either.
When you get a request for data, you can just publish a DSM segment
with the data and that's it. Why do you need anything more? You
could set the requestor's latch if it's convenient; that wouldn't be a
problem. But the process supplying the data can't end up in a
different state than it was before supplying that data, or stuff WILL
break.

Using a shm_mq is appropriate when the amount of data that needs to be
transmitted might be very large - too large to just allocate a buffer
for the whole thing - or when the amount of data can't be predicted
before memory is allocated. But there is obviously no rule that a
shm_mq should be used any time we have "data exchange between
processes"; we have lots of shared-memory based IPC already and have
for many years, and shm_mq is newer than the vast majority of that
code.

I am little bit disappointed - I hoped so shm_mq can be used as generic
interprocess mechanism - that will ensure all corner cases for work with
shared memory. I understand to shm_mq is new, and nobody used it, but this
risk is better than invent wheels again and again.

shm_mq is useful, but if you insist on using a complicated tool when a
simple one is plenty sufficient, you may not get the results you're
hoping for.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Pavel Stehule
pavel.stehule@gmail.com
In reply to: Pavel Stehule (#59)
Re: On-demand running query plans using auto_explain and signals

2015-09-17 16:37 GMT+02:00 Pavel Stehule <pavel.stehule@gmail.com>:

2015-09-17 16:06 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Thu, Sep 17, 2015 at 12:06 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

That won't work really well with something like I use to do when
testing this patch, namely:

postgres=# select pid, array(select pg_cmdstatus(pid, 1, 10)) from
pg_stat_activity where pid<>pg_backend_pid() \watch 1

while also running pgbench with -C option (to create new connection for
every transaction). When a targeted backend exits before it can handle the
signal, the receiving process keeps waiting forever.

no - every timeout you have to check, if targeted backend is living
still, if not you will do cancel. It is 100% safe.

But then you need to make this internal timeout rather short, not 1s as
originally suggested.

can be - 1 sec is max, maybe 100ms is optimum.

The statement_timeout in this case will stop the whole select, not just

individual function call. Unless you wrap it to set statement_timeout and
catch QUERY_CANCELED in plpgsql, but then you won't be able to ^C the whole
select. The ability to set a (short) timeout for the function itself
proves to be a really useful feature to me.

you cannot to handle QUERY_CANCELED in plpgsql.

Well, you can but its not that useful of course:

hmm, some is wrong - I remember from some older plpgsql, so CANCEL message
is not catchable. Maybe I have bad memory. I have to recheck it.

it is ok. I didn't remeber well this behave. You cannot to catch CANCEL
(and today ASSERT) in OTHER handler. It can be handled if it is explicitly
mentioned.

Regards

Pavel

#66Pavel Stehule
pavel.stehule@gmail.com
In reply to: Robert Haas (#64)
Re: On-demand running query plans using auto_explain and signals

2015-09-17 22:13 GMT+02:00 Robert Haas <robertmhaas@gmail.com>:

On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Is there some risk if we take too much large DSM segment? And what is max
size of DSM segment? When we use shm_mq, we don't need to solve limits.

I can't really think you are going to have a problem. How much data
do you really intend to send back? Surely it's going to be <100kB.
If you think it's not a problem to have a running query stop and send
a gigabyte of data someplace anytime somebody asks, well, I don't
think I agree.

I afraid so <100kB is too optimistic. I know so GoodData environment is a
exception - it uses query generator, but I found few plans >1MB . It is
unusual probably due long identifiers too - used 63bytes long hash - and
some queries and models are strange. It does translation between analytic
multi dimensional business oriented query language and SQL. Back to topic.
With high probability we can calculate <10MB.

Also, if there are any bugs in the way the shm_mq is being used,
they're likely to be quite rare and hard to find, because the vast
majority of messages will probably be short enough to be sent in a
single chunk, so whatever bugs may exist when the processes play
ping-ping are unlikely to occur in practice except in unusual cases
where the message being returned is very long.

This is true for any functionality based on shm_mq - parallel seq scan,

Parallel sequential scan is likely to put a lot more data through a
shm_mq than you would for this.

Second, using a shm_mq manipulates the state of the process latch. I
don't think you can make the assumption that it's safe to reset the
process latch at any and every place where we check for interrupts.
For example, suppose the process is already using a shm_mq and the
CHECK_FOR_INTERRUPTS() call inside that code then discovers that
somebody has activated this mechanism and you now go try to send and
receive from a new shm_mq. But even if that and every other
CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
today, it's a new coding rule that could easily trip people up in the
future.

It is valid, and probably most important. But if we introduce own

mechanism,

we will play with process latch too (although we can use LWlocks)

With the design I proposed, there is zero need to touch the process
latch, which is good, because I'm pretty sure that is going to be a
problem. I don't think there is any need to use LWLocks here either.
When you get a request for data, you can just publish a DSM segment
with the data and that's it. Why do you need anything more? You
could set the requestor's latch if it's convenient; that wouldn't be a
problem. But the process supplying the data can't end up in a
different state than it was before supplying that data, or stuff WILL
break.

sure - but the same behave have to have shm_mq - if not, then only one can
be used for communication between process - that is little bit limiting.

Show quoted text

Using a shm_mq is appropriate when the amount of data that needs to be
transmitted might be very large - too large to just allocate a buffer
for the whole thing - or when the amount of data can't be predicted
before memory is allocated. But there is obviously no rule that a
shm_mq should be used any time we have "data exchange between
processes"; we have lots of shared-memory based IPC already and have
for many years, and shm_mq is newer than the vast majority of that
code.

I am little bit disappointed - I hoped so shm_mq can be used as generic
interprocess mechanism - that will ensure all corner cases for work with
shared memory. I understand to shm_mq is new, and nobody used it, but

this

risk is better than invent wheels again and again.

shm_mq is useful, but if you insist on using a complicated tool when a
simple one is plenty sufficient, you may not get the results you're
hoping for.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#67Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Robert Haas (#64)
Re: On-demand running query plans using auto_explain and signals

On Thu, Sep 17, 2015 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Second, using a shm_mq manipulates the state of the process latch. I
don't think you can make the assumption that it's safe to reset the
process latch at any and every place where we check for interrupts.
For example, suppose the process is already using a shm_mq and the
CHECK_FOR_INTERRUPTS() call inside that code then discovers that
somebody has activated this mechanism and you now go try to send and
receive from a new shm_mq. But even if that and every other
CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
today, it's a new coding rule that could easily trip people up in the
future.

It is valid, and probably most important. But if we introduce own

mechanism,

we will play with process latch too (although we can use LWlocks)

With the design I proposed, there is zero need to touch the process
latch, which is good, because I'm pretty sure that is going to be a
problem. I don't think there is any need to use LWLocks here either.
When you get a request for data, you can just publish a DSM segment
with the data and that's it. Why do you need anything more? You
could set the requestor's latch if it's convenient; that wouldn't be a
problem. But the process supplying the data can't end up in a
different state than it was before supplying that data, or stuff WILL
break.

There is still the whole problem of where exactly the backend being queried
for the status should publish that DSM segment and when to free it?

If it's a location shared between all backends, there should be locking
around it. Probably this is not a big problem, if you don't expect all the
backends start querying each other rapidly. That is how it was implemented
in the first versions of this patch actually.

If we take the per-backend slot approach the locking seems unnecessary and
there are principally two options:

1) The backend puts the DSM handle in its own slot and notifies the
requester to read it.
2) The backend puts the DSM handle in the slot of the requester (and
notifies it).

If we go with the first option, the backend that has created the DSM will
not know when it's OK to free it, so this has to be responsibility of the
requester. If the latter exits before reading and freeing the DSM, we have
a leak. Even bigger is the problem that the sender backend can no longer
send responses to a number of concurrent requestors: if its slot is
occupied by a DSM handle, it can not send a reply to another backend until
the slot is freed.

With the second option we have all the same problems with not knowing when
to free the DSM and potentially leaking it, but we can handle concurrent
requests.

The current approach where the requester creates and frees the DSM doesn't
suffer from these problems, so if we pre-allocate the segment just big
enough we can avoid the use of shm_mq. That will take another GUC for the
segment size. Certainly no one expects a query plan to weigh a bloody
megabyte, but this is what happens to Pavel apparently.

--
Alex

#68Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#67)
Re: On-demand running query plans using auto_explain and signals

2015-09-18 10:59 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Thu, Sep 17, 2015 at 10:13 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Thu, Sep 17, 2015 at 11:16 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

Second, using a shm_mq manipulates the state of the process latch. I
don't think you can make the assumption that it's safe to reset the
process latch at any and every place where we check for interrupts.
For example, suppose the process is already using a shm_mq and the
CHECK_FOR_INTERRUPTS() call inside that code then discovers that
somebody has activated this mechanism and you now go try to send and
receive from a new shm_mq. But even if that and every other
CHECK_FOR_INTERRUPTS() in the code can tolerate a process latch reset
today, it's a new coding rule that could easily trip people up in the
future.

It is valid, and probably most important. But if we introduce own

mechanism,

we will play with process latch too (although we can use LWlocks)

With the design I proposed, there is zero need to touch the process
latch, which is good, because I'm pretty sure that is going to be a
problem. I don't think there is any need to use LWLocks here either.
When you get a request for data, you can just publish a DSM segment
with the data and that's it. Why do you need anything more? You
could set the requestor's latch if it's convenient; that wouldn't be a
problem. But the process supplying the data can't end up in a
different state than it was before supplying that data, or stuff WILL
break.

There is still the whole problem of where exactly the backend being
queried for the status should publish that DSM segment and when to free it?

If it's a location shared between all backends, there should be locking
around it. Probably this is not a big problem, if you don't expect all the
backends start querying each other rapidly. That is how it was implemented
in the first versions of this patch actually.

If we take the per-backend slot approach the locking seems unnecessary and
there are principally two options:

1) The backend puts the DSM handle in its own slot and notifies the
requester to read it.
2) The backend puts the DSM handle in the slot of the requester (and
notifies it).

If we go with the first option, the backend that has created the DSM will
not know when it's OK to free it, so this has to be responsibility of the
requester. If the latter exits before reading and freeing the DSM, we have
a leak. Even bigger is the problem that the sender backend can no longer
send responses to a number of concurrent requestors: if its slot is
occupied by a DSM handle, it can not send a reply to another backend until
the slot is freed.

With the second option we have all the same problems with not knowing when
to free the DSM and potentially leaking it, but we can handle concurrent
requests.

It should not be true - the data sender create DSM and fills it. Then set
caller slot and send signal to caller. Caller can free DSM any time,
because data sender send newer touch it.

The current approach where the requester creates and frees the DSM doesn't
suffer from these problems, so if we pre-allocate the segment just big
enough we can avoid the use of shm_mq. That will take another GUC for the
segment size. Certainly no one expects a query plan to weigh a bloody
megabyte, but this is what happens to Pavel apparently.

It is plan C - last variant from my view. Any other GUC :(

Pavel

Show quoted text

--
Alex

#69Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#68)
Re: On-demand running query plans using auto_explain and signals

On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-18 10:59 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

If we take the per-backend slot approach the locking seems unnecessary
and there are principally two options:

1) The backend puts the DSM handle in its own slot and notifies the
requester to read it.
2) The backend puts the DSM handle in the slot of the requester (and
notifies it).

If we go with the first option, the backend that has created the DSM will
not know when it's OK to free it, so this has to be responsibility of the
requester. If the latter exits before reading and freeing the DSM, we have
a leak. Even bigger is the problem that the sender backend can no longer
send responses to a number of concurrent requestors: if its slot is
occupied by a DSM handle, it can not send a reply to another backend until
the slot is freed.

With the second option we have all the same problems with not knowing
when to free the DSM and potentially leaking it, but we can handle
concurrent requests.

It should not be true - the data sender create DSM and fills it. Then set
caller slot and send signal to caller. Caller can free DSM any time,
because data sender send newer touch it.

But the requester can timeout on waiting for reply and exit before it sees
the reply DSM. Actually, I now don't even think a backend can free the DSM
it has not created. First it will need to attach it, effectively
increasing the refcount, and upon detach it will only decrease the
refcount, but not actually release the segment...

So this has to be the responsibility of the reply sending backend in the
end: to create and release the DSM *at some point*.

--
Alex

#70Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#69)
Re: On-demand running query plans using auto_explain and signals

2015-09-18 12:05 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-18 10:59 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

If we take the per-backend slot approach the locking seems unnecessary
and there are principally two options:

1) The backend puts the DSM handle in its own slot and notifies the
requester to read it.
2) The backend puts the DSM handle in the slot of the requester (and
notifies it).

If we go with the first option, the backend that has created the DSM
will not know when it's OK to free it, so this has to be responsibility of
the requester. If the latter exits before reading and freeing the DSM, we
have a leak. Even bigger is the problem that the sender backend can no
longer send responses to a number of concurrent requestors: if its slot is
occupied by a DSM handle, it can not send a reply to another backend until
the slot is freed.

With the second option we have all the same problems with not knowing
when to free the DSM and potentially leaking it, but we can handle
concurrent requests.

It should not be true - the data sender create DSM and fills it. Then set
caller slot and send signal to caller. Caller can free DSM any time,
because data sender send newer touch it.

But the requester can timeout on waiting for reply and exit before it sees
the reply DSM. Actually, I now don't even think a backend can free the DSM
it has not created. First it will need to attach it, effectively
increasing the refcount, and upon detach it will only decrease the
refcount, but not actually release the segment...

I am afraid so it has not simple and nice solution - when data sender will
wait for to moment when data are received, then we have same complexity
like we use shm_mq.

Isn't better to introduce new background worker with responsibility to
clean orphaned DSM?

Regards

Pavel

Show quoted text

So this has to be the responsibility of the reply sending backend in the
end: to create and release the DSM *at some point*.

--
Alex

#71Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#70)
Re: On-demand running query plans using auto_explain and signals

On Fri, Sep 18, 2015 at 12:59 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-18 12:05 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

It should not be true - the data sender create DSM and fills it. Then
set caller slot and send signal to caller. Caller can free DSM any time,
because data sender send newer touch it.

But the requester can timeout on waiting for reply and exit before it
sees the reply DSM. Actually, I now don't even think a backend can free
the DSM it has not created. First it will need to attach it, effectively
increasing the refcount, and upon detach it will only decrease the
refcount, but not actually release the segment...

I am afraid so it has not simple and nice solution - when data sender will
wait for to moment when data are received, then we have same complexity
like we use shm_mq.

Isn't better to introduce new background worker with responsibility to
clean orphaned DSM?

I'm not thrilled by this idea.

We don't even need a worker to do that, as leaked segments are reported by
the backend itself upon transaction commit (see
ResourceOwnerReleaseInternal), e.g:

WARNING: dynamic shared memory leak: segment 808539725 still referenced

Still I believe relying on some magic cleanup mechanism and not caring
about managing the shared memory properly is not the way to go.

--
Alex

#72Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#71)
Re: On-demand running query plans using auto_explain and signals

2015-09-18 13:22 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Fri, Sep 18, 2015 at 12:59 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-18 12:05 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Fri, Sep 18, 2015 at 11:25 AM, Pavel Stehule <pavel.stehule@gmail.com

wrote:

It should not be true - the data sender create DSM and fills it. Then
set caller slot and send signal to caller. Caller can free DSM any time,
because data sender send newer touch it.

But the requester can timeout on waiting for reply and exit before it
sees the reply DSM. Actually, I now don't even think a backend can free
the DSM it has not created. First it will need to attach it, effectively
increasing the refcount, and upon detach it will only decrease the
refcount, but not actually release the segment...

I am afraid so it has not simple and nice solution - when data sender
will wait for to moment when data are received, then we have same
complexity like we use shm_mq.

Isn't better to introduce new background worker with responsibility to
clean orphaned DSM?

I'm not thrilled by this idea.

We don't even need a worker to do that, as leaked segments are reported by
the backend itself upon transaction commit (see
ResourceOwnerReleaseInternal), e.g:

WARNING: dynamic shared memory leak: segment 808539725 still referenced

Still I believe relying on some magic cleanup mechanism and not caring
about managing the shared memory properly is not the way to go.

It was one my idea too, to check shared memory on exit. The disadvantage
is clean - some times you can wait too long - although the very practical
limit for session is about 2h.

Show quoted text

--
Alex

#73Robert Haas
robertmhaas@gmail.com
In reply to: Shulgin, Oleksandr (#63)
Re: On-demand running query plans using auto_explain and signals

On Thu, Sep 17, 2015 at 12:47 PM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

But if we make the sender backend create the DSM with the response payload,
it suddenly becomes really unclear at which point and who should make the
final detach of that DSM. We're getting back to the problem of
synchronization between these processes all over again.

Yes, some thought is needed here. There's not really a problem if
somebody asks for information just once: you could just have the
publishing process keep it attached until process exit, and nothing
bad would happen. The real problem comes when multiple people ask for
information in quick succession: if you detach the old segment too
quickly after publishing it, the guy who requested that data might not
have attached it yet, and then when he gets around to looking at it,
it won't be there.

This seems like a pretty solvable problem, though. For example, when
somebody asks for information, they store in the main shared segment a
pointer to their PGPROC and then they signal the target process, which
responds by publishing a dsm_segment_id. When the requesting process
has accessed it, or when it errors out or exits, it clears the PGPROC
and the dsm_segment_id. The publisher avoids unmapping it until that
happens.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Robert Haas
robertmhaas@gmail.com
In reply to: Pavel Stehule (#70)
Re: On-demand running query plans using auto_explain and signals

On Fri, Sep 18, 2015 at 6:59 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

I am afraid so it has not simple and nice solution - when data sender will
wait for to moment when data are received, then we have same complexity like
we use shm_mq.

Isn't better to introduce new background worker with responsibility to clean
orphaned DSM?

That won't work, or at least not easily. On Windows, the DSM is
cleaned up by the operating system as soon as nobody has it mapped.

Frankly, I think you guys are making this out to be way more
complicated than it really is. Basically, I think the process being
queried should publish a DSM via a slot it owns. The recipient is
responsible for reading it and then notifying the sender. If a second
process requests data before the first process reads its data, the
second process can either (1) become an additional reader of the
already-published data or (2) wait for the first process to finish,
and then send its own inquiry.

There are some problems to solve here, but they hardly seem impossible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Robert Haas (#74)
1 attachment(s)
Re: On-demand running query plans using auto_explain and signals

On Fri, Sep 18, 2015 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Frankly, I think you guys are making this out to be way more
complicated than it really is. Basically, I think the process being
queried should publish a DSM via a slot it owns. The recipient is
responsible for reading it and then notifying the sender. If a second
process requests data before the first process reads its data, the
second process can either (1) become an additional reader of the
already-published data or (2) wait for the first process to finish,
and then send its own inquiry.

There are some problems to solve here, but they hardly seem impossible.

Thank you Robert, for your invaluable input on this patch!

I now believe that use of ProcessInterrupts() in the recently proposed
design as well as manipulation of proc latch due to use of shared memory
queue are major blockers.

In order to simplify things up, I've taken a step back and looked again at
the auto_explain contrib module. I now propose the most simple thing
possible in my opinion: a new GUC option for auto_explain. It will make
every backend, in which the module is loaded via *_preload_libraries
mechanism or using CREATE EXTENSION/LOAD commands, to actively publish the
plans of queries in dynamic shared memory as long as
auto_explain.publish_plans is turned on.

The greatest part for me, is that this approach doesn't require handling of
signals and is isolated in an external module, so it can be readily used
with the current server versions, no need to wait for >= 9.6.

Some implementation details:

For every backend that might be running (MaxBackends) we reserve a
dsm_handle slot in the addins shared memory. When the new option is turned
on, the ExecutorRun hook will produce a plan in whatever format is
specified by the auto_explain.log_format, allocate a DSM segment, copy the
plan into the segment and finally store the DSM handle into its own slot.
No locking is required around this because every backend writes to its slot
exclusively, no other backend can be writing into it concurrently. In the
ExecutorFinish hook we invalidate the backend's slot by setting the handle
to 0 and deallocate the DSM segment.

Reading of the plan is performed by a newly added function
pg_explain_backend(PID). Since it can determine the target process'
backendId, it simply reads the DSM handle from that backend's slot and
tries to attach it (there's not much point in checking the handle for being
non-0, because the other backend could release the segment the moment after
we've checked it, so we rely on dsm_attach returning non-NULL). If
attached successfully, we parse the contents and detach. At this point the
backend to detach the last is actually releasing the segment, due to
reference counting.

Handling of the nested statements plans is an open question. It can be
really useful when the top-level plan is simply displaying a "SELECT
my_stored_procedure()" and all the interesting stuff is happening behind
the scenes, but I didn't start to think about how this could be implemented
yet.

Pavel was really interested in retrieving the complete query text/plans
which could be over a megabyte in his case (and pg_stat_activity.query is
capped by 10240 bytes I believe). This is now possible with the patch, but
some others might still want to put a threshold on the allocation,
especially given this is shared memory. I can envision another GUC, but in
our experience the extremely long queries (and plans) are most of the time
due to use of VALUES() or IN() clauses with a huge list of literals.

I think we could fold the VALUES/IN into "?" if the query/plan text exceeds
the specified threshold, or unconditionally (yes, pg_stat_statements, I'm
looking at you). This should help in the cases when the most interesting
part is in the plan nodes near the end, but there's such a huge list of
literals before it.

Future plans:

I believe this approach can be extended to enable instrumentation once
again. The backend executing the query could update the published plan
every once in a while (for example, every N ms or 1% of rows processed in a
node), and any other process interested in this data, can simply read it
without the need for signals and complex and fragile communication. This
obviously requires a core patch.

Some problems:

There is a potential problem with the limited total number of DSM segments:
it is reserved in a way to only allow 2 per backend (on average) and 64
additional per server, so if you run with the new option enabled at all
times, you're down to only 1 additional DSM per backend (again, on
average). Not sure how critical this can be, but no one is forced to run
with this option enabled for all backends.

If you don't want to run it enabled at all times, then enabling the GUC
per-backend can be problematic. It's still possible to update the conf
file and send SIGHUP to a single backend, but it's harder to accomplish
over psql, for example. I think here we might still have some luck with
the signals: use another array of per-backend slots with flags, set the
target backend's flag and send it SIGUSR1. The backend wakes on the signal
and examines its slot, then toggles the GUC if needed. Sounds pretty safe,
eh?

No documentation changes yet, waiting for your comments. :-)

Happy hacking!
--
Alex

Attachments:

0001-Add-auto_explain.publish_plans.patchtext/x-patch; charset=US-ASCII; name=0001-Add-auto_explain.publish_plans.patchDownload
From 9f192d17e4674b2a042ca435ad5219f59722cd7a Mon Sep 17 00:00:00 2001
From: Oleksandr Shulgin <oleksandr.shulgin@zalando.de>
Date: Fri, 25 Sep 2015 17:29:34 +0200
Subject: [PATCH] Add auto_explain.publish_plans

This adds an optional feature to publish the plans of every query being
executed into per-backend dynamic shared memory segment.  The plans can then be
examined by other backends by use of newly added pg_explain_backend() function.
The process ID that is passed to the function must be this server's backend PID
and the caller must be either superuser or the same role that the other backend
is running under.

In order to support addition of the SQL-callable function, EXTENSION packaging
is provided for auto_explain.
---
 contrib/auto_explain/Makefile              |   2 +
 contrib/auto_explain/auto_explain--1.0.sql |  10 +
 contrib/auto_explain/auto_explain.c        | 326 ++++++++++++++++++++++++++---
 contrib/auto_explain/auto_explain.control  |   5 +
 4 files changed, 315 insertions(+), 28 deletions(-)
 create mode 100644 contrib/auto_explain/auto_explain--1.0.sql
 create mode 100644 contrib/auto_explain/auto_explain.control

diff --git a/contrib/auto_explain/Makefile b/contrib/auto_explain/Makefile
index fcf36c5..b6057a3 100644
--- a/contrib/auto_explain/Makefile
+++ b/contrib/auto_explain/Makefile
@@ -3,6 +3,8 @@
 MODULE_big = auto_explain
 OBJS = auto_explain.o $(WIN32RES)
 PGFILEDESC = "auto_explain - logging facility for execution plans"
+EXTENSION = auto_explain
+DATA = auto_explain--1.0.sql
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/auto_explain/auto_explain--1.0.sql b/contrib/auto_explain/auto_explain--1.0.sql
new file mode 100644
index 0000000..6059949
--- /dev/null
+++ b/contrib/auto_explain/auto_explain--1.0.sql
@@ -0,0 +1,10 @@
+/* contrib/auto_explain/auto_explain--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION auto_explain" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_explain_backend(INT)
+RETURNS SETOF TEXT
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 2a184ed..0fa123f 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -14,9 +14,18 @@
 
 #include <limits.h>
 
+#include "access/htup_details.h"
 #include "commands/explain.h"
 #include "executor/instrument.h"
+#include "miscadmin.h"
+#include "storage/dsm.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -29,6 +38,7 @@ static bool auto_explain_log_triggers = false;
 static bool auto_explain_log_timing = true;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_publish_plans = false;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -38,10 +48,23 @@ static const struct config_enum_entry format_options[] = {
 	{NULL, 0, false}
 };
 
+
+/* Shared memory structs */
+static dsm_handle *published_plan_handles = NULL;
+
+#define MaxPublishedPlans		MaxBackends
+#define explain_shmem_size()	(sizeof(dsm_handle) * MaxPublishedPlans)
+
+static dsm_segment	   *MyPublishedPlanSegment = NULL;
+static dsm_handle	   *MyPublishedPlanHandle = NULL;
+
+
 /* Current nesting depth of ExecutorRun calls */
 static int	nesting_level = 0;
 
 /* Saved hook values in case of unload */
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+
 static ExecutorStart_hook_type prev_ExecutorStart = NULL;
 static ExecutorRun_hook_type prev_ExecutorRun = NULL;
 static ExecutorFinish_hook_type prev_ExecutorFinish = NULL;
@@ -51,15 +74,28 @@ static ExecutorEnd_hook_type prev_ExecutorEnd = NULL;
 	(auto_explain_log_min_duration >= 0 && \
 	 (nesting_level == 0 || auto_explain_log_nested_statements))
 
+#define publish_explain_enabled() \
+	(auto_explain_publish_plans && \
+	 nesting_level == 0)	/* XXX what do we do with the nested statements? */
+
 void		_PG_init(void);
 void		_PG_fini(void);
 
+PG_FUNCTION_INFO_V1(pg_explain_backend);
+
+static void explain_shmem_startup(void);
+static void attach_published_plan_handles(void);
+
 static void explain_ExecutorStart(QueryDesc *queryDesc, int eflags);
 static void explain_ExecutorRun(QueryDesc *queryDesc,
 					ScanDirection direction,
 					long count);
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
+static StringInfo explain_query(QueryDesc *queryDesc);
+
+static void explain_publish_plan(QueryDesc *queryDesc);
+static void explain_unpublish_my_plan(void);
 
 
 /*
@@ -159,9 +195,26 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomBoolVariable("auto_explain.publish_plans",
+							 "Publish plan of every executed statement in shared memory.",
+							 NULL,
+							 &auto_explain_publish_plans,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
+	if (auto_explain_publish_plans)
+		RequestAddinShmemSpace(explain_shmem_size());
+
 	/* Install hooks. */
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = explain_shmem_startup;
+
 	prev_ExecutorStart = ExecutorStart_hook;
 	ExecutorStart_hook = explain_ExecutorStart;
 	prev_ExecutorRun = ExecutorRun_hook;
@@ -183,8 +236,42 @@ _PG_fini(void)
 	ExecutorRun_hook = prev_ExecutorRun;
 	ExecutorFinish_hook = prev_ExecutorFinish;
 	ExecutorEnd_hook = prev_ExecutorEnd;
+
+	shmem_startup_hook = prev_shmem_startup_hook;
 }
 
+
+static void
+explain_shmem_startup(void)
+{
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	attach_published_plan_handles();
+}
+
+static void
+attach_published_plan_handles(void)
+{
+	bool		found;
+
+	if (auto_explain_publish_plans && published_plan_handles == NULL)
+	{
+		LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+
+		published_plan_handles =
+			(dsm_handle *) ShmemInitStruct("auto_explain.published_plan_handles",
+										   explain_shmem_size(), &found);
+		if (!found)
+			MemSet(published_plan_handles, 0, explain_shmem_size());
+
+		MyPublishedPlanHandle = &published_plan_handles[MyBackendId];
+
+		LWLockRelease(AddinShmemInitLock);
+	}
+}
+
+
 /*
  * ExecutorStart hook: start up logging if needed
  */
@@ -234,6 +321,11 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
 static void
 explain_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
 {
+	if (publish_explain_enabled())
+	{
+		explain_publish_plan(queryDesc);
+	}
+
 	nesting_level++;
 	PG_TRY();
 	{
@@ -257,6 +349,11 @@ explain_ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count)
 static void
 explain_ExecutorFinish(QueryDesc *queryDesc)
 {
+	if (MyPublishedPlanSegment)
+	{
+		explain_unpublish_my_plan();
+	}
+
 	nesting_level++;
 	PG_TRY();
 	{
@@ -294,32 +391,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 		msec = queryDesc->totaltime->total * 1000.0;
 		if (msec >= auto_explain_log_min_duration)
 		{
-			ExplainState *es = NewExplainState();
-
-			es->analyze = (queryDesc->instrument_options && auto_explain_log_analyze);
-			es->verbose = auto_explain_log_verbose;
-			es->buffers = (es->analyze && auto_explain_log_buffers);
-			es->timing = (es->analyze && auto_explain_log_timing);
-			es->summary = es->analyze;
-			es->format = auto_explain_log_format;
-
-			ExplainBeginOutput(es);
-			ExplainQueryText(es, queryDesc);
-			ExplainPrintPlan(es, queryDesc);
-			if (es->analyze && auto_explain_log_triggers)
-				ExplainPrintTriggers(es, queryDesc);
-			ExplainEndOutput(es);
-
-			/* Remove last line break */
-			if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
-				es->str->data[--es->str->len] = '\0';
-
-			/* Fix JSON to output an object */
-			if (auto_explain_log_format == EXPLAIN_FORMAT_JSON)
-			{
-				es->str->data[0] = '{';
-				es->str->data[es->str->len - 1] = '}';
-			}
+			StringInfo		str = explain_query(queryDesc);
 
 			/*
 			 * Note: we rely on the existing logging of context or
@@ -329,10 +401,11 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			 */
 			ereport(LOG,
 					(errmsg("duration: %.3f ms  plan:\n%s",
-							msec, es->str->data),
+							msec, str->data),
 					 errhidestmt(true)));
 
-			pfree(es->str->data);
+			pfree(str->data);
+			pfree(str);
 		}
 	}
 
@@ -341,3 +414,200 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 	else
 		standard_ExecutorEnd(queryDesc);
 }
+
+static StringInfo
+explain_query(QueryDesc *queryDesc)
+{
+	StringInfo		result;
+	ExplainState   *es = NewExplainState();
+
+	es->analyze = (queryDesc->instrument_options && auto_explain_log_analyze);
+	es->verbose = auto_explain_log_verbose;
+	es->buffers = (es->analyze && auto_explain_log_buffers);
+	es->timing = (es->analyze && auto_explain_log_timing);
+	es->summary = es->analyze;
+	es->format = auto_explain_log_format;
+
+	ExplainBeginOutput(es);
+	ExplainQueryText(es, queryDesc);
+	ExplainPrintPlan(es, queryDesc);
+	if (es->analyze && auto_explain_log_triggers)
+		ExplainPrintTriggers(es, queryDesc);
+	ExplainEndOutput(es);
+
+	/* Remove last line break */
+	if (es->str->len > 0 && es->str->data[es->str->len - 1] == '\n')
+		es->str->data[--es->str->len] = '\0';
+
+	/* Fix JSON to output an object */
+	if (auto_explain_log_format == EXPLAIN_FORMAT_JSON)
+	{
+		es->str->data[0] = '{';
+		es->str->data[es->str->len - 1] = '}';
+	}
+
+	result = es->str;
+	pfree(es);
+
+	return result;
+}
+
+static void
+explain_publish_plan(QueryDesc *queryDesc)
+{
+	StringInfo		str = explain_query(queryDesc);
+	dsm_segment	   *dsm;
+
+	attach_published_plan_handles();
+
+	dsm = dsm_create(str->len + 1, DSM_CREATE_NULL_IF_MAXSEGMENTS);
+	if (dsm)
+	{
+		char	   *buffer = (char *) dsm_segment_address(dsm);
+		dsm_handle	handle;
+
+		memcpy(buffer, str->data, str->len + 1);
+
+		MyPublishedPlanSegment = dsm;
+
+		handle = dsm_segment_handle(dsm);
+		*MyPublishedPlanHandle = handle;
+
+		elog(DEBUG5, "published explain plan in a DSM segment: %d", handle);
+	}
+	else
+		elog(LOG, "explain_publish_plan: out of dynamic shared memory, needed: %d",
+			 str->len + 1);
+
+	pfree(str->data);
+	pfree(str);
+}
+
+static void
+explain_unpublish_my_plan(void)
+{
+	Assert(MyPublishedPlanSegment);
+	Assert(MyPublishedPlanHandle);
+
+	elog(DEBUG5, "unpublishing explain plan in a DSM segment: %d", *MyPublishedPlanHandle);
+
+	*MyPublishedPlanHandle = 0;
+
+	dsm_detach(MyPublishedPlanSegment);
+	MyPublishedPlanSegment = NULL;
+}
+
+
+/*
+ * Retrieve the plan of a statement currently running in a server backend
+ * specified by the process ID.
+ *
+ * FUNCTION pg_explain_backend(pid)
+ * RETURNS SETOF text
+ */
+Datum
+pg_explain_backend(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo	   *rsinfo	= (ReturnSetInfo *) fcinfo->resultinfo;
+	Tuplestorestate	   *tupstore;
+	TupleDesc			tupdesc;
+	MemoryContext		per_query_ctx;
+	MemoryContext		oldcontext;
+	pid_t				target_pid = (pid_t) PG_GETARG_INT32(0);
+	PGPROC			   *target_proc;
+	dsm_handle			handle = 0;
+	dsm_segment		   *seg = NULL;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not allowed in this context")));
+
+	/* verify access to target_pid */
+	target_proc = BackendPidGetProc(target_pid);
+
+	if (target_proc == NULL)
+		ereport(ERROR,
+				(errmsg("PID %d is not a PostgreSQL server process", target_pid)));
+
+	if (!(superuser() || target_proc->roleId == GetUserId()))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser or have the same role to explain queries running in other server processes"))));
+
+	attach_published_plan_handles();
+	if (published_plan_handles == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot attach to published plans shared memory"),
+				 errhint("is auto_explain.publish_plans turned on?")));
+
+	/* need to build tuplestore in query context */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupdesc = CreateTupleDescCopy(rsinfo->expectedDesc);
+	tupstore = tuplestore_begin_heap(false, false, work_mem);
+	MemoryContextSwitchTo(oldcontext);
+
+	if (target_pid == MyProcPid)
+	{
+		/* cannot attach to our own segment, but we know where to look */
+		seg = MyPublishedPlanSegment;
+	}
+	else
+	{
+		handle = published_plan_handles[target_proc->backendId];
+		seg =  dsm_attach(handle);
+	}
+	if (seg != NULL)
+	{
+		PG_TRY();
+		{
+			const char	   *p = (const char *) dsm_segment_address(seg);
+
+			/* break into tuples on newline */
+			while (*p)
+			{
+				const char *q = strchr(p, '\n');
+				Size		len = q ? q - p : strlen(p);
+				Datum		value;
+				HeapTuple	tuple;
+				bool		isnull = false;
+
+				value = PointerGetDatum(cstring_to_text_with_len(p, len));
+
+				tuple = heap_form_tuple(tupdesc, &value, &isnull);
+				tuplestore_puttuple(tupstore, tuple);
+
+				if (!q)
+					break;
+
+				p += len + 1;
+			}
+
+			/* detach DSM, but only if it's not our own one */
+			if (handle)
+				dsm_detach(seg);
+		}
+		PG_CATCH();
+		{
+			if (handle)
+				dsm_detach(seg);
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+	}
+
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	return (Datum) 0;
+}
diff --git a/contrib/auto_explain/auto_explain.control b/contrib/auto_explain/auto_explain.control
new file mode 100644
index 0000000..bd92c71
--- /dev/null
+++ b/contrib/auto_explain/auto_explain.control
@@ -0,0 +1,5 @@
+# auto_explain extension
+comment = 'logging facility for execution plans'
+default_version = '1.0'
+module_pathname = '$libdir/auto_explain'
+relocatable = true
-- 
2.1.4

#76Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#75)
Re: On-demand running query plans using auto_explain and signals

2015-09-25 19:13 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

Some problems:

There is a potential problem with the limited total number of DSM
segments: it is reserved in a way to only allow 2 per backend (on average)
and 64 additional per server, so if you run with the new option enabled at
all times, you're down to only 1 additional DSM per backend (again, on
average). Not sure how critical this can be, but no one is forced to run
with this option enabled for all backends.

If you don't want to run it enabled at all times, then enabling the GUC
per-backend can be problematic. It's still possible to update the conf
file and send SIGHUP to a single backend, but it's harder to accomplish
over psql, for example. I think here we might still have some luck with
the signals: use another array of per-backend slots with flags, set the
target backend's flag and send it SIGUSR1. The backend wakes on the signal
and examines its slot, then toggles the GUC if needed. Sounds pretty safe,
eh?

No documentation changes yet, waiting for your comments. :-)

the preparing of content before execution is interesting idea, that can be
used more. The almost queries and plans are not too big, so when the size
of content is not too big - less than 1MB, then can be used one DSM for all
backends. When size of content is bigger than limit, then DSM will be
allocated specially for this content. The pointer to DSM and offset can be
stored in requested process slot. The reading and writing to requested slot
should be protected by spinlock, but it should block only two related
processes for short time (copy memory). Other possibility is showing the
size of content in requested process slot. Then the requester can
preallocate enough size of shared memory. But this doesn't solve a issues
related to waiting requester for content. So first variant is pretty
simple, and should be preferred. The disadvantage is clear - it can enforce
maybe significant slowdown of fast queries.

Regards

Pavel

Show quoted text

Happy hacking!
--
Alex

#77Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#76)
Re: On-demand running query plans using auto_explain and signals

On Sun, Sep 27, 2015 at 8:05 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

the preparing of content before execution is interesting idea, that can be

used more. The almost queries and plans are not too big, so when the size
of content is not too big - less than 1MB, then can be used one DSM for all
backends.

When size of content is bigger than limit, then DSM will be allocated
specially for this content. The pointer to DSM and offset can be stored in
requested process slot. The reading and writing to requested slot should be
protected by spinlock, but it should block only two related processes for
short time (copy memory).

Sorry, I don't think this will fly.

The whole idea is that a backend publishes the plan of a query just before
running it and it doesn't care which other backend(s) might be reading it,
how many times and in which order. The only required locking (implicit) is
contained in the code for dsm_attach/detach().

Other possibility is showing the size of content in requested process
slot. Then the requester can preallocate enough size of shared memory. But
this doesn't solve a issues related to waiting requester for content. So
first variant is pretty simple, and should be preferred. The disadvantage
is clear - it can enforce maybe significant slowdown of fast queries.

Both of these approaches have just too many synchronization problems, IMO.

--
Alex

#78Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#77)
Re: On-demand running query plans using auto_explain and signals

2015-09-28 12:01 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Sun, Sep 27, 2015 at 8:05 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

the preparing of content before execution is interesting idea, that can be

used more. The almost queries and plans are not too big, so when the size
of content is not too big - less than 1MB, then can be used one DSM for all
backends.

When size of content is bigger than limit, then DSM will be allocated
specially for this content. The pointer to DSM and offset can be stored in
requested process slot. The reading and writing to requested slot should be
protected by spinlock, but it should block only two related processes for
short time (copy memory).

Sorry, I don't think this will fly.

The whole idea is that a backend publishes the plan of a query just before
running it and it doesn't care which other backend(s) might be reading it,
how many times and in which order. The only required locking (implicit) is
contained in the code for dsm_attach/detach().

I didn't propose too different solution. There is only one difference -
sharing DSM for smaller data. It is similar to using usual shared memory.

Regards

Pavel

Show quoted text

Other possibility is showing the size of content in requested process
slot. Then the requester can preallocate enough size of shared memory. But
this doesn't solve a issues related to waiting requester for content. So
first variant is pretty simple, and should be preferred. The disadvantage
is clear - it can enforce maybe significant slowdown of fast queries.

Both of these approaches have just too many synchronization problems, IMO.

--
Alex

#79Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#78)
Re: On-demand running query plans using auto_explain and signals

On Mon, Sep 28, 2015 at 12:05 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-28 12:01 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Sun, Sep 27, 2015 at 8:05 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

the preparing of content before execution is interesting idea, that can

be used more. The almost queries and plans are not too big, so when the
size of content is not too big - less than 1MB, then can be used one DSM
for all backends.

When size of content is bigger than limit, then DSM will be allocated
specially for this content. The pointer to DSM and offset can be stored in
requested process slot. The reading and writing to requested slot should be
protected by spinlock, but it should block only two related processes for
short time (copy memory).

Sorry, I don't think this will fly.

The whole idea is that a backend publishes the plan of a query just
before running it and it doesn't care which other backend(s) might be
reading it, how many times and in which order. The only required locking
(implicit) is contained in the code for dsm_attach/detach().

I didn't propose too different solution. There is only one difference -
sharing DSM for smaller data. It is similar to using usual shared memory.

Does this mean implementing some sort of allocator on top of the shared
memory segment? If so, how are you going to prevent fragmentation?

--
Alex

#80Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Shulgin, Oleksandr (#75)
1 attachment(s)
Re: On-demand running query plans using auto_explain and signals

On Fri, Sep 25, 2015 at 7:13 PM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

Some implementation details:

For every backend that might be running (MaxBackends) we reserve a
dsm_handle slot in the addins shared memory. When the new option is turned
on, the ExecutorRun hook will produce a plan in whatever format is
specified by the auto_explain.log_format, allocate a DSM segment, copy the
plan into the segment and finally store the DSM handle into its own slot.
No locking is required around this because every backend writes to its slot
exclusively, no other backend can be writing into it concurrently. In the
ExecutorFinish hook we invalidate the backend's slot by setting the handle
to 0 and deallocate the DSM segment.

And this patch adds back the use of table of contents on the DSM.
Benefits: magic number validation; communication of the actual plan size.

--
Alex

Attachments:

0002-Add-SHM-table-of-contents-to-the-explain-DSM.patchtext/x-patch; charset=US-ASCII; name=0002-Add-SHM-table-of-contents-to-the-explain-DSM.patchDownload
From 9294995e200fca2abdccf0825ef852f26df1f4a2 Mon Sep 17 00:00:00 2001
From: Oleksandr Shulgin <oleksandr.shulgin@zalando.de>
Date: Mon, 28 Sep 2015 16:20:42 +0200
Subject: [PATCH 2/2] Add SHM table of contents to the explain DSM

By doing so we achieve two things:

1) verify the actual contents of the segment we've attached to, by
enforcing the magic number check;

2) transmit the actual size of the payload, so the receiving side can
allocate and parse in process-local memory, detaching from the DSM
early.
---
 contrib/auto_explain/auto_explain.c | 59 ++++++++++++++++++++++++++++++++-----
 1 file changed, 51 insertions(+), 8 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 0fa123f..ac6b750 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -23,6 +23,7 @@
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/shmem.h"
+#include "storage/shm_toc.h"
 #include "utils/guc.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -48,6 +49,7 @@ static const struct config_enum_entry format_options[] = {
 	{NULL, 0, false}
 };
 
+#define PG_EXPLAIN_BACKEND_MAGIC	0x79fb2449
 
 /* Shared memory structs */
 static dsm_handle *published_plan_handles = NULL;
@@ -455,18 +457,40 @@ explain_query(QueryDesc *queryDesc)
 static void
 explain_publish_plan(QueryDesc *queryDesc)
 {
-	StringInfo		str = explain_query(queryDesc);
+	StringInfo		str;
+	Size			plan_size;
+	shm_toc_estimator	toc_estimator;
+	Size				seg_size;
 	dsm_segment	   *dsm;
+	shm_toc		   *toc = NULL;
 
 	attach_published_plan_handles();
 
-	dsm = dsm_create(str->len + 1, DSM_CREATE_NULL_IF_MAXSEGMENTS);
+	str = explain_query(queryDesc);
+	plan_size = str->len + 1;
+
+	shm_toc_initialize_estimator(&toc_estimator);
+	shm_toc_estimate_keys(&toc_estimator, 2);	/* size, plan */
+	shm_toc_estimate_chunk(&toc_estimator, sizeof(Size));
+	shm_toc_estimate_chunk(&toc_estimator, plan_size);
+	seg_size = shm_toc_estimate(&toc_estimator);
+
+	dsm = dsm_create(seg_size, DSM_CREATE_NULL_IF_MAXSEGMENTS);
 	if (dsm)
 	{
-		char	   *buffer = (char *) dsm_segment_address(dsm);
+		Size	   *size_ptr;
+		char	   *buffer;
 		dsm_handle	handle;
 
-		memcpy(buffer, str->data, str->len + 1);
+		toc = shm_toc_create(PG_EXPLAIN_BACKEND_MAGIC, dsm_segment_address(dsm), seg_size);
+
+		size_ptr = (Size *) shm_toc_allocate(toc, sizeof(Size));
+		*size_ptr = plan_size;
+		shm_toc_insert(toc, 0, size_ptr);
+
+		buffer = (char *) shm_toc_allocate(toc, plan_size);
+		memcpy(buffer, str->data, plan_size);
+		shm_toc_insert(toc, 1, buffer);
 
 		MyPublishedPlanSegment = dsm;
 
@@ -570,9 +594,30 @@ pg_explain_backend(PG_FUNCTION_ARGS)
 	{
 		PG_TRY();
 		{
-			const char	   *p = (const char *) dsm_segment_address(seg);
+			shm_toc		   *toc;
+			Size			plan_size;
+			char		   *buffer;
+			const char	   *p;
+
+			toc = shm_toc_attach(PG_EXPLAIN_BACKEND_MAGIC, dsm_segment_address(seg));
+			if (toc == NULL)
+				ereport(ERROR,
+						(errmsg("bad magic in dynamic memory segment")));
+
+			plan_size = *((Size *) shm_toc_lookup(toc, 0));
+
+			buffer = palloc(plan_size);
+			memcpy(buffer, shm_toc_lookup(toc, 1), plan_size);
+
+			/* we can detach DSM already, but only if it's not our own one */
+			if (handle)
+			{
+				handle = 0;		/* avoid possible extra detach in PG_CATCH */
+				dsm_detach(seg);
+			}
 
 			/* break into tuples on newline */
+			p = buffer;
 			while (*p)
 			{
 				const char *q = strchr(p, '\n');
@@ -592,9 +637,7 @@ pg_explain_backend(PG_FUNCTION_ARGS)
 				p += len + 1;
 			}
 
-			/* detach DSM, but only if it's not our own one */
-			if (handle)
-				dsm_detach(seg);
+			pfree(buffer);
 		}
 		PG_CATCH();
 		{
-- 
2.1.4

#81Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#79)
Re: On-demand running query plans using auto_explain and signals

2015-09-28 12:37 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Mon, Sep 28, 2015 at 12:05 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-28 12:01 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

On Sun, Sep 27, 2015 at 8:05 AM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

the preparing of content before execution is interesting idea, that can

be used more. The almost queries and plans are not too big, so when the
size of content is not too big - less than 1MB, then can be used one DSM
for all backends.

When size of content is bigger than limit, then DSM will be allocated
specially for this content. The pointer to DSM and offset can be stored in
requested process slot. The reading and writing to requested slot should be
protected by spinlock, but it should block only two related processes for
short time (copy memory).

Sorry, I don't think this will fly.

The whole idea is that a backend publishes the plan of a query just
before running it and it doesn't care which other backend(s) might be
reading it, how many times and in which order. The only required locking
(implicit) is contained in the code for dsm_attach/detach().

I didn't propose too different solution. There is only one difference -
sharing DSM for smaller data. It is similar to using usual shared memory.

Does this mean implementing some sort of allocator on top of the shared
memory segment? If so, how are you going to prevent fragmentation?

yes, simple memory allocator is necessary in this case. But it should be
really simple - you can allocate only fixed size blocks - 10KB, 100KB and
1MB from separate buffers. So the fragmentation is not possible.

Regards

Pavel

Show quoted text

--
Alex

#82Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Pavel Stehule (#81)
Re: On-demand running query plans using auto_explain and signals

On Mon, Sep 28, 2015 at 7:09 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-28 12:37 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

I didn't propose too different solution. There is only one difference -
sharing DSM for smaller data. It is similar to using usual shared memory.

Does this mean implementing some sort of allocator on top of the shared
memory segment? If so, how are you going to prevent fragmentation?

yes, simple memory allocator is necessary in this case. But it should be
really simple - you can allocate only fixed size blocks - 10KB, 100KB and
1MB from separate buffers. So the fragmentation is not possible.

Maybe we're talking about completely different ideas, but I don't see how
fixing the block helps to fight fragmentation, in particular. Can you
sketch a simple example? E.g. 400 backends share the common segment and
all of them want to publish a plan of ~1kb, for example. Now what?

--
Alex

#83Pavel Stehule
pavel.stehule@gmail.com
In reply to: Shulgin, Oleksandr (#82)
Re: On-demand running query plans using auto_explain and signals

2015-09-28 19:13 GMT+02:00 Shulgin, Oleksandr <oleksandr.shulgin@zalando.de>
:

On Mon, Sep 28, 2015 at 7:09 PM, Pavel Stehule <pavel.stehule@gmail.com>
wrote:

2015-09-28 12:37 GMT+02:00 Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de>:

I didn't propose too different solution. There is only one difference -
sharing DSM for smaller data. It is similar to using usual shared memory.

Does this mean implementing some sort of allocator on top of the shared
memory segment? If so, how are you going to prevent fragmentation?

yes, simple memory allocator is necessary in this case. But it should be
really simple - you can allocate only fixed size blocks - 10KB, 100KB and
1MB from separate buffers. So the fragmentation is not possible.

Maybe we're talking about completely different ideas, but I don't see how
fixing the block helps to fight fragmentation, in particular. Can you
sketch a simple example? E.g. 400 backends share the common segment and
all of them want to publish a plan of ~1kb, for example. Now what?

Probably only few backends will be data requesters and few backends will be
requested for data. The size of shared data will be typically less than
10KB, with few exceptions 100KB, and few exceptions 10MB. So for 400
backends you need 400*10KB+100*100KB+20*1MB = 34MB. You can have three
independent buffers for this sizes. If some process require 5KB then you
returns 10KB (it is good enough) from first buffer. This simple mechanism
can coverage 90% of usage. for other 10% you can create new DSM. Because
you are working with fixed size blocks and you don't divide the size of
block, then the fragmentation isn't possible. This model allow very fast
allocation, very fast deallocation.

It is not terribly effective, but 34MB is better than 400 DSM or than 400MB
(max size for any backend).

Our super long queries are terribly long > 1MB, but we use a few clients
(less than 2x CPU cores)

Regards

Pavel

Show quoted text

--
Alex

#84Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Shulgin, Oleksandr (#69)
Re: On-demand running query plans using auto_explain and signals

On 9/18/15 5:05 AM, Shulgin, Oleksandr wrote:

It should not be true - the data sender create DSM and fills it.
Then set caller slot and send signal to caller. Caller can free DSM
any time, because data sender send newer touch it.

But the requester can timeout on waiting for reply and exit before it
sees the reply DSM. Actually, I now don't even think a backend can free
the DSM it has not created. First it will need to attach it,
effectively increasing the refcount, and upon detach it will only
decrease the refcount, but not actually release the segment...

So this has to be the responsibility of the reply sending backend in the
end: to create and release the DSM *at some point*.

What's wrong with just releasing it at the end of the statement? When
the statement is done there's no point to reading it asynchronously anymore.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Jim Nasby (#84)
Re: On-demand running query plans using auto_explain and signals

On Tue, Sep 29, 2015 at 12:28 AM, Jim Nasby <Jim.Nasby@bluetreble.com>
wrote:

On 9/18/15 5:05 AM, Shulgin, Oleksandr wrote:

So this has to be the responsibility of the reply sending backend in the
end: to create and release the DSM *at some point*.

What's wrong with just releasing it at the end of the statement? When the
statement is done there's no point to reading it asynchronously anymore.

That was only one of the problems in signals-based design, and it has other
more significant problems. The current approach does exactly that:
allocate the segment before running the plan, release at the end of the
statement.

--
Alex

#86Andres Freund
andres@anarazel.de
In reply to: Shulgin, Oleksandr (#75)
Re: On-demand running query plans using auto_explain and signals

On 2015-09-25 19:13:13 +0200, Shulgin, Oleksandr wrote:

the auto_explain contrib module. I now propose the most simple thing
possible in my opinion: a new GUC option for auto_explain. It will make
every backend, in which the module is loaded via *_preload_libraries
mechanism or using CREATE EXTENSION/LOAD commands, to actively publish the
plans of queries in dynamic shared memory as long as
auto_explain.publish_plans is turned on.

Wait. You're proposing that every query does this unconditionally? For
every single query? That sounds prohibitively expensive to me.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Andres Freund (#86)
Re: On-demand running query plans using auto_explain and signals

On Tue, Sep 29, 2015 at 12:57 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-09-25 19:13:13 +0200, Shulgin, Oleksandr wrote:

the auto_explain contrib module. I now propose the most simple thing
possible in my opinion: a new GUC option for auto_explain. It will make
every backend, in which the module is loaded via *_preload_libraries
mechanism or using CREATE EXTENSION/LOAD commands, to actively publish

the

plans of queries in dynamic shared memory as long as
auto_explain.publish_plans is turned on.

Wait. You're proposing that every query does this unconditionally? For
every single query? That sounds prohibitively expensive to me.

Only if the extension is loaded AND the option is turned on. Likely you
don't want to do this for an OLTP database, but for other types of load it
might make sense. Also, I think it should be possible to toggle this on a
per-process basis.

Finally, we can put in a cost threshold, so the plans only get published if
they have total_cost >= publish_plan_cost_threshod.

Also, do you mean expensive in terms of CPU or the dynamic shared memory?

--
Alex

#88Simon Riggs
simon@2ndQuadrant.com
In reply to: Shulgin, Oleksandr (#75)
Re: On-demand running query plans using auto_explain and signals

On 25 September 2015 at 12:13, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

I now believe that use of ProcessInterrupts() in the recently proposed
design as well as manipulation of proc latch due to use of shared memory
queue are major blockers.

In order to simplify things up, I've taken a step back and looked again at
the auto_explain contrib module. I now propose the most simple thing
possible in my opinion: a new GUC option for auto_explain. It will make
every backend, in which the module is loaded via *_preload_libraries
mechanism or using CREATE EXTENSION/LOAD commands, to actively publish the
plans of queries in dynamic shared memory as long as
auto_explain.publish_plans is turned on.

The greatest part for me, is that this approach doesn't require handling
of signals and is isolated in an external module, so it can be readily used
with the current server versions, no need to wait for >= 9.6.

This is a major change of direction, so the thread title no longer applies
at all.

The requirement is to be able to see the output of EXPLAIN ANALYZE for a
running process. Ideally, we would dump the diagnostic output for any
process that runs longer than a specific timeout, so we can decide whether
to kill it, or just save that for later diagnostics.

I'm interested in helping the original goal happen. Dumping an EXPLAIN,
without ANALYZE info, is not at all interesting since it has no value for
immediate action or post-facto diagnosis, sorry to say - making it happen
for every backend just makes it even less useful.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#89Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Simon Riggs (#88)
Re: On-demand running query plans using auto_explain and signals

On Tue, Sep 29, 2015 at 7:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 25 September 2015 at 12:13, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

I now believe that use of ProcessInterrupts() in the recently proposed
design as well as manipulation of proc latch due to use of shared memory
queue are major blockers.

In order to simplify things up, I've taken a step back and looked again
at the auto_explain contrib module. I now propose the most simple thing
possible in my opinion: a new GUC option for auto_explain. It will make
every backend, in which the module is loaded via *_preload_libraries
mechanism or using CREATE EXTENSION/LOAD commands, to actively publish the
plans of queries in dynamic shared memory as long as
auto_explain.publish_plans is turned on.

The greatest part for me, is that this approach doesn't require handling
of signals and is isolated in an external module, so it can be readily used
with the current server versions, no need to wait for >= 9.6.

This is a major change of direction, so the thread title no longer applies
at all.

The requirement is to be able to see the output of EXPLAIN ANALYZE for a
running process. Ideally, we would dump the diagnostic output for any
process that runs longer than a specific timeout, so we can decide whether
to kill it, or just save that for later diagnostics.

I'm interested in helping the original goal happen. Dumping an EXPLAIN,
without ANALYZE info, is not at all interesting since it has no value for
immediate action or post-facto diagnosis, sorry to say - making it happen
for every backend just makes it even less useful.

This is not a change of the direction, but rather of the approach. Hitting
a process with a signal and hoping it will produce a meaningful response in
all circumstances without disrupting its current task was way too naive.
I'd really want to make this work with ANALYZE, just not as the first
step. I believe this approach can be extended to include instrumentation
support (obviously we will not be able to contain this in the auto_explain
module).

--
Alex

#90Simon Riggs
simon@2ndQuadrant.com
In reply to: Shulgin, Oleksandr (#89)
Re: On-demand running query plans using auto_explain and signals

On 29 September 2015 at 12:52, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

Hitting a process with a signal and hoping it will produce a meaningful
response in all circumstances without disrupting its current task was way
too naive.

Hmm, I would have to disagree, sorry. For me the problem was dynamically
allocating everything at the time the signal is received and getting into
problems when that caused errors.

* INIT - Allocate N areas of memory for use by queries, which can be
expanded/contracted as needed. Keep a freelist of structures.
* OBSERVER - When requested, gain exclusive access to a diagnostic area,
then allocate the designated process to that area, then send a signal
* QUERY - When signal received dump an EXPLAIN ANALYZE to the allocated
diagnostic area, (set flag to show complete, set latch on observer)
* OBSERVER - process data in diagnostic area and then release area for use
by next observation

If the EXPLAIN ANALYZE doesn't fit into the diagnostic chunk, LOG it as a
problem and copy data only up to the size defined. Any other ERRORs that
are caused by this process cause it to fail normally.

That allows the observer to be another backend, or it allows the query
process to perform self-observation based upon a timeout (e.g. >1 hour) or
a row limit (e.g. when an optimizer estimate is seen to be badly wrong).

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#91Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Simon Riggs (#90)
Re: On-demand running query plans using auto_explain and signals

On Tue, Sep 29, 2015 at 8:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 29 September 2015 at 12:52, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

Hitting a process with a signal and hoping it will produce a meaningful
response in all circumstances without disrupting its current task was way
too naive.

Hmm, I would have to disagree, sorry. For me the problem was dynamically
allocating everything at the time the signal is received and getting into
problems when that caused errors.

What I mean is that we need to move the actual EXPLAIN run out of
ProcessInterrupts(). It can be still fine to trigger the communication
with a signal.

* INIT - Allocate N areas of memory for use by queries, which can be

expanded/contracted as needed. Keep a freelist of structures.
* OBSERVER - When requested, gain exclusive access to a diagnostic area,
then allocate the designated process to that area, then send a signal
* QUERY - When signal received dump an EXPLAIN ANALYZE to the allocated
diagnostic area, (set flag to show complete, set latch on observer)
* OBSERVER - process data in diagnostic area and then release area for use
by next observation

If the EXPLAIN ANALYZE doesn't fit into the diagnostic chunk, LOG it as a
problem and copy data only up to the size defined. Any other ERRORs that
are caused by this process cause it to fail normally.

Do you envision problems if we do this with a newly allocated DSM every
time instead of pre-allocated area? This will have to revert the workflow,
because only the QUERY knows the required segment size:

OBSERVER - sends a signal and waits for its proc latch to be set
QUERY - when signal is received allocates a DSM just big enough to fit the
EXPLAIN plan, then locates the OBSERVER(s) and sets its latch (or their
latches)

The EXPLAIN plan should already be produced somewhere in the executor, to
avoid calling into explain.c from ProcessInterrupts().

That allows the observer to be another backend, or it allows the query

process to perform self-observation based upon a timeout (e.g. >1 hour) or
a row limit (e.g. when an optimizer estimate is seen to be badly wrong).

Do you think there is one single best place in the executor code where such
a check could be added? I have very little idea about that.

--
Alex

#92Simon Riggs
simon@2ndQuadrant.com
In reply to: Shulgin, Oleksandr (#91)
Re: On-demand running query plans using auto_explain and signals

On 29 September 2015 at 20:52, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

On Tue, Sep 29, 2015 at 8:34 PM, Simon Riggs <simon@2ndquadrant.com>
wrote:

On 29 September 2015 at 12:52, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

Hitting a process with a signal and hoping it will produce a meaningful
response in all circumstances without disrupting its current task was way
too naive.

Hmm, I would have to disagree, sorry. For me the problem was dynamically
allocating everything at the time the signal is received and getting into
problems when that caused errors.

What I mean is that we need to move the actual EXPLAIN run out of
ProcessInterrupts(). It can be still fine to trigger the communication
with a signal.

Good

* INIT - Allocate N areas of memory for use by queries, which can be

expanded/contracted as needed. Keep a freelist of structures.
* OBSERVER - When requested, gain exclusive access to a diagnostic area,
then allocate the designated process to that area, then send a signal
* QUERY - When signal received dump an EXPLAIN ANALYZE to the allocated
diagnostic area, (set flag to show complete, set latch on observer)
* OBSERVER - process data in diagnostic area and then release area for
use by next observation

If the EXPLAIN ANALYZE doesn't fit into the diagnostic chunk, LOG it as a
problem and copy data only up to the size defined. Any other ERRORs that
are caused by this process cause it to fail normally.

Do you envision problems if we do this with a newly allocated DSM every
time instead of pre-allocated area? This will have to revert the workflow,
because only the QUERY knows the required segment size:

That's too fiddly; we need to keep it simple by using just fixed sizes.

OBSERVER - sends a signal and waits for its proc latch to be set
QUERY - when signal is received allocates a DSM just big enough to fit the
EXPLAIN plan, then locates the OBSERVER(s) and sets its latch (or their
latches)

The EXPLAIN plan should already be produced somewhere in the executor, to
avoid calling into explain.c from ProcessInterrupts().

That allows the observer to be another backend, or it allows the query

process to perform self-observation based upon a timeout (e.g. >1 hour) or
a row limit (e.g. when an optimizer estimate is seen to be badly wrong).

Do you think there is one single best place in the executor code where
such a check could be added? I have very little idea about that.

Fairly simple.

Main problem is knowing how to handle nested calls to the executor.

I'll look at the patch.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#93Robert Haas
robertmhaas@gmail.com
In reply to: Shulgin, Oleksandr (#89)
Re: On-demand running query plans using auto_explain and signals

On Tue, Sep 29, 2015 at 1:52 PM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

This is not a change of the direction, but rather of the approach. Hitting
a process with a signal and hoping it will produce a meaningful response in
all circumstances without disrupting its current task was way too naive.

I think that's exactly right. It's not safe to do much anything from
a signal handler, and while ProcessInterrupts() is a very
substantially safer, the set of things that can be safely done there
is still awfully restricted. You have to cope with the fact that the
function you just interrupted may be doing anything at all, and if you
change anything before returning, you may knock over the apple cart.
Just as bad, the state you inherit may not be very sane: we call
ProcessInterrupts() from LOTS of places and there is absolutely no
guarantee that every one of those places has the data structures that
you need to run EXPLAIN ANALYZE in a sane state.

I haven't looked at the new patch specifically so I don't have an
opinion on that at this time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Shulgin, Oleksandr (#89)
Re: On-demand running query plans using auto_explain and signals

On Tue, Sep 29, 2015 at 7:52 PM, Shulgin, Oleksandr <
oleksandr.shulgin@zalando.de> wrote:

This is not a change of the direction, but rather of the approach.
Hitting a process with a signal and hoping it will produce a meaningful
response in all circumstances without disrupting its current task was way
too naive. I'd really want to make this work with ANALYZE, just not as the
first step. I believe this approach can be extended to include
instrumentation support (obviously we will not be able to contain this in
the auto_explain module).

I was thinking about this and what seems to be the biggest problem is when
to actually turn the feature on. It seems unlikely that someone will want
to enable it unconditionally. Enabling per-backend also doesn't seem to be
a good approach because you don't know if the next query you'd like to look
at is going to run in this exact backend.

What might be actually usable is poking pg_stat_statements for queryid to
decide if we need to do explain (and possibly analyze). We could make it
so that the explain is produced for any query that is known to run longer
than certain configurable threshold on average (or we can provide for
enabling this explicitly per entry in pg_stat_statements). Then the
interested client can go and do pg_explain_backend() on the pid.

If we would also track the plan total costs, we could do some predictions
so that if the same query comes along and is planned with total cost
exceeding the recorded average the configurable amount of percent, then we
enable explain.

Does this make sense to you? Does this make a good argument for merging
pg_stat_statements and auto_explain into core?

--
Alex

#95Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shulgin, Oleksandr (#94)
Re: On-demand running query plans using auto_explain and signals

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:

I was thinking about this and what seems to be the biggest problem is when
to actually turn the feature on. It seems unlikely that someone will want
to enable it unconditionally. Enabling per-backend also doesn't seem to be
a good approach because you don't know if the next query you'd like to look
at is going to run in this exact backend.

Check.

What might be actually usable is poking pg_stat_statements for queryid to
decide if we need to do explain (and possibly analyze).

Hm, interesting thought.

Does this make sense to you? Does this make a good argument for merging
pg_stat_statements and auto_explain into core?

I'd say more that it's a good argument for moving this feature out to
one of those extensions, or perhaps building a third extension that
depends on both of those. TBH, none of this stuff smells to me like
something that ought to be in core.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96Pavel Stehule
pavel.stehule@gmail.com
In reply to: Tom Lane (#95)
Re: On-demand running query plans using auto_explain and signals

2015-10-15 15:42 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us>:

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de> writes:

I was thinking about this and what seems to be the biggest problem is

when

to actually turn the feature on. It seems unlikely that someone will

want

to enable it unconditionally. Enabling per-backend also doesn't seem to

be

a good approach because you don't know if the next query you'd like to

look

at is going to run in this exact backend.

Check.

What might be actually usable is poking pg_stat_statements for queryid to
decide if we need to do explain (and possibly analyze).

Hm, interesting thought.

Does this make sense to you? Does this make a good argument for merging
pg_stat_statements and auto_explain into core?

I'd say more that it's a good argument for moving this feature out to
one of those extensions, or perhaps building a third extension that
depends on both of those. TBH, none of this stuff smells to me like
something that ought to be in core.

There are few features, that I would to see in core:

1. possibility to get full SQL string
2. possibility to get state string

We can speak how to do it well.

Regards

Pavel

Show quoted text

regards, tom lane

#97Julien Rouhaud
julien.rouhaud@dalibo.com
In reply to: Pavel Stehule (#96)
Re: On-demand running query plans using auto_explain and signals

Hello,

On 15/10/2015 16:04, Pavel Stehule wrote:

2015-10-15 15:42 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>>:

"Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de
<mailto:oleksandr.shulgin@zalando.de>> writes:

I was thinking about this and what seems to be the biggest problem is when
to actually turn the feature on. It seems unlikely that someone will want
to enable it unconditionally. Enabling per-backend also doesn't seem to be
a good approach because you don't know if the next query you'd like to look
at is going to run in this exact backend.

Check.

What might be actually usable is poking pg_stat_statements for queryid to
decide if we need to do explain (and possibly analyze).

Hm, interesting thought.

Does this make sense to you? Does this make a good argument for merging
pg_stat_statements and auto_explain into core?

I'd say more that it's a good argument for moving this feature out to
one of those extensions, or perhaps building a third extension that
depends on both of those. TBH, none of this stuff smells to me like
something that ought to be in core.

There are few features, that I would to see in core:

1. possibility to get full SQL string
2. possibility to get state string

We can speak how to do it well.

I registered as reviewer on this, but after reading the whole thread for
the second time, it's still not clear to me if the last two submitted
patches (0001-Add-auto_explain.publish_plans.patch and
0002-Add-SHM-table-of-contents-to-the-explain-DSM.patch) are still
needed reviews, since multiple refactoring ideas and objections have
been raised since.

Regards.

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98Simon Riggs
simon@2ndQuadrant.com
In reply to: Julien Rouhaud (#97)
Re: On-demand running query plans using auto_explain and signals

On 30 November 2015 at 22:27, Julien Rouhaud <julien.rouhaud@dalibo.com>
wrote:

I registered as reviewer on this, but after reading the whole thread for
the second time, it's still not clear to me if the last two submitted
patches (0001-Add-auto_explain.publish_plans.patch and
0002-Add-SHM-table-of-contents-to-the-explain-DSM.patch) are still
needed reviews, since multiple refactoring ideas and objections have
been raised since.

I looked into it more deeply and decided partial EXPLAINs aren't very
useful yet.

I've got some thoughts around that which I'll publish when I have more
time. I suggest this is a good idea, just needs some serious
committer-love, so we should bounce this for now.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#99Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Simon Riggs (#98)
Re: On-demand running query plans using auto_explain and signals

On Tue, Dec 1, 2015 at 12:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 30 November 2015 at 22:27, Julien Rouhaud <julien.rouhaud@dalibo.com>
wrote:

I registered as reviewer on this, but after reading the whole thread for
the second time, it's still not clear to me if the last two submitted
patches (0001-Add-auto_explain.publish_plans.patch and
0002-Add-SHM-table-of-contents-to-the-explain-DSM.patch) are still
needed reviews, since multiple refactoring ideas and objections have
been raised since.

I looked into it more deeply and decided partial EXPLAINs aren't very
useful yet.

I've got some thoughts around that which I'll publish when I have more
time. I suggest this is a good idea, just needs some serious
committer-love, so we should bounce this for now.

I have the plans to make something from this on top of pg_stat_statements
and auto_explain, as I've mentioned last time. The next iteration will be
based on the two latest patches above, so it still makes sense to review
them.

As for EXPLAIN ANALYZE support, it will require changes to core, but this
can be done separately.

--
Alex

#100Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Shulgin, Oleksandr (#99)
Re: On-demand running query plans using auto_explain and signals

Hi,

On 12/01/2015 10:34 AM, Shulgin, Oleksandr wrote:

I have the plans to make something from this on top of
pg_stat_statements and auto_explain, as I've mentioned last time. The
next iteration will be based on the two latest patches above, so it
still makes sense to review them.

As for EXPLAIN ANALYZE support, it will require changes to core, but
this can be done separately.

I'm re-reading the thread, and I have to admit I'm utterly confused what
is the current plan, what features it's supposed to provide and whether
it will solve the use case I'm most interested in. Oleksandr, could you
please post a summary explaining that?

My use case for this functionality is debugging of long-running queries,
particularly getting EXPLAIN ANALYZE for them. In such cases I either
can't run the EXPLAIN ANALYZE manually because it will either run
forever (just like the query) and may not be the same (e.g. due to
recent ANALYZE). So we need to extract the data from the process
executing the query.

I'm not essentially opposed to doing this in an extension (better an
extension than nothing), but I don't quite see how you could to do that
from pg_stat_statements or auto_explain. AFAIK both extensions merely
use hooks before/after the executor, and therefore can't do anything in
between (while the query is actually running).

Perhaps you don't intend to solve this particular use case? Which use
cases are you aiming to solve, then? Could you explain?

Maybe all we need to do is add another hook somewhere in the executor,
and push the explain analyze into pg_stat_statements once in a while,
entirely eliminating the need for inter-process communication (signals,
queues, ...).

I'm pretty sure we don't need this for "short" queries, because in those
cases we have other means to get the explain analyze (e.g. running the
query again or auto_explain). So I can imagine having a rather high
threshold (say, 60 seconds or even more), and we'd only push the explain
analyze after crossing it. And then we'd only update it once in a while
- say, again every 60 seconds.

Of course, this might be configurable by two GUCs:

pg_stat_statements.explain_analyze_threshold = 60 # -1 is "off"
pg_stat_statements.explain_analyze_refresh = 60

FWIW I'd still prefer having "EXPLAIN ANALYZE" in core, but better this
than nothing.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101Shulgin, Oleksandr
oleksandr.shulgin@zalando.de
In reply to: Tomas Vondra (#100)
Re: On-demand running query plans using auto_explain and signals

On Wed, Dec 16, 2015 at 8:39 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Hi,

On 12/01/2015 10:34 AM, Shulgin, Oleksandr wrote:

I have the plans to make something from this on top of
pg_stat_statements and auto_explain, as I've mentioned last time. The
next iteration will be based on the two latest patches above, so it
still makes sense to review them.

As for EXPLAIN ANALYZE support, it will require changes to core, but
this can be done separately.

I'm re-reading the thread, and I have to admit I'm utterly confused what
is the current plan, what features it's supposed to provide and whether it
will solve the use case I'm most interested in. Oleksandr, could you please
post a summary explaining that?

My use case for this functionality is debugging of long-running queries,
particularly getting EXPLAIN ANALYZE for them. In such cases I either can't
run the EXPLAIN ANALYZE manually because it will either run forever (just
like the query) and may not be the same (e.g. due to recent ANALYZE). So we
need to extract the data from the process executing the query.

I'm not essentially opposed to doing this in an extension (better an
extension than nothing), but I don't quite see how you could to do that
from pg_stat_statements or auto_explain. AFAIK both extensions merely use
hooks before/after the executor, and therefore can't do anything in between
(while the query is actually running).

Perhaps you don't intend to solve this particular use case? Which use
cases are you aiming to solve, then? Could you explain?

Hi Tomas.

Thanks for your interest in this patch!

My motivation is the same as your use case: having a long-running query, be
able to look inside this exact query run by this exact backend.

I admit the evolution of ideas in this patch can be very confusing: we were
trying a number of different approaches, w/o thinking deeply on the
implications, just to have a proof of concept.

Maybe all we need to do is add another hook somewhere in the executor, and

push the explain analyze into pg_stat_statements once in a while, entirely
eliminating the need for inter-process communication (signals, queues, ...).

I'm pretty sure we don't need this for "short" queries, because in those
cases we have other means to get the explain analyze (e.g. running the
query again or auto_explain). So I can imagine having a rather high
threshold (say, 60 seconds or even more), and we'd only push the explain
analyze after crossing it. And then we'd only update it once in a while -
say, again every 60 seconds.

Of course, this might be configurable by two GUCs:

pg_stat_statements.explain_analyze_threshold = 60 # -1 is "off"
pg_stat_statements.explain_analyze_refresh = 60

FWIW I'd still prefer having "EXPLAIN ANALYZE" in core, but better this
than nothing.

Yes, this is how pg_stat_statements idea comes into play: even if we
implement support for online EXPLAIN ANALYZE, enabling it for every query
is an overkill, but if you don't enable it from the query start, there is
no way to enable it later on as the query has already progressed. So in
order to know for which queries it makes sense to enable this feature, we
need to know what is the query's expected run time, thus pg_stat_statements
seems like a natural place to obtain this information and/or trigger the
behavior.

I'm also all for simplification of the underlying communication mechanism:
shared memory queues are neat, but not necessarily the best way to handle
it. As for the use of signals: I believe this was a red herring, it will
make the code much less fragile if the progressing backend itself will
publish intermediary EXPLAIN ANALYZE reports for other backends to read.

The EXPLAIN (w/o ANALYZE) we can do completely as an extension: no core
support required. To enable ANALYZE it will require a little hacking
around Instrumentation methods: otherwise the Explain functions just crash
when run in the middle of the query.

Hope that makes it clear.

--
Alex

#102Michael Paquier
michael.paquier@gmail.com
In reply to: Shulgin, Oleksandr (#101)
Re: On-demand running query plans using auto_explain and signals

On Thu, Dec 17, 2015 at 6:45 PM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

On Wed, Dec 16, 2015 at 8:39 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 12/01/2015 10:34 AM, Shulgin, Oleksandr wrote:

I have the plans to make something from this on top of
pg_stat_statements and auto_explain, as I've mentioned last time. The
next iteration will be based on the two latest patches above, so it
still makes sense to review them.

As for EXPLAIN ANALYZE support, it will require changes to core, but
this can be done separately.

I'm re-reading the thread, and I have to admit I'm utterly confused what
is the current plan, what features it's supposed to provide and whether it
will solve the use case I'm most interested in. Oleksandr, could you please
post a summary explaining that?

My use case for this functionality is debugging of long-running queries,
particularly getting EXPLAIN ANALYZE for them. In such cases I either can't
run the EXPLAIN ANALYZE manually because it will either run forever (just
like the query) and may not be the same (e.g. due to recent ANALYZE). So we
need to extract the data from the process executing the query.

I'm not essentially opposed to doing this in an extension (better an
extension than nothing), but I don't quite see how you could to do that from
pg_stat_statements or auto_explain. AFAIK both extensions merely use hooks
before/after the executor, and therefore can't do anything in between (while
the query is actually running).

Perhaps you don't intend to solve this particular use case? Which use
cases are you aiming to solve, then? Could you explain?

Thanks for your interest in this patch!

My motivation is the same as your use case: having a long-running query, be
able to look inside this exact query run by this exact backend.

I admit the evolution of ideas in this patch can be very confusing: we were
trying a number of different approaches, w/o thinking deeply on the
implications, just to have a proof of concept.

It's great to see ideas of concepts and things to help address those
issues, at least we are agreeing that there is a hole in the
instrumentation and that it would be nice to fill it with something.
Still, it is not completely clear which approach would be taken. Is it
fair to mark the current patch as returned with feedback then?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103Pavel Stehule
pavel.stehule@gmail.com
In reply to: Michael Paquier (#102)
Re: On-demand running query plans using auto_explain and signals

2015-12-24 3:23 GMT+01:00 Michael Paquier <michael.paquier@gmail.com>:

On Thu, Dec 17, 2015 at 6:45 PM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

On Wed, Dec 16, 2015 at 8:39 PM, Tomas Vondra <

tomas.vondra@2ndquadrant.com>

wrote:

On 12/01/2015 10:34 AM, Shulgin, Oleksandr wrote:

I have the plans to make something from this on top of
pg_stat_statements and auto_explain, as I've mentioned last time. The
next iteration will be based on the two latest patches above, so it
still makes sense to review them.

As for EXPLAIN ANALYZE support, it will require changes to core, but
this can be done separately.

I'm re-reading the thread, and I have to admit I'm utterly confused what
is the current plan, what features it's supposed to provide and whether

it

will solve the use case I'm most interested in. Oleksandr, could you

please

post a summary explaining that?

My use case for this functionality is debugging of long-running queries,
particularly getting EXPLAIN ANALYZE for them. In such cases I either

can't

run the EXPLAIN ANALYZE manually because it will either run forever

(just

like the query) and may not be the same (e.g. due to recent ANALYZE).

So we

need to extract the data from the process executing the query.

I'm not essentially opposed to doing this in an extension (better an
extension than nothing), but I don't quite see how you could to do that

from

pg_stat_statements or auto_explain. AFAIK both extensions merely use

hooks

before/after the executor, and therefore can't do anything in between

(while

the query is actually running).

Perhaps you don't intend to solve this particular use case? Which use
cases are you aiming to solve, then? Could you explain?

Thanks for your interest in this patch!

My motivation is the same as your use case: having a long-running query,

be

able to look inside this exact query run by this exact backend.

I admit the evolution of ideas in this patch can be very confusing: we

were

trying a number of different approaches, w/o thinking deeply on the
implications, just to have a proof of concept.

It's great to see ideas of concepts and things to help address those
issues, at least we are agreeing that there is a hole in the
instrumentation and that it would be nice to fill it with something.
Still, it is not completely clear which approach would be taken. Is it
fair to mark the current patch as returned with feedback then?

+1

Pavel

Show quoted text

--
Michael

#104Michael Paquier
michael.paquier@gmail.com
In reply to: Pavel Stehule (#103)
Re: On-demand running query plans using auto_explain and signals

On Thu, Dec 24, 2015 at 1:57 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

2015-12-24 3:23 GMT+01:00 Michael Paquier <michael.paquier@gmail.com>:

On Thu, Dec 17, 2015 at 6:45 PM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:

On Wed, Dec 16, 2015 at 8:39 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com>
wrote:

On 12/01/2015 10:34 AM, Shulgin, Oleksandr wrote:

I have the plans to make something from this on top of
pg_stat_statements and auto_explain, as I've mentioned last time. The
next iteration will be based on the two latest patches above, so it
still makes sense to review them.

As for EXPLAIN ANALYZE support, it will require changes to core, but
this can be done separately.

I'm re-reading the thread, and I have to admit I'm utterly confused
what
is the current plan, what features it's supposed to provide and whether
it
will solve the use case I'm most interested in. Oleksandr, could you
please
post a summary explaining that?

My use case for this functionality is debugging of long-running
queries,
particularly getting EXPLAIN ANALYZE for them. In such cases I either
can't
run the EXPLAIN ANALYZE manually because it will either run forever
(just
like the query) and may not be the same (e.g. due to recent ANALYZE).
So we
need to extract the data from the process executing the query.

I'm not essentially opposed to doing this in an extension (better an
extension than nothing), but I don't quite see how you could to do that
from
pg_stat_statements or auto_explain. AFAIK both extensions merely use
hooks
before/after the executor, and therefore can't do anything in between
(while
the query is actually running).

Perhaps you don't intend to solve this particular use case? Which use
cases are you aiming to solve, then? Could you explain?

Thanks for your interest in this patch!

My motivation is the same as your use case: having a long-running query,
be
able to look inside this exact query run by this exact backend.

I admit the evolution of ideas in this patch can be very confusing: we
were
trying a number of different approaches, w/o thinking deeply on the
implications, just to have a proof of concept.

It's great to see ideas of concepts and things to help address those
issues, at least we are agreeing that there is a hole in the
instrumentation and that it would be nice to fill it with something.
Still, it is not completely clear which approach would be taken. Is it
fair to mark the current patch as returned with feedback then?

+1

So, done this way. We could always move it to next CF if someone
wishes to move on. But at this point that would be a different
approach than what was proposed first..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers