Sampling-based timing for EXPLAIN ANALYZE

Started by Lukas Fittlabout 3 years ago12 messages

lukas@fittl.com

about 3 years ago

1 attachment(s)

Hi,

Since EXPLAIN ANALYZE with TIMING ON still carries noticeable overhead on
modern hardware (despite time sources being faster), I'd like to propose a
new setting EXPLAIN ANALYZE, called "TIMING SAMPLING", as compared to
TIMING ON.

This new timing mode uses a timer on a fixed recurring frequency (e.g. 100
or 1000 Hz) to gather a sampled timestamp on a predefined schedule, instead
of getting the time on-demand when InstrStartNode/InstrStopNode is called.
To implement the timer, we can use the existing timeout infrastructure,
which is backed by a wall clock timer (ITIMER_REAL).

Conceptually this is inspired by how sampling profilers work (e.g. "perf"),
but it ties into the existing per-plan node instrumentation done by EXPLAIN
ANALYZE, and simply provides a lower accuracy version of the total time for
each plan node.

In EXPLAIN output this is marked as "sampled time", and scaled to the total
wall clock time (to adjust for the sampling undercounting):

=# EXPLAIN (ANALYZE, BUFFERS, TIMING SAMPLING, SAMPLEFREQ 100) SELECT ...;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=201747.90..201748.00 rows=10 width=12) (actual
sampled time=5490.974 rows=9 loops=1)
...
-> Hash Join (cost=0.23..199247.90 rows=499999 width=4) (actual
sampled time=3738.619 rows=9000000 loops=1)
...
-> Seq Scan on large (cost=0.00..144247.79 rows=9999979 width=4)
(actual sampled time=1004.671 rows=10000001 loops=1)
...
-> Hash (cost=0.10..0.10 rows=10 width=4) (actual sampled
time=0.000 rows=10 loops=1)
...
Execution Time: 5491.475 ms
---

In simple query tests like this on my local machine, this shows a
consistent benefit over TIMING ON (and behaves close to ANALYZE with TIMING
OFF), whilst providing a "good enough" accuracy to identify which part of
the query was problematic.

Attached is a prototype patch for early feedback on the concept, with tests
and documentation to come in a follow-up. Since the January commitfest is
still marked as open I'll register it there, but note that my assumption is
this is *not* Postgres 16 material.

As an open item, note that in the patch the requested sampling frequency is
not yet passed to parallel workers (it always defaults to 1000 Hz when
sampling is enabled). Also, note the timing frequency is limited to a
maximum of 1000 Hz (1ms) due to current limitations of the timeout
infrastructure.

With thanks to Andres Freund for help on refining the idea, collaborating
on early code and finding the approach to hook into the timeout API.

Thanks,
Lukas

--
Lukas Fittl

Attachments:

v1-0001-Add-TIMING-SAMPLING-option-for-EXPLAIN-ANALYZE.patchapplication/octet-stream; name=v1-0001-Add-TIMING-SAMPLING-option-for-EXPLAIN-ANALYZE.patchDownload

From bbfc7f5a1b96481943bdced043d2931a2576fd21 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 1 Jan 2023 16:23:18 -0800
Subject: [PATCH v1] Add TIMING SAMPLING option for EXPLAIN ANALYZE

This introduces a new mode, TIMING SAMPLING, as a lower overhead alternative to
TIMING ON for EXPLAIN ANALYZE and auto_explain.

Instead of measuring the current time when entering and exiting each plan node
(which can get expensive even on modern systems, e.g. when doing many loops),
the sampling timing mode measures the time at a fixed recurring interval using
a wall clock timer.

The sampling frequency can be adjusted from 1 to 1000 Hz by passing the SAMPLEFREQ
option to EXPLAIN, and defaults to 1000 Hz (1ms). Higher frequencies are
technically possible but currently not supported by the Postgres timeout API.

The sampled wall clock time is cached and used by the the plan node instrumentation
to attribute how much sampled time passed whilst a plan node was executed.

Since sampling is likely to undercount the actual time taken, the sampled time
values are scaled to the total execution time of the query (or parallel query
worker), before being displayed.

Additionally, this adds support for sampling based timing to auto_explain by
introducing the new "auto_explain.log_timing = sampling" setting, and
"auto_explain.log_timing_samplefreq" setting to set the frequency.
---
 contrib/auto_explain/auto_explain.c  |  48 ++++++++++--
 src/backend/commands/explain.c       | 113 ++++++++++++++++++++++++++-
 src/backend/executor/execMain.c      |  16 ++++
 src/backend/executor/execParallel.c  |  38 +++++++++
 src/backend/executor/instrument.c    |  89 +++++++++++++++++++++
 src/backend/utils/init/postinit.c    |   2 +
 src/include/commands/explain.h       |   4 +
 src/include/executor/execdesc.h      |   5 +-
 src/include/executor/instrument.h    |   7 ++
 src/include/portability/instr_time.h |   6 ++
 src/include/utils/timeout.h          |   1 +
 11 files changed, 320 insertions(+), 9 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 269a0fa86c..137a41e6e1 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -24,6 +24,13 @@
 
 PG_MODULE_MAGIC;
 
+typedef enum
+{
+	LOG_TIMING_OFF,
+	LOG_TIMING_ON,
+	LOG_TIMING_SAMPLING
+} LogTimingType;
+
 /* GUC variables */
 static int	auto_explain_log_min_duration = -1; /* msec or -1 */
 static int	auto_explain_log_parameter_max_length = -1; /* bytes or -1 */
@@ -32,7 +39,8 @@ static bool auto_explain_log_verbose = false;
 static bool auto_explain_log_buffers = false;
 static bool auto_explain_log_wal = false;
 static bool auto_explain_log_triggers = false;
-static bool auto_explain_log_timing = true;
+static int	auto_explain_log_timing = LOG_TIMING_ON;
+static int	auto_explain_log_timing_samplefreq = 1000;
 static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
@@ -61,6 +69,19 @@ static const struct config_enum_entry loglevel_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry log_timing_options[] = {
+	{"sampling", LOG_TIMING_SAMPLING, false},
+	{"on", LOG_TIMING_ON, false},
+	{"off", LOG_TIMING_OFF, false},
+	{"true", LOG_TIMING_ON, true},
+	{"false", LOG_TIMING_OFF, true},
+	{"yes", LOG_TIMING_ON, true},
+	{"no", LOG_TIMING_OFF, true},
+	{"1", LOG_TIMING_ON, true},
+	{"0", LOG_TIMING_OFF, true},
+	{NULL, 0, false}
+};
+
 /* Current nesting depth of ExecutorRun calls */
 static int	nesting_level = 0;
 
@@ -218,17 +239,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
-	DefineCustomBoolVariable("auto_explain.log_timing",
+	DefineCustomEnumVariable("auto_explain.log_timing",
 							 "Collect timing data, not just row counts.",
 							 NULL,
 							 &auto_explain_log_timing,
-							 true,
+							 LOG_TIMING_ON,
+							 log_timing_options,
 							 PGC_SUSET,
 							 0,
 							 NULL,
 							 NULL,
 							 NULL);
 
+	DefineCustomIntVariable("auto_explain.log_timing_samplefreq",
+							"Frequency for sampling-based timing, if used.",
+							"Number of sampling ticks per second (Hz).",
+							&auto_explain_log_timing_samplefreq,
+							1000,
+							1, 1000,
+							PGC_SUSET,
+							0,
+							NULL,
+							NULL,
+							NULL);
+
 	DefineCustomRealVariable("auto_explain.sample_rate",
 							 "Fraction of queries to process.",
 							 NULL,
@@ -284,14 +318,17 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
 		/* Enable per-node instrumentation iff log_analyze is required. */
 		if (auto_explain_log_analyze && (eflags & EXEC_FLAG_EXPLAIN_ONLY) == 0)
 		{
-			if (auto_explain_log_timing)
+			if (auto_explain_log_timing == LOG_TIMING_ON)
 				queryDesc->instrument_options |= INSTRUMENT_TIMER;
+			else if (auto_explain_log_timing == LOG_TIMING_SAMPLING)
+				queryDesc->instrument_options |= INSTRUMENT_TIMER_SAMPLING | INSTRUMENT_ROWS;
 			else
 				queryDesc->instrument_options |= INSTRUMENT_ROWS;
 			if (auto_explain_log_buffers)
 				queryDesc->instrument_options |= INSTRUMENT_BUFFERS;
 			if (auto_explain_log_wal)
 				queryDesc->instrument_options |= INSTRUMENT_WAL;
+			queryDesc->sample_freq_hz = auto_explain_log_timing_samplefreq;
 		}
 	}
 
@@ -394,7 +431,8 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			es->verbose = auto_explain_log_verbose;
 			es->buffers = (es->analyze && auto_explain_log_buffers);
 			es->wal = (es->analyze && auto_explain_log_wal);
-			es->timing = (es->analyze && auto_explain_log_timing);
+			es->timing = (es->analyze && auto_explain_log_timing == LOG_TIMING_ON);
+			es->sampling = (es->analyze && auto_explain_log_timing == LOG_TIMING_SAMPLING);
 			es->summary = es->analyze;
 			es->format = auto_explain_log_format;
 			es->settings = auto_explain_log_settings;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f86983c660..32d63de67f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -12,6 +12,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <sys/time.h>
 
 #include "access/xact.h"
 #include "catalog/pg_type.h"
@@ -30,12 +31,14 @@
 #include "storage/bufmgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
+#include "utils/float.h"
 #include "utils/guc_tables.h"
 #include "utils/json.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
+#include "utils/timeout.h"
 #include "utils/tuplesort.h"
 #include "utils/typcache.h"
 #include "utils/xml.h"
@@ -192,8 +195,41 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 			es->settings = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "timing") == 0)
 		{
+			char	   *p = opt->arg != NULL ? defGetString(opt) : NULL;
 			timing_set = true;
-			es->timing = defGetBoolean(opt);
+			if (p == NULL || pg_strcasecmp(p, "1") == 0 || pg_strcasecmp(p, "true") == 0 || pg_strcasecmp(p, "on") == 0)
+			{
+				es->timing = true;
+				es->sampling = false;
+			}
+			else if (pg_strcasecmp(p, "0") == 0 || pg_strcasecmp(p, "false") == 0 || pg_strcasecmp(p, "off") == 0)
+			{
+				es->timing = false;
+				es->sampling = false;
+			}
+			else if (pg_strcasecmp(p, "sampling") == 0)
+			{
+				es->timing = false;
+				es->sampling = true;
+			}
+			else
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("unrecognized value for EXPLAIN option \"%s\": \"%s\"",
+								opt->defname, p),
+						 parser_errposition(pstate, opt->location)));
+			}
+		}
+		else if (strcmp(opt->defname, "samplefreq") == 0)
+		{
+			int32 p = defGetInt32(opt);
+			if (p < 1 || p > 1000)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("EXPLAIN option samplefreq requires value between %d and %d", 1, 1000),
+						 parser_errposition(pstate, opt->location)));
+			es->sample_freq_hz = (uint32) p;
 		}
 		else if (strcmp(opt->defname, "summary") == 0)
 		{
@@ -236,11 +272,20 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 	es->timing = (timing_set) ? es->timing : es->analyze;
 
 	/* check that timing is used with EXPLAIN ANALYZE */
-	if (es->timing && !es->analyze)
+	if ((es->timing || es->sampling) && !es->analyze)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("EXPLAIN option TIMING requires ANALYZE")));
 
+	/* check that sampling is enabled when sample frequency is passed */
+	if (es->sample_freq_hz != 0 && !es->sampling)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("EXPLAIN option SAMPLEFREQ requires TIMING SAMPLING")));
+
+	/* if sampling frequency was not set explicitly, set default value */
+	es->sample_freq_hz = es->sample_freq_hz != 0 ? es->sample_freq_hz : 1000;
+
 	/* if the summary was not set explicitly, set default value */
 	es->summary = (summary_set) ? es->summary : es->analyze;
 
@@ -531,6 +576,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 
 	if (es->analyze && es->timing)
 		instrument_option |= INSTRUMENT_TIMER;
+	else if (es->analyze && es->sampling)
+		instrument_option |= INSTRUMENT_TIMER_SAMPLING | INSTRUMENT_ROWS;
 	else if (es->analyze)
 		instrument_option |= INSTRUMENT_ROWS;
 
@@ -578,6 +625,21 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	/* call ExecutorStart to prepare the plan for execution */
 	ExecutorStart(queryDesc, eflags);
 
+	/*
+	 * If sampling, we initialize executor time tracking for the query,
+	 * in order to scale individual sampled times to the total accurate
+	 * clock time spent for running all plan nodes in ExecutorRun + time
+	 * spent in ExecutorFinish. This also keeps the total sampled time.
+	 *
+	 * Note this is separate from "Execution Time" shown by EXPLAIN, which
+	 * includes ExecutorStart, ExecutorEnd and the EXPLAIN print functions.
+	 */
+	if (es->sampling)
+	{
+		queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+		queryDesc->sample_freq_hz = es->sample_freq_hz;
+	}
+
 	/* Execute the plan for statistics if asked for */
 	if (es->analyze)
 	{
@@ -595,6 +657,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		/* run cleanup too */
 		ExecutorFinish(queryDesc);
 
+		/* ensure total time gets set, if needed */
+		if (es->sampling)
+			InstrEndLoop(queryDesc->totaltime);
+
 		/* We can't run ExecutorEnd 'till we're done printing the stats... */
 		totaltime += elapsed_time(&starttime);
 	}
@@ -768,6 +834,7 @@ ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc)
 	es->deparse_cxt = deparse_context_for_plan_tree(queryDesc->plannedstmt,
 													es->rtable_names);
 	es->printed_subplans = NULL;
+	es->totaltime = queryDesc->totaltime;
 
 	/*
 	 * Sometimes we mark a Gather node as "invisible", which means that it's
@@ -1642,6 +1709,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		double		startup_ms = 1000.0 * planstate->instrument->startup / nloops;
 		double		total_ms = 1000.0 * planstate->instrument->total / nloops;
 		double		rows = planstate->instrument->ntuples / nloops;
+		double		sampled_time_ms;
+
+		if (es->totaltime != NULL && es->totaltime->sampled_total != 0)
+		{
+			/* Sampling undercounts time, scale per-node sampled time based on actual full execution time */
+			double sampled_pct = (double) planstate->instrument->sampled_total / es->totaltime->sampled_total;
+			sampled_time_ms = 1000.0 * es->totaltime->total * sampled_pct / nloops;
+		}
+		else
+		{
+			sampled_time_ms = get_float8_nan();
+		}
 
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 		{
@@ -1649,6 +1728,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				appendStringInfo(es->str,
 								 " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)",
 								 startup_ms, total_ms, rows, nloops);
+			else if (es->sampling)
+				appendStringInfo(es->str,
+								 " (actual sampled time=%.3f rows=%.0f loops=%.0f)",
+								 sampled_time_ms, rows, nloops);
 			else
 				appendStringInfo(es->str,
 								 " (actual rows=%.0f loops=%.0f)",
@@ -1663,6 +1746,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
 									 3, es);
 			}
+			else if (es->sampling)
+			{
+				ExplainPropertyFloat("Actual Sampled Time", "ms", sampled_time_ms,
+									 3, es);
+			}
 			ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
 			ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
 		}
@@ -1678,6 +1766,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				ExplainPropertyFloat("Actual Startup Time", "ms", 0.0, 3, es);
 				ExplainPropertyFloat("Actual Total Time", "ms", 0.0, 3, es);
 			}
+			else if (es->sampling)
+			{
+				ExplainPropertyFloat("Actual Sampled Time", "ms", 0.0, 3, es);
+			}
 			ExplainPropertyFloat("Actual Rows", NULL, 0.0, 0, es);
 			ExplainPropertyFloat("Actual Loops", NULL, 0.0, 0, es);
 		}
@@ -1699,6 +1791,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			double		startup_ms;
 			double		total_ms;
 			double		rows;
+			double		sampled_time_ms;
 
 			if (nloops <= 0)
 				continue;
@@ -1706,6 +1799,13 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			total_ms = 1000.0 * instrument->total / nloops;
 			rows = instrument->ntuples / nloops;
 
+			/*
+			 * For parallel query, normalization of sampled times with the total
+			 * time is done in each worker, see ExecParallelNormalizeSampledTiming
+			 * */
+			if (es->sampling)
+				sampled_time_ms = (instrument->sampled_total / 1000000.0) / nloops;
+
 			ExplainOpenWorker(n, es);
 
 			if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1715,6 +1815,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 					appendStringInfo(es->str,
 									 "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
 									 startup_ms, total_ms, rows, nloops);
+				else if (es->sampling)
+					appendStringInfo(es->str,
+								 	 "actual sampled time=%.3f rows=%.0f loops=%.0f\n",
+								 	 sampled_time_ms, rows, nloops);
 				else
 					appendStringInfo(es->str,
 									 "actual rows=%.0f loops=%.0f\n",
@@ -1729,6 +1833,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 					ExplainPropertyFloat("Actual Total Time", "ms",
 										 total_ms, 3, es);
 				}
+				else if (es->sampling)
+				{
+					ExplainPropertyFloat("Actual Sampled Time", "ms",
+										 sampled_time_ms, 3, es);
+				}
 				ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
 				ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
 			}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 2c2b3a8874..55d4ed9a05 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -333,7 +333,11 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 
 	/* Allow instrumentation of Executor overall runtime */
 	if (queryDesc->totaltime)
+	{
+		if (queryDesc->instrument_options & INSTRUMENT_TIMER_SAMPLING)
+			InstrStartSampling(queryDesc->sample_freq_hz);
 		InstrStartNode(queryDesc->totaltime);
+	}
 
 	/*
 	 * extract information from the query descriptor and the query feature.
@@ -379,7 +383,11 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		dest->rShutdown(dest);
 
 	if (queryDesc->totaltime)
+	{
 		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+		if (queryDesc->instrument_options & INSTRUMENT_TIMER_SAMPLING)
+			InstrStopSampling();
+	}
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -429,7 +437,11 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 
 	/* Allow instrumentation of Executor overall runtime */
 	if (queryDesc->totaltime)
+	{
+		if (queryDesc->instrument_options & INSTRUMENT_TIMER_SAMPLING)
+			InstrStartSampling(queryDesc->sample_freq_hz);
 		InstrStartNode(queryDesc->totaltime);
+	}
 
 	/* Run ModifyTable nodes to completion */
 	ExecPostprocessPlan(estate);
@@ -439,7 +451,11 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 		AfterTriggerEndQuery(estate);
 
 	if (queryDesc->totaltime)
+	{
 		InstrStopNode(queryDesc->totaltime, 0);
+		if (queryDesc->instrument_options & INSTRUMENT_TIMER_SAMPLING)
+			InstrStopSampling();
+	}
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a5b8e43ec5..21d1eab908 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1296,6 +1296,29 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 								 instrumentation);
 }
 
+/*
+ * Normalize sampled timing information collected for EXPLAIN ANALYZE,
+ * based on the parallel worker's executor total time and total sampled time.
+ *
+ * For non-parallel cases this is done when printing each plan node, but for
+ * parallel query we need to do this normalization ahead of time, to avoid
+ * passing a bunch of extra information back to the leader process.
+ */
+static bool
+ExecParallelNormalizeSampledTiming(PlanState *planstate,
+								   Instrumentation *worker_totaltime)
+{
+	if (worker_totaltime->sampled_total != 0)
+	{
+		/* Sampling undercounts time, scale per-node sampled time based on actual full execution time */
+		double sampled_pct = (double) planstate->instrument->sampled_total / worker_totaltime->sampled_total;
+		planstate->instrument->sampled_total = (uint64) (worker_totaltime->total * sampled_pct * 1000000000.0);
+	}
+
+	return planstate_tree_walker(planstate, ExecParallelNormalizeSampledTiming,
+								 worker_totaltime);
+}
+
 /*
  * Initialize the PlanState and its descendants with the information
  * retrieved from shared memory.  This has to be done once the PlanState
@@ -1419,6 +1442,13 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 										 true);
 	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
 
+	/* Set up instrumentation for sampling based EXPLAIN ANALYZE, if requested */
+	if (instrument_options & INSTRUMENT_TIMER_SAMPLING)
+	{
+		queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+		queryDesc->sample_freq_hz = 1000; // TODO: Pass the correct value here
+	}
+
 	/* Setting debug_query_string for individual workers */
 	debug_query_string = queryDesc->sourceText;
 
@@ -1470,6 +1500,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	/* Shut down the executor */
 	ExecutorFinish(queryDesc);
 
+	/* Normalize sampled time values whilst we have the query descriptor */
+	if (instrument_options & INSTRUMENT_TIMER_SAMPLING)
+	{
+		InstrEndLoop(queryDesc->totaltime);
+		ExecParallelNormalizeSampledTiming(queryDesc->planstate,
+										   queryDesc->totaltime);
+	}
+
 	/* Report buffer/WAL usage during parallel execution. */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ceff4727d4..dba321bc1b 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,15 +16,96 @@
 #include <unistd.h>
 
 #include "executor/instrument.h"
+#include "nodes/pg_list.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+#include "utils/memutils.h"
 
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
 static WalUsage save_pgWalUsage;
 
+volatile uint64 last_sampled_time;
+static List *sample_rate_stack;
+
 static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void WalUsageAdd(WalUsage *dst, WalUsage *add);
 
+/*
+ * Update the last sampled timestamp
+ *
+ * NB: Runs inside a signal handler, be careful.
+ */
+void
+InstrumentSamplingTimeoutHandler(void)
+{
+	instr_time now;
+	INSTR_TIME_SET_CURRENT(now);
+	last_sampled_time = INSTR_TIME_GET_NANOSEC(now);
+}
+
+static void
+StartSamplingTimeout(int sample_rate_hz, bool disable_old_timeout)
+{
+	int timeout_delay_ms = 1000 / sample_rate_hz;
+	TimestampTz fin_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout_delay_ms);
+	instr_time now;
+
+	if (disable_old_timeout)
+		disable_timeout(INSTRUMENT_SAMPLING_TIMEOUT, false);
+
+	INSTR_TIME_SET_CURRENT(now);
+	last_sampled_time = INSTR_TIME_GET_NANOSEC(now);
+
+	enable_timeout_every(INSTRUMENT_SAMPLING_TIMEOUT, fin_time, timeout_delay_ms);
+}
+
+void
+InstrStartSampling(int sample_rate_hz)
+{
+	MemoryContext old_ctx;
+
+	Assert(sample_rate_hz > 0);
+	Assert(sample_rate_hz <= 1000);
+
+	/* In case of errors, a previous timeout may have been stopped without us knowing */
+	if (sample_rate_stack != NIL && !get_timeout_active(INSTRUMENT_SAMPLING_TIMEOUT))
+	{
+		list_free(sample_rate_stack);
+		sample_rate_stack = NIL;
+	}
+
+	if (sample_rate_stack == NIL)
+		StartSamplingTimeout(sample_rate_hz, false);
+	else if (sample_rate_hz > llast_int(sample_rate_stack))
+		/* Reset timeout if a higher sampling frequency is requested */
+		StartSamplingTimeout(sample_rate_hz, true);
+	else
+		sample_rate_hz = llast_int(sample_rate_stack);
+
+	/* Keep sample rate so we can reduce the frequency or stop the timeout */
+	old_ctx = MemoryContextSwitchTo(TopMemoryContext);
+	sample_rate_stack = lappend_int(sample_rate_stack, sample_rate_hz);
+	MemoryContextSwitchTo(old_ctx);
+}
+
+void
+InstrStopSampling()
+{
+	int old_sample_rate_hz;
+
+	Assert(sample_rate_stack != NIL);
+
+	old_sample_rate_hz = llast_int(sample_rate_stack);
+	sample_rate_stack = list_delete_last(sample_rate_stack);
+
+	if (sample_rate_stack == NIL)
+		disable_timeout(INSTRUMENT_SAMPLING_TIMEOUT, false);
+	else if (old_sample_rate_hz > llast_int(sample_rate_stack))
+		/* Reset timeout if we're returning to a lower frequency */
+		StartSamplingTimeout(llast_int(sample_rate_stack), true);
+}
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -77,6 +158,9 @@ InstrStartNode(Instrumentation *instr)
 
 	if (instr->need_walusage)
 		instr->walusage_start = pgWalUsage;
+
+	/* Save sampled start time unconditionally (this is very cheap and not worth a branch) */
+	instr->sampled_starttime = last_sampled_time;
 }
 
 /* Exit from a plan node */
@@ -125,6 +209,9 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 		if (instr->async_mode && save_tuplecount < 1.0)
 			instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
 	}
+
+	/* Calculate sampled time unconditionally (this is very cheap and not worth a branch) */
+	instr->sampled_total += last_sampled_time - instr->sampled_starttime;
 }
 
 /* Update tuple count */
@@ -193,6 +280,8 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 
 	if (dst->need_walusage)
 		WalUsageAdd(&dst->walusage, &add->walusage);
+
+	dst->sampled_total += add->sampled_total;
 }
 
 /* note current values during parallel executor startup */
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a990c833c5..f469a41a7b 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -760,6 +760,8 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
 		RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
 						IdleStatsUpdateTimeoutHandler);
+		RegisterTimeout(INSTRUMENT_SAMPLING_TIMEOUT,
+						InstrumentSamplingTimeoutHandler);
 	}
 
 	/*
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 9ebde089ae..29b0651f34 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -44,6 +44,8 @@ typedef struct ExplainState
 	bool		buffers;		/* print buffer usage */
 	bool		wal;			/* print WAL usage */
 	bool		timing;			/* print detailed node timing */
+	bool		sampling;		/* print per-node timing based on sampling */
+	uint32		sample_freq_hz; /* frequency of sampling for per-node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		settings;		/* print modified settings */
 	ExplainFormat format;		/* output format */
@@ -57,6 +59,8 @@ typedef struct ExplainState
 	List	   *deparse_cxt;	/* context list for deparsing expressions */
 	Bitmapset  *printed_subplans;	/* ids of SubPlans we've printed */
 	bool		hide_workers;	/* set if we find an invisible Gather */
+	/* state for the whole query execution */
+	struct Instrumentation *totaltime;	/* totaltime from query descriptor */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index e79e2c001f..6e1bb35c3c 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -51,8 +51,9 @@ typedef struct QueryDesc
 	/* This field is set by ExecutorRun */
 	bool		already_executed;	/* true if previously executed */
 
-	/* This is always set NULL by the core system, but plugins can change it */
-	struct Instrumentation *totaltime;	/* total time spent in ExecutorRun */
+	/* This is used by sampling mode for EXPLAIN ANALYZE timings, and plugins */
+	struct Instrumentation *totaltime;	/* total time spent in ExecutorRun and ExecutorFinish */
+	int sample_freq_hz;	/* frequency of sampling (if enabled) */
 } QueryDesc;
 
 /* in pquery.c */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 2945cce3a9..e08941023a 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -60,6 +60,7 @@ typedef enum InstrumentOption
 	INSTRUMENT_BUFFERS = 1 << 1,	/* needs buffer usage */
 	INSTRUMENT_ROWS = 1 << 2,	/* needs row count */
 	INSTRUMENT_WAL = 1 << 3,	/* needs WAL usage */
+	INSTRUMENT_TIMER_SAMPLING = 1 << 4,	/* needs timer based on sampling */
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
@@ -78,6 +79,7 @@ typedef struct Instrumentation
 	double		tuplecount;		/* # of tuples emitted so far this cycle */
 	BufferUsage bufusage_start; /* buffer usage at start */
 	WalUsage	walusage_start; /* WAL usage at start */
+	uint64		sampled_starttime; /* sampled start time of node */
 	/* Accumulated statistics across all completed cycles: */
 	double		startup;		/* total startup time (in seconds) */
 	double		total;			/* total time (in seconds) */
@@ -88,6 +90,7 @@ typedef struct Instrumentation
 	double		nfiltered2;		/* # of tuples removed by "other" quals */
 	BufferUsage bufusage;		/* total buffer usage */
 	WalUsage	walusage;		/* total WAL usage */
+	uint64		sampled_total;  /* sampled total time */
 } Instrumentation;
 
 typedef struct WorkerInstrumentation
@@ -99,6 +102,10 @@ typedef struct WorkerInstrumentation
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 extern PGDLLIMPORT WalUsage pgWalUsage;
 
+extern void InstrStartSampling(int sample_rate_hz);
+extern void InstrStopSampling();
+extern void InstrumentSamplingTimeoutHandler(void);
+
 extern Instrumentation *InstrAlloc(int n, int instrument_options,
 								   bool async_mode);
 extern void InstrInit(Instrumentation *instr, int instrument_options);
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 22bcf3d288..d0f60d91c1 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -138,6 +138,9 @@ typedef struct timespec instr_time;
 #define INSTR_TIME_GET_MICROSEC(t) \
 	(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
 
+#define INSTR_TIME_GET_NANOSEC(t) \
+	(((uint64) (t).tv_sec * (uint64) 1000000000) + (uint64) ((t).tv_nsec))
+
 #else							/* WIN32 */
 
 /* Use QueryPerformanceCounter() */
@@ -168,6 +171,9 @@ typedef LARGE_INTEGER instr_time;
 #define INSTR_TIME_GET_MICROSEC(t) \
 	((uint64) (((double) (t).QuadPart * 1000000.0) / GetTimerFrequency()))
 
+#define INSTR_TIME_GET_NANOSEC(t) \
+	((uint64) (((double) (t).QuadPart * 1000000000.0) / GetTimerFrequency()))
+
 static inline double
 GetTimerFrequency(void)
 {
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index c068986d09..49d4c214dd 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
 	IDLE_STATS_UPDATE_TIMEOUT,
 	CLIENT_CONNECTION_CHECK_TIMEOUT,
 	STARTUP_PROGRESS_TIMEOUT,
+	INSTRUMENT_SAMPLING_TIMEOUT,
 	/* First user-definable timeout reason */
 	USER_TIMEOUT,
 	/* Maximum number of timeout reasons */
-- 
2.34.0

Jelte Fennema

me@jeltef.nl

about 3 years ago

In reply to: Lukas Fittl (#1)

Re: Sampling-based timing for EXPLAIN ANALYZE

Nice addition! And the code looks pretty straight forward.

The current patch triggers warnings:
https://cirrus-ci.com/task/6016013976731648 Looks like you need to add void
as the argument.

Do you have some performance comparison between TIMING ON and TIMING
SAMPLING?

In InstrStartSampling there's logic to increase/decrease the frequency of
an already existing timer. It's not clear to me when this can occur. I'd
expect sampling frequency to remain constant throughout an explain plan. If
it's indeed needed, I think a code comment would be useful to explain why
this edge case is necessary.

On Fri, 6 Jan 2023 at 09:41, Lukas Fittl <lukas@fittl.com> wrote:

Show quoted text

Hi,

Since EXPLAIN ANALYZE with TIMING ON still carries noticeable overhead on
modern hardware (despite time sources being faster), I'd like to propose a
new setting EXPLAIN ANALYZE, called "TIMING SAMPLING", as compared to
TIMING ON.

This new timing mode uses a timer on a fixed recurring frequency (e.g. 100
or 1000 Hz) to gather a sampled timestamp on a predefined schedule, instead
of getting the time on-demand when InstrStartNode/InstrStopNode is called.
To implement the timer, we can use the existing timeout infrastructure,
which is backed by a wall clock timer (ITIMER_REAL).

Conceptually this is inspired by how sampling profilers work (e.g.
"perf"), but it ties into the existing per-plan node instrumentation done
by EXPLAIN ANALYZE, and simply provides a lower accuracy version of the
total time for each plan node.

In EXPLAIN output this is marked as "sampled time", and scaled to the
total wall clock time (to adjust for the sampling undercounting):

=# EXPLAIN (ANALYZE, BUFFERS, TIMING SAMPLING, SAMPLEFREQ 100) SELECT ...;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=201747.90..201748.00 rows=10 width=12) (actual
sampled time=5490.974 rows=9 loops=1)
...
-> Hash Join (cost=0.23..199247.90 rows=499999 width=4) (actual
sampled time=3738.619 rows=9000000 loops=1)
...
-> Seq Scan on large (cost=0.00..144247.79 rows=9999979
width=4) (actual sampled time=1004.671 rows=10000001 loops=1)
...
-> Hash (cost=0.10..0.10 rows=10 width=4) (actual sampled
time=0.000 rows=10 loops=1)
...
Execution Time: 5491.475 ms
---

In simple query tests like this on my local machine, this shows a
consistent benefit over TIMING ON (and behaves close to ANALYZE with TIMING
OFF), whilst providing a "good enough" accuracy to identify which part of
the query was problematic.

Attached is a prototype patch for early feedback on the concept, with
tests and documentation to come in a follow-up. Since the January
commitfest is still marked as open I'll register it there, but note that my
assumption is this is *not* Postgres 16 material.

As an open item, note that in the patch the requested sampling frequency
is not yet passed to parallel workers (it always defaults to 1000 Hz when
sampling is enabled). Also, note the timing frequency is limited to a
maximum of 1000 Hz (1ms) due to current limitations of the timeout
infrastructure.

With thanks to Andres Freund for help on refining the idea, collaborating
on early code and finding the approach to hook into the timeout API.

Thanks,
Lukas

--
Lukas Fittl

David Geier

geidav.pg@gmail.com

almost 3 years ago

In reply to: Jelte Fennema (#2)

Re: Sampling-based timing for EXPLAIN ANALYZE

Nice idea.

On 1/6/23 10:19, Jelte Fennema wrote:

Do you have some performance comparison between TIMING ON and TIMING
SAMPLING?

+1 to see some numbers compared to TIMING ON.

Mostly I'm wondering if the sampling based approach gains us enough to
be worth it, once the patch to use RDTSC hopefully landed (see [1]/messages/by-id/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de). I
believe that with the RDTSC patch the overhead of TIMING ON is lower
than the overhead of using ANALYZE with TIMING OFF in the first place.
Hence, to be really useful, it would be great if we could on top of
TIMING SAMPLING also lower the overhead of ANALYZE itself further (e.g.
by using a fast path for the default EXPLAIN (ANALYZE, TIMING ON /
SAMPLING)). Currently, InstrStartNode() and InstrStopNode() have a ton
of branches and without all the typically deactivated code the
implementation would be very small and could be placed in an inlinable
function.

[1]: /messages/by-id/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de
/messages/by-id/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de

--
David Geier
(ServiceNow)

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: David Geier (#3)

Re: Sampling-based timing for EXPLAIN ANALYZE

Hi,

On 2023-01-13 09:11:06 +0100, David Geier wrote:

Mostly I'm wondering if the sampling based approach gains us enough to be
worth it, once the patch to use RDTSC hopefully landed (see [1]).

Well, I'm not sure we have a path forward on it. There's portability and
accuracy concerns. But more importantly:

I believe that with the RDTSC patch the overhead of TIMING ON is lower than
the overhead of using ANALYZE with TIMING OFF in the first place. Hence, to
be really useful, it would be great if we could on top of TIMING SAMPLING
also lower the overhead of ANALYZE itself further (e.g. by using a fast path
for the default EXPLAIN (ANALYZE, TIMING ON / SAMPLING)). Currently,
InstrStartNode() and InstrStopNode() have a ton of branches and without all
the typically deactivated code the implementation would be very small and
could be placed in an inlinable function.

Yes, I think SAMPLING could get rid of most of the instrumentation overhead -
at the cost of a bit less detail in the explain, of course. Which could make
it a lot more feasible to enable something like auto_explain.log_timing in
busy workloads.

For the sampling mode we don't really need something like
InstrStart/StopNode. We just need a pointer to node currently executing - not
free to set, but still a heck of a lot cheaper than InstrStopNode(), even
without ->need_timer etc. Then the timer just needs to do
instr->sampled_total += (now - last_sample)
last_sample = now

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

Separately, I think we should consider re-ordering Instrumentation so that
bufusage_start, walusage_start are after the much more commonly used
elements. We're forcing ntuples, nloops, .. onto separate cachelines, even
though they're accounted for unconditionally.

Greetings,

Andres Freund

Lukas Fittl

lukas@fittl.com

almost 3 years ago

In reply to: Jelte Fennema (#2)

1 attachment(s)

Re: Sampling-based timing for EXPLAIN ANALYZE

On Fri, Jan 6, 2023 at 1:19 AM Jelte Fennema <me@jeltef.nl> wrote:

Nice addition! And the code looks pretty straight forward.

Thanks for reviewing!

The current patch triggers warnings:

https://cirrus-ci.com/task/6016013976731648 Looks like you need to add
void as the argument.

Fixed in v2 attached. This also adds a simple regression test, as well as
fixes the parallel working handling.

Do you have some performance comparison between TIMING ON and TIMING

SAMPLING?

Here are some benchmarks of auto_explain overhead on my ARM-based M1
Macbook for the following query run with pgbench on a scale factor 100 data
set:

SELECT COUNT(*) FROM pgbench_branches JOIN pgbench_accounts USING (bid)
JOIN pgbench_tellers USING (bid) WHERE bid = 42;

(the motivation is to use a query that is more complex than the standard
pgbench select-only test query)

avg latency (best of 3), -T 300, -c 4, -s 100, shared_buffers 2GB, fsync
off, max_parallel_workers_per_gather 0:

master, log_timing = off: 871 ms (878 / 877 / 871)
patch, log_timing = off: 869 ms (882 / 880 / 869)
patch, log_timing = on: 890 ms (917 / 930 / 890)
patch, log_timing = sampling, samplefreq = 1000: 869 ms (887 / 869 / 894)

Additionally, here is Andres' benchmark from [1]/messages/by-id/20230116213913.4oseovlzvc2674z7@awork3.anarazel.de, with the sampling option
added:

% psql -Xc 'DROP TABLE IF EXISTS t; CREATE TABLE t AS SELECT * FROM
generate_series(1, 100000) g(i);' postgres && pgbench -n -r -t 100 -f
<(echo -e "SELECT COUNT(*) FROM t;EXPLAIN (ANALYZE, TIMING OFF) SELECT
COUNT(*) FROM t;EXPLAIN (ANALYZE, TIMING SAMPLING) SELECT COUNT(*) FROM
t;EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;") postgres |grep '^ '
DROP TABLE
SELECT 100000
3.507 0 SELECT COUNT(*) FROM t;
3.476 0 EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*)
FROM t;
3.576 0 EXPLAIN (ANALYZE, TIMING SAMPLING) SELECT
COUNT(*) FROM t;
5.096 0 EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*)
FROM t;

My pg_test_timing data for reference:

% pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 23.65 ns
Histogram of timing durations:
< us % of total count
1 97.64472 123876325
2 2.35421 2986658
4 0.00022 277
8 0.00016 202
16 0.00064 815
32 0.00005 64

In InstrStartSampling there's logic to increase/decrease the frequency of

an already existing timer. It's not clear to me when this can occur. I'd
expect sampling frequency to remain constant throughout an explain plan. If
it's indeed needed, I think a code comment would be useful to explain why
this edge case is necessary.

Clarified in a code comment in v2. This is needed for handling nested
statements which could have different sampling frequencies for each nesting
level, i.e. a function might want to sample it's queries at a higher
frequency than its caller.

Thanks,
Lukas

[1]: /messages/by-id/20230116213913.4oseovlzvc2674z7@awork3.anarazel.de

--
Lukas Fittl

Attachments:

v2-0001-Add-TIMING-SAMPLING-option-for-EXPLAIN-ANALYZE.patchapplication/octet-stream; name=v2-0001-Add-TIMING-SAMPLING-option-for-EXPLAIN-ANALYZE.patchDownload

From 5228ab37a62dfd923f8e60529b5d64c457e26739 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 1 Jan 2023 16:23:18 -0800
Subject: [PATCH v2] Add TIMING SAMPLING option for EXPLAIN ANALYZE

This introduces a new mode, TIMING SAMPLING, as a lower overhead alternative to
TIMING ON for EXPLAIN ANALYZE and auto_explain.

Instead of measuring the current time when entering and exiting each plan node
(which can get expensive even on modern systems, e.g. when doing many loops),
the sampling timing mode measures the time at a fixed recurring interval using
a wall clock timer.

The sampling frequency can be adjusted from 1 to 1000 Hz by passing the SAMPLEFREQ
option to EXPLAIN, and defaults to 1000 Hz (1ms). Higher frequencies are
technically possible but currently not supported by the Postgres timeout API.

The sampled wall clock time is cached and used by the the plan node instrumentation
to attribute how much sampled time passed whilst a plan node was executed.

Since sampling is likely to undercount the actual time taken, the sampled time
values are scaled to the total execution time of the query (or parallel query
worker), before being displayed.

Additionally, this adds support for sampling based timing to auto_explain by
introducing the new "auto_explain.log_timing = sampling" setting, and
"auto_explain.log_timing_samplefreq" setting to set the frequency.
---
 contrib/auto_explain/auto_explain.c   |  48 +++++++++--
 src/backend/commands/explain.c        | 113 +++++++++++++++++++++++++-
 src/backend/executor/execMain.c       |  18 ++++
 src/backend/executor/execParallel.c   |  40 +++++++++
 src/backend/executor/instrument.c     | 104 ++++++++++++++++++++++++
 src/backend/tcop/pquery.c             |   3 +
 src/backend/utils/init/postinit.c     |   2 +
 src/include/commands/explain.h        |   4 +
 src/include/executor/execdesc.h       |   5 +-
 src/include/executor/instrument.h     |   7 ++
 src/include/nodes/execnodes.h         |   1 +
 src/include/portability/instr_time.h  |   6 ++
 src/include/utils/timeout.h           |   1 +
 src/test/regress/expected/explain.out |  68 ++++++++++++++++
 src/test/regress/sql/explain.sql      |   8 ++
 15 files changed, 419 insertions(+), 9 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index c3ac27ae99..b887889571 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -24,6 +24,13 @@
 
 PG_MODULE_MAGIC;
 
+typedef enum
+{
+	LOG_TIMING_OFF,
+	LOG_TIMING_ON,
+	LOG_TIMING_SAMPLING
+} LogTimingType;
+
 /* GUC variables */
 static int	auto_explain_log_min_duration = -1; /* msec or -1 */
 static int	auto_explain_log_parameter_max_length = -1; /* bytes or -1 */
@@ -32,7 +39,8 @@ static bool auto_explain_log_verbose = false;
 static bool auto_explain_log_buffers = false;
 static bool auto_explain_log_wal = false;
 static bool auto_explain_log_triggers = false;
-static bool auto_explain_log_timing = true;
+static int	auto_explain_log_timing = LOG_TIMING_ON;
+static int	auto_explain_log_timing_samplefreq = 1000;
 static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
@@ -61,6 +69,19 @@ static const struct config_enum_entry loglevel_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry log_timing_options[] = {
+	{"sampling", LOG_TIMING_SAMPLING, false},
+	{"on", LOG_TIMING_ON, false},
+	{"off", LOG_TIMING_OFF, false},
+	{"true", LOG_TIMING_ON, true},
+	{"false", LOG_TIMING_OFF, true},
+	{"yes", LOG_TIMING_ON, true},
+	{"no", LOG_TIMING_OFF, true},
+	{"1", LOG_TIMING_ON, true},
+	{"0", LOG_TIMING_OFF, true},
+	{NULL, 0, false}
+};
+
 /* Current nesting depth of ExecutorRun calls */
 static int	nesting_level = 0;
 
@@ -218,17 +239,30 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
-	DefineCustomBoolVariable("auto_explain.log_timing",
+	DefineCustomEnumVariable("auto_explain.log_timing",
 							 "Collect timing data, not just row counts.",
 							 NULL,
 							 &auto_explain_log_timing,
-							 true,
+							 LOG_TIMING_ON,
+							 log_timing_options,
 							 PGC_SUSET,
 							 0,
 							 NULL,
 							 NULL,
 							 NULL);
 
+	DefineCustomIntVariable("auto_explain.log_timing_samplefreq",
+							"Frequency for sampling-based timing, if used.",
+							"Number of sampling ticks per second (Hz).",
+							&auto_explain_log_timing_samplefreq,
+							1000,
+							1, 1000,
+							PGC_SUSET,
+							0,
+							NULL,
+							NULL,
+							NULL);
+
 	DefineCustomRealVariable("auto_explain.sample_rate",
 							 "Fraction of queries to process.",
 							 NULL,
@@ -284,14 +318,17 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
 		/* Enable per-node instrumentation iff log_analyze is required. */
 		if (auto_explain_log_analyze && (eflags & EXEC_FLAG_EXPLAIN_ONLY) == 0)
 		{
-			if (auto_explain_log_timing)
+			if (auto_explain_log_timing == LOG_TIMING_ON)
 				queryDesc->instrument_options |= INSTRUMENT_TIMER;
+			else if (auto_explain_log_timing == LOG_TIMING_SAMPLING)
+				queryDesc->instrument_options |= INSTRUMENT_TIMER_SAMPLING | INSTRUMENT_ROWS;
 			else
 				queryDesc->instrument_options |= INSTRUMENT_ROWS;
 			if (auto_explain_log_buffers)
 				queryDesc->instrument_options |= INSTRUMENT_BUFFERS;
 			if (auto_explain_log_wal)
 				queryDesc->instrument_options |= INSTRUMENT_WAL;
+			queryDesc->sample_freq_hz = auto_explain_log_timing_samplefreq;
 		}
 	}
 
@@ -394,7 +431,8 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			es->verbose = auto_explain_log_verbose;
 			es->buffers = (es->analyze && auto_explain_log_buffers);
 			es->wal = (es->analyze && auto_explain_log_wal);
-			es->timing = (es->analyze && auto_explain_log_timing);
+			es->timing = (es->analyze && auto_explain_log_timing == LOG_TIMING_ON);
+			es->sampling = (es->analyze && auto_explain_log_timing == LOG_TIMING_SAMPLING);
 			es->summary = es->analyze;
 			es->format = auto_explain_log_format;
 			es->settings = auto_explain_log_settings;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e4621ef8d6..6c07e88efb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -12,6 +12,7 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
+#include <sys/time.h>
 
 #include "access/xact.h"
 #include "catalog/pg_type.h"
@@ -30,12 +31,14 @@
 #include "storage/bufmgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
+#include "utils/float.h"
 #include "utils/guc_tables.h"
 #include "utils/json.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
+#include "utils/timeout.h"
 #include "utils/tuplesort.h"
 #include "utils/typcache.h"
 #include "utils/xml.h"
@@ -192,8 +195,41 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 			es->settings = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "timing") == 0)
 		{
+			char	   *p = opt->arg != NULL ? defGetString(opt) : NULL;
 			timing_set = true;
-			es->timing = defGetBoolean(opt);
+			if (p == NULL || pg_strcasecmp(p, "1") == 0 || pg_strcasecmp(p, "true") == 0 || pg_strcasecmp(p, "on") == 0)
+			{
+				es->timing = true;
+				es->sampling = false;
+			}
+			else if (pg_strcasecmp(p, "0") == 0 || pg_strcasecmp(p, "false") == 0 || pg_strcasecmp(p, "off") == 0)
+			{
+				es->timing = false;
+				es->sampling = false;
+			}
+			else if (pg_strcasecmp(p, "sampling") == 0)
+			{
+				es->timing = false;
+				es->sampling = true;
+			}
+			else
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("unrecognized value for EXPLAIN option \"%s\": \"%s\"",
+								opt->defname, p),
+						 parser_errposition(pstate, opt->location)));
+			}
+		}
+		else if (strcmp(opt->defname, "samplefreq") == 0)
+		{
+			int32 p = defGetInt32(opt);
+			if (p < 1 || p > 1000)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("EXPLAIN option samplefreq requires value between %d and %d", 1, 1000),
+						 parser_errposition(pstate, opt->location)));
+			es->sample_freq_hz = (uint32) p;
 		}
 		else if (strcmp(opt->defname, "summary") == 0)
 		{
@@ -236,11 +272,20 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 	es->timing = (timing_set) ? es->timing : es->analyze;
 
 	/* check that timing is used with EXPLAIN ANALYZE */
-	if (es->timing && !es->analyze)
+	if ((es->timing || es->sampling) && !es->analyze)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("EXPLAIN option TIMING requires ANALYZE")));
 
+	/* check that sampling is enabled when sample frequency is passed */
+	if (es->sample_freq_hz != 0 && !es->sampling)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("EXPLAIN option SAMPLEFREQ requires TIMING SAMPLING")));
+
+	/* if sampling frequency was not set explicitly, set default value */
+	es->sample_freq_hz = es->sample_freq_hz != 0 ? es->sample_freq_hz : 1000;
+
 	/* if the summary was not set explicitly, set default value */
 	es->summary = (summary_set) ? es->summary : es->analyze;
 
@@ -531,6 +576,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 
 	if (es->analyze && es->timing)
 		instrument_option |= INSTRUMENT_TIMER;
+	else if (es->analyze && es->sampling)
+		instrument_option |= INSTRUMENT_TIMER_SAMPLING | INSTRUMENT_ROWS;
 	else if (es->analyze)
 		instrument_option |= INSTRUMENT_ROWS;
 
@@ -578,6 +625,21 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	/* call ExecutorStart to prepare the plan for execution */
 	ExecutorStart(queryDesc, eflags);
 
+	/*
+	 * If sampling, we initialize executor time tracking for the query,
+	 * in order to scale individual sampled times to the total accurate
+	 * clock time spent for running all plan nodes in ExecutorRun + time
+	 * spent in ExecutorFinish. This also keeps the total sampled time.
+	 *
+	 * Note this is separate from "Execution Time" shown by EXPLAIN, which
+	 * includes ExecutorStart, ExecutorEnd and the EXPLAIN print functions.
+	 */
+	if (es->sampling)
+	{
+		queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+		queryDesc->sample_freq_hz = es->sample_freq_hz;
+	}
+
 	/* Execute the plan for statistics if asked for */
 	if (es->analyze)
 	{
@@ -595,6 +657,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		/* run cleanup too */
 		ExecutorFinish(queryDesc);
 
+		/* ensure total time gets set, if needed */
+		if (es->sampling)
+			InstrEndLoop(queryDesc->totaltime);
+
 		/* We can't run ExecutorEnd 'till we're done printing the stats... */
 		totaltime += elapsed_time(&starttime);
 	}
@@ -768,6 +834,7 @@ ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc)
 	es->deparse_cxt = deparse_context_for_plan_tree(queryDesc->plannedstmt,
 													es->rtable_names);
 	es->printed_subplans = NULL;
+	es->totaltime = queryDesc->totaltime;
 
 	/*
 	 * Sometimes we mark a Gather node as "invisible", which means that it's
@@ -1642,6 +1709,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		double		startup_ms = 1000.0 * planstate->instrument->startup / nloops;
 		double		total_ms = 1000.0 * planstate->instrument->total / nloops;
 		double		rows = planstate->instrument->ntuples / nloops;
+		double		sampled_time_ms;
+
+		if (es->totaltime != NULL && es->totaltime->sampled_total != 0)
+		{
+			/* Sampling undercounts time, scale per-node sampled time based on actual full execution time */
+			double sampled_pct = (double) planstate->instrument->sampled_total / es->totaltime->sampled_total;
+			sampled_time_ms = 1000.0 * es->totaltime->total * sampled_pct / nloops;
+		}
+		else
+		{
+			sampled_time_ms = get_float8_nan();
+		}
 
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 		{
@@ -1649,6 +1728,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				appendStringInfo(es->str,
 								 " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)",
 								 startup_ms, total_ms, rows, nloops);
+			else if (es->sampling)
+				appendStringInfo(es->str,
+								 " (actual sampled time=%.3f rows=%.0f loops=%.0f)",
+								 sampled_time_ms, rows, nloops);
 			else
 				appendStringInfo(es->str,
 								 " (actual rows=%.0f loops=%.0f)",
@@ -1663,6 +1746,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
 									 3, es);
 			}
+			else if (es->sampling)
+			{
+				ExplainPropertyFloat("Actual Sampled Time", "ms", sampled_time_ms,
+									 3, es);
+			}
 			ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
 			ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
 		}
@@ -1678,6 +1766,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				ExplainPropertyFloat("Actual Startup Time", "ms", 0.0, 3, es);
 				ExplainPropertyFloat("Actual Total Time", "ms", 0.0, 3, es);
 			}
+			else if (es->sampling)
+			{
+				ExplainPropertyFloat("Actual Sampled Time", "ms", 0.0, 3, es);
+			}
 			ExplainPropertyFloat("Actual Rows", NULL, 0.0, 0, es);
 			ExplainPropertyFloat("Actual Loops", NULL, 0.0, 0, es);
 		}
@@ -1699,6 +1791,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			double		startup_ms;
 			double		total_ms;
 			double		rows;
+			double		sampled_time_ms = 0.0;
 
 			if (nloops <= 0)
 				continue;
@@ -1706,6 +1799,13 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			total_ms = 1000.0 * instrument->total / nloops;
 			rows = instrument->ntuples / nloops;
 
+			/*
+			 * For parallel query, normalization of sampled times with the total
+			 * time is done in each worker, see ExecParallelNormalizeSampledTiming
+			 * */
+			if (es->sampling)
+				sampled_time_ms = (instrument->sampled_total / 1000000.0) / nloops;
+
 			ExplainOpenWorker(n, es);
 
 			if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1715,6 +1815,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
 					appendStringInfo(es->str,
 									 "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
 									 startup_ms, total_ms, rows, nloops);
+				else if (es->sampling)
+					appendStringInfo(es->str,
+								 	 "actual sampled time=%.3f rows=%.0f loops=%.0f\n",
+								 	 sampled_time_ms, rows, nloops);
 				else
 					appendStringInfo(es->str,
 									 "actual rows=%.0f loops=%.0f\n",
@@ -1729,6 +1833,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 					ExplainPropertyFloat("Actual Total Time", "ms",
 										 total_ms, 3, es);
 				}
+				else if (es->sampling)
+				{
+					ExplainPropertyFloat("Actual Sampled Time", "ms",
+										 sampled_time_ms, 3, es);
+				}
 				ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
 				ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
 			}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index a5115b9c1f..0f7badf56d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -251,6 +251,7 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
 	estate->es_top_eflags = eflags;
 	estate->es_instrument = queryDesc->instrument_options;
+	estate->es_sample_freq_hz = queryDesc->sample_freq_hz;
 	estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
 
 	/*
@@ -333,7 +334,11 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 
 	/* Allow instrumentation of Executor overall runtime */
 	if (queryDesc->totaltime)
+	{
+		if (queryDesc->instrument_options & INSTRUMENT_TIMER_SAMPLING)
+			InstrStartSampling(queryDesc->sample_freq_hz);
 		InstrStartNode(queryDesc->totaltime);
+	}
 
 	/*
 	 * extract information from the query descriptor and the query feature.
@@ -379,7 +384,11 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		dest->rShutdown(dest);
 
 	if (queryDesc->totaltime)
+	{
 		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+		if (queryDesc->instrument_options & INSTRUMENT_TIMER_SAMPLING)
+			InstrStopSampling();
+	}
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -429,7 +438,11 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 
 	/* Allow instrumentation of Executor overall runtime */
 	if (queryDesc->totaltime)
+	{
+		if (queryDesc->instrument_options & INSTRUMENT_TIMER_SAMPLING)
+			InstrStartSampling(queryDesc->sample_freq_hz);
 		InstrStartNode(queryDesc->totaltime);
+	}
 
 	/* Run ModifyTable nodes to completion */
 	ExecPostprocessPlan(estate);
@@ -439,7 +452,11 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 		AfterTriggerEndQuery(estate);
 
 	if (queryDesc->totaltime)
+	{
 		InstrStopNode(queryDesc->totaltime, 0);
+		if (queryDesc->instrument_options & INSTRUMENT_TIMER_SAMPLING)
+			InstrStopSampling();
+	}
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -2819,6 +2836,7 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
 	/* es_trig_target_relations must NOT be copied */
 	rcestate->es_top_eflags = parentestate->es_top_eflags;
 	rcestate->es_instrument = parentestate->es_instrument;
+	rcestate->es_sample_freq_hz = parentestate->es_sample_freq_hz;
 	/* es_auxmodifytables must NOT be copied */
 
 	/*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index aa3f283453..09dfe7a082 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -78,6 +78,7 @@ typedef struct FixedParallelExecutorState
 	dsa_pointer param_exec;
 	int			eflags;
 	int			jit_flags;
+	int			sample_freq_hz;	/* frequency of sampling mode for EXPLAIN ANALYZE timings, if enabled */
 } FixedParallelExecutorState;
 
 /*
@@ -740,6 +741,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	fpes->param_exec = InvalidDsaPointer;
 	fpes->eflags = estate->es_top_eflags;
 	fpes->jit_flags = estate->es_jit_flags;
+	fpes->sample_freq_hz = estate->es_sample_freq_hz;
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_EXECUTOR_FIXED, fpes);
 
 	/* Store query string */
@@ -1296,6 +1298,29 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 								 instrumentation);
 }
 
+/*
+ * Normalize sampled timing information collected for EXPLAIN ANALYZE,
+ * based on the parallel worker's executor total time and total sampled time.
+ *
+ * For non-parallel cases this is done when printing each plan node, but for
+ * parallel query we need to do this normalization ahead of time, to avoid
+ * passing a bunch of extra information back to the leader process.
+ */
+static bool
+ExecParallelNormalizeSampledTiming(PlanState *planstate,
+								   Instrumentation *worker_totaltime)
+{
+	if (worker_totaltime->sampled_total != 0)
+	{
+		/* Sampling undercounts time, scale per-node sampled time based on actual full execution time */
+		double sampled_pct = (double) planstate->instrument->sampled_total / worker_totaltime->sampled_total;
+		planstate->instrument->sampled_total = (uint64) (worker_totaltime->total * sampled_pct * 1000000000.0);
+	}
+
+	return planstate_tree_walker(planstate, ExecParallelNormalizeSampledTiming,
+								 worker_totaltime);
+}
+
 /*
  * Initialize the PlanState and its descendants with the information
  * retrieved from shared memory.  This has to be done once the PlanState
@@ -1419,6 +1444,13 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 										 true);
 	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
 
+	/* Set up instrumentation for sampling based EXPLAIN ANALYZE, if requested */
+	if (instrument_options & INSTRUMENT_TIMER_SAMPLING)
+	{
+		queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+		queryDesc->sample_freq_hz = fpes->sample_freq_hz;
+	}
+
 	/* Setting debug_query_string for individual workers */
 	debug_query_string = queryDesc->sourceText;
 
@@ -1470,6 +1502,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	/* Shut down the executor */
 	ExecutorFinish(queryDesc);
 
+	/* Normalize sampled time values whilst we have the query descriptor */
+	if (instrument_options & INSTRUMENT_TIMER_SAMPLING)
+	{
+		InstrEndLoop(queryDesc->totaltime);
+		ExecParallelNormalizeSampledTiming(queryDesc->planstate,
+										   queryDesc->totaltime);
+	}
+
 	/* Report buffer/WAL usage during parallel execution. */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d..668cde5ee4 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,15 +16,111 @@
 #include <unistd.h>
 
 #include "executor/instrument.h"
+#include "nodes/pg_list.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+#include "utils/memutils.h"
 
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
 static WalUsage save_pgWalUsage;
 
+volatile uint64 last_sampled_time;
+static List *sample_rate_stack;
+
 static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void WalUsageAdd(WalUsage *dst, WalUsage *add);
 
+/*
+ * Update the last sampled timestamp
+ *
+ * NB: Runs inside a signal handler, be careful.
+ */
+void
+InstrumentSamplingTimeoutHandler(void)
+{
+	instr_time now;
+	INSTR_TIME_SET_CURRENT(now);
+	last_sampled_time = INSTR_TIME_GET_NANOSEC(now);
+}
+
+static void
+StartSamplingTimeout(int sample_rate_hz, bool disable_old_timeout)
+{
+	int timeout_delay_ms = 1000 / sample_rate_hz;
+	TimestampTz fin_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), timeout_delay_ms);
+	instr_time now;
+
+	if (disable_old_timeout)
+		disable_timeout(INSTRUMENT_SAMPLING_TIMEOUT, false);
+
+	INSTR_TIME_SET_CURRENT(now);
+	last_sampled_time = INSTR_TIME_GET_NANOSEC(now);
+
+	enable_timeout_every(INSTRUMENT_SAMPLING_TIMEOUT, fin_time, timeout_delay_ms);
+}
+
+/*
+ * Start sampling timing information with the given rate (or higher)
+ *
+ * Note this may be called multiple times in a nested manner, for example when
+ * a query calls a function which calls a query. Further, it is possible for
+ * different statement nesting levels to have different sampling rates. In such
+ * cases we pick the highest sampling rate of all currently in effect.
+ */
+void
+InstrStartSampling(int sample_rate_hz)
+{
+	MemoryContext old_ctx;
+
+	Assert(sample_rate_hz > 0);
+	Assert(sample_rate_hz <= 1000);
+
+	/* In case of errors, a previous timeout may have been stopped without us knowing */
+	if (sample_rate_stack != NIL && !get_timeout_active(INSTRUMENT_SAMPLING_TIMEOUT))
+	{
+		list_free(sample_rate_stack);
+		sample_rate_stack = NIL;
+	}
+
+	if (sample_rate_stack == NIL)
+		StartSamplingTimeout(sample_rate_hz, false);
+	else if (sample_rate_hz > llast_int(sample_rate_stack))
+		/* Reset timeout if a higher sampling frequency is requested */
+		StartSamplingTimeout(sample_rate_hz, true);
+	else
+		sample_rate_hz = llast_int(sample_rate_stack);
+
+	/* Keep sample rate so we can reduce the frequency or stop the timeout */
+	old_ctx = MemoryContextSwitchTo(TopMemoryContext);
+	sample_rate_stack = lappend_int(sample_rate_stack, sample_rate_hz);
+	MemoryContextSwitchTo(old_ctx);
+}
+
+/*
+ * Stop sampling timing information at the current rate
+ *
+ * When called in a nested manner this will go back to the lower sampling rate
+ * (if applicable), if this is the highest nesting level it will stop sampling
+ * timing altogether.
+ */
+void
+InstrStopSampling()
+{
+	int old_sample_rate_hz;
+
+	Assert(sample_rate_stack != NIL);
+
+	old_sample_rate_hz = llast_int(sample_rate_stack);
+	sample_rate_stack = list_delete_last(sample_rate_stack);
+
+	if (sample_rate_stack == NIL)
+		disable_timeout(INSTRUMENT_SAMPLING_TIMEOUT, false);
+	else if (old_sample_rate_hz > llast_int(sample_rate_stack))
+		/* Reset timeout if we're returning to a lower frequency */
+		StartSamplingTimeout(llast_int(sample_rate_stack), true);
+}
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -77,6 +173,9 @@ InstrStartNode(Instrumentation *instr)
 
 	if (instr->need_walusage)
 		instr->walusage_start = pgWalUsage;
+
+	/* Save sampled start time unconditionally (this is very cheap and not worth a branch) */
+	instr->sampled_starttime = last_sampled_time;
 }
 
 /* Exit from a plan node */
@@ -125,6 +224,9 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 		if (instr->async_mode && save_tuplecount < 1.0)
 			instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
 	}
+
+	/* Calculate sampled time unconditionally (this is very cheap and not worth a branch) */
+	instr->sampled_total += last_sampled_time - instr->sampled_starttime;
 }
 
 /* Update tuple count */
@@ -193,6 +295,8 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 
 	if (dst->need_walusage)
 		WalUsageAdd(&dst->walusage, &add->walusage);
+
+	dst->sampled_total += add->sampled_total;
 }
 
 /* note current values during parallel executor startup */
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index 5f0248acc5..78ee2db00e 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -90,7 +90,10 @@ CreateQueryDesc(PlannedStmt *plannedstmt,
 	qd->tupDesc = NULL;
 	qd->estate = NULL;
 	qd->planstate = NULL;
+
+	/* may be set later by EXPLAIN (ANALYZE, TIMING SAMPLING), or plugins */
 	qd->totaltime = NULL;
+	qd->sample_freq_hz = 0;
 
 	/* not yet executed */
 	qd->already_executed = false;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..0b332efb36 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -760,6 +760,8 @@ InitPostgres(const char *in_dbname, Oid dboid,
 		RegisterTimeout(CLIENT_CONNECTION_CHECK_TIMEOUT, ClientCheckTimeoutHandler);
 		RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
 						IdleStatsUpdateTimeoutHandler);
+		RegisterTimeout(INSTRUMENT_SAMPLING_TIMEOUT,
+						InstrumentSamplingTimeoutHandler);
 	}
 
 	/*
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 7c1071ddd1..6b3d5e8145 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -44,6 +44,8 @@ typedef struct ExplainState
 	bool		buffers;		/* print buffer usage */
 	bool		wal;			/* print WAL usage */
 	bool		timing;			/* print detailed node timing */
+	bool		sampling;		/* print per-node timing based on sampling */
+	uint32		sample_freq_hz; /* frequency of sampling for per-node timing */
 	bool		summary;		/* print total planning and execution timing */
 	bool		settings;		/* print modified settings */
 	ExplainFormat format;		/* output format */
@@ -57,6 +59,8 @@ typedef struct ExplainState
 	List	   *deparse_cxt;	/* context list for deparsing expressions */
 	Bitmapset  *printed_subplans;	/* ids of SubPlans we've printed */
 	bool		hide_workers;	/* set if we find an invisible Gather */
+	/* state for the whole query execution */
+	struct Instrumentation *totaltime;	/* totaltime from query descriptor */
 	/* state related to the current plan node */
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index af2bf36dfb..6d0c1b4f6f 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -51,8 +51,9 @@ typedef struct QueryDesc
 	/* This field is set by ExecutorRun */
 	bool		already_executed;	/* true if previously executed */
 
-	/* This is always set NULL by the core system, but plugins can change it */
-	struct Instrumentation *totaltime;	/* total time spent in ExecutorRun */
+	/* This is used by sampling mode for EXPLAIN ANALYZE timings, and plugins */
+	struct Instrumentation *totaltime;	/* total time spent in ExecutorRun and ExecutorFinish */
+	int sample_freq_hz;	/* frequency of sampling (if enabled) */
 } QueryDesc;
 
 /* in pquery.c */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183b..e495952881 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -60,6 +60,7 @@ typedef enum InstrumentOption
 	INSTRUMENT_BUFFERS = 1 << 1,	/* needs buffer usage */
 	INSTRUMENT_ROWS = 1 << 2,	/* needs row count */
 	INSTRUMENT_WAL = 1 << 3,	/* needs WAL usage */
+	INSTRUMENT_TIMER_SAMPLING = 1 << 4,	/* needs timer based on sampling */
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
@@ -78,6 +79,7 @@ typedef struct Instrumentation
 	double		tuplecount;		/* # of tuples emitted so far this cycle */
 	BufferUsage bufusage_start; /* buffer usage at start */
 	WalUsage	walusage_start; /* WAL usage at start */
+	uint64		sampled_starttime; /* sampled start time of node */
 	/* Accumulated statistics across all completed cycles: */
 	double		startup;		/* total startup time (in seconds) */
 	double		total;			/* total time (in seconds) */
@@ -88,6 +90,7 @@ typedef struct Instrumentation
 	double		nfiltered2;		/* # of tuples removed by "other" quals */
 	BufferUsage bufusage;		/* total buffer usage */
 	WalUsage	walusage;		/* total WAL usage */
+	uint64		sampled_total;  /* sampled total time */
 } Instrumentation;
 
 typedef struct WorkerInstrumentation
@@ -99,6 +102,10 @@ typedef struct WorkerInstrumentation
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 extern PGDLLIMPORT WalUsage pgWalUsage;
 
+extern void InstrStartSampling(int sample_rate_hz);
+extern void InstrStopSampling(void);
+extern void InstrumentSamplingTimeoutHandler(void);
+
 extern Instrumentation *InstrAlloc(int n, int instrument_options,
 								   bool async_mode);
 extern void InstrInit(Instrumentation *instr, int instrument_options);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 20f4c8b35f..1151397503 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -661,6 +661,7 @@ typedef struct EState
 
 	int			es_top_eflags;	/* eflags passed to ExecutorStart */
 	int			es_instrument;	/* OR of InstrumentOption flags */
+	int			es_sample_freq_hz; 	/* frequency of sampling mode for EXPLAIN ANALYZE timings, if enabled */
 	bool		es_finished;	/* true when ExecutorFinish is done */
 
 	List	   *es_exprcontexts;	/* List of ExprContexts within EState */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 9ea1a68bd9..0a77d693b4 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -138,6 +138,9 @@ typedef struct timespec instr_time;
 #define INSTR_TIME_GET_MICROSEC(t) \
 	(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
 
+#define INSTR_TIME_GET_NANOSEC(t) \
+	(((uint64) (t).tv_sec * (uint64) 1000000000) + (uint64) ((t).tv_nsec))
+
 #else							/* WIN32 */
 
 /* Use QueryPerformanceCounter() */
@@ -168,6 +171,9 @@ typedef LARGE_INTEGER instr_time;
 #define INSTR_TIME_GET_MICROSEC(t) \
 	((uint64) (((double) (t).QuadPart * 1000000.0) / GetTimerFrequency()))
 
+#define INSTR_TIME_GET_NANOSEC(t) \
+	((uint64) (((double) (t).QuadPart * 1000000000.0) / GetTimerFrequency()))
+
 static inline double
 GetTimerFrequency(void)
 {
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index e561a1cde9..0bf4ecc5f8 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
 	IDLE_STATS_UPDATE_TIMEOUT,
 	CLIENT_CONNECTION_CHECK_TIMEOUT,
 	STARTUP_PROGRESS_TIMEOUT,
+	INSTRUMENT_SAMPLING_TIMEOUT,
 	/* First user-definable timeout reason */
 	USER_TIMEOUT,
 	/* Maximum number of timeout reasons */
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 48620edbc2..68b043941d 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -18,6 +18,8 @@ begin
     loop
         -- Replace any numeric word with just 'N'
         ln := regexp_replace(ln, '-?\m\d+\M', 'N', 'g');
+        -- Replace NaN with 'N.N' to support TIMING SAMPLING tests
+        ln := regexp_replace(ln, '-?\mNaN\M', 'N.N', 'g');
         -- In sort output, the above won't match units-suffixed numbers
         ln := regexp_replace(ln, '\m\d+kB', 'NkB', 'g');
         -- Ignore text-mode buffers output because it varies depending
@@ -42,6 +44,8 @@ begin
     loop
         -- Replace any numeric word with just '0'
         ln := regexp_replace(ln, '\m\d+\M', '0', 'g');
+        -- Replace NaN with '0.0' to support TIMING SAMPLING tests
+        ln := regexp_replace(ln, '\mNaN\M', '0.0', 'g');
         data := data || ln;
     end loop;
     return data::jsonb;
@@ -178,6 +182,70 @@ select explain_filter('explain (analyze, buffers, format yaml) select * from int
    Execution Time: N.N
 (1 row)
 
+select explain_filter('explain (analyze, timing sampling) select * from int8_tbl i8');
+                                          explain_filter                                          
+--------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N) (actual sampled time=N.N rows=N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, timing sampling, samplefreq 10, format text) select * from int8_tbl i8');
+                                          explain_filter                                          
+--------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N) (actual sampled time=N.N rows=N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, timing sampling, samplefreq 100, format xml) select * from int8_tbl i8');
+                    explain_filter                     
+-------------------------------------------------------
+ <explain xmlns="http://www.postgresql.org/N/explain">+
+   <Query>                                            +
+     <Plan>                                           +
+       <Node-Type>Seq Scan</Node-Type>                +
+       <Parallel-Aware>false</Parallel-Aware>         +
+       <Async-Capable>false</Async-Capable>           +
+       <Relation-Name>int8_tbl</Relation-Name>        +
+       <Alias>i8</Alias>                              +
+       <Startup-Cost>N.N</Startup-Cost>               +
+       <Total-Cost>N.N</Total-Cost>                   +
+       <Plan-Rows>N</Plan-Rows>                       +
+       <Plan-Width>N</Plan-Width>                     +
+       <Actual-Sampled-Time>N.N</Actual-Sampled-Time> +
+       <Actual-Rows>N</Actual-Rows>                   +
+       <Actual-Loops>N</Actual-Loops>                 +
+     </Plan>                                          +
+     <Planning-Time>N.N</Planning-Time>               +
+     <Triggers>                                       +
+     </Triggers>                                      +
+     <Execution-Time>N.N</Execution-Time>             +
+   </Query>                                           +
+ </explain>
+(1 row)
+
+select explain_filter('explain (analyze, timing sampling, samplefreq 1000, format yaml) select * from int8_tbl i8');
+        explain_filter         
+-------------------------------
+ - Plan:                      +
+     Node Type: "Seq Scan"    +
+     Parallel Aware: false    +
+     Async Capable: false     +
+     Relation Name: "int8_tbl"+
+     Alias: "i8"              +
+     Startup Cost: N.N        +
+     Total Cost: N.N          +
+     Plan Rows: N             +
+     Plan Width: N            +
+     Actual Sampled Time: N.N +
+     Actual Rows: N           +
+     Actual Loops: N          +
+   Planning Time: N.N         +
+   Triggers:                  +
+   Execution Time: N.N
+(1 row)
+
 select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
                      explain_filter                      
 ---------------------------------------------------------
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ae3f7a308d..6eaf2d61aa 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -20,6 +20,8 @@ begin
     loop
         -- Replace any numeric word with just 'N'
         ln := regexp_replace(ln, '-?\m\d+\M', 'N', 'g');
+        -- Replace NaN with 'N.N' to support TIMING SAMPLING tests
+        ln := regexp_replace(ln, '-?\mNaN\M', 'N.N', 'g');
         -- In sort output, the above won't match units-suffixed numbers
         ln := regexp_replace(ln, '\m\d+kB', 'NkB', 'g');
         -- Ignore text-mode buffers output because it varies depending
@@ -45,6 +47,8 @@ begin
     loop
         -- Replace any numeric word with just '0'
         ln := regexp_replace(ln, '\m\d+\M', '0', 'g');
+        -- Replace NaN with '0.0' to support TIMING SAMPLING tests
+        ln := regexp_replace(ln, '\mNaN\M', '0.0', 'g');
         data := data || ln;
     end loop;
     return data::jsonb;
@@ -67,6 +71,10 @@ select explain_filter('explain (analyze, verbose) select * from int8_tbl i8');
 select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
 select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
 select explain_filter('explain (analyze, buffers, format yaml) select * from int8_tbl i8');
+select explain_filter('explain (analyze, timing sampling) select * from int8_tbl i8');
+select explain_filter('explain (analyze, timing sampling, samplefreq 10, format text) select * from int8_tbl i8');
+select explain_filter('explain (analyze, timing sampling, samplefreq 100, format xml) select * from int8_tbl i8');
+select explain_filter('explain (analyze, timing sampling, samplefreq 1000, format yaml) select * from int8_tbl i8');
 select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
 select explain_filter('explain (buffers, format json) select * from int8_tbl i8');
 
-- 
2.34.0

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 3 years ago

In reply to: Andres Freund (#4)

Re: Sampling-based timing for EXPLAIN ANALYZE

On 1/15/23 21:22, Andres Freund wrote:

Hi,

On 2023-01-13 09:11:06 +0100, David Geier wrote:

Mostly I'm wondering if the sampling based approach gains us enough to be
worth it, once the patch to use RDTSC hopefully landed (see [1]).

Well, I'm not sure we have a path forward on it. There's portability and
accuracy concerns. But more importantly:

I believe that with the RDTSC patch the overhead of TIMING ON is lower than
the overhead of using ANALYZE with TIMING OFF in the first place. Hence, to
be really useful, it would be great if we could on top of TIMING SAMPLING
also lower the overhead of ANALYZE itself further (e.g. by using a fast path
for the default EXPLAIN (ANALYZE, TIMING ON / SAMPLING)). Currently,
InstrStartNode() and InstrStopNode() have a ton of branches and without all
the typically deactivated code the implementation would be very small and
could be placed in an inlinable function.

Yes, I think SAMPLING could get rid of most of the instrumentation overhead -
at the cost of a bit less detail in the explain, of course. Which could make
it a lot more feasible to enable something like auto_explain.log_timing in
busy workloads.

For the sampling mode we don't really need something like
InstrStart/StopNode. We just need a pointer to node currently executing - not
free to set, but still a heck of a lot cheaper than InstrStopNode(), even
without ->need_timer etc. Then the timer just needs to do
instr->sampled_total += (now - last_sample)
last_sample = now

I don't understand why we would even use timestamps, in this case? AFAIK
"sampling profilers" simply increment a counter for the executing node,
and then approximate the time as proportional to the count.

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

(now - last_sample)

ends up being 0 (assuming I correctly understand the code).

And I don't think there's any particularly good way to correct this.

It seems ExplainNode() attempts to do some correction, but I doubt
that's very reliable, as these fast nodes will have sampled_total=0, so
no matter what you multiply this with, it'll still be 0.

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

Separately, I think we should consider re-ordering Instrumentation so that
bufusage_start, walusage_start are after the much more commonly used
elements. We're forcing ntuples, nloops, .. onto separate cachelines, even
though they're accounted for unconditionally.

+1 to that

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Tomas Vondra (#6)

Re: Sampling-based timing for EXPLAIN ANALYZE

Hi,

On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:

I don't understand why we would even use timestamps, in this case? AFAIK
"sampling profilers" simply increment a counter for the executing node,
and then approximate the time as proportional to the count.

The timer interrupt distances aren't all that evenly spaced, particularly
under load, and are easily distorted by having to wait for IO, an lwlock ...

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

(now - last_sample)

ends up being 0 (assuming I correctly understand the code).

That part should be counting in nanoseconds, I think? Unless I misunderstand
something?

We already compute the timestamp inside timeout.c, but don't yet pass that to
timeout handlers. I think there's others re-computing timestamps.

And I don't think there's any particularly good way to correct this.

It seems ExplainNode() attempts to do some correction, but I doubt
that's very reliable, as these fast nodes will have sampled_total=0, so
no matter what you multiply this with, it'll still be 0.

That's just the scaling to the "actual time" that you're talking about above,
no?

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

I don't see a realistic way at least. We could pass down an
"InstrumentationContext" through everything that needs to do IO and WAL. But
that seems infeasible at this point.

Greetings,

Andres Freund

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 3 years ago

In reply to: Andres Freund (#7)

Re: Sampling-based timing for EXPLAIN ANALYZE

On 1/17/23 18:02, Andres Freund wrote:

Hi,

On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:

I don't understand why we would even use timestamps, in this case? AFAIK
"sampling profilers" simply increment a counter for the executing node,
and then approximate the time as proportional to the count.

The timer interrupt distances aren't all that evenly spaced, particularly
under load, and are easily distorted by having to wait for IO, an lwlock ...

OK, so the difference is that these events (I/O, lwlocks) may block
signals, and after signals get unblocked we only get a single event for
each signal. Yeah, the timestamp handles that case better.

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

(now - last_sample)

ends up being 0 (assuming I correctly understand the code).

That part should be counting in nanoseconds, I think? Unless I misunderstand
something?

The higher precision does not help, because both values come from the
*sampled* timestamp (i.e. the one updated from the signal handler). So
if the node happens to execute between two signals, the values are going
to be the same, and the difference is 0.

Perhaps for many executions it works out, because some executions will
cross the boundary, and the average will converge to the right value.

We already compute the timestamp inside timeout.c, but don't yet pass that to
timeout handlers. I think there's others re-computing timestamps.

And I don't think there's any particularly good way to correct this.

It seems ExplainNode() attempts to do some correction, but I doubt
that's very reliable, as these fast nodes will have sampled_total=0, so
no matter what you multiply this with, it'll still be 0.

That's just the scaling to the "actual time" that you're talking about above,
no?

Maybe, not sure.

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

I don't see a realistic way at least. We could pass down an
"InstrumentationContext" through everything that needs to do IO and WAL. But
that seems infeasible at this point.

Why infeasible?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Tomas Vondra (#8)

Re: Sampling-based timing for EXPLAIN ANALYZE

Hi,

On 2023-01-17 19:00:02 +0100, Tomas Vondra wrote:

On 1/17/23 18:02, Andres Freund wrote:

On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

(now - last_sample)

ends up being 0 (assuming I correctly understand the code).

That part should be counting in nanoseconds, I think? Unless I misunderstand
something?

The higher precision does not help, because both values come from the
*sampled* timestamp (i.e. the one updated from the signal handler). So
if the node happens to execute between two signals, the values are going
to be the same, and the difference is 0.

In that case there simply wasn't any sample for the node, and a non-timestamp
based sample counter wouldn't do anything different?

If you're worried about the case where a timer does fire during execution of
the node, but exactly once, that should provide a difference between the last
sampled timestamp and the current time. It'll attribute a bit too much to the
in-progress nodes, but well, that's sampling for you.

I think a "hybrid" explain mode might be worth thinking about. Use the
"current" sampling method for the first execution of a node, and for the first
few milliseconds of a query (or perhaps the first few timestamp
acquisitions). That provides an accurate explain analyze for short queries,
without a significant slowdown. Then switch to sampling, which provides decent
attribution for a bit longer running queries.

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

I don't see a realistic way at least. We could pass down an
"InstrumentationContext" through everything that needs to do IO and WAL. But
that seems infeasible at this point.

Why infeasible?

Primarily the scale of the change. We'd have to pass down the context into all
table/index AM functions. And into a lot of bufmgr.c, xlog.c functions,
which'd require their callers to have access to the context. That's hundreds
if not thousands places.

Adding that many function parameters might turn out to be noticable runtime
wise, due to increased register movement. I think for a number of the places
where we currently don't, we ought to use by-reference struct for the
not-always-used parameters, that then also could contain this context.

Greetings,

Andres Freund

#10

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 3 years ago

In reply to: Andres Freund (#9)

Re: Sampling-based timing for EXPLAIN ANALYZE

On 1/17/23 19:46, Andres Freund wrote:

Hi,

On 2023-01-17 19:00:02 +0100, Tomas Vondra wrote:

On 1/17/23 18:02, Andres Freund wrote:

On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

(now - last_sample)

ends up being 0 (assuming I correctly understand the code).

That part should be counting in nanoseconds, I think? Unless I misunderstand
something?

The higher precision does not help, because both values come from the
*sampled* timestamp (i.e. the one updated from the signal handler). So
if the node happens to execute between two signals, the values are going
to be the same, and the difference is 0.

In that case there simply wasn't any sample for the node, and a non-timestamp
based sample counter wouldn't do anything different?

Yeah, you're right.

If you're worried about the case where a timer does fire during execution of
the node, but exactly once, that should provide a difference between the last
sampled timestamp and the current time. It'll attribute a bit too much to the
in-progress nodes, but well, that's sampling for you.

I think a "hybrid" explain mode might be worth thinking about. Use the
"current" sampling method for the first execution of a node, and for the first
few milliseconds of a query (or perhaps the first few timestamp
acquisitions). That provides an accurate explain analyze for short queries,
without a significant slowdown. Then switch to sampling, which provides decent
attribution for a bit longer running queries.

Yeah, this is essentially the sampling I imagined when I first read the
subject of this thread. It samples which node executions to measure (and
then measures those accurately), while these patches sample timestamps.

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

I don't see a realistic way at least. We could pass down an
"InstrumentationContext" through everything that needs to do IO and WAL. But
that seems infeasible at this point.

Why infeasible?

Primarily the scale of the change. We'd have to pass down the context into all
table/index AM functions. And into a lot of bufmgr.c, xlog.c functions,
which'd require their callers to have access to the context. That's hundreds
if not thousands places.

Adding that many function parameters might turn out to be noticable runtime
wise, due to increased register movement. I think for a number of the places
where we currently don't, we ought to use by-reference struct for the
not-always-used parameters, that then also could contain this context.

OK, I haven't realized we'd have to pass it to that many places.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Greg Stark

stark@mit.edu

almost 3 years ago

In reply to: Tomas Vondra (#10)

Re: Sampling-based timing for EXPLAIN ANALYZE

On Tue, 17 Jan 2023 at 14:52, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/17/23 19:46, Andres Freund wrote:

I think a "hybrid" explain mode might be worth thinking about. Use the
"current" sampling method for the first execution of a node, and for the first
few milliseconds of a query (or perhaps the first few timestamp
acquisitions). That provides an accurate explain analyze for short queries,
without a significant slowdown. Then switch to sampling, which provides decent
attribution for a bit longer running queries.

Yeah, this is essentially the sampling I imagined when I first read the
subject of this thread. It samples which node executions to measure (and
then measures those accurately), while these patches sample timestamps.

That sounds interesting. Fwiw my first thought would be to implement
it a bit differently. Always have a timer running sampling right
from the start, but also if there are less than, say, 1000 samples for
a node then measure the actual start/finish time.

So for any given node once you've hit enough samples to get a decent
estimate you stop checking the time. That way any fast or rarely
called nodes still have accurate measurements even if they get few
samples and any long or frequently called nodes stop getting
timestamps and just use timer counts.

--
greg

#12

Daniel Gustafsson

daniel@yesql.se

over 2 years ago

In reply to: Greg Stark (#11)

Re: Sampling-based timing for EXPLAIN ANALYZE

This thread has been stale since January with no movement at all during the
March CF, and according to the CFBot it stopped building at all ~ 14 weeks ago.

I'm marking this returned with feedback, it can be resubmitted for a future CF
if someone decides to pick it up.

--
Daniel Gustafsson