Re: [PATCH] Add extra statistics to explain for Nested Loop

Started by Ekaterina Sokolovaalmost 4 years ago8 messages

e.sokolova@postgrespro.ru

almost 4 years ago

3 attachment(s)

Hi, hackers. Thank you for your attention to this topic.

Julien Rouhaud wrote:

+static void show_loop_info(Instrumentation *instrument, bool isworker,
+                           ExplainState *es);
I think this should be done as a separate refactoring commit.

Sure. I divided the patch. Now Justin's refactor commit is separated.
Also I actualized it a bit.

Most of the comments I have are easy to fix. But I think that the real
problem
is the significant overhead shown by Ekaterina that for now would apply
even if
you don't consume the new stats, for instance if you have
pg_stat_statements.
And I'm still not sure of what is the best way to avoid that.

I took your advice about InstrumentOption. Now INSTRUMENT_EXTRA exists.
So currently it's no overheads during basic load. Operations using
INSTRUMENT_ALL contain overheads (because of INSTRUMENT_EXTRA is a part
of INSTRUMENT_ALL), but they are much less significant than before. I
apply new overhead statistics collected by pgbench with auto _explain
enabled.

Why do you need to initialize min_t and min_tuples but not max_t and
max_tuples while both will initially be 0 and possibly updated
afterwards?

We need this initialization for min values so comment about it located
above the block of code with initialization.

I am convinced that the latest changes have affected the patch in a
positive way. I'll be pleased to hear your thoughts on this.

--
Ekaterina Sokolova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

0001-explain.c-refactor-ExplainNode_v2.patchtext/x-diff; name=0001-explain.c-refactor-ExplainNode_v2.patchDownload

From: Justin Pryzby <pryzbyj@telsasoft.com>
Subject: [PATCH 2/2] explain.c: refactor ExplainNode()

---
 src/backend/commands/explain.c | 110 +++++++++++++++++++++--------------------------
 1 file changed, 48 insertions(+), 62 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b7dcffa13f2..61a8a67e2ad 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -118,6 +118,8 @@ static void show_instrumentation_count(const char *qlabel, int which,
 									   PlanState *planstate, ExplainState *es);
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static void show_eval_params(Bitmapset *bms_params, ExplainState *es);
+static void show_loop_info(Instrumentation *instrument, bool isworker,
+						   ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage,
 							  bool planning);
@@ -1618,36 +1620,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 	if (es->analyze &&
 		planstate->instrument && planstate->instrument->nloops > 0)
-	{
-		double		nloops = planstate->instrument->nloops;
-		double		startup_ms = 1000.0 * planstate->instrument->startup / nloops;
-		double		total_ms = 1000.0 * planstate->instrument->total / nloops;
-		double		rows = planstate->instrument->ntuples / nloops;
-
-		if (es->format == EXPLAIN_FORMAT_TEXT)
-		{
-			if (es->timing)
-				appendStringInfo(es->str,
-								 " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)",
-								 startup_ms, total_ms, rows, nloops);
-			else
-				appendStringInfo(es->str,
-								 " (actual rows=%.0f loops=%.0f)",
-								 rows, nloops);
-		}
-		else
-		{
-			if (es->timing)
-			{
-				ExplainPropertyFloat("Actual Startup Time", "ms", startup_ms,
-									 3, es);
-				ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
-									 3, es);
-			}
-			ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
-			ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
-		}
-	}
+		show_loop_info(planstate->instrument, false, es);
 	else if (es->analyze)
 	{
 		if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1677,44 +1650,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		for (int n = 0; n < w->num_workers; n++)
 		{
 			Instrumentation *instrument = &w->instrument[n];
-			double		nloops = instrument->nloops;
-			double		startup_ms;
-			double		total_ms;
-			double		rows;
 
-			if (nloops <= 0)
+			if (instrument->nloops <= 0)
 				continue;
-			startup_ms = 1000.0 * instrument->startup / nloops;
-			total_ms = 1000.0 * instrument->total / nloops;
-			rows = instrument->ntuples / nloops;
 
 			ExplainOpenWorker(n, es);
 
-			if (es->format == EXPLAIN_FORMAT_TEXT)
-			{
-				ExplainIndentText(es);
-				if (es->timing)
-					appendStringInfo(es->str,
-									 "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
-									 startup_ms, total_ms, rows, nloops);
-				else
-					appendStringInfo(es->str,
-									 "actual rows=%.0f loops=%.0f\n",
-									 rows, nloops);
-			}
-			else
-			{
-				if (es->timing)
-				{
-					ExplainPropertyFloat("Actual Startup Time", "ms",
-										 startup_ms, 3, es);
-					ExplainPropertyFloat("Actual Total Time", "ms",
-										 total_ms, 3, es);
-				}
-				ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
-				ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
-			}
+			show_loop_info(instrument, true, es);
 
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfoChar(es->str, '\n');
 			ExplainCloseWorker(n, es);
 		}
 	}
@@ -4043,6 +3988,47 @@ show_modifytable_info(ModifyTableState *mtstate, List *ancestors,
 		ExplainCloseGroup("Target Tables", "Target Tables", false, es);
 }
 
+void
+show_loop_info(Instrumentation *instrument, bool isworker, ExplainState *es)
+{
+	double		nloops = instrument->nloops;
+	double		startup_ms = 1000.0 * instrument->startup / nloops;
+	double		total_ms = 1000.0 * instrument->total / nloops;
+	double		rows = instrument->ntuples / nloops;
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (isworker)
+			ExplainIndentText(es);
+		else
+			appendStringInfo(es->str, " (");
+
+		if (es->timing)
+			appendStringInfo(es->str,
+							 "actual time=%.3f..%.3f rows=%.0f loops=%.0f",
+							 startup_ms, total_ms, rows, nloops);
+		else
+			appendStringInfo(es->str,
+							 "actual rows=%.0f loops=%.0f",
+							 rows, nloops);
+
+		if (!isworker)
+			appendStringInfoChar(es->str, ')');
+	}
+	else
+	{
+		if (es->timing)
+		{
+			ExplainPropertyFloat("Actual Startup Time", "ms", startup_ms,
+								 3, es);
+			ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
+								 3, es);
+		}
+		ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
+		ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
+	}
+}
+
 /*
  * Explain the constituent plans of an Append, MergeAppend,
  * BitmapAnd, or BitmapOr node.

0002-extra_statistics_v7.patchtext/x-diff; name=0002-extra_statistics_v7.patchDownload

From: Ekaterina Sokolova <e.sokolova@postgrespro.ru>
Subject: [PATCH 1/2] Add extra statistics to explain for Nested Loop

For some distributions of data in tables, different loops in nested loop
joins can take different time and process different amounts of entries.
It makes average statistics returned by explain analyze not very useful for DBA.

This patch add collecting of min, max and total statistics for time and rows
across all loops to EXPLAIN ANALYSE. You need to set the VERBOSE flag to display
this information. The patch contains regression tests.

Example of results in TEXT format:
   ->  Append (actual rows=5 loops=5)
         Loop Min Rows: 2  Max Rows: 6  Total Rows: 23

Reviewed-by: Lukas Fittl, Justin Pryzby, Yugo Nagata, Julien Rouhaud.

---
 src/backend/commands/explain.c                |  40 ++++++++++++++++++++++++++++++++++++++++
 src/backend/executor/instrument.c             |  39 ++++++++++++++++++++++++++++++++++++++-
 src/include/executor/instrument.h             |  10 ++++++++++
 src/test/regress/expected/explain.out         |  10 ++++++++++
 src/test/regress/expected/partition_prune.out | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 src/test/regress/sql/partition_prune.sql      |  32 ++++++++++++++++++++++++++++++++
 6 files changed, 247 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 98b44009591..285438cf18b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -541,6 +541,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	if (es->wal)
 		instrument_option |= INSTRUMENT_WAL;
 
+	if (es->verbose)
+		instrument_option |= INSTRUMENT_EXTRA;
+
 	/*
 	 * We always collect timing for the entire statement, even when node-level
 	 * timing is off, so we don't look at es->timing here.  (We could skip
@@ -1624,6 +1627,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			appendStringInfoString(es->str, " (never executed)");
 		else
 		{
+			/* without min and max values because actual result is 0 */
 			if (es->timing)
 			{
 				ExplainPropertyFloat("Actual Startup Time", "ms", 0.0, 3, es);
@@ -3991,6 +3995,11 @@ show_loop_info(Instrumentation *instrument, bool isworker, ExplainState *es)
 	double		startup_ms = 1000.0 * instrument->startup / nloops;
 	double		total_ms = 1000.0 * instrument->total / nloops;
 	double		rows = instrument->ntuples / nloops;
+	double		loop_total_rows = instrument->ntuples;
+	double		loop_min_r = instrument->min_tuples;
+	double		loop_max_r = instrument->max_tuples;
+	double		loop_min_t_ms = 1000.0 * instrument->min_t;
+	double		loop_max_t_ms = 1000.0 * instrument->max_t;
 
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
@@ -4010,6 +4019,21 @@ show_loop_info(Instrumentation *instrument, bool isworker, ExplainState *es)
 
 		if (!isworker)
 			appendStringInfoChar(es->str, ')');
+
+		if (nloops > 1 && es->verbose)
+		{
+			appendStringInfoChar(es->str, '\n');
+			ExplainIndentText(es);
+
+			if (es->timing)
+				appendStringInfo(es->str,
+								 "Loop Min Rows: %.0f  Max Rows: %.0f  Total Rows: %.0f  Min Time: %.3f  Max Time: %.3f",
+								 loop_min_r, loop_max_r, loop_total_rows, loop_min_t_ms, loop_max_t_ms);
+			else
+				appendStringInfo(es->str,
+								 "Loop Min Rows: %.0f  Max Rows: %.0f  Total Rows: %.0f",
+								 loop_min_r, loop_max_r, loop_total_rows);
+		}
 	}
 	else
 	{
@@ -4019,8 +4043,24 @@ show_loop_info(Instrumentation *instrument, bool isworker, ExplainState *es)
 								 3, es);
 			ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
 								 3, es);
+			if (nloops > 1 && es->verbose)
+			{
+				ExplainPropertyFloat("Loop Min Time", "ms", loop_min_t_ms,
+									 3, es);
+				ExplainPropertyFloat("Loop Max Time", "ms", loop_max_t_ms,
+									 3, es);
+			}
 		}
+
 		ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
+
+		if (nloops > 1 && es->verbose)
+		{
+			ExplainPropertyFloat("Loop Min Rows", NULL, loop_min_r, 0, es);
+			ExplainPropertyFloat("Loop Max Rows", NULL, loop_max_r, 0, es);
+			ExplainPropertyFloat("Loop Total Rows", NULL, loop_total_rows, 0, es);
+		}
+
 		ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
 	}
 }
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c5ff02a8424..ce5abe39b56 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -34,11 +34,12 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
 
 	/* initialize all fields to zeroes, then modify as needed */
 	instr = palloc0(n * sizeof(Instrumentation));
-	if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+	if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL | INSTRUMENT_EXTRA))
 	{
 		bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
 		bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
 		bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+		bool		need_extra = (instrument_options & INSTRUMENT_EXTRA) != 0;
 		int			i;
 
 		for (i = 0; i < n; i++)
@@ -46,6 +47,7 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
 			instr[i].need_bufusage = need_buffers;
 			instr[i].need_walusage = need_wal;
 			instr[i].need_timer = need_timer;
+			instr[i].need_extra = need_extra;
 			instr[i].async_mode = async_mode;
 		}
 	}
@@ -61,6 +63,7 @@ InstrInit(Instrumentation *instr, int instrument_options)
 	instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
 	instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
 	instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+	instr->need_extra = (instrument_options & INSTRUMENT_EXTRA) != 0;
 }
 
 /* Entry to a plan node */
@@ -154,6 +157,36 @@ InstrEndLoop(Instrumentation *instr)
 	instr->startup += instr->firsttuple;
 	instr->total += totaltime;
 	instr->ntuples += instr->tuplecount;
+
+	/*
+	 * this is first loop
+	 *
+	 * We only initialize the min values. We don't need to bother with the
+	 * max, because those are 0 and the non-zero values will get updated a
+	 * couple lines later.
+	 */
+	if (instr->need_extra)
+	{
+		if (instr->nloops == 0)
+		{
+			instr->min_t = totaltime;
+			instr->min_tuples = instr->tuplecount;
+		}
+
+		if (instr->min_t > totaltime)
+			instr->min_t = totaltime;
+
+		if (instr->max_t < totaltime)
+			instr->max_t = totaltime;
+
+
+		if (instr->min_tuples > instr->tuplecount)
+			instr->min_tuples = instr->tuplecount;
+
+		if (instr->max_tuples < instr->tuplecount)
+			instr->max_tuples = instr->tuplecount;
+	}
+
 	instr->nloops += 1;
 
 	/* Reset for next cycle (if any) */
@@ -186,6 +219,10 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 	dst->nloops += add->nloops;
 	dst->nfiltered1 += add->nfiltered1;
 	dst->nfiltered2 += add->nfiltered2;
+	dst->min_t = Min(dst->min_t, add->min_t);
+	dst->max_t = Max(dst->max_t, add->max_t);
+	dst->min_tuples = Min(dst->min_tuples, add->min_tuples);
+	dst->max_tuples = Max(dst->max_tuples, add->max_tuples);
 
 	/* Add delta of buffer usage since entry to node's totals */
 	if (dst->need_bufusage)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1b7157bdd15..e6178e248dc 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -58,6 +58,8 @@ typedef enum InstrumentOption
 	INSTRUMENT_BUFFERS = 1 << 1,	/* needs buffer usage */
 	INSTRUMENT_ROWS = 1 << 2,	/* needs row count */
 	INSTRUMENT_WAL = 1 << 3,	/* needs WAL usage */
+	INSTRUMENT_EXTRA = 1 << 4,	/* needs counters for min,
+								 * max and total values */
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
@@ -67,6 +69,8 @@ typedef struct Instrumentation
 	bool		need_timer;		/* true if we need timer data */
 	bool		need_bufusage;	/* true if we need buffer usage data */
 	bool		need_walusage;	/* true if we need WAL usage data */
+	bool		need_extra;		/* true if we need min, max and total
+								 * statistics for loops */
 	bool		async_mode;		/* true if node is in async mode */
 	/* Info about current plan cycle: */
 	bool		running;		/* true if we've completed first tuple */
@@ -79,7 +83,13 @@ typedef struct Instrumentation
 	/* Accumulated statistics across all completed cycles: */
 	double		startup;		/* total startup time (in seconds) */
 	double		total;			/* total time (in seconds) */
+	double		min_t;			/* time of fastest loop (in seconds) */
+	double		max_t;			/* time of slowest loop (in seconds) */
 	double		ntuples;		/* total tuples produced */
+	double		min_tuples;		/* min counter of produced tuples for all
+								 * loops */
+	double		max_tuples;		/* max counter of produced tuples for all
+								 * loops */
 	double		ntuples2;		/* secondary node-specific tuple counter */
 	double		nloops;			/* # of run cycles for this node */
 	double		nfiltered1;		/* # of tuples removed by scanqual or joinqual */
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index bc361759219..c70d0e288da 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -357,8 +357,13 @@ select jsonb_pretty(
                              "Actual Loops": 0,             +
                              "Startup Cost": 0.0,           +
                              "Async Capable": false,        +
+                             "Loop Max Rows": 0,            +
+                             "Loop Max Time": 0.0,          +
+                             "Loop Min Rows": 0,            +
+                             "Loop Min Time": 0.0,          +
                              "Relation Name": "tenk1",      +
                              "Parallel Aware": true,        +
+                             "Loop Total Rows": 0,          +
                              "Local Hit Blocks": 0,         +
                              "Temp Read Blocks": 0,         +
                              "Actual Total Time": 0.0,      +
@@ -403,7 +408,12 @@ select jsonb_pretty(
                      "Actual Loops": 0,                     +
                      "Startup Cost": 0.0,                   +
                      "Async Capable": false,                +
+                     "Loop Max Rows": 0,                    +
+                     "Loop Max Time": 0.0,                  +
+                     "Loop Min Rows": 0,                    +
+                     "Loop Min Time": 0.0,                  +
                      "Parallel Aware": false,               +
+                     "Loop Total Rows": 0,                  +
                      "Sort Space Used": 0,                  +
                      "Local Hit Blocks": 0,                 +
                      "Temp Read Blocks": 0,                 +
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 7555764c779..18617b9e206 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -1962,6 +1962,24 @@ begin
     end loop;
 end;
 $$;
+create function explain_verbose_parallel_append(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in
+        execute format('explain (analyze, verbose, costs off, summary off, timing off) %s',
+            $1)
+    loop
+        ln := regexp_replace(ln, 'Workers Launched: \d+', 'Workers Launched: N');
+        ln := regexp_replace(ln, 'actual rows=\d+ loops=\d+', 'actual rows=N loops=N');
+        ln := regexp_replace(ln, 'Loop Min Rows: \d+  Max Rows: \d+  Total Rows: \d+',
+                                 'Loop Min Rows: N  Max Rows: N  Total Rows: N');
+        return next ln;
+    end loop;
+end;
+$$;
 prepare ab_q4 (int, int) as
 select avg(a) from ab where a between $1 and $2 and b < 4;
 -- Encourage use of parallel plans
@@ -2218,6 +2236,39 @@ select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on
                                  Index Cond: (a = a.a)
 (28 rows)
 
+-- Tests for extra statistics
+create table lprt_b (b int not null);
+insert into lprt_b select generate_series(1,20);
+select explain_verbose_parallel_append('select * from lprt_a join lprt_b on a != b');
+                       explain_verbose_parallel_append                        
+------------------------------------------------------------------------------
+ Nested Loop (actual rows=N loops=N)
+   Output: lprt_a.a, lprt_b.b
+   Join Filter: (lprt_a.a <> lprt_b.b)
+   Rows Removed by Join Filter: 4
+   ->  Gather (actual rows=N loops=N)
+         Output: lprt_b.b
+         Workers Planned: 2
+         Workers Launched: N
+         ->  Parallel Seq Scan on public.lprt_b (actual rows=N loops=N)
+               Loop Min Rows: N  Max Rows: N  Total Rows: N
+               Output: lprt_b.b
+               Worker 0:  actual rows=N loops=N
+               Worker 1:  actual rows=N loops=N
+   ->  Materialize (actual rows=N loops=N)
+         Loop Min Rows: N  Max Rows: N  Total Rows: N
+         Output: lprt_a.a
+         ->  Gather (actual rows=N loops=N)
+               Output: lprt_a.a
+               Workers Planned: 1
+               Workers Launched: N
+               ->  Parallel Seq Scan on public.lprt_a (actual rows=N loops=N)
+                     Loop Min Rows: N  Max Rows: N  Total Rows: N
+                     Output: lprt_a.a
+                     Worker 0:  actual rows=N loops=N
+(24 rows)
+
+drop table lprt_b;
 delete from lprt_a where a = 1;
 select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(1, 0, 0)');
                                      explain_parallel_append                                     
@@ -2715,6 +2766,72 @@ order by tbl1.col1, tprt.col1;
  1001 | 1001
 (3 rows)
 
+-- Tests for extra statistics
+explain (analyze, verbose, costs off, summary off, timing off)
+select * from tbl1 inner join tprt on tbl1.col1 > tprt.col1;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Nested Loop (actual rows=23 loops=1)
+   Output: tbl1.col1, tprt.col1
+   ->  Seq Scan on public.tbl1 (actual rows=5 loops=1)
+         Output: tbl1.col1
+   ->  Append (actual rows=5 loops=5)
+         Loop Min Rows: 2  Max Rows: 6  Total Rows: 23
+         ->  Index Scan using tprt1_idx on public.tprt_1 (actual rows=2 loops=5)
+               Loop Min Rows: 2  Max Rows: 2  Total Rows: 10
+               Output: tprt_1.col1
+               Index Cond: (tprt_1.col1 < tbl1.col1)
+         ->  Index Scan using tprt2_idx on public.tprt_2 (actual rows=3 loops=4)
+               Loop Min Rows: 2  Max Rows: 3  Total Rows: 11
+               Output: tprt_2.col1
+               Index Cond: (tprt_2.col1 < tbl1.col1)
+         ->  Index Scan using tprt3_idx on public.tprt_3 (actual rows=1 loops=2)
+               Loop Min Rows: 1  Max Rows: 1  Total Rows: 2
+               Output: tprt_3.col1
+               Index Cond: (tprt_3.col1 < tbl1.col1)
+         ->  Index Scan using tprt4_idx on public.tprt_4 (never executed)
+               Output: tprt_4.col1
+               Index Cond: (tprt_4.col1 < tbl1.col1)
+         ->  Index Scan using tprt5_idx on public.tprt_5 (never executed)
+               Output: tprt_5.col1
+               Index Cond: (tprt_5.col1 < tbl1.col1)
+         ->  Index Scan using tprt6_idx on public.tprt_6 (never executed)
+               Output: tprt_6.col1
+               Index Cond: (tprt_6.col1 < tbl1.col1)
+(27 rows)
+
+explain (analyze, verbose, costs off, summary off, timing off)
+select * from tbl1 inner join tprt on tbl1.col1 = tprt.col1;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Nested Loop (actual rows=3 loops=1)
+   Output: tbl1.col1, tprt.col1
+   ->  Seq Scan on public.tbl1 (actual rows=5 loops=1)
+         Output: tbl1.col1
+   ->  Append (actual rows=1 loops=5)
+         Loop Min Rows: 0  Max Rows: 1  Total Rows: 3
+         ->  Index Scan using tprt1_idx on public.tprt_1 (never executed)
+               Output: tprt_1.col1
+               Index Cond: (tprt_1.col1 = tbl1.col1)
+         ->  Index Scan using tprt2_idx on public.tprt_2 (actual rows=1 loops=2)
+               Loop Min Rows: 1  Max Rows: 1  Total Rows: 2
+               Output: tprt_2.col1
+               Index Cond: (tprt_2.col1 = tbl1.col1)
+         ->  Index Scan using tprt3_idx on public.tprt_3 (actual rows=0 loops=3)
+               Loop Min Rows: 0  Max Rows: 1  Total Rows: 1
+               Output: tprt_3.col1
+               Index Cond: (tprt_3.col1 = tbl1.col1)
+         ->  Index Scan using tprt4_idx on public.tprt_4 (never executed)
+               Output: tprt_4.col1
+               Index Cond: (tprt_4.col1 = tbl1.col1)
+         ->  Index Scan using tprt5_idx on public.tprt_5 (never executed)
+               Output: tprt_5.col1
+               Index Cond: (tprt_5.col1 = tbl1.col1)
+         ->  Index Scan using tprt6_idx on public.tprt_6 (never executed)
+               Output: tprt_6.col1
+               Index Cond: (tprt_6.col1 = tbl1.col1)
+(26 rows)
+
 -- Last partition
 delete from tbl1;
 insert into tbl1 values (4400);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index d70bd8610cb..68110e20eeb 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -469,6 +469,25 @@ begin
 end;
 $$;
 
+create function explain_verbose_parallel_append(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in
+        execute format('explain (analyze, verbose, costs off, summary off, timing off) %s',
+            $1)
+    loop
+        ln := regexp_replace(ln, 'Workers Launched: \d+', 'Workers Launched: N');
+        ln := regexp_replace(ln, 'actual rows=\d+ loops=\d+', 'actual rows=N loops=N');
+        ln := regexp_replace(ln, 'Loop Min Rows: \d+  Max Rows: \d+  Total Rows: \d+',
+                                 'Loop Min Rows: N  Max Rows: N  Total Rows: N');
+        return next ln;
+    end loop;
+end;
+$$;
+
 prepare ab_q4 (int, int) as
 select avg(a) from ab where a between $1 and $2 and b < 4;
 
@@ -528,6 +547,12 @@ insert into lprt_a values(3),(3);
 select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(1, 0, 3)');
 select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(1, 0, 0)');
 
+-- Tests for extra statistics
+create table lprt_b (b int not null);
+insert into lprt_b select generate_series(1,20);
+select explain_verbose_parallel_append('select * from lprt_a join lprt_b on a != b');
+drop table lprt_b;
+
 delete from lprt_a where a = 1;
 
 select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(1, 0, 0)');
@@ -654,6 +679,13 @@ select tbl1.col1, tprt.col1 from tbl1
 inner join tprt on tbl1.col1 = tprt.col1
 order by tbl1.col1, tprt.col1;
 
+-- Tests for extra statistics
+explain (analyze, verbose, costs off, summary off, timing off)
+select * from tbl1 inner join tprt on tbl1.col1 > tprt.col1;
+
+explain (analyze, verbose, costs off, summary off, timing off)
+select * from tbl1 inner join tprt on tbl1.col1 = tprt.col1;
+
 -- Last partition
 delete from tbl1;
 insert into tbl1 values (4400);

overhead_v1.txttext/plain; name=overhead_v1.txtDownload

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Ekaterina Sokolova (#1)

Hi,

On Fri, Apr 01, 2022 at 11:46:47PM +0300, Ekaterina Sokolova wrote:

Most of the comments I have are easy to fix. But I think that the real
problem
is the significant overhead shown by Ekaterina that for now would apply
even if
you don't consume the new stats, for instance if you have
pg_stat_statements.
And I'm still not sure of what is the best way to avoid that.

I took your advice about InstrumentOption. Now INSTRUMENT_EXTRA exists.
So currently it's no overheads during basic load. Operations using
INSTRUMENT_ALL contain overheads (because of INSTRUMENT_EXTRA is a part of
INSTRUMENT_ALL), but they are much less significant than before. I apply new
overhead statistics collected by pgbench with auto _explain enabled.

Can you give a bit more details on your bench scenario? I see contradictory
results, where the patched version with more code is sometimes way faster,
sometimes way slower. If you're using pgbench
default queries (including write queries) I don't think that any of them will
hit the loop code, so it's really a best case scenario. Also write queries
will make tests less stable for no added value wrt. this code.

Ideally you would need a custom scenario with a single read-only query
involving a nested loop or something like that to check how much overhead you
really get when you cumulate those values. I will try to

Why do you need to initialize min_t and min_tuples but not max_t and
max_tuples while both will initially be 0 and possibly updated
afterwards?

We need this initialization for min values so comment about it located above
the block of code with initialization.

Sure, but if we're going to have a branch for nloops == 0, I think it would be
better to avoid redundant / useless instructions, something like:

if (nloops == 0)
{
min_t = totaltime;
min_tuple = tuplecount;
}
else
{
if (min_t...)
...
}

While on that part of the patch, there's an extra new line between max_t and
min_tuple processing.

Greg Stark

stark@mit.edu

almost 4 years ago

In reply to: Julien Rouhaud (#2)

This is not passing regression tests due to some details of the plan
output - marking Waiting on Author:

diff -w -U3 c:/cirrus/src/test/regress/expected/partition_prune.out
c:/cirrus/src/test/recovery/tmp_check/results/partition_prune.out
--- c:/cirrus/src/test/regress/expected/partition_prune.out 2022-04-05
17:00:25.433576100 +0000
+++ c:/cirrus/src/test/recovery/tmp_check/results/partition_prune.out
2022-04-05 17:18:30.092203500 +0000
@@ -2251,10 +2251,7 @@
          Workers Planned: 2
          Workers Launched: N
          ->  Parallel Seq Scan on public.lprt_b (actual rows=N loops=N)
-               Loop Min Rows: N  Max Rows: N  Total Rows: N
                Output: lprt_b.b
-               Worker 0:  actual rows=N loops=N
-               Worker 1:  actual rows=N loops=N
    ->  Materialize (actual rows=N loops=N)
          Loop Min Rows: N  Max Rows: N  Total Rows: N
          Output: lprt_a.a
@@ -2263,10 +2260,8 @@
                Workers Planned: 1
                Workers Launched: N
                ->  Parallel Seq Scan on public.lprt_a (actual rows=N loops=N)
-                     Loop Min Rows: N  Max Rows: N  Total Rows: N
                      Output: lprt_a.a
-                     Worker 0:  actual rows=N loops=N
-(24 rows)
+(19 rows)

drop table lprt_b;
delete from lprt_a where a = 1;

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Greg Stark (#3)

On Tue, Apr 05, 2022 at 05:14:09PM -0400, Greg Stark wrote:

This is not passing regression tests due to some details of the plan
output - marking Waiting on Author:

It's unstable due to parallel workers.
I'm not sure what the usual workarounds here.
Maybe set parallel_leader_participation=no for this test.

Show quoted text

diff -w -U3 c:/cirrus/src/test/regress/expected/partition_prune.out
c:/cirrus/src/test/recovery/tmp_check/results/partition_prune.out
--- c:/cirrus/src/test/regress/expected/partition_prune.out 2022-04-05
17:00:25.433576100 +0000
+++ c:/cirrus/src/test/recovery/tmp_check/results/partition_prune.out
2022-04-05 17:18:30.092203500 +0000
@@ -2251,10 +2251,7 @@
Workers Planned: 2
Workers Launched: N
->  Parallel Seq Scan on public.lprt_b (actual rows=N loops=N)
-               Loop Min Rows: N  Max Rows: N  Total Rows: N
Output: lprt_b.b
-               Worker 0:  actual rows=N loops=N
-               Worker 1:  actual rows=N loops=N
->  Materialize (actual rows=N loops=N)
Loop Min Rows: N  Max Rows: N  Total Rows: N
Output: lprt_a.a
@@ -2263,10 +2260,8 @@
Workers Planned: 1
Workers Launched: N
->  Parallel Seq Scan on public.lprt_a (actual rows=N loops=N)
-                     Loop Min Rows: N  Max Rows: N  Total Rows: N
Output: lprt_a.a
-                     Worker 0:  actual rows=N loops=N
-(24 rows)
+(19 rows)

Ekaterina Sokolova

e.sokolova@postgrespro.ru

over 3 years ago

In reply to: Justin Pryzby (#4)

5 attachment(s)

Hi, hackers!

We started discussion about overheads and how to calculate it correctly.

Julien Rouhaud wrote:

Can you give a bit more details on your bench scenario? I see
contradictory
results, where the patched version with more code is sometimes way
faster,
sometimes way slower. If you're using pgbench
default queries (including write queries) I don't think that any of
them will
hit the loop code, so it's really a best case scenario. Also write
queries
will make tests less stable for no added value wrt. this code.

Ideally you would need a custom scenario with a single read-only query
involving a nested loop or something like that to check how much
overhead you
really get when you cumulate those values.

I created 2 custom scenarios. First one contains VERBOSE flag so this
scenario uses extra statistics. Second one doesn't use new feature and
doesn't disable its use (therefore still collect data).
I attach scripts for pgbench to this letter.

Main conclusions are:
1) the use of additional statistics affects no more than 4.5%;
2) data collection affects no more than 1.5%.
I think testing on another machine would be very helpful, so if you get
a chance, I'd be happy if you share your observations.

Some fixes:

Sure, but if we're going to have a branch for nloops == 0, I think it
would be
better to avoid redundant / useless instructions

Right. I done it.

Justin Pryzby wrote:

Maybe set parallel_leader_participation=no for this test.

Thanks for reporting the issue and advice. I set
parallel_leader_participation = off. I hope this helps to solve the
problem of inconsistencies in the outputs.

If you have any comments on this topic or want to share your
impressions, please write to me.
Thank you very much for your contribution to the development of this
patch.

--
Ekaterina Sokolova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

0001-explain.c-refactor-ExplainNode_v3.patchtext/x-diff; name=0001-explain.c-refactor-ExplainNode_v3.patchDownload

From: Justin Pryzby <pryzbyj@telsasoft.com>
Subject: [PATCH 1/2] explain.c-refactor-ExplainNode_v3

---
 src/backend/commands/explain.c | 110 ++++++++++++++-------------------
 1 file changed, 48 insertions(+), 62 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5d1f7089daf..bd785c6741d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -118,6 +118,8 @@ static void show_instrumentation_count(const char *qlabel, int which,
 									   PlanState *planstate, ExplainState *es);
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static void show_eval_params(Bitmapset *bms_params, ExplainState *es);
+static void show_loop_info(Instrumentation *instrument, bool isworker,
+						   ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage,
 							  bool planning);
@@ -1615,36 +1617,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 	if (es->analyze &&
 		planstate->instrument && planstate->instrument->nloops > 0)
-	{
-		double		nloops = planstate->instrument->nloops;
-		double		startup_ms = 1000.0 * planstate->instrument->startup / nloops;
-		double		total_ms = 1000.0 * planstate->instrument->total / nloops;
-		double		rows = planstate->instrument->ntuples / nloops;
-
-		if (es->format == EXPLAIN_FORMAT_TEXT)
-		{
-			if (es->timing)
-				appendStringInfo(es->str,
-								 " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)",
-								 startup_ms, total_ms, rows, nloops);
-			else
-				appendStringInfo(es->str,
-								 " (actual rows=%.0f loops=%.0f)",
-								 rows, nloops);
-		}
-		else
-		{
-			if (es->timing)
-			{
-				ExplainPropertyFloat("Actual Startup Time", "ms", startup_ms,
-									 3, es);
-				ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
-									 3, es);
-			}
-			ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
-			ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
-		}
-	}
+		show_loop_info(planstate->instrument, false, es);
 	else if (es->analyze)
 	{
 		if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1673,44 +1646,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		for (int n = 0; n < w->num_workers; n++)
 		{
 			Instrumentation *instrument = &w->instrument[n];
-			double		nloops = instrument->nloops;
-			double		startup_ms;
-			double		total_ms;
-			double		rows;
 
-			if (nloops <= 0)
+			if (instrument->nloops <= 0)
 				continue;
-			startup_ms = 1000.0 * instrument->startup / nloops;
-			total_ms = 1000.0 * instrument->total / nloops;
-			rows = instrument->ntuples / nloops;
 
 			ExplainOpenWorker(n, es);
 
-			if (es->format == EXPLAIN_FORMAT_TEXT)
-			{
-				ExplainIndentText(es);
-				if (es->timing)
-					appendStringInfo(es->str,
-									 "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
-									 startup_ms, total_ms, rows, nloops);
-				else
-					appendStringInfo(es->str,
-									 "actual rows=%.0f loops=%.0f\n",
-									 rows, nloops);
-			}
-			else
-			{
-				if (es->timing)
-				{
-					ExplainPropertyFloat("Actual Startup Time", "ms",
-										 startup_ms, 3, es);
-					ExplainPropertyFloat("Actual Total Time", "ms",
-										 total_ms, 3, es);
-				}
-				ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
-				ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
-			}
+			show_loop_info(instrument, true, es);
 
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+				appendStringInfoChar(es->str, '\n');
 			ExplainCloseWorker(n, es);
 		}
 	}
@@ -4099,6 +4044,47 @@ show_modifytable_info(ModifyTableState *mtstate, List *ancestors,
 		ExplainCloseGroup("Target Tables", "Target Tables", false, es);
 }
 
+void
+show_loop_info(Instrumentation *instrument, bool isworker, ExplainState *es)
+{
+	double		nloops = instrument->nloops;
+	double		startup_ms = 1000.0 * instrument->startup / nloops;
+	double		total_ms = 1000.0 * instrument->total / nloops;
+	double		rows = instrument->ntuples / nloops;
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (isworker)
+			ExplainIndentText(es);
+		else
+			appendStringInfo(es->str, " (");
+
+		if (es->timing)
+			appendStringInfo(es->str,
+							 "actual time=%.3f..%.3f rows=%.0f loops=%.0f",
+							 startup_ms, total_ms, rows, nloops);
+		else
+			appendStringInfo(es->str,
+							 "actual rows=%.0f loops=%.0f",
+							 rows, nloops);
+
+		if (!isworker)
+			appendStringInfoChar(es->str, ')');
+	}
+	else
+	{
+		if (es->timing)
+		{
+			ExplainPropertyFloat("Actual Startup Time", "ms", startup_ms,
+								 3, es);
+			ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
+								 3, es);
+		}
+		ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
+		ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
+	}
+}
+
 /*
  * Explain the constituent plans of an Append, MergeAppend,
  * BitmapAnd, or BitmapOr node.
-- 
2.30.1 (Apple Git-130)

0002-extra_statistics_v8.patchtext/x-diff; name=0002-extra_statistics_v8.patchDownload

From: Ekaterina Sokolova <e.sokolova@postgrespro.ru>
Subject: [PATCH 2/2] Add extra statistics to explain for Nested Loop

For some distributions of data in tables, different loops in nested loop
joins can take different time and process different amounts of entries.
It makes average statistics returned by explain analyze not very useful for DBA.

This patch add collecting of min, max and total statistics for time and rows
across all loops to EXPLAIN ANALYSE. You need to set the VERBOSE flag to display
this information. The patch contains regression tests.

Example of results in TEXT format:
   ->  Append (actual rows=5 loops=5)
         Loop Min Rows: 2  Max Rows: 6  Total Rows: 23

Reviewed-by: Lukas Fittl, Justin Pryzby, Yugo Nagata, Julien Rouhaud, Greg Stark.

---
 src/backend/commands/explain.c                |  40 +++++++
 src/backend/executor/instrument.c             |  40 ++++++-
 src/include/executor/instrument.h             |  10 ++
 src/test/regress/expected/explain.out         |  10 ++
 src/test/regress/expected/partition_prune.out | 105 ++++++++++++++++++
 src/test/regress/sql/partition_prune.sql      |  34 ++++++
 6 files changed, 238 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index bd785c6741d..04e66a67cdb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -541,6 +541,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	if (es->wal)
 		instrument_option |= INSTRUMENT_WAL;
 
+	if (es->verbose)
+		instrument_option |= INSTRUMENT_EXTRA;
+
 	/*
 	 * We always collect timing for the entire statement, even when node-level
 	 * timing is off, so we don't look at es->timing here.  (We could skip
@@ -1624,6 +1627,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			appendStringInfoString(es->str, " (never executed)");
 		else
 		{
+			/* without min and max values because actual result is 0 */
 			if (es->timing)
 			{
 				ExplainPropertyFloat("Actual Startup Time", "ms", 0.0, 3, es);
@@ -4051,6 +4055,11 @@ show_loop_info(Instrumentation *instrument, bool isworker, ExplainState *es)
 	double		startup_ms = 1000.0 * instrument->startup / nloops;
 	double		total_ms = 1000.0 * instrument->total / nloops;
 	double		rows = instrument->ntuples / nloops;
+	double		loop_total_rows = instrument->ntuples;
+	double		loop_min_r = instrument->min_tuples;
+	double		loop_max_r = instrument->max_tuples;
+	double		loop_min_t_ms = 1000.0 * instrument->min_t;
+	double		loop_max_t_ms = 1000.0 * instrument->max_t;
 
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
@@ -4070,6 +4079,21 @@ show_loop_info(Instrumentation *instrument, bool isworker, ExplainState *es)
 
 		if (!isworker)
 			appendStringInfoChar(es->str, ')');
+
+		if (nloops > 1 && es->verbose)
+		{
+			appendStringInfoChar(es->str, '\n');
+			ExplainIndentText(es);
+
+			if (es->timing)
+				appendStringInfo(es->str,
+								 "Loop Min Rows: %.0f  Max Rows: %.0f  Total Rows: %.0f  Min Time: %.3f  Max Time: %.3f",
+								 loop_min_r, loop_max_r, loop_total_rows, loop_min_t_ms, loop_max_t_ms);
+			else
+				appendStringInfo(es->str,
+								 "Loop Min Rows: %.0f  Max Rows: %.0f  Total Rows: %.0f",
+								 loop_min_r, loop_max_r, loop_total_rows);
+		}
 	}
 	else
 	{
@@ -4079,8 +4103,24 @@ show_loop_info(Instrumentation *instrument, bool isworker, ExplainState *es)
 								 3, es);
 			ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
 								 3, es);
+			if (nloops > 1 && es->verbose)
+			{
+				ExplainPropertyFloat("Loop Min Time", "ms", loop_min_t_ms,
+									 3, es);
+				ExplainPropertyFloat("Loop Max Time", "ms", loop_max_t_ms,
+									 3, es);
+			}
 		}
+
 		ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
+
+		if (nloops > 1 && es->verbose)
+		{
+			ExplainPropertyFloat("Loop Min Rows", NULL, loop_min_r, 0, es);
+			ExplainPropertyFloat("Loop Max Rows", NULL, loop_max_r, 0, es);
+			ExplainPropertyFloat("Loop Total Rows", NULL, loop_total_rows, 0, es);
+		}
+
 		ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
 	}
 }
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ceff4727d4a..e64ae77f005 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -34,11 +34,12 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
 
 	/* initialize all fields to zeroes, then modify as needed */
 	instr = palloc0(n * sizeof(Instrumentation));
-	if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+	if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL | INSTRUMENT_EXTRA))
 	{
 		bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
 		bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
 		bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+		bool		need_extra = (instrument_options & INSTRUMENT_EXTRA) != 0;
 		int			i;
 
 		for (i = 0; i < n; i++)
@@ -46,6 +47,7 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
 			instr[i].need_bufusage = need_buffers;
 			instr[i].need_walusage = need_wal;
 			instr[i].need_timer = need_timer;
+			instr[i].need_extra = need_extra;
 			instr[i].async_mode = async_mode;
 		}
 	}
@@ -61,6 +63,7 @@ InstrInit(Instrumentation *instr, int instrument_options)
 	instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
 	instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
 	instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+	instr->need_extra = (instrument_options & INSTRUMENT_EXTRA) != 0;
 }
 
 /* Entry to a plan node */
@@ -154,6 +157,37 @@ InstrEndLoop(Instrumentation *instr)
 	instr->startup += instr->firsttuple;
 	instr->total += totaltime;
 	instr->ntuples += instr->tuplecount;
+
+	/*
+	 * this is first loop
+	 *
+	 * We only initialize the min values. We don't need to bother with the
+	 * max, because those are 0 and the non-zero values will get updated a
+	 * couple lines later.
+	 */
+	if (instr->need_extra)
+	{
+		if (instr->nloops == 0)
+		{
+			instr->min_t = totaltime;
+			instr->min_tuples = instr->tuplecount;
+		}
+		else
+		{
+			if (instr->min_t > totaltime)
+				instr->min_t = totaltime;
+
+			if (instr->min_tuples > instr->tuplecount)
+				instr->min_tuples = instr->tuplecount;
+		}
+
+		if (instr->max_t < totaltime)
+			instr->max_t = totaltime;
+
+		if (instr->max_tuples < instr->tuplecount)
+			instr->max_tuples = instr->tuplecount;
+	}
+
 	instr->nloops += 1;
 
 	/* Reset for next cycle (if any) */
@@ -186,6 +220,10 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 	dst->nloops += add->nloops;
 	dst->nfiltered1 += add->nfiltered1;
 	dst->nfiltered2 += add->nfiltered2;
+	dst->min_t = Min(dst->min_t, add->min_t);
+	dst->max_t = Max(dst->max_t, add->max_t);
+	dst->min_tuples = Min(dst->min_tuples, add->min_tuples);
+	dst->max_tuples = Max(dst->max_tuples, add->max_tuples);
 
 	/* Add delta of buffer usage since entry to node's totals */
 	if (dst->need_bufusage)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 2945cce3a97..1d7a5f9c177 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -60,6 +60,8 @@ typedef enum InstrumentOption
 	INSTRUMENT_BUFFERS = 1 << 1,	/* needs buffer usage */
 	INSTRUMENT_ROWS = 1 << 2,	/* needs row count */
 	INSTRUMENT_WAL = 1 << 3,	/* needs WAL usage */
+	INSTRUMENT_EXTRA = 1 << 4,	/* needs counters for min,
+								 * max and total values */
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
@@ -69,6 +71,8 @@ typedef struct Instrumentation
 	bool		need_timer;		/* true if we need timer data */
 	bool		need_bufusage;	/* true if we need buffer usage data */
 	bool		need_walusage;	/* true if we need WAL usage data */
+	bool		need_extra;		/* true if we need min, max and total
+								 * statistics for loops */
 	bool		async_mode;		/* true if node is in async mode */
 	/* Info about current plan cycle: */
 	bool		running;		/* true if we've completed first tuple */
@@ -81,7 +85,13 @@ typedef struct Instrumentation
 	/* Accumulated statistics across all completed cycles: */
 	double		startup;		/* total startup time (in seconds) */
 	double		total;			/* total time (in seconds) */
+	double		min_t;			/* time of fastest loop (in seconds) */
+	double		max_t;			/* time of slowest loop (in seconds) */
 	double		ntuples;		/* total tuples produced */
+	double		min_tuples;		/* min counter of produced tuples for all
+								 * loops */
+	double		max_tuples;		/* max counter of produced tuples for all
+								 * loops */
 	double		ntuples2;		/* secondary node-specific tuple counter */
 	double		nloops;			/* # of run cycles for this node */
 	double		nfiltered1;		/* # of tuples removed by scanqual or joinqual */
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 48620edbc2b..957771be2ba 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -369,8 +369,13 @@ select jsonb_pretty(
                              "Actual Loops": 0,             +
                              "Startup Cost": 0.0,           +
                              "Async Capable": false,        +
+                             "Loop Max Rows": 0,            +
+                             "Loop Max Time": 0.0,          +
+                             "Loop Min Rows": 0,            +
+                             "Loop Min Time": 0.0,          +
                              "Relation Name": "tenk1",      +
                              "Parallel Aware": true,        +
+                             "Loop Total Rows": 0,          +
                              "Local Hit Blocks": 0,         +
                              "Temp Read Blocks": 0,         +
                              "Actual Total Time": 0.0,      +
@@ -415,7 +420,12 @@ select jsonb_pretty(
                      "Actual Loops": 0,                     +
                      "Startup Cost": 0.0,                   +
                      "Async Capable": false,                +
+                     "Loop Max Rows": 0,                    +
+                     "Loop Max Time": 0.0,                  +
+                     "Loop Min Rows": 0,                    +
+                     "Loop Min Time": 0.0,                  +
                      "Parallel Aware": false,               +
+                     "Loop Total Rows": 0,                  +
                      "Sort Space Used": 0,                  +
                      "Local Hit Blocks": 0,                 +
                      "Temp Read Blocks": 0,                 +
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 7555764c779..d8afab5025d 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -1962,6 +1962,23 @@ begin
     end loop;
 end;
 $$;
+create function explain_verbose_parallel_append(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in
+        execute format('explain (analyze, verbose, costs off, summary off, timing off) %s',
+            $1)
+    loop
+        ln := regexp_replace(ln, 'actual rows=\d+ loops=\d+', 'actual rows=N loops=N');
+        ln := regexp_replace(ln, 'Loop Min Rows: \d+  Max Rows: \d+  Total Rows: \d+',
+                                 'Loop Min Rows: N  Max Rows: N  Total Rows: N');
+        return next ln;
+    end loop;
+end;
+$$;
 prepare ab_q4 (int, int) as
 select avg(a) from ab where a between $1 and $2 and b < 4;
 -- Encourage use of parallel plans
@@ -2218,7 +2235,29 @@ select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on
                                  Index Cond: (a = a.a)
 (28 rows)
 
+-- Tests for extra statistics
+set parallel_leader_participation = off;
+create table lprt_b (b int not null);
+insert into lprt_b select generate_series(1,20);
+select explain_verbose_parallel_append('select * from lprt_a join lprt_b on a != b');
+                explain_verbose_parallel_append                
+---------------------------------------------------------------
+ Nested Loop (actual rows=N loops=N)
+   Output: lprt_a.a, lprt_b.b
+   Join Filter: (lprt_a.a <> lprt_b.b)
+   Rows Removed by Join Filter: 4
+   ->  Seq Scan on public.lprt_b (actual rows=N loops=N)
+         Output: lprt_b.b
+   ->  Materialize (actual rows=N loops=N)
+         Loop Min Rows: N  Max Rows: N  Total Rows: N
+         Output: lprt_a.a
+         ->  Seq Scan on public.lprt_a (actual rows=N loops=N)
+               Output: lprt_a.a
+(11 rows)
+
+drop table lprt_b;
 delete from lprt_a where a = 1;
+reset parallel_leader_participation;
 select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(1, 0, 0)');
                                      explain_parallel_append                                     
 -------------------------------------------------------------------------------------------------
@@ -2715,6 +2754,72 @@ order by tbl1.col1, tprt.col1;
  1001 | 1001
 (3 rows)
 
+-- Tests for extra statistics
+explain (analyze, verbose, costs off, summary off, timing off)
+select * from tbl1 inner join tprt on tbl1.col1 > tprt.col1;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Nested Loop (actual rows=23 loops=1)
+   Output: tbl1.col1, tprt.col1
+   ->  Seq Scan on public.tbl1 (actual rows=5 loops=1)
+         Output: tbl1.col1
+   ->  Append (actual rows=5 loops=5)
+         Loop Min Rows: 2  Max Rows: 6  Total Rows: 23
+         ->  Index Scan using tprt1_idx on public.tprt_1 (actual rows=2 loops=5)
+               Loop Min Rows: 2  Max Rows: 2  Total Rows: 10
+               Output: tprt_1.col1
+               Index Cond: (tprt_1.col1 < tbl1.col1)
+         ->  Index Scan using tprt2_idx on public.tprt_2 (actual rows=3 loops=4)
+               Loop Min Rows: 2  Max Rows: 3  Total Rows: 11
+               Output: tprt_2.col1
+               Index Cond: (tprt_2.col1 < tbl1.col1)
+         ->  Index Scan using tprt3_idx on public.tprt_3 (actual rows=1 loops=2)
+               Loop Min Rows: 1  Max Rows: 1  Total Rows: 2
+               Output: tprt_3.col1
+               Index Cond: (tprt_3.col1 < tbl1.col1)
+         ->  Index Scan using tprt4_idx on public.tprt_4 (never executed)
+               Output: tprt_4.col1
+               Index Cond: (tprt_4.col1 < tbl1.col1)
+         ->  Index Scan using tprt5_idx on public.tprt_5 (never executed)
+               Output: tprt_5.col1
+               Index Cond: (tprt_5.col1 < tbl1.col1)
+         ->  Index Scan using tprt6_idx on public.tprt_6 (never executed)
+               Output: tprt_6.col1
+               Index Cond: (tprt_6.col1 < tbl1.col1)
+(27 rows)
+
+explain (analyze, verbose, costs off, summary off, timing off)
+select * from tbl1 inner join tprt on tbl1.col1 = tprt.col1;
+                                   QUERY PLAN                                    
+---------------------------------------------------------------------------------
+ Nested Loop (actual rows=3 loops=1)
+   Output: tbl1.col1, tprt.col1
+   ->  Seq Scan on public.tbl1 (actual rows=5 loops=1)
+         Output: tbl1.col1
+   ->  Append (actual rows=1 loops=5)
+         Loop Min Rows: 0  Max Rows: 1  Total Rows: 3
+         ->  Index Scan using tprt1_idx on public.tprt_1 (never executed)
+               Output: tprt_1.col1
+               Index Cond: (tprt_1.col1 = tbl1.col1)
+         ->  Index Scan using tprt2_idx on public.tprt_2 (actual rows=1 loops=2)
+               Loop Min Rows: 1  Max Rows: 1  Total Rows: 2
+               Output: tprt_2.col1
+               Index Cond: (tprt_2.col1 = tbl1.col1)
+         ->  Index Scan using tprt3_idx on public.tprt_3 (actual rows=0 loops=3)
+               Loop Min Rows: 0  Max Rows: 1  Total Rows: 1
+               Output: tprt_3.col1
+               Index Cond: (tprt_3.col1 = tbl1.col1)
+         ->  Index Scan using tprt4_idx on public.tprt_4 (never executed)
+               Output: tprt_4.col1
+               Index Cond: (tprt_4.col1 = tbl1.col1)
+         ->  Index Scan using tprt5_idx on public.tprt_5 (never executed)
+               Output: tprt_5.col1
+               Index Cond: (tprt_5.col1 = tbl1.col1)
+         ->  Index Scan using tprt6_idx on public.tprt_6 (never executed)
+               Output: tprt_6.col1
+               Index Cond: (tprt_6.col1 = tbl1.col1)
+(26 rows)
+
 -- Last partition
 delete from tbl1;
 insert into tbl1 values (4400);
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index d70bd8610cb..2f7927c7ec1 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -469,6 +469,24 @@ begin
 end;
 $$;
 
+create function explain_verbose_parallel_append(text) returns setof text
+language plpgsql as
+$$
+declare
+    ln text;
+begin
+    for ln in
+        execute format('explain (analyze, verbose, costs off, summary off, timing off) %s',
+            $1)
+    loop
+        ln := regexp_replace(ln, 'actual rows=\d+ loops=\d+', 'actual rows=N loops=N');
+        ln := regexp_replace(ln, 'Loop Min Rows: \d+  Max Rows: \d+  Total Rows: \d+',
+                                 'Loop Min Rows: N  Max Rows: N  Total Rows: N');
+        return next ln;
+    end loop;
+end;
+$$;
+
 prepare ab_q4 (int, int) as
 select avg(a) from ab where a between $1 and $2 and b < 4;
 
@@ -528,8 +546,17 @@ insert into lprt_a values(3),(3);
 select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(1, 0, 3)');
 select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(1, 0, 0)');
 
+-- Tests for extra statistics
+set parallel_leader_participation = off;
+create table lprt_b (b int not null);
+insert into lprt_b select generate_series(1,20);
+select explain_verbose_parallel_append('select * from lprt_a join lprt_b on a != b');
+drop table lprt_b;
+
 delete from lprt_a where a = 1;
 
+reset parallel_leader_participation;
+
 select explain_parallel_append('select avg(ab.a) from ab inner join lprt_a a on ab.a = a.a where a.a in(1, 0, 0)');
 
 reset enable_hashjoin;
@@ -654,6 +681,13 @@ select tbl1.col1, tprt.col1 from tbl1
 inner join tprt on tbl1.col1 = tprt.col1
 order by tbl1.col1, tprt.col1;
 
+-- Tests for extra statistics
+explain (analyze, verbose, costs off, summary off, timing off)
+select * from tbl1 inner join tprt on tbl1.col1 > tprt.col1;
+
+explain (analyze, verbose, costs off, summary off, timing off)
+select * from tbl1 inner join tprt on tbl1.col1 = tprt.col1;
+
 -- Last partition
 delete from tbl1;
 insert into tbl1 values (4400);
-- 
2.30.1 (Apple Git-130)

pgbench_looptext/plain; name=pgbench_loopDownload

pgbench_loop_without_verbosetext/plain; name=pgbench_loop_without_verboseDownload

overhead_v3.txttext/plain; name=overhead_v3.txtDownload

Julien Rouhaud

rjuju123@gmail.com

over 3 years ago

In reply to: Ekaterina Sokolova (#5)

Hi,

On Fri, Jun 24, 2022 at 08:16:06PM +0300, Ekaterina Sokolova wrote:

We started discussion about overheads and how to calculate it correctly.

Julien Rouhaud wrote:

Can you give a bit more details on your bench scenario?
[...]
Ideally you would need a custom scenario with a single read-only query
involving a nested loop or something like that to check how much
overhead you
really get when you cumulate those values.

I created 2 custom scenarios. First one contains VERBOSE flag so this
scenario uses extra statistics. Second one doesn't use new feature and
doesn't disable its use (therefore still collect data).
I attach scripts for pgbench to this letter.

I don't think that this scenario is really representative for the problem I was
mentioning as you're only testing the overhead using the EXPLAIN (ANALYZE)
command, which doesn't say much about normal query execution.

I did a simple benchmark using a scale 50 pgbench on a pg_stat_statements
enabled instance, and the following scenario:

SET enable_mergejoin = off;
SELECT count(*) FROM pgbench_accounts JOIN pgbench_tellers on aid = tid;

(which forces a nested loop) and compared the result from this patch and fixing
pg_stat_statements to not request INSTRUMENT extra, something like:

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 049da9fe6d..9a2177e438 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -985,7 +985,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
            MemoryContext oldcxt;

            oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
-           queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+           queryDesc->totaltime = InstrAlloc(1, (INSTRUMENT_ALL & ~INSTRUMENT_EXTRA), false);
            MemoryContextSwitchTo(oldcxt);
        }
    }

It turns out that having pg_stat_statements with INSTRUMENT_EXTRA indirectly
requested by INSTRUMENT_ALL adds a ~27% overhead.

I'm not sure that I actually believe these results, but they're really
consistent, so maybe that's real.

Anyway, even if the overheadwas only 1.5% like in your own benchmark, that
still wouldn't be acceptable. Such a feature is in my opinion very welcome,
but it shouldn't add *any* overhead outside of EXPLAIN (ANALYZE, VERBOSE).

Note that this was done using a "production build" (so with -02, without assert
and such). Doing the same on a debug build (and a scale 20 pgbench), the
overhead is about 1.75%, which is closer to your result. What was the
configure option you used for your benchmark?

Also, I don't think it's not acceptable to ask every single extension that
currently relies on INSTRUMENT_ALL to be patched and drop some random
INSTRUMENT_XXX flags to avoid this overhead. So as I mentioned previously, I
think we should keep INSTRUMENT_ALL to mean something like "all instrumentation
that gives metrics at the statement level", and have INSTRUMENT_EXTRA be
outside of INSTRUMENT_ALL. Maybe this new category should have a global flag
to request all of them, and maybe there should be some additional alias to grab
all categories.

While at it, INSTRUMENT_EXTRA doesn't really seem like a nice name either since
there's no guarantee that the next time someone adds a new instrument option
for per-node information, she will want to combine it with this one. Maybe
INSTRUMENT_MINMAX_LOOPS or something like that?

Julien Rouhaud

rjuju123@gmail.com

over 3 years ago

In reply to: Julien Rouhaud (#6)

On Sat, Jul 30, 2022 at 08:54:33PM +0800, Julien Rouhaud wrote:

It turns out that having pg_stat_statements with INSTRUMENT_EXTRA indirectly
requested by INSTRUMENT_ALL adds a ~27% overhead.

I'm not sure that I actually believe these results, but they're really
consistent, so maybe that's real.

Anyway, even if the overheadwas only 1.5% like in your own benchmark, that
still wouldn't be acceptable. Such a feature is in my opinion very welcome,
but it shouldn't add *any* overhead outside of EXPLAIN (ANALYZE, VERBOSE).

I did the same benchmark this morning, although trying to stop all background
jobs and things on my machine that could interfere with the results, using
longer runs and more runs and I now get a reproducible ~1% overhead, which is
way more believable. Not sure what happened yesterday as I got reproducible
number doing the same benchmark twice, I guess that the fun doing performance
tests on a development machine.

Anyway, 1% is in my opinion still too much overhead for extensions that won't
get any extra information.

Andrey Lepikhov

a.lepikhov@postgrespro.ru

over 2 years ago

In reply to: Julien Rouhaud (#7)

On 31/7/2022 10:49, Julien Rouhaud wrote:

On Sat, Jul 30, 2022 at 08:54:33PM +0800, Julien Rouhaud wrote:
Anyway, 1% is in my opinion still too much overhead for extensions that won't
get any extra information.

I have read all the thread and still can't understand something. What
valuable data can I find with these extra statistics if no one
parameterized node in the plan exists?
Also, thinking about min/max time in the explain, I guess it would be
necessary in rare cases. Usually, the execution time will correlate to
the number of tuples scanned, won't it? So, maybe skip the time
boundaries in the instrument structure?
In my experience, it is enough to know the total number of tuples
bubbled up from a parameterized node to decide further optimizations.
Maybe simplify this feature up to the one total_rows field in the case
of nloops > 1 and in the presence of parameters?
And at the end. If someone wants a lot of additional statistics, why not
give them that by extension? It is only needed to add a hook into the
point of the node explanation and some efforts to make instrumentation
extensible. But here, honestly, I don't have code/ideas so far.

--
regards,
Andrey Lepikhov
Postgres Professional