Custom explain options

Started by Konstantin Knizhnikabout 2 years ago24 messages

knizhnik@garret.ru

about 2 years ago

1 attachment(s)

Hi hackers,

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page
prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is
implemented not only for bitmap heap scan as in Vanilla Postgres, but
also for seqscan, indexscan and indexonly scan). Another possible
candidate for explain options is local file cache (extra caching layer
above shared buffers which is used to somehow replace file system cache
in standalone Postgres).

I think that it will be nice to have a generic mechanism which allows
extensions to add its own options to EXPLAIN.
I have attached the patch with implementation of such mechanism (also
available as PR: https://github.com/knizhnik/postgres/pull/1 )

I have demonstrated this mechanism using Bloom extension - just to
report number of Bloom matches.
Not sure that it is really useful information but it is used mostly as
example:

explain (analyze,bloom) select * from t where pk=2000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=15348.00..15352.01 rows=1 width=4) (actual time=25.244..25.939 rows=1 loops=1)
Recheck Cond: (pk = 2000)
Rows Removed by Index Recheck: 292
Heap Blocks: exact=283
Bloom: matches=293
-> Bitmap Index Scan on t_pk_idx (cost=0.00..15348.00 rows=1 width=0) (actual time=25.147..25.147 rows=293 loops=1)
Index Cond: (pk = 2000)
Bloom: matches=293
Planning:
Bloom: matches=0
Planning Time: 0.387 ms
Execution Time: 26.053 ms
(12 rows)

There are two known issues with this proposal:

1. I have to limit total size of all custom metrics - right now it is
limited by 128 bytes. It is done to keep|Instrumentation|and some other
data structures fixes size. Otherwise maintaining varying parts of this
structure is ugly, especially in shared memory

2. Custom extension is added by means
of|RegisterCustomInsrumentation|function which is called from|_PG_init|
But|_PG_init|is called when extension is loaded and it is loaded on
demand when some of extension functions is called (except when extension
is included
in shared_preload_libraries list), Bloom extension doesn't require it.
So if your first statement executed in your session is:

explain (analyze,bloom) select * from t where pk=2000;

...you will get error:

ERROR: unrecognized EXPLAIN option "bloom"
LINE 1: explain (analyze,bloom) select * from t where pk=2000;

It happens because at the moment when explain statement parses options,
Bloom index is not yet selected and so bloom extension is not loaded
and|RegisterCustomInsrumentation|is not yet called. If we repeat the
query, then proper result will be displayed (see above).

Attachments:

custom_explain_options.patchtext/plain; charset=UTF-8; name=custom_explain_options.patchDownload

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index 330811ec60..af23ffe821 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -174,6 +174,13 @@ typedef struct BloomScanOpaqueData
 
 typedef BloomScanOpaqueData *BloomScanOpaque;
 
+typedef struct
+{
+	uint64 matches;
+} BloomUsage;
+
+extern BloomUsage bloomUsage;
+
 /* blutils.c */
 extern void initBloomState(BloomState *state, Relation index);
 extern void BloomFillMetapage(Relation index, Page metaPage);
diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index 61d1f66b38..8890951943 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -166,6 +166,6 @@ blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 		CHECK_FOR_INTERRUPTS();
 	}
 	FreeAccessStrategy(bas);
-
+	bloomUsage.matches += ntids;
 	return ntids;
 }
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index f23fbb1d9e..d795875920 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -18,6 +18,7 @@
 #include "access/reloptions.h"
 #include "bloom.h"
 #include "catalog/index.h"
+#include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -34,6 +35,55 @@
 
 PG_FUNCTION_INFO_V1(blhandler);
 
+BloomUsage bloomUsage;
+
+static void
+bloomUsageAdd(CustomResourceUsage* dst, CustomResourceUsage const* add)
+{
+	((BloomUsage*)dst)->matches += ((BloomUsage*)add)->matches;
+}
+
+static void
+bloomUsageAccum(CustomResourceUsage* acc, CustomResourceUsage const* end, CustomResourceUsage const* start)
+{
+	((BloomUsage*)acc)->matches += ((BloomUsage*)end)->matches - ((BloomUsage*)start)->matches;;
+}
+
+static void
+bloomUsageShow(ExplainState* es, CustomResourceUsage const* usage, bool planning)
+{
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (planning)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Planning:\n");
+			es->indent++;
+		}
+		ExplainIndentText(es);
+		appendStringInfoString(es->str, "Bloom:");
+		appendStringInfo(es->str, " matches=%lld",
+						 (long long) ((BloomUsage*)usage)->matches);
+		appendStringInfoChar(es->str, '\n');
+		if (planning)
+			es->indent--;
+	}
+	else
+	{
+		ExplainPropertyInteger("Bloom Matches", NULL,
+							   ((BloomUsage*)usage)->matches, es);
+	}
+}
+
+static CustomInstrumentation bloomInstr = {
+	"bloom",
+	sizeof(BloomUsage),
+	&bloomUsage,
+	bloomUsageAdd,
+	bloomUsageAccum,
+	bloomUsageShow
+};
+
 /* Kind of relation options for bloom index */
 static relopt_kind bl_relopt_kind;
 
@@ -78,6 +128,7 @@ _PG_init(void)
 		bl_relopt_tab[i + 1].opttype = RELOPT_TYPE_INT;
 		bl_relopt_tab[i + 1].offset = offsetof(BloomOptions, bitSize[0]) + sizeof(int) * i;
 	}
+	RegisterCustomInsrumentation(&bloomInstr);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c2665fce41..a4ab0d6ba0 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -71,6 +71,7 @@
 #define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xA000000000000004)
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xA000000000000005)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_CUST_USAGE		    UINT64CONST(0xA000000000000007)
 
 /*
  * DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -197,6 +198,7 @@ typedef struct BTLeader
 	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char       *custusage;
 } BTLeader;
 
 /*
@@ -1467,6 +1469,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BTLeader   *btleader = (BTLeader *) palloc0(sizeof(BTLeader));
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char        *custusage;
 	bool		leaderparticipates = true;
 	int			querylen;
 
@@ -1532,6 +1535,9 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
 	if (debug_query_string)
@@ -1625,6 +1631,10 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	bufferusage = shm_toc_allocate(pcxt->toc,
 								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+	custusage = shm_toc_allocate(pcxt->toc,
+								 mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_CUST_USAGE, custusage);
+
 
 	/* Launch workers, saving status for leader/caller */
 	LaunchParallelWorkers(pcxt);
@@ -1638,6 +1648,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btleader->snapshot = snapshot;
 	btleader->walusage = walusage;
 	btleader->bufferusage = bufferusage;
+	btleader->custusage = custusage;
 
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
@@ -1676,7 +1687,7 @@ _bt_end_parallel(BTLeader *btleader)
 	 * or we might get incomplete data.)
 	 */
 	for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
-		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i], btleader->custusage + pgCustUsageSize*i);
 
 	/* Free last reference to MVCC snapshot, if one was used */
 	if (IsMVCCSnapshot(btleader->snapshot))
@@ -1811,6 +1822,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	LOCKMODE	indexLockmode;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char       *custusage;
 	int			sortmem;
 
 #ifdef BTREE_BUILD_STATS
@@ -1891,8 +1903,10 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	custusage = shm_toc_lookup(toc, PARALLEL_KEY_CUST_USAGE, false);
 	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
-						  &walusage[ParallelWorkerNumber]);
+						  &walusage[ParallelWorkerNumber],
+						  custusage + ParallelWorkerNumber*pgCustUsageSize);
 
 #ifdef BTREE_BUILD_STATS
 	if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 13217807ee..3955590cb9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -119,6 +119,8 @@ static void show_instrumentation_count(const char *qlabel, int which,
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static void show_eval_params(Bitmapset *bms_params, ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
+static void show_custom_usage(ExplainState *es, const char* usage,
+							  bool planning);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage,
 							  bool planning);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
@@ -149,7 +151,6 @@ static void ExplainRestoreGroup(ExplainState *es, int depth, int *state_save);
 static void ExplainDummyGroup(const char *objtype, const char *labelname,
 							  ExplainState *es);
 static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
-static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
@@ -170,6 +171,7 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 	Query	   *query;
 	List	   *rewritten;
 	ListCell   *lc;
+	ListCell   *c;
 	bool		timing_set = false;
 	bool		summary_set = false;
 
@@ -222,11 +224,28 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 						 parser_errposition(pstate, opt->location)));
 		}
 		else
-			ereport(ERROR,
+		{
+			bool found = false;
+			foreach (c, pgCustInstr)
+			{
+				CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(c);
+				if (strcmp(opt->defname, ci->name) == 0)
+				{
+					ci->selected = true;
+					es->custom = true;
+					found = true;
+					break;
+				}
+			}
+			if (!found)
+			{
+				ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
 					 errmsg("unrecognized EXPLAIN option \"%s\"",
 							opt->defname),
 					 parser_errposition(pstate, opt->location)));
+			}
+		}
 	}
 
 	/* check that WAL is used with EXPLAIN ANALYZE */
@@ -320,12 +339,19 @@ ExplainState *
 NewExplainState(void)
 {
 	ExplainState *es = (ExplainState *) palloc0(sizeof(ExplainState));
+	ListCell* lc;
 
 	/* Set default options (most fields can be left as zeroes). */
 	es->costs = true;
 	/* Prepare output buffer. */
 	es->str = makeStringInfo();
 
+	/* Reset custom instrumentations selection flag */
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->selected = false;
+	}
 	return es;
 }
 
@@ -397,9 +423,14 @@ ExplainOneQuery(Query *query, int cursorOptions,
 					planduration;
 		BufferUsage bufusage_start,
 					bufusage;
+		CustomInstrumentationData custusage_start, custusage;
 
 		if (es->buffers)
 			bufusage_start = pgBufferUsage;
+
+		if (es->custom)
+			GetCustomInstrumentationState(custusage_start.data);
+
 		INSTR_TIME_SET_CURRENT(planstart);
 
 		/* plan the query */
@@ -415,9 +446,14 @@ ExplainOneQuery(Query *query, int cursorOptions,
 			BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
 		}
 
+		if (es->custom)
+			AccumulateCustomInstrumentationState(custusage.data, custusage_start.data);
+
 		/* run it (if needed) and produce output */
 		ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-					   &planduration, (es->buffers ? &bufusage : NULL));
+					   &planduration,
+					   (es->buffers ? &bufusage : NULL),
+					   (es->custom ? &custusage : NULL));
 	}
 }
 
@@ -527,7 +563,8 @@ void
 ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 			   const char *queryString, ParamListInfo params,
 			   QueryEnvironment *queryEnv, const instr_time *planduration,
-			   const BufferUsage *bufusage)
+			   const BufferUsage *bufusage,
+			   const CustomInstrumentationData *custusage)
 {
 	DestReceiver *dest;
 	QueryDesc  *queryDesc;
@@ -623,6 +660,13 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		ExplainCloseGroup("Planning", "Planning", true, es);
 	}
 
+	if (custusage)
+	{
+		ExplainOpenGroup("Planning", "Planning", true, es);
+		show_custom_usage(es, custusage->data, true);
+		ExplainCloseGroup("Planning", "Planning", true, es);
+	}
+
 	if (es->summary && planduration)
 	{
 		double		plantime = INSTR_TIME_GET_DOUBLE(*planduration);
@@ -2110,8 +2154,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (es->wal && planstate->instrument)
 		show_wal_usage(es, &planstate->instrument->walusage);
 
+	/* Show custom instrumentation */
+	if (es->custom && planstate->instrument)
+		show_custom_usage(es, planstate->instrument->cust_usage.data, false);
+
 	/* Prepare per-worker buffer/WAL usage */
-	if (es->workers_state && (es->buffers || es->wal) && es->verbose)
+	if (es->workers_state && (es->buffers || es->wal || es->custom) && es->verbose)
 	{
 		WorkerInstrumentation *w = planstate->worker_instrument;
 
@@ -2128,6 +2176,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_buffer_usage(es, &instrument->bufusage, false);
 			if (es->wal)
 				show_wal_usage(es, &instrument->walusage);
+			if (es->custom)
+				show_custom_usage(es, instrument->cust_usage.data, false);
 			ExplainCloseWorker(n, es);
 		}
 	}
@@ -3544,6 +3594,23 @@ explain_get_index_name(Oid indexId)
 	return result;
 }
 
+/*
+ * Show select custom usage details
+ */
+static void
+show_custom_usage(ExplainState *es, const char* usage, bool planning)
+{
+	ListCell* lc;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		if (ci->selected)
+			ci->show(es, usage, planning);
+		usage += ci->size;
+	}
+}
+
 /*
  * Show buffer usage details.
  */
@@ -4995,7 +5062,7 @@ ExplainXMLTag(const char *tagname, int flags, ExplainState *es)
  * data for a parallel worker there might already be data on the current line
  * (cf. ExplainOpenWorker); in that case, don't indent any more.
  */
-static void
+void
 ExplainIndentText(ExplainState *es)
 {
 	Assert(es->format == EXPLAIN_FORMAT_TEXT);
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 18f70319fc..7999980699 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -583,9 +583,13 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	instr_time	planduration;
 	BufferUsage bufusage_start,
 				bufusage;
+	CustomInstrumentationData custusage_start, custusage;
 
 	if (es->buffers)
 		bufusage_start = pgBufferUsage;
+	if (es->custom)
+		GetCustomInstrumentationState(custusage_start.data);
+
 	INSTR_TIME_SET_CURRENT(planstart);
 
 	/* Look it up in the hash table */
@@ -630,6 +634,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 		memset(&bufusage, 0, sizeof(BufferUsage));
 		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
 	}
+	if (es->custom)
+		AccumulateCustomInstrumentationState(custusage.data, custusage_start.data);
 
 	plan_list = cplan->stmt_list;
 
@@ -640,7 +646,9 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 
 		if (pstmt->commandType != CMD_UTILITY)
 			ExplainOnePlan(pstmt, into, es, query_string, paramLI, queryEnv,
-						   &planduration, (es->buffers ? &bufusage : NULL));
+						   &planduration,
+						   (es->buffers ? &bufusage : NULL),
+						   (es->custom ? &custusage : NULL));
 		else
 			ExplainOneUtility(pstmt->utilityStmt, into, es, query_string,
 							  paramLI, queryEnv);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 351ab4957a..5b7a081f78 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -50,6 +50,7 @@
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		6
+#define PARALLEL_VACUUM_KEY_CUSTOM_USAGE	7
 
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
@@ -184,6 +185,9 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/* Points to custom usage area in DSM */
+	char *custom_usage;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
@@ -242,6 +246,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	char       *custom_usage;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
@@ -313,6 +318,9 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
 	if (debug_query_string)
@@ -403,6 +411,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
 	pvs->wal_usage = wal_usage;
+    custom_usage = shm_toc_allocate(pcxt->toc,
+									mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_CUSTOM_USAGE, custom_usage);
+	pvs->custom_usage = custom_usage;
 
 	/* Store query string for workers */
 	if (debug_query_string)
@@ -706,7 +718,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		WaitForParallelWorkersToFinish(pvs->pcxt);
 
 		for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
-			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i], pvs->custom_usage + pgCustUsageSize*i);
 	}
 
 	/*
@@ -964,6 +976,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	VacDeadItems *dead_items;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	char       *custom_usage;
 	int			nindexes;
 	char	   *sharedquery;
 	ErrorContextCallback errcallback;
@@ -1053,8 +1066,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
+	custom_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_CUSTOM_USAGE, false);
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
-						  &wal_usage[ParallelWorkerNumber]);
+						  &wal_usage[ParallelWorkerNumber],
+						  custom_usage + pgCustUsageSize*ParallelWorkerNumber);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index cc2b8ccab7..39d9754a0f 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -66,6 +66,7 @@
 #define PARALLEL_KEY_QUERY_TEXT		UINT64CONST(0xE000000000000008)
 #define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_CUSTOM_USAGE		UINT64CONST(0xE00000000000000B)
 
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
@@ -600,6 +601,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	char	   *paramlistinfo_space;
 	BufferUsage *bufusage_space;
 	WalUsage   *walusage_space;
+	char       *customusage_space;
 	SharedExecutorInstrumentation *instrumentation = NULL;
 	SharedJitInstrumentation *jit_instrumentation = NULL;
 	int			pstmt_len;
@@ -680,6 +682,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+	/*
+	 * Same thing for CustomUsage.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 	/* Estimate space for tuple queues. */
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -768,6 +777,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
 	pei->wal_usage = walusage_space;
 
+	customusage_space = shm_toc_allocate(pcxt->toc,
+										 mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_CUSTOM_USAGE, customusage_space);
+	pei->custom_usage = customusage_space;
+
 	/* Set up the tuple queues that the workers will write into. */
 	pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
 
@@ -1164,7 +1178,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
 	 * finish, or we might get incomplete data.)
 	 */
 	for (i = 0; i < nworkers; i++)
-		InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+		InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i], pei->custom_usage + pgCustUsageSize*i);
 
 	pei->finished = true;
 }
@@ -1397,6 +1411,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	FixedParallelExecutorState *fpes;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	char *custom_usage; 
 	DestReceiver *receiver;
 	QueryDesc  *queryDesc;
 	SharedExecutorInstrumentation *instrumentation;
@@ -1472,8 +1487,10 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	/* Report buffer/WAL usage during parallel execution. */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	custom_usage = shm_toc_lookup(toc, PARALLEL_KEY_CUSTOM_USAGE, false);
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
-						  &wal_usage[ParallelWorkerNumber]);
+						  &wal_usage[ParallelWorkerNumber],
+						  custom_usage + ParallelWorkerNumber*pgCustUsageSize);
 
 	/* Report instrumentation data if any instrumentation options are set. */
 	if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d..d883c99874 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,15 +16,62 @@
 #include <unistd.h>
 
 #include "executor/instrument.h"
+#include "utils/memutils.h"
 
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
 static WalUsage save_pgWalUsage;
 
+List* pgCustInstr; /* description of custom instriumentations */
+Size pgCustUsageSize;
+static CustomInstrumentationData save_pgCustUsage; /* saved custom instrumentation state */
+
+
 static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void WalUsageAdd(WalUsage *dst, WalUsage *add);
 
+void
+RegisterCustomInsrumentation(CustomInstrumentation* inst)
+{
+	MemoryContext oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+	pgCustInstr = lappend(pgCustInstr, inst);
+	pgCustUsageSize += inst->size;
+	MemoryContextSwitchTo(oldcontext);
+	if (pgCustUsageSize > MAX_CUSTOM_INSTR_SIZE)
+		elog(ERROR, "Total size of custom instrumentations exceed limit %d",  MAX_CUSTOM_INSTR_SIZE);
+}
+
+void
+GetCustomInstrumentationState(char* dst)
+{
+	ListCell* lc;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);	
+		memcpy(dst, ci->usage, ci->size);
+		dst += ci->size;
+	}
+}
+
+void
+AccumulateCustomInstrumentationState(char* dst, char const* before)
+{
+	ListCell* lc;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);	
+		if (ci->selected)
+		{
+			memset(dst, 0, ci->size);
+			ci->accum(dst, ci->usage, before);
+		}
+		dst += ci->size;
+		before += ci->size;
+	}
+}
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -49,7 +96,6 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
 			instr[i].async_mode = async_mode;
 		}
 	}
-
 	return instr;
 }
 
@@ -67,6 +113,8 @@ InstrInit(Instrumentation *instr, int instrument_options)
 void
 InstrStartNode(Instrumentation *instr)
 {
+	ListCell *lc;
+	char* cust_start = instr->cust_usage_start.data;
 	if (instr->need_timer &&
 		!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
 		elog(ERROR, "InstrStartNode called twice in a row");
@@ -77,6 +125,13 @@ InstrStartNode(Instrumentation *instr)
 
 	if (instr->need_walusage)
 		instr->walusage_start = pgWalUsage;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		memcpy(cust_start, ci->usage, ci->size);
+		cust_start += ci->size;
+	}
 }
 
 /* Exit from a plan node */
@@ -85,6 +140,9 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 {
 	double		save_tuplecount = instr->tuplecount;
 	instr_time	endtime;
+	ListCell *lc;
+	char *cust_start = instr->cust_usage_start.data;
+	char *cust_usage = instr->cust_usage.data;
 
 	/* count the returned tuples */
 	instr->tuplecount += nTuples;
@@ -110,7 +168,15 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 		WalUsageAccumDiff(&instr->walusage,
 						  &pgWalUsage, &instr->walusage_start);
 
-	/* Is this the first tuple of this cycle? */
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->accum(cust_usage, ci->usage, cust_start);
+		cust_start += ci->size;
+		cust_usage += ci->size;
+	}
+
+    /* Is this the first tuple of this cycle? */
 	if (!instr->running)
 	{
 		instr->running = true;
@@ -168,6 +234,10 @@ InstrEndLoop(Instrumentation *instr)
 void
 InstrAggNode(Instrumentation *dst, Instrumentation *add)
 {
+	ListCell *lc;
+	char *cust_dst = dst->cust_usage.data;
+	char *cust_add = add->cust_usage.data;
+
 	if (!dst->running && add->running)
 	{
 		dst->running = true;
@@ -193,32 +263,69 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 
 	if (dst->need_walusage)
 		WalUsageAdd(&dst->walusage, &add->walusage);
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->add(cust_dst, cust_add);
+		cust_dst += ci->size;
+		cust_add += ci->size;
+	}
 }
 
 /* note current values during parallel executor startup */
 void
 InstrStartParallelQuery(void)
 {
+	ListCell* lc;
+	char* cust_dst = save_pgCustUsage.data;
+
 	save_pgBufferUsage = pgBufferUsage;
 	save_pgWalUsage = pgWalUsage;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		memcpy(cust_dst, ci->usage, ci->size);
+		cust_dst += ci->size;
+	}
 }
 
 /* report usage after parallel executor shutdown */
 void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* cust_usage)
 {
+	ListCell *lc;
+	char* cust_save = save_pgCustUsage.data;
+
 	memset(bufusage, 0, sizeof(BufferUsage));
 	BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
 	memset(walusage, 0, sizeof(WalUsage));
 	WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->accum(cust_usage, ci->usage, cust_save);
+		cust_usage += ci->size;
+		cust_save += ci->size;
+	}
 }
 
 /* accumulate work done by workers in leader's stats */
 void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* cust_usage)
 {
+	ListCell *lc;
 	BufferUsageAdd(&pgBufferUsage, bufusage);
 	WalUsageAdd(&pgWalUsage, walusage);
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->add(ci->usage, cust_usage);
+		cust_usage += ci->size;
+	}
 }
 
 /* dst += add */
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 3d3e632a0c..d2639be8c4 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -41,6 +41,7 @@ typedef struct ExplainState
 	bool		verbose;		/* be verbose */
 	bool		analyze;		/* print actual times */
 	bool		costs;			/* print estimated costs */
+	bool		custom;		    /* print custom usage */
 	bool		buffers;		/* print buffer usage */
 	bool		wal;			/* print WAL usage */
 	bool		timing;			/* print detailed node timing */
@@ -92,7 +93,8 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
 						   ExplainState *es, const char *queryString,
 						   ParamListInfo params, QueryEnvironment *queryEnv,
 						   const instr_time *planduration,
-						   const BufferUsage *bufusage);
+						   const BufferUsage *bufusage,
+						   const CustomInstrumentationData *custusage);
 
 extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
 extern void ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc);
@@ -125,5 +127,6 @@ extern void ExplainOpenGroup(const char *objtype, const char *labelname,
 							 bool labeled, ExplainState *es);
 extern void ExplainCloseGroup(const char *objtype, const char *labelname,
 							  bool labeled, ExplainState *es);
+extern void ExplainIndentText(ExplainState *es);
 
 #endif							/* EXPLAIN_H */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 39a8792a31..02699828a2 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -27,6 +27,7 @@ typedef struct ParallelExecutorInfo
 	ParallelContext *pcxt;		/* parallel context we're using */
 	BufferUsage *buffer_usage;	/* points to bufusage area in DSM */
 	WalUsage   *wal_usage;		/* walusage area in DSM */
+	char *custom_usage;         /* points to custiom usage area in DSM */
 	SharedExecutorInstrumentation *instrumentation; /* optional */
 	struct SharedJitInstrumentation *jit_instrumentation;	/* optional */
 	dsa_area   *area;			/* points to DSA area in DSM */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183b..91d85cf3c8 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,7 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
-
+#include "nodes/pg_list.h"
 
 /*
  * BufferUsage and WalUsage counters keep being incremented infinitely,
@@ -63,6 +63,36 @@ typedef enum InstrumentOption
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
+/*
+ * Maximal total size of all custom intrumentations
+ */
+#define MAX_CUSTOM_INSTR_SIZE 128
+
+typedef struct {
+	char data[MAX_CUSTOM_INSTR_SIZE];
+} CustomInstrumentationData;
+
+typedef void CustomResourceUsage;
+typedef struct ExplainState ExplainState;
+typedef void (*cust_instr_add_t)(CustomResourceUsage* dst, CustomResourceUsage const* add);
+typedef void (*cust_instr_accum_t)(CustomResourceUsage* acc, CustomResourceUsage const* end, CustomResourceUsage const* start);
+typedef void (*cust_instr_show_t)(ExplainState* es, CustomResourceUsage const* usage, bool planning);
+
+typedef struct
+{
+	char const*          name; /* instrumentation name (as will be recongnized in EXPLAIN options */
+	Size                 size;
+	CustomResourceUsage* usage;
+	cust_instr_add_t     add;
+	cust_instr_accum_t   accum;
+	cust_instr_show_t    show;
+	bool                 selected; /* selected in EXPLAIN options */
+} CustomInstrumentation;
+
+extern PGDLLIMPORT List* pgCustInstr; /* description of custom instrumentations */
+extern Size              pgCustUsageSize;
+
+
 typedef struct Instrumentation
 {
 	/* Parameters set at node creation: */
@@ -88,6 +118,8 @@ typedef struct Instrumentation
 	double		nfiltered2;		/* # of tuples removed by "other" quals */
 	BufferUsage bufusage;		/* total buffer usage */
 	WalUsage	walusage;		/* total WAL usage */
+	CustomInstrumentationData cust_usage_start; /* state of custom usage at start */
+	CustomInstrumentationData cust_usage; /* total custom  usage */
 } Instrumentation;
 
 typedef struct WorkerInstrumentation
@@ -99,6 +131,11 @@ typedef struct WorkerInstrumentation
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 extern PGDLLIMPORT WalUsage pgWalUsage;
 
+
+extern void RegisterCustomInsrumentation(CustomInstrumentation* inst);
+extern void GetCustomInstrumentationState(char* dst);
+extern void AccumulateCustomInstrumentationState(char* dst, char const* before);
+
 extern Instrumentation *InstrAlloc(int n, int instrument_options,
 								   bool async_mode);
 extern void InstrInit(Instrumentation *instr, int instrument_options);
@@ -108,8 +145,8 @@ extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
 extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
 extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* custusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* custusage);
 extern void BufferUsageAccumDiff(BufferUsage *dst,
 								 const BufferUsage *add, const BufferUsage *sub);
 extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,

Pavel Stehule

pavel.stehule@gmail.com

about 2 years ago

In reply to: Konstantin Knizhnik (#1)

2 attachment(s)

Re: Custom explain options

so 25. 11. 2023 v 8:23 odesílatel Konstantin Knizhnik <knizhnik@garret.ru>
napsal:

Hi hackers,

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...)
which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page
prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is
implemented not only for bitmap heap scan as in Vanilla Postgres, but also
for seqscan, indexscan and indexonly scan). Another possible candidate for
explain options is local file cache (extra caching layer above shared
buffers which is used to somehow replace file system cache in standalone
Postgres).

I think that it will be nice to have a generic mechanism which allows
extensions to add its own options to EXPLAIN.
I have attached the patch with implementation of such mechanism (also
available as PR: https://github.com/knizhnik/postgres/pull/1 )

I have demonstrated this mechanism using Bloom extension - just to report
number of Bloom matches.
Not sure that it is really useful information but it is used mostly as
example:

explain (analyze,bloom) select * from t where pk=2000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=15348.00..15352.01 rows=1 width=4) (actual time=25.244..25.939 rows=1 loops=1)
Recheck Cond: (pk = 2000)
Rows Removed by Index Recheck: 292
Heap Blocks: exact=283
Bloom: matches=293
-> Bitmap Index Scan on t_pk_idx (cost=0.00..15348.00 rows=1 width=0) (actual time=25.147..25.147 rows=293 loops=1)
Index Cond: (pk = 2000)
Bloom: matches=293
Planning:
Bloom: matches=0
Planning Time: 0.387 ms
Execution Time: 26.053 ms
(12 rows)

There are two known issues with this proposal:

1. I have to limit total size of all custom metrics - right now it is
limited by 128 bytes. It is done to keep Instrumentation and some other
data structures fixes size. Otherwise maintaining varying parts of this
structure is ugly, especially in shared memory

2. Custom extension is added by means of RegisterCustomInsrumentation function
which is called from _PG_init
But _PG_init is called when extension is loaded and it is loaded on
demand when some of extension functions is called (except when extension is
included
in shared_preload_libraries list), Bloom extension doesn't require it. So
if your first statement executed in your session is:

explain (analyze,bloom) select * from t where pk=2000;

...you will get error:

ERROR: unrecognized EXPLAIN option "bloom"
LINE 1: explain (analyze,bloom) select * from t where pk=2000;

It happens because at the moment when explain statement parses options,
Bloom index is not yet selected and so bloom extension is not loaded and
RegisterCustomInsrumentation is not yet called. If we repeat the query,
then proper result will be displayed (see above).

This patch has a lot of whitespaces and formatting issues. I fixed some

I don't understand how selecting some custom instrumentation can be safe.

List *pgCustInstr is a global variable. The attribute selected is set by
NewExplainState routine

+   /* Reset custom instrumentations selection flag */
+   foreach (lc, pgCustInstr)
+   {
+       CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
+       ci->selected = false;
+   }

and this attribute is used more times. But the queries can be nested.
Theoretically EXPLAIN ANALYZE can run another EXPLAIN ANALYZE, and then
this attribute of the global list can be rewritten. The list of selected
custom instrumentations should be part of explain state, I think.

Regards

Pavel

Attachments:

v20231129-0002-fix-whitespaces-and-formatting.patchtext/x-patch; charset=US-ASCII; name=v20231129-0002-fix-whitespaces-and-formatting.patchDownload

From fcc48bee889e4a052ba20f10f08d7f5b5eeaa43f Mon Sep 17 00:00:00 2001
From: "okbob@github.com" <pavel.stehule@gmail.com>
Date: Wed, 29 Nov 2023 07:51:51 +0100
Subject: [PATCH 2/2] fix whitespaces and formatting

---
 contrib/bloom/blscan.c                |  1 +
 contrib/bloom/blutils.c               | 21 +++++---
 src/backend/access/nbtree/nbtsort.c   |  7 +--
 src/backend/commands/explain.c        | 17 ++++---
 src/backend/commands/vacuumparallel.c |  5 +-
 src/backend/executor/execParallel.c   |  5 +-
 src/backend/executor/instrument.c     | 72 ++++++++++++++++-----------
 7 files changed, 79 insertions(+), 49 deletions(-)

diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index 8890951943..7fc43f174d 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -167,5 +167,6 @@ blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	}
 	FreeAccessStrategy(bas);
 	bloomUsage.matches += ntids;
+
 	return ntids;
 }
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d8fff4f057..1ba7afc607 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -38,19 +38,23 @@ PG_FUNCTION_INFO_V1(blhandler);
 BloomUsage bloomUsage;
 
 static void
-bloomUsageAdd(CustomResourceUsage* dst, CustomResourceUsage const* add)
+bloomUsageAdd(CustomResourceUsage *dst,
+			  CustomResourceUsage const *add)
 {
-	((BloomUsage*)dst)->matches += ((BloomUsage*)add)->matches;
+	((BloomUsage*) dst)->matches += ((BloomUsage*) add)->matches;
 }
 
 static void
-bloomUsageAccum(CustomResourceUsage* acc, CustomResourceUsage const* end, CustomResourceUsage const* start)
+bloomUsageAccum(CustomResourceUsage* acc,
+				CustomResourceUsage const *end,
+				CustomResourceUsage const *start)
 {
-	((BloomUsage*)acc)->matches += ((BloomUsage*)end)->matches - ((BloomUsage*)start)->matches;;
+	((BloomUsage*) acc)->matches +=
+		  ((BloomUsage*) end)->matches - ((BloomUsage*) start)->matches;
 }
 
 static void
-bloomUsageShow(ExplainState* es, CustomResourceUsage const* usage, bool planning)
+bloomUsageShow(ExplainState *es, CustomResourceUsage const *usage, bool planning)
 {
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
@@ -60,18 +64,20 @@ bloomUsageShow(ExplainState* es, CustomResourceUsage const* usage, bool planning
 			appendStringInfoString(es->str, "Planning:\n");
 			es->indent++;
 		}
+
 		ExplainIndentText(es);
 		appendStringInfoString(es->str, "Bloom:");
 		appendStringInfo(es->str, " matches=%lld",
-						 (long long) ((BloomUsage*)usage)->matches);
+						 (long long) ((BloomUsage*) usage)->matches);
 		appendStringInfoChar(es->str, '\n');
+
 		if (planning)
 			es->indent--;
 	}
 	else
 	{
 		ExplainPropertyInteger("Bloom Matches", NULL,
-							   ((BloomUsage*)usage)->matches, es);
+							   ((BloomUsage*) usage)->matches, es);
 	}
 }
 
@@ -128,6 +134,7 @@ _PG_init(void)
 		bl_relopt_tab[i + 1].opttype = RELOPT_TYPE_INT;
 		bl_relopt_tab[i + 1].offset = offsetof(BloomOptions, bitSize[0]) + sizeof(int) * i;
 	}
+
 	RegisterCustomInsrumentation(&bloomInstr);
 }
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index a4ab0d6ba0..e9dc1190af 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1635,7 +1635,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 								 mul_size(pgCustUsageSize, pcxt->nworkers));
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_CUST_USAGE, custusage);
 
-
 	/* Launch workers, saving status for leader/caller */
 	LaunchParallelWorkers(pcxt);
 	btleader->pcxt = pcxt;
@@ -1687,7 +1686,9 @@ _bt_end_parallel(BTLeader *btleader)
 	 * or we might get incomplete data.)
 	 */
 	for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
-		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i], btleader->custusage + pgCustUsageSize*i);
+		InstrAccumParallelQuery(&btleader->bufferusage[i],
+								&btleader->walusage[i],
+								btleader->custusage + pgCustUsageSize * i);
 
 	/* Free last reference to MVCC snapshot, if one was used */
 	if (IsMVCCSnapshot(btleader->snapshot))
@@ -1906,7 +1907,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	custusage = shm_toc_lookup(toc, PARALLEL_KEY_CUST_USAGE, false);
 	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
 						  &walusage[ParallelWorkerNumber],
-						  custusage + ParallelWorkerNumber*pgCustUsageSize);
+						  custusage + ParallelWorkerNumber * pgCustUsageSize);
 
 #ifdef BTREE_BUILD_STATS
 	if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d3dc19f3ce..a5dc8451f5 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -226,9 +226,11 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 		else
 		{
 			bool found = false;
+
 			foreach (c, pgCustInstr)
 			{
-				CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(c);
+				CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(c);
+
 				if (strcmp(opt->defname, ci->name) == 0)
 				{
 					ci->selected = true;
@@ -339,7 +341,7 @@ ExplainState *
 NewExplainState(void)
 {
 	ExplainState *es = (ExplainState *) palloc0(sizeof(ExplainState));
-	ListCell* lc;
+	ListCell   *lc;
 
 	/* Set default options (most fields can be left as zeroes). */
 	es->costs = true;
@@ -349,7 +351,8 @@ NewExplainState(void)
 	/* Reset custom instrumentations selection flag */
 	foreach (lc, pgCustInstr)
 	{
-		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
 		ci->selected = false;
 	}
 	return es;
@@ -3598,15 +3601,17 @@ explain_get_index_name(Oid indexId)
  * Show select custom usage details
  */
 static void
-show_custom_usage(ExplainState *es, const char* usage, bool planning)
+show_custom_usage(ExplainState *es, const char *usage, bool planning)
 {
-	ListCell* lc;
+	ListCell   *lc;
 
 	foreach (lc, pgCustInstr)
 	{
-		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
 		if (ci->selected)
 			ci->show(es, usage, planning);
+
 		usage += ci->size;
 	}
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 299782b2f7..5afb31d145 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -411,7 +411,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
 	pvs->wal_usage = wal_usage;
-    custom_usage = shm_toc_allocate(pcxt->toc,
+	custom_usage = shm_toc_allocate(pcxt->toc,
 									mul_size(pgCustUsageSize, pcxt->nworkers));
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_CUSTOM_USAGE, custom_usage);
 	pvs->custom_usage = custom_usage;
@@ -718,7 +718,8 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		WaitForParallelWorkersToFinish(pvs->pcxt);
 
 		for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
-			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i], pvs->custom_usage + pgCustUsageSize*i);
+			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i],
+									pvs->custom_usage + pgCustUsageSize * i);
 	}
 
 	/*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 39d9754a0f..cba8fe7795 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1178,7 +1178,8 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
 	 * finish, or we might get incomplete data.)
 	 */
 	for (i = 0; i < nworkers; i++)
-		InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i], pei->custom_usage + pgCustUsageSize*i);
+		InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i],
+								pei->custom_usage + pgCustUsageSize * i);
 
 	pei->finished = true;
 }
@@ -1411,7 +1412,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	FixedParallelExecutorState *fpes;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	char *custom_usage; 
+	char *custom_usage;
 	DestReceiver *receiver;
 	QueryDesc  *queryDesc;
 	SharedExecutorInstrumentation *instrumentation;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ca3e7eff74..d37b783616 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -23,33 +23,36 @@ static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
 static WalUsage save_pgWalUsage;
 
-List* pgCustInstr; /* description of custom instriumentations */
+List *pgCustInstr; /* description of custom instriumentations */
 Size pgCustUsageSize;
 static CustomInstrumentationData save_pgCustUsage; /* saved custom instrumentation state */
 
-
 static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void WalUsageAdd(WalUsage *dst, WalUsage *add);
 
 void
-RegisterCustomInsrumentation(CustomInstrumentation* inst)
+RegisterCustomInsrumentation(CustomInstrumentation *inst)
 {
 	MemoryContext oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
 	pgCustInstr = lappend(pgCustInstr, inst);
 	pgCustUsageSize += inst->size;
 	MemoryContextSwitchTo(oldcontext);
+
 	if (pgCustUsageSize > MAX_CUSTOM_INSTR_SIZE)
-		elog(ERROR, "Total size of custom instrumentations exceed limit %d",  MAX_CUSTOM_INSTR_SIZE);
+		elog(ERROR, "Total size of custom instrumentations exceed limit %d",
+			 MAX_CUSTOM_INSTR_SIZE);
 }
 
 void
-GetCustomInstrumentationState(char* dst)
+GetCustomInstrumentationState(char *dst)
 {
-	ListCell* lc;
+	ListCell   *lc;
 
 	foreach (lc, pgCustInstr)
 	{
-		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);	
+		CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
 		memcpy(dst, ci->usage, ci->size);
 		dst += ci->size;
 	}
@@ -58,16 +61,18 @@ GetCustomInstrumentationState(char* dst)
 void
 AccumulateCustomInstrumentationState(char* dst, char const* before)
 {
-	ListCell* lc;
+	ListCell   *lc;
 
 	foreach (lc, pgCustInstr)
 	{
-		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);	
+		CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
 		if (ci->selected)
 		{
 			memset(dst, 0, ci->size);
 			ci->accum(dst, ci->usage, before);
 		}
+
 		dst += ci->size;
 		before += ci->size;
 	}
@@ -96,6 +101,7 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
 			instr[i].async_mode = async_mode;
 		}
 	}
+
 	return instr;
 }
 
@@ -113,8 +119,9 @@ InstrInit(Instrumentation *instr, int instrument_options)
 void
 InstrStartNode(Instrumentation *instr)
 {
-	ListCell *lc;
-	char* cust_start = instr->cust_usage_start.data;
+	ListCell   *lc;
+	char	   *cust_start = instr->cust_usage_start.data;
+
 	if (instr->need_timer &&
 		!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
 		elog(ERROR, "InstrStartNode called twice in a row");
@@ -128,7 +135,8 @@ InstrStartNode(Instrumentation *instr)
 
 	foreach (lc, pgCustInstr)
 	{
-		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
 		memcpy(cust_start, ci->usage, ci->size);
 		cust_start += ci->size;
 	}
@@ -140,9 +148,9 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 {
 	double		save_tuplecount = instr->tuplecount;
 	instr_time	endtime;
-	ListCell *lc;
-	char *cust_start = instr->cust_usage_start.data;
-	char *cust_usage = instr->cust_usage.data;
+	ListCell  *lc;
+	char	   *cust_start = instr->cust_usage_start.data;
+	char	   *cust_usage = instr->cust_usage.data;
 
 	/* count the returned tuples */
 	instr->tuplecount += nTuples;
@@ -170,13 +178,14 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 
 	foreach (lc, pgCustInstr)
 	{
-		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
 		ci->accum(cust_usage, ci->usage, cust_start);
 		cust_start += ci->size;
 		cust_usage += ci->size;
 	}
 
-    /* Is this the first tuple of this cycle? */
+	/* Is this the first tuple of this cycle? */
 	if (!instr->running)
 	{
 		instr->running = true;
@@ -234,9 +243,9 @@ InstrEndLoop(Instrumentation *instr)
 void
 InstrAggNode(Instrumentation *dst, Instrumentation *add)
 {
-	ListCell *lc;
-	char *cust_dst = dst->cust_usage.data;
-	char *cust_add = add->cust_usage.data;
+	ListCell   *lc;
+	char	   *cust_dst = dst->cust_usage.data;
+	char	   *cust_add = add->cust_usage.data;
 
 	if (!dst->running && add->running)
 	{
@@ -266,7 +275,8 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 
 	foreach (lc, pgCustInstr)
 	{
-		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
 		ci->add(cust_dst, cust_add);
 		cust_dst += ci->size;
 		cust_add += ci->size;
@@ -277,15 +287,16 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 void
 InstrStartParallelQuery(void)
 {
-	ListCell* lc;
-	char* cust_dst = save_pgCustUsage.data;
+	ListCell   *lc;
+	char	   *cust_dst = save_pgCustUsage.data;
 
 	save_pgBufferUsage = pgBufferUsage;
 	save_pgWalUsage = pgWalUsage;
 
 	foreach (lc, pgCustInstr)
 	{
-		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
 		memcpy(cust_dst, ci->usage, ci->size);
 		cust_dst += ci->size;
 	}
@@ -295,8 +306,8 @@ InstrStartParallelQuery(void)
 void
 InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* cust_usage)
 {
-	ListCell *lc;
-	char* cust_save = save_pgCustUsage.data;
+	ListCell   *lc;
+	char	   *cust_save = save_pgCustUsage.data;
 
 	memset(bufusage, 0, sizeof(BufferUsage));
 	BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
@@ -305,7 +316,8 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* cust_usag
 
 	foreach (lc, pgCustInstr)
 	{
-		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
 		ci->accum(cust_usage, ci->usage, cust_save);
 		cust_usage += ci->size;
 		cust_save += ci->size;
@@ -316,13 +328,15 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* cust_usag
 void
 InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* cust_usage)
 {
-	ListCell *lc;
+	ListCell   *lc;
+
 	BufferUsageAdd(&pgBufferUsage, bufusage);
 	WalUsageAdd(&pgWalUsage, walusage);
 
 	foreach (lc, pgCustInstr)
 	{
-		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
 		ci->add(ci->usage, cust_usage);
 		cust_usage += ci->size;
 	}
-- 
2.43.0

v20231129-0001-custom-explain-options.patchtext/x-patch; charset=US-ASCII; name=v20231129-0001-custom-explain-options.patchDownload

From 652244e518cc0a47c43955d335e3f9372d7925be Mon Sep 17 00:00:00 2001
From: "okbob@github.com" <pavel.stehule@gmail.com>
Date: Sat, 25 Nov 2023 08:21:24 +0100
Subject: [PATCH 1/2] custom explain options

---
 contrib/bloom/bloom.h                 |   7 ++
 contrib/bloom/blscan.c                |   2 +-
 contrib/bloom/blutils.c               |  51 ++++++++++++
 src/backend/access/nbtree/nbtsort.c   |  18 +++-
 src/backend/commands/explain.c        |  79 ++++++++++++++++--
 src/backend/commands/prepare.c        |  10 ++-
 src/backend/commands/vacuumparallel.c |  19 ++++-
 src/backend/executor/execParallel.c   |  21 ++++-
 src/backend/executor/instrument.c     | 115 +++++++++++++++++++++++++-
 src/include/commands/explain.h        |   5 +-
 src/include/executor/execParallel.h   |   1 +
 src/include/executor/instrument.h     |  43 +++++++++-
 12 files changed, 349 insertions(+), 22 deletions(-)

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index 330811ec60..af23ffe821 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -174,6 +174,13 @@ typedef struct BloomScanOpaqueData
 
 typedef BloomScanOpaqueData *BloomScanOpaque;
 
+typedef struct
+{
+	uint64 matches;
+} BloomUsage;
+
+extern BloomUsage bloomUsage;
+
 /* blutils.c */
 extern void initBloomState(BloomState *state, Relation index);
 extern void BloomFillMetapage(Relation index, Page metaPage);
diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index 61d1f66b38..8890951943 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -166,6 +166,6 @@ blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 		CHECK_FOR_INTERRUPTS();
 	}
 	FreeAccessStrategy(bas);
-
+	bloomUsage.matches += ntids;
 	return ntids;
 }
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 4830cb3fee..d8fff4f057 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -18,6 +18,7 @@
 #include "access/reloptions.h"
 #include "bloom.h"
 #include "catalog/index.h"
+#include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -34,6 +35,55 @@
 
 PG_FUNCTION_INFO_V1(blhandler);
 
+BloomUsage bloomUsage;
+
+static void
+bloomUsageAdd(CustomResourceUsage* dst, CustomResourceUsage const* add)
+{
+	((BloomUsage*)dst)->matches += ((BloomUsage*)add)->matches;
+}
+
+static void
+bloomUsageAccum(CustomResourceUsage* acc, CustomResourceUsage const* end, CustomResourceUsage const* start)
+{
+	((BloomUsage*)acc)->matches += ((BloomUsage*)end)->matches - ((BloomUsage*)start)->matches;;
+}
+
+static void
+bloomUsageShow(ExplainState* es, CustomResourceUsage const* usage, bool planning)
+{
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (planning)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Planning:\n");
+			es->indent++;
+		}
+		ExplainIndentText(es);
+		appendStringInfoString(es->str, "Bloom:");
+		appendStringInfo(es->str, " matches=%lld",
+						 (long long) ((BloomUsage*)usage)->matches);
+		appendStringInfoChar(es->str, '\n');
+		if (planning)
+			es->indent--;
+	}
+	else
+	{
+		ExplainPropertyInteger("Bloom Matches", NULL,
+							   ((BloomUsage*)usage)->matches, es);
+	}
+}
+
+static CustomInstrumentation bloomInstr = {
+	"bloom",
+	sizeof(BloomUsage),
+	&bloomUsage,
+	bloomUsageAdd,
+	bloomUsageAccum,
+	bloomUsageShow
+};
+
 /* Kind of relation options for bloom index */
 static relopt_kind bl_relopt_kind;
 
@@ -78,6 +128,7 @@ _PG_init(void)
 		bl_relopt_tab[i + 1].opttype = RELOPT_TYPE_INT;
 		bl_relopt_tab[i + 1].offset = offsetof(BloomOptions, bitSize[0]) + sizeof(int) * i;
 	}
+	RegisterCustomInsrumentation(&bloomInstr);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index c2665fce41..a4ab0d6ba0 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -71,6 +71,7 @@
 #define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xA000000000000004)
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xA000000000000005)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_CUST_USAGE		    UINT64CONST(0xA000000000000007)
 
 /*
  * DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -197,6 +198,7 @@ typedef struct BTLeader
 	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char       *custusage;
 } BTLeader;
 
 /*
@@ -1467,6 +1469,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BTLeader   *btleader = (BTLeader *) palloc0(sizeof(BTLeader));
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char        *custusage;
 	bool		leaderparticipates = true;
 	int			querylen;
 
@@ -1532,6 +1535,9 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
 	if (debug_query_string)
@@ -1625,6 +1631,10 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	bufferusage = shm_toc_allocate(pcxt->toc,
 								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+	custusage = shm_toc_allocate(pcxt->toc,
+								 mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_CUST_USAGE, custusage);
+
 
 	/* Launch workers, saving status for leader/caller */
 	LaunchParallelWorkers(pcxt);
@@ -1638,6 +1648,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btleader->snapshot = snapshot;
 	btleader->walusage = walusage;
 	btleader->bufferusage = bufferusage;
+	btleader->custusage = custusage;
 
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
@@ -1676,7 +1687,7 @@ _bt_end_parallel(BTLeader *btleader)
 	 * or we might get incomplete data.)
 	 */
 	for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
-		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i], btleader->custusage + pgCustUsageSize*i);
 
 	/* Free last reference to MVCC snapshot, if one was used */
 	if (IsMVCCSnapshot(btleader->snapshot))
@@ -1811,6 +1822,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	LOCKMODE	indexLockmode;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char       *custusage;
 	int			sortmem;
 
 #ifdef BTREE_BUILD_STATS
@@ -1891,8 +1903,10 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	custusage = shm_toc_lookup(toc, PARALLEL_KEY_CUST_USAGE, false);
 	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
-						  &walusage[ParallelWorkerNumber]);
+						  &walusage[ParallelWorkerNumber],
+						  custusage + ParallelWorkerNumber*pgCustUsageSize);
 
 #ifdef BTREE_BUILD_STATS
 	if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f1d71bc54e..d3dc19f3ce 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -119,6 +119,8 @@ static void show_instrumentation_count(const char *qlabel, int which,
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static void show_eval_params(Bitmapset *bms_params, ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
+static void show_custom_usage(ExplainState *es, const char* usage,
+							  bool planning);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage,
 							  bool planning);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
@@ -149,7 +151,6 @@ static void ExplainRestoreGroup(ExplainState *es, int depth, int *state_save);
 static void ExplainDummyGroup(const char *objtype, const char *labelname,
 							  ExplainState *es);
 static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
-static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
@@ -170,6 +171,7 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 	Query	   *query;
 	List	   *rewritten;
 	ListCell   *lc;
+	ListCell   *c;
 	bool		timing_set = false;
 	bool		summary_set = false;
 
@@ -222,11 +224,28 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 						 parser_errposition(pstate, opt->location)));
 		}
 		else
-			ereport(ERROR,
+		{
+			bool found = false;
+			foreach (c, pgCustInstr)
+			{
+				CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(c);
+				if (strcmp(opt->defname, ci->name) == 0)
+				{
+					ci->selected = true;
+					es->custom = true;
+					found = true;
+					break;
+				}
+			}
+			if (!found)
+			{
+				ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
 					 errmsg("unrecognized EXPLAIN option \"%s\"",
 							opt->defname),
 					 parser_errposition(pstate, opt->location)));
+			}
+		}
 	}
 
 	/* check that WAL is used with EXPLAIN ANALYZE */
@@ -320,12 +339,19 @@ ExplainState *
 NewExplainState(void)
 {
 	ExplainState *es = (ExplainState *) palloc0(sizeof(ExplainState));
+	ListCell* lc;
 
 	/* Set default options (most fields can be left as zeroes). */
 	es->costs = true;
 	/* Prepare output buffer. */
 	es->str = makeStringInfo();
 
+	/* Reset custom instrumentations selection flag */
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->selected = false;
+	}
 	return es;
 }
 
@@ -397,9 +423,14 @@ ExplainOneQuery(Query *query, int cursorOptions,
 					planduration;
 		BufferUsage bufusage_start,
 					bufusage;
+		CustomInstrumentationData custusage_start, custusage;
 
 		if (es->buffers)
 			bufusage_start = pgBufferUsage;
+
+		if (es->custom)
+			GetCustomInstrumentationState(custusage_start.data);
+
 		INSTR_TIME_SET_CURRENT(planstart);
 
 		/* plan the query */
@@ -415,9 +446,14 @@ ExplainOneQuery(Query *query, int cursorOptions,
 			BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
 		}
 
+		if (es->custom)
+			AccumulateCustomInstrumentationState(custusage.data, custusage_start.data);
+
 		/* run it (if needed) and produce output */
 		ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-					   &planduration, (es->buffers ? &bufusage : NULL));
+					   &planduration,
+					   (es->buffers ? &bufusage : NULL),
+					   (es->custom ? &custusage : NULL));
 	}
 }
 
@@ -527,7 +563,8 @@ void
 ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 			   const char *queryString, ParamListInfo params,
 			   QueryEnvironment *queryEnv, const instr_time *planduration,
-			   const BufferUsage *bufusage)
+			   const BufferUsage *bufusage,
+			   const CustomInstrumentationData *custusage)
 {
 	DestReceiver *dest;
 	QueryDesc  *queryDesc;
@@ -623,6 +660,13 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		ExplainCloseGroup("Planning", "Planning", true, es);
 	}
 
+	if (custusage)
+	{
+		ExplainOpenGroup("Planning", "Planning", true, es);
+		show_custom_usage(es, custusage->data, true);
+		ExplainCloseGroup("Planning", "Planning", true, es);
+	}
+
 	if (es->summary && planduration)
 	{
 		double		plantime = INSTR_TIME_GET_DOUBLE(*planduration);
@@ -2110,8 +2154,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (es->wal && planstate->instrument)
 		show_wal_usage(es, &planstate->instrument->walusage);
 
+	/* Show custom instrumentation */
+	if (es->custom && planstate->instrument)
+		show_custom_usage(es, planstate->instrument->cust_usage.data, false);
+
 	/* Prepare per-worker buffer/WAL usage */
-	if (es->workers_state && (es->buffers || es->wal) && es->verbose)
+	if (es->workers_state && (es->buffers || es->wal || es->custom) && es->verbose)
 	{
 		WorkerInstrumentation *w = planstate->worker_instrument;
 
@@ -2128,6 +2176,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_buffer_usage(es, &instrument->bufusage, false);
 			if (es->wal)
 				show_wal_usage(es, &instrument->walusage);
+			if (es->custom)
+				show_custom_usage(es, instrument->cust_usage.data, false);
 			ExplainCloseWorker(n, es);
 		}
 	}
@@ -3544,6 +3594,23 @@ explain_get_index_name(Oid indexId)
 	return result;
 }
 
+/*
+ * Show select custom usage details
+ */
+static void
+show_custom_usage(ExplainState *es, const char* usage, bool planning)
+{
+	ListCell* lc;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		if (ci->selected)
+			ci->show(es, usage, planning);
+		usage += ci->size;
+	}
+}
+
 /*
  * Show buffer usage details.
  */
@@ -5017,7 +5084,7 @@ ExplainXMLTag(const char *tagname, int flags, ExplainState *es)
  * data for a parallel worker there might already be data on the current line
  * (cf. ExplainOpenWorker); in that case, don't indent any more.
  */
-static void
+void
 ExplainIndentText(ExplainState *es)
 {
 	Assert(es->format == EXPLAIN_FORMAT_TEXT);
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 18f70319fc..7999980699 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -583,9 +583,13 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	instr_time	planduration;
 	BufferUsage bufusage_start,
 				bufusage;
+	CustomInstrumentationData custusage_start, custusage;
 
 	if (es->buffers)
 		bufusage_start = pgBufferUsage;
+	if (es->custom)
+		GetCustomInstrumentationState(custusage_start.data);
+
 	INSTR_TIME_SET_CURRENT(planstart);
 
 	/* Look it up in the hash table */
@@ -630,6 +634,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 		memset(&bufusage, 0, sizeof(BufferUsage));
 		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
 	}
+	if (es->custom)
+		AccumulateCustomInstrumentationState(custusage.data, custusage_start.data);
 
 	plan_list = cplan->stmt_list;
 
@@ -640,7 +646,9 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 
 		if (pstmt->commandType != CMD_UTILITY)
 			ExplainOnePlan(pstmt, into, es, query_string, paramLI, queryEnv,
-						   &planduration, (es->buffers ? &bufusage : NULL));
+						   &planduration,
+						   (es->buffers ? &bufusage : NULL),
+						   (es->custom ? &custusage : NULL));
 		else
 			ExplainOneUtility(pstmt->utilityStmt, into, es, query_string,
 							  paramLI, queryEnv);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 176c555d63..299782b2f7 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -50,6 +50,7 @@
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		6
+#define PARALLEL_VACUUM_KEY_CUSTOM_USAGE	7
 
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
@@ -184,6 +185,9 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/* Points to custom usage area in DSM */
+	char *custom_usage;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
@@ -242,6 +246,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	char       *custom_usage;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
@@ -313,6 +318,9 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
 	if (debug_query_string)
@@ -403,6 +411,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
 	pvs->wal_usage = wal_usage;
+    custom_usage = shm_toc_allocate(pcxt->toc,
+									mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_CUSTOM_USAGE, custom_usage);
+	pvs->custom_usage = custom_usage;
 
 	/* Store query string for workers */
 	if (debug_query_string)
@@ -706,7 +718,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		WaitForParallelWorkersToFinish(pvs->pcxt);
 
 		for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
-			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i], pvs->custom_usage + pgCustUsageSize*i);
 	}
 
 	/*
@@ -964,6 +976,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	VacDeadItems *dead_items;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	char       *custom_usage;
 	int			nindexes;
 	char	   *sharedquery;
 	ErrorContextCallback errcallback;
@@ -1053,8 +1066,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
+	custom_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_CUSTOM_USAGE, false);
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
-						  &wal_usage[ParallelWorkerNumber]);
+						  &wal_usage[ParallelWorkerNumber],
+						  custom_usage + pgCustUsageSize*ParallelWorkerNumber);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index cc2b8ccab7..39d9754a0f 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -66,6 +66,7 @@
 #define PARALLEL_KEY_QUERY_TEXT		UINT64CONST(0xE000000000000008)
 #define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_CUSTOM_USAGE		UINT64CONST(0xE00000000000000B)
 
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
@@ -600,6 +601,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	char	   *paramlistinfo_space;
 	BufferUsage *bufusage_space;
 	WalUsage   *walusage_space;
+	char       *customusage_space;
 	SharedExecutorInstrumentation *instrumentation = NULL;
 	SharedJitInstrumentation *jit_instrumentation = NULL;
 	int			pstmt_len;
@@ -680,6 +682,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+	/*
+	 * Same thing for CustomUsage.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 	/* Estimate space for tuple queues. */
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -768,6 +777,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
 	pei->wal_usage = walusage_space;
 
+	customusage_space = shm_toc_allocate(pcxt->toc,
+										 mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_CUSTOM_USAGE, customusage_space);
+	pei->custom_usage = customusage_space;
+
 	/* Set up the tuple queues that the workers will write into. */
 	pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
 
@@ -1164,7 +1178,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
 	 * finish, or we might get incomplete data.)
 	 */
 	for (i = 0; i < nworkers; i++)
-		InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+		InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i], pei->custom_usage + pgCustUsageSize*i);
 
 	pei->finished = true;
 }
@@ -1397,6 +1411,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	FixedParallelExecutorState *fpes;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	char *custom_usage; 
 	DestReceiver *receiver;
 	QueryDesc  *queryDesc;
 	SharedExecutorInstrumentation *instrumentation;
@@ -1472,8 +1487,10 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	/* Report buffer/WAL usage during parallel execution. */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	custom_usage = shm_toc_lookup(toc, PARALLEL_KEY_CUSTOM_USAGE, false);
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
-						  &wal_usage[ParallelWorkerNumber]);
+						  &wal_usage[ParallelWorkerNumber],
+						  custom_usage + ParallelWorkerNumber*pgCustUsageSize);
 
 	/* Report instrumentation data if any instrumentation options are set. */
 	if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c383f34c06..ca3e7eff74 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,15 +16,62 @@
 #include <unistd.h>
 
 #include "executor/instrument.h"
+#include "utils/memutils.h"
 
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
 static WalUsage save_pgWalUsage;
 
+List* pgCustInstr; /* description of custom instriumentations */
+Size pgCustUsageSize;
+static CustomInstrumentationData save_pgCustUsage; /* saved custom instrumentation state */
+
+
 static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void WalUsageAdd(WalUsage *dst, WalUsage *add);
 
+void
+RegisterCustomInsrumentation(CustomInstrumentation* inst)
+{
+	MemoryContext oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+	pgCustInstr = lappend(pgCustInstr, inst);
+	pgCustUsageSize += inst->size;
+	MemoryContextSwitchTo(oldcontext);
+	if (pgCustUsageSize > MAX_CUSTOM_INSTR_SIZE)
+		elog(ERROR, "Total size of custom instrumentations exceed limit %d",  MAX_CUSTOM_INSTR_SIZE);
+}
+
+void
+GetCustomInstrumentationState(char* dst)
+{
+	ListCell* lc;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);	
+		memcpy(dst, ci->usage, ci->size);
+		dst += ci->size;
+	}
+}
+
+void
+AccumulateCustomInstrumentationState(char* dst, char const* before)
+{
+	ListCell* lc;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);	
+		if (ci->selected)
+		{
+			memset(dst, 0, ci->size);
+			ci->accum(dst, ci->usage, before);
+		}
+		dst += ci->size;
+		before += ci->size;
+	}
+}
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -49,7 +96,6 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
 			instr[i].async_mode = async_mode;
 		}
 	}
-
 	return instr;
 }
 
@@ -67,6 +113,8 @@ InstrInit(Instrumentation *instr, int instrument_options)
 void
 InstrStartNode(Instrumentation *instr)
 {
+	ListCell *lc;
+	char* cust_start = instr->cust_usage_start.data;
 	if (instr->need_timer &&
 		!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
 		elog(ERROR, "InstrStartNode called twice in a row");
@@ -77,6 +125,13 @@ InstrStartNode(Instrumentation *instr)
 
 	if (instr->need_walusage)
 		instr->walusage_start = pgWalUsage;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		memcpy(cust_start, ci->usage, ci->size);
+		cust_start += ci->size;
+	}
 }
 
 /* Exit from a plan node */
@@ -85,6 +140,9 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 {
 	double		save_tuplecount = instr->tuplecount;
 	instr_time	endtime;
+	ListCell *lc;
+	char *cust_start = instr->cust_usage_start.data;
+	char *cust_usage = instr->cust_usage.data;
 
 	/* count the returned tuples */
 	instr->tuplecount += nTuples;
@@ -110,7 +168,15 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 		WalUsageAccumDiff(&instr->walusage,
 						  &pgWalUsage, &instr->walusage_start);
 
-	/* Is this the first tuple of this cycle? */
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->accum(cust_usage, ci->usage, cust_start);
+		cust_start += ci->size;
+		cust_usage += ci->size;
+	}
+
+    /* Is this the first tuple of this cycle? */
 	if (!instr->running)
 	{
 		instr->running = true;
@@ -168,6 +234,10 @@ InstrEndLoop(Instrumentation *instr)
 void
 InstrAggNode(Instrumentation *dst, Instrumentation *add)
 {
+	ListCell *lc;
+	char *cust_dst = dst->cust_usage.data;
+	char *cust_add = add->cust_usage.data;
+
 	if (!dst->running && add->running)
 	{
 		dst->running = true;
@@ -193,32 +263,69 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 
 	if (dst->need_walusage)
 		WalUsageAdd(&dst->walusage, &add->walusage);
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->add(cust_dst, cust_add);
+		cust_dst += ci->size;
+		cust_add += ci->size;
+	}
 }
 
 /* note current values during parallel executor startup */
 void
 InstrStartParallelQuery(void)
 {
+	ListCell* lc;
+	char* cust_dst = save_pgCustUsage.data;
+
 	save_pgBufferUsage = pgBufferUsage;
 	save_pgWalUsage = pgWalUsage;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		memcpy(cust_dst, ci->usage, ci->size);
+		cust_dst += ci->size;
+	}
 }
 
 /* report usage after parallel executor shutdown */
 void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* cust_usage)
 {
+	ListCell *lc;
+	char* cust_save = save_pgCustUsage.data;
+
 	memset(bufusage, 0, sizeof(BufferUsage));
 	BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
 	memset(walusage, 0, sizeof(WalUsage));
 	WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->accum(cust_usage, ci->usage, cust_save);
+		cust_usage += ci->size;
+		cust_save += ci->size;
+	}
 }
 
 /* accumulate work done by workers in leader's stats */
 void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* cust_usage)
 {
+	ListCell *lc;
 	BufferUsageAdd(&pgBufferUsage, bufusage);
 	WalUsageAdd(&pgWalUsage, walusage);
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->add(ci->usage, cust_usage);
+		cust_usage += ci->size;
+	}
 }
 
 /* dst += add */
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index f9525fb572..9c25a85e92 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -41,6 +41,7 @@ typedef struct ExplainState
 	bool		verbose;		/* be verbose */
 	bool		analyze;		/* print actual times */
 	bool		costs;			/* print estimated costs */
+	bool		custom;		    /* print custom usage */
 	bool		buffers;		/* print buffer usage */
 	bool		wal;			/* print WAL usage */
 	bool		timing;			/* print detailed node timing */
@@ -92,7 +93,8 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
 						   ExplainState *es, const char *queryString,
 						   ParamListInfo params, QueryEnvironment *queryEnv,
 						   const instr_time *planduration,
-						   const BufferUsage *bufusage);
+						   const BufferUsage *bufusage,
+						   const CustomInstrumentationData *custusage);
 
 extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
 extern void ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc);
@@ -125,5 +127,6 @@ extern void ExplainOpenGroup(const char *objtype, const char *labelname,
 							 bool labeled, ExplainState *es);
 extern void ExplainCloseGroup(const char *objtype, const char *labelname,
 							  bool labeled, ExplainState *es);
+extern void ExplainIndentText(ExplainState *es);
 
 #endif							/* EXPLAIN_H */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 39a8792a31..02699828a2 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -27,6 +27,7 @@ typedef struct ParallelExecutorInfo
 	ParallelContext *pcxt;		/* parallel context we're using */
 	BufferUsage *buffer_usage;	/* points to bufusage area in DSM */
 	WalUsage   *wal_usage;		/* walusage area in DSM */
+	char *custom_usage;         /* points to custiom usage area in DSM */
 	SharedExecutorInstrumentation *instrumentation; /* optional */
 	struct SharedJitInstrumentation *jit_instrumentation;	/* optional */
 	dsa_area   *area;			/* points to DSA area in DSM */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d5d69941c5..2e6a330f6f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,7 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
-
+#include "nodes/pg_list.h"
 
 /*
  * BufferUsage and WalUsage counters keep being incremented infinitely,
@@ -65,6 +65,36 @@ typedef enum InstrumentOption
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
+/*
+ * Maximal total size of all custom intrumentations
+ */
+#define MAX_CUSTOM_INSTR_SIZE 128
+
+typedef struct {
+	char data[MAX_CUSTOM_INSTR_SIZE];
+} CustomInstrumentationData;
+
+typedef void CustomResourceUsage;
+typedef struct ExplainState ExplainState;
+typedef void (*cust_instr_add_t)(CustomResourceUsage* dst, CustomResourceUsage const* add);
+typedef void (*cust_instr_accum_t)(CustomResourceUsage* acc, CustomResourceUsage const* end, CustomResourceUsage const* start);
+typedef void (*cust_instr_show_t)(ExplainState* es, CustomResourceUsage const* usage, bool planning);
+
+typedef struct
+{
+	char const*          name; /* instrumentation name (as will be recongnized in EXPLAIN options */
+	Size                 size;
+	CustomResourceUsage* usage;
+	cust_instr_add_t     add;
+	cust_instr_accum_t   accum;
+	cust_instr_show_t    show;
+	bool                 selected; /* selected in EXPLAIN options */
+} CustomInstrumentation;
+
+extern PGDLLIMPORT List* pgCustInstr; /* description of custom instrumentations */
+extern Size              pgCustUsageSize;
+
+
 typedef struct Instrumentation
 {
 	/* Parameters set at node creation: */
@@ -90,6 +120,8 @@ typedef struct Instrumentation
 	double		nfiltered2;		/* # of tuples removed by "other" quals */
 	BufferUsage bufusage;		/* total buffer usage */
 	WalUsage	walusage;		/* total WAL usage */
+	CustomInstrumentationData cust_usage_start; /* state of custom usage at start */
+	CustomInstrumentationData cust_usage; /* total custom  usage */
 } Instrumentation;
 
 typedef struct WorkerInstrumentation
@@ -101,6 +133,11 @@ typedef struct WorkerInstrumentation
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 extern PGDLLIMPORT WalUsage pgWalUsage;
 
+
+extern void RegisterCustomInsrumentation(CustomInstrumentation* inst);
+extern void GetCustomInstrumentationState(char* dst);
+extern void AccumulateCustomInstrumentationState(char* dst, char const* before);
+
 extern Instrumentation *InstrAlloc(int n, int instrument_options,
 								   bool async_mode);
 extern void InstrInit(Instrumentation *instr, int instrument_options);
@@ -110,8 +147,8 @@ extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
 extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
 extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* custusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* custusage);
 extern void BufferUsageAccumDiff(BufferUsage *dst,
 								 const BufferUsage *add, const BufferUsage *sub);
 extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
-- 
2.43.0

Andrei Lepikhov

a.lepikhov@postgrespro.ru

about 2 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Custom explain options

On 21/10/2023 19:16, Konstantin Knizhnik wrote:

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page
prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is
implemented not only for bitmap heap scan as in Vanilla Postgres, but
also for seqscan, indexscan and indexonly scan). Another possible
candidate for explain options is local file cache (extra caching layer
above shared buffers which is used to somehow replace file system cache
in standalone Postgres).

I think that it will be nice to have a generic mechanism which allows
extensions to add its own options to EXPLAIN.

Generally, I welcome this idea: Extensions can already do a lot of work,
and they should have a tool to report their state, not only into the log.
But I think your approach needs to be elaborated. At first, it would be
better to allow registering extended instruments for specific node types
to avoid unneeded calls.
Secondly, looking into the Instrumentation usage, I don't see the reason
to limit the size: as I see everywhere it exists locally or in the DSA
where its size is calculated on the fly. So, by registering an extended
instrument, we can reserve a slot for the extension. The actual size of
underlying data can be provided by the extension routine.

--
regards,
Andrei Lepikhov
Postgres Professional

Konstantin Knizhnik

knizhnik@garret.ru

about 2 years ago

In reply to: Andrei Lepikhov (#3)

Re: Custom explain options

On 30/11/2023 5:59 am, Andrei Lepikhov wrote:

On 21/10/2023 19:16, Konstantin Knizhnik wrote:

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about
page prefetching during query execution (prefetching is more critical
for Neon
architecture because of separation of compute and storage, so it is
implemented not only for bitmap heap scan as in Vanilla Postgres, but
also for seqscan, indexscan and indexonly scan). Another possible
candidate for explain options is local file cache (extra caching
layer above shared buffers which is used to somehow replace file
system cache in standalone Postgres).

I think that it will be nice to have a generic mechanism which allows
extensions to add its own options to EXPLAIN.

Generally, I welcome this idea: Extensions can already do a lot of
work, and they should have a tool to report their state, not only into
the log.
But I think your approach needs to be elaborated. At first, it would
be better to allow registering extended instruments for specific node
types to avoid unneeded calls.
Secondly, looking into the Instrumentation usage, I don't see the
reason to limit the size: as I see everywhere it exists locally or in
the DSA where its size is calculated on the fly. So, by registering an
extended instrument, we can reserve a slot for the extension. The
actual size of underlying data can be provided by the extension routine.

Thank you for review.

I agree that support of extended instruments is desired. I just tried to
minimize number of changes to make this patch smaller.

Concerning limiting instrumentation size, may be I missed something, but
I do not see any goo way to handle this:

```

./src/backend/executor/nodeMemoize.c1106:        si =
&node->shared_info->sinstrument[ParallelWorkerNumber];
./src/backend/executor/nodeAgg.c4322:       si =
&node->shared_info->sinstrument[ParallelWorkerNumber];
./src/backend/executor/nodeIncrementalSort.c107:
instrumentSortedGroup(&(node)->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo,
\
./src/backend/executor/execParallel.c808: InstrInit(&instrument[i],
estate->es_instrument);
./src/backend/executor/execParallel.c1052:
InstrAggNode(planstate->instrument, &instrument[n]);
./src/backend/executor/execParallel.c1306:
InstrAggNode(&instrument[ParallelWorkerNumber], planstate->instrument);
./src/backend/commands/explain.c1763:            Instrumentation
*instrument = &w->instrument[n];
./src/backend/commands/explain.c2168:            Instrumentation
*instrument = &w->instrument[n];
```

In all this cases we are using array of `Instrumentation` and if it
contains varying part, then it is not clear where to place it.
Yes, there is also code which serialize and sends instrumentations
between worker processes and I have updated this code in my PR to send
actual amount of custom instrumentation data. But it can not help with
the cases above.

Andrei Lepikhov

a.lepikhov@postgrespro.ru

about 2 years ago

In reply to: Konstantin Knizhnik (#4)

Re: Custom explain options

On 30/11/2023 22:40, Konstantin Knizhnik wrote:

On 30/11/2023 5:59 am, Andrei Lepikhov wrote:

On 21/10/2023 19:16, Konstantin Knizhnik wrote:

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about
page prefetching during query execution (prefetching is more critical
for Neon
architecture because of separation of compute and storage, so it is
implemented not only for bitmap heap scan as in Vanilla Postgres, but
also for seqscan, indexscan and indexonly scan). Another possible
candidate for explain options is local file cache (extra caching
layer above shared buffers which is used to somehow replace file
system cache in standalone Postgres).

I think that it will be nice to have a generic mechanism which allows
extensions to add its own options to EXPLAIN.

Generally, I welcome this idea: Extensions can already do a lot of
work, and they should have a tool to report their state, not only into
the log.
But I think your approach needs to be elaborated. At first, it would
be better to allow registering extended instruments for specific node
types to avoid unneeded calls.
Secondly, looking into the Instrumentation usage, I don't see the
reason to limit the size: as I see everywhere it exists locally or in
the DSA where its size is calculated on the fly. So, by registering an
extended instrument, we can reserve a slot for the extension. The
actual size of underlying data can be provided by the extension routine.

Thank you for review.

I agree that support of extended instruments is desired. I just tried to
minimize number of changes to make this patch smaller.

I got it. But having a substantial number of extensions in support, I
think the extension part of instrumentation could have advantages and be
worth elaborating on.

In all this cases we are using array of `Instrumentation` and if it
contains varying part, then it is not clear where to place it.
Yes, there is also code which serialize and sends instrumentations
between worker processes and I have updated this code in my PR to send
actual amount of custom instrumentation data. But it can not help with
the cases above.

I see next basic instruments in the code:
- Instrumentation (which should be named NodeInstrumentation)
- MemoizeInstrumentation
- JitInstrumentation
- AggregateInstrumentation
- HashInstrumentation
- TuplesortInstrumentation

As a variant, extensibility can be designed with parent
'AbstractInstrumentation' node, containing node type and link to
extensible part. sizeof(Instr) calls should be replaced with the
getInstrSize() call - not so much places in the code; memcpy() also can
be replaced with the copy_instr() routine.

--
regards,
Andrei Lepikhov
Postgres Professional

vignesh C

vignesh21@gmail.com

about 2 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Custom explain options

On Sat, 21 Oct 2023 at 18:34, Konstantin Knizhnik <knizhnik@garret.ru> wrote:

Hi hackers,

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is implemented not only for bitmap heap scan as in Vanilla Postgres, but also for seqscan, indexscan and indexonly scan). Another possible candidate for explain options is local file cache (extra caching layer above shared buffers which is used to somehow replace file system cache in standalone Postgres).

I think that it will be nice to have a generic mechanism which allows extensions to add its own options to EXPLAIN.
I have attached the patch with implementation of such mechanism (also available as PR: https://github.com/knizhnik/postgres/pull/1 )

I have demonstrated this mechanism using Bloom extension - just to report number of Bloom matches.
Not sure that it is really useful information but it is used mostly as example:

explain (analyze,bloom) select * from t where pk=2000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=15348.00..15352.01 rows=1 width=4) (actual time=25.244..25.939 rows=1 loops=1)
Recheck Cond: (pk = 2000)
Rows Removed by Index Recheck: 292
Heap Blocks: exact=283
Bloom: matches=293
-> Bitmap Index Scan on t_pk_idx (cost=0.00..15348.00 rows=1 width=0) (actual time=25.147..25.147 rows=293 loops=1)
Index Cond: (pk = 2000)
Bloom: matches=293
Planning:
Bloom: matches=0
Planning Time: 0.387 ms
Execution Time: 26.053 ms
(12 rows)

There are two known issues with this proposal:

There are few compilation errors reported by CFBot at [1]https://cirrus-ci.com/task/5452124486631424?logs=build#L374 with:
[05:00:40.452] ../src/backend/access/brin/brin.c: In function
‘_brin_end_parallel’:
[05:00:40.452] ../src/backend/access/brin/brin.c:2675:3: error: too
few arguments to function ‘InstrAccumParallelQuery’
[05:00:40.452] 2675 |
InstrAccumParallelQuery(&brinleader->bufferusage[i],
&brinleader->walusage[i]);
[05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~~~
[05:00:40.452] In file included from ../src/include/nodes/execnodes.h:33,
[05:00:40.452] from ../src/include/access/brin.h:13,
[05:00:40.452] from ../src/backend/access/brin/brin.c:18:
[05:00:40.452] ../src/include/executor/instrument.h:151:13: note: declared here
[05:00:40.452] 151 | extern void InstrAccumParallelQuery(BufferUsage
*bufusage, WalUsage *walusage, char* custusage);
[05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~~~
[05:00:40.452] ../src/backend/access/brin/brin.c: In function
‘_brin_parallel_build_main’:
[05:00:40.452] ../src/backend/access/brin/brin.c:2873:2: error: too
few arguments to function ‘InstrEndParallelQuery’
[05:00:40.452] 2873 |
InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
[05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~
[05:00:40.452] In file included from ../src/include/nodes/execnodes.h:33,
[05:00:40.452] from ../src/include/access/brin.h:13,
[05:00:40.452] from ../src/backend/access/brin/brin.c:18:
[05:00:40.452] ../src/include/executor/instrument.h:150:13: note: declared here
[05:00:40.452] 150 | extern void InstrEndParallelQuery(BufferUsage
*bufusage, WalUsage *walusage, char* custusage);

[1]: https://cirrus-ci.com/task/5452124486631424?logs=build#L374

Regards,
Vignesh

Andrei Lepikhov

a.lepikhov@postgrespro.ru

about 2 years ago

In reply to: Konstantin Knizhnik (#4)

Re: Custom explain options

On 30/11/2023 22:40, Konstantin Knizhnik wrote:

In all this cases we are using array of `Instrumentation` and if it
contains varying part, then it is not clear where to place it.
Yes, there is also code which serialize and sends instrumentations
between worker processes and I have updated this code in my PR to send
actual amount of custom instrumentation data. But it can not help with
the cases above.

What do you think about this really useful feature? Do you wish to
develop it further?

--
regards,
Andrei Lepikhov
Postgres Professional

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Andrei Lepikhov (#7)

Re: Custom explain options

On Wed, Jan 10, 2024 at 01:29:30PM +0700, Andrei Lepikhov wrote:

What do you think about this really useful feature? Do you wish to develop
it further?

I am biased here. This seems like a lot of code for something we've
been delegating to the explain hook for ages. Even if I can see the
appeal of pushing that more into explain.c to get more data on a
per-node basis depending on the custom options given by the caller of
an EXPLAIN entry point, I cannot get really excited about the extra
maintenance this facility would involve compared to the potential
gains, knowing that there's a hook.
--
Michael

Konstantin Knizhnik

knizhnik@garret.ru

about 2 years ago

In reply to: Michael Paquier (#8)

Re: Custom explain options

On 10/01/2024 8:46 am, Michael Paquier wrote:

On Wed, Jan 10, 2024 at 01:29:30PM +0700, Andrei Lepikhov wrote:

What do you think about this really useful feature? Do you wish to develop
it further?

I am biased here. This seems like a lot of code for something we've
been delegating to the explain hook for ages. Even if I can see the
appeal of pushing that more into explain.c to get more data on a
per-node basis depending on the custom options given by the caller of
an EXPLAIN entry point, I cannot get really excited about the extra
maintenance this facility would involve compared to the potential
gains, knowing that there's a hook.
--
Michael

Well, I am not sure that proposed patch is flexible enough to handle all
possible scenarios.
I just wanted to make it as simple as possible to leave some chances for
it to me merged.
But it is easy to answer the question why existed explain hook is not
enough:

1. It doesn't allow to add some extra options to EXPLAIN. My intention
was to be able to do something like this "explain
(analyze,buffers,prefetch) ...". It is completely not possible with
explain hook.
2. May be I wrong, but it is impossible now to collect and combine
instrumentation from all parallel workers without changing Postgres core

Explain hook can be useful if you add some custom node to query
execution plan and want to provide information about this node.
But if you are implementing some alternative storage mechanism or some
optimization for existed plan nodes, then it is very difficult to do it
using existed explain hook.

#10

Konstantin Knizhnik

knizhnik@garret.ru

about 2 years ago

In reply to: Andrei Lepikhov (#7)

Re: Custom explain options

On 10/01/2024 8:29 am, Andrei Lepikhov wrote:

On 30/11/2023 22:40, Konstantin Knizhnik wrote:

In all this cases we are using array of `Instrumentation` and if it
contains varying part, then it is not clear where to place it.
Yes, there is also code which serialize and sends instrumentations
between worker processes and I have updated this code in my PR to
send actual amount of custom instrumentation data. But it can not
help with the cases above.

What do you think about this really useful feature? Do you wish to
develop it further?

In Neon (cloud Postgres) we have changed Postgres core to include in
explain information about prefetch and local file cache.
EXPLAIN seems to be most convenient way for users to get this
information which can be very useful for investigation of query
execution speed.
So my intention was to make it possible to add extra information to
explain without patching Postgres core.
Existed explain hook is not enough for it.

I am not sure that the suggested approach is flexible enough. First of
all I tried to make it is simple as possible, minimizing changes in
Postgres core.

#11

Konstantin Knizhnik

knizhnik@garret.ru

about 2 years ago

In reply to: vignesh C (#6)

1 attachment(s)

Re: Custom explain options

On 09/01/2024 10:33 am, vignesh C wrote:

On Sat, 21 Oct 2023 at 18:34, Konstantin Knizhnik <knizhnik@garret.ru> wrote:

Hi hackers,

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is implemented not only for bitmap heap scan as in Vanilla Postgres, but also for seqscan, indexscan and indexonly scan). Another possible candidate for explain options is local file cache (extra caching layer above shared buffers which is used to somehow replace file system cache in standalone Postgres).

I think that it will be nice to have a generic mechanism which allows extensions to add its own options to EXPLAIN.
I have attached the patch with implementation of such mechanism (also available as PR: https://github.com/knizhnik/postgres/pull/1 )

I have demonstrated this mechanism using Bloom extension - just to report number of Bloom matches.
Not sure that it is really useful information but it is used mostly as example:

explain (analyze,bloom) select * from t where pk=2000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=15348.00..15352.01 rows=1 width=4) (actual time=25.244..25.939 rows=1 loops=1)
Recheck Cond: (pk = 2000)
Rows Removed by Index Recheck: 292
Heap Blocks: exact=283
Bloom: matches=293
-> Bitmap Index Scan on t_pk_idx (cost=0.00..15348.00 rows=1 width=0) (actual time=25.147..25.147 rows=293 loops=1)
Index Cond: (pk = 2000)
Bloom: matches=293
Planning:
Bloom: matches=0
Planning Time: 0.387 ms
Execution Time: 26.053 ms
(12 rows)

There are two known issues with this proposal:

There are few compilation errors reported by CFBot at [1] with:
[05:00:40.452] ../src/backend/access/brin/brin.c: In function
‘_brin_end_parallel’:
[05:00:40.452] ../src/backend/access/brin/brin.c:2675:3: error: too
few arguments to function ‘InstrAccumParallelQuery’
[05:00:40.452] 2675 |
InstrAccumParallelQuery(&brinleader->bufferusage[i],
&brinleader->walusage[i]);
[05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~~~
[05:00:40.452] In file included from ../src/include/nodes/execnodes.h:33,
[05:00:40.452] from ../src/include/access/brin.h:13,
[05:00:40.452] from ../src/backend/access/brin/brin.c:18:
[05:00:40.452] ../src/include/executor/instrument.h:151:13: note: declared here
[05:00:40.452] 151 | extern void InstrAccumParallelQuery(BufferUsage
*bufusage, WalUsage *walusage, char* custusage);
[05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~~~
[05:00:40.452] ../src/backend/access/brin/brin.c: In function
‘_brin_parallel_build_main’:
[05:00:40.452] ../src/backend/access/brin/brin.c:2873:2: error: too
few arguments to function ‘InstrEndParallelQuery’
[05:00:40.452] 2873 |
InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
[05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~
[05:00:40.452] In file included from ../src/include/nodes/execnodes.h:33,
[05:00:40.452] from ../src/include/access/brin.h:13,
[05:00:40.452] from ../src/backend/access/brin/brin.c:18:
[05:00:40.452] ../src/include/executor/instrument.h:150:13: note: declared here
[05:00:40.452] 150 | extern void InstrEndParallelQuery(BufferUsage
*bufusage, WalUsage *walusage, char* custusage);

[1] - https://cirrus-ci.com/task/5452124486631424?logs=build#L374

Regards,
Vignesh

Thank you for reporting the problem.
Rebased version of the patch is attached.

Attachments:

custom_explain_options-v2.patchtext/plain; charset=UTF-8; name=custom_explain_options-v2.patchDownload

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index fba3ba7771..383b1fb8e3 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -174,6 +174,13 @@ typedef struct BloomScanOpaqueData
 
 typedef BloomScanOpaqueData *BloomScanOpaque;
 
+typedef struct
+{
+	uint64 matches;
+} BloomUsage;
+
+extern BloomUsage bloomUsage;
+
 /* blutils.c */
 extern void initBloomState(BloomState *state, Relation index);
 extern void BloomFillMetapage(Relation index, Page metaPage);
diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index 0a303a49b2..da29dc7f68 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -166,6 +166,6 @@ blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 		CHECK_FOR_INTERRUPTS();
 	}
 	FreeAccessStrategy(bas);
-
+	bloomUsage.matches += ntids;
 	return ntids;
 }
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6836129c90..9120f96083 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -18,6 +18,7 @@
 #include "access/reloptions.h"
 #include "bloom.h"
 #include "catalog/index.h"
+#include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -34,6 +35,55 @@
 
 PG_FUNCTION_INFO_V1(blhandler);
 
+BloomUsage bloomUsage;
+
+static void
+bloomUsageAdd(CustomResourceUsage* dst, CustomResourceUsage const* add)
+{
+	((BloomUsage*)dst)->matches += ((BloomUsage*)add)->matches;
+}
+
+static void
+bloomUsageAccum(CustomResourceUsage* acc, CustomResourceUsage const* end, CustomResourceUsage const* start)
+{
+	((BloomUsage*)acc)->matches += ((BloomUsage*)end)->matches - ((BloomUsage*)start)->matches;;
+}
+
+static void
+bloomUsageShow(ExplainState* es, CustomResourceUsage const* usage, bool planning)
+{
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		if (planning)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Planning:\n");
+			es->indent++;
+		}
+		ExplainIndentText(es);
+		appendStringInfoString(es->str, "Bloom:");
+		appendStringInfo(es->str, " matches=%lld",
+						 (long long) ((BloomUsage*)usage)->matches);
+		appendStringInfoChar(es->str, '\n');
+		if (planning)
+			es->indent--;
+	}
+	else
+	{
+		ExplainPropertyInteger("Bloom Matches", NULL,
+							   ((BloomUsage*)usage)->matches, es);
+	}
+}
+
+static CustomInstrumentation bloomInstr = {
+	"bloom",
+	sizeof(BloomUsage),
+	&bloomUsage,
+	bloomUsageAdd,
+	bloomUsageAccum,
+	bloomUsageShow
+};
+
 /* Kind of relation options for bloom index */
 static relopt_kind bl_relopt_kind;
 
@@ -78,6 +128,7 @@ _PG_init(void)
 		bl_relopt_tab[i + 1].opttype = RELOPT_TYPE_INT;
 		bl_relopt_tab[i + 1].offset = offsetof(BloomOptions, bitSize[0]) + sizeof(int) * i;
 	}
+	RegisterCustomInsrumentation(&bloomInstr);
 }
 
 /*
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1087a9011e..67b9d8351e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -49,6 +49,7 @@
 #define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_CUST_USAGE		    UINT64CONST(0xA000000000000006)
 
 /*
  * Status for index builds performed in parallel.  This is allocated in a
@@ -143,6 +144,7 @@ typedef struct BrinLeader
 	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char        *custusage;
 } BrinLeader;
 
 /*
@@ -2342,6 +2344,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	BrinLeader *brinleader = (BrinLeader *) palloc0(sizeof(BrinLeader));
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char        *custusage;
 	bool		leaderparticipates = true;
 	int			querylen;
 
@@ -2396,6 +2399,10 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 
 	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
 	if (debug_query_string)
@@ -2475,6 +2482,9 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	bufferusage = shm_toc_allocate(pcxt->toc,
 								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+	custusage = shm_toc_allocate(pcxt->toc,
+								 mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_CUST_USAGE, custusage);
 
 	/* Launch workers, saving status for leader/caller */
 	LaunchParallelWorkers(pcxt);
@@ -2487,6 +2497,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
+	brinleader->custusage = custusage;
 
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
@@ -2669,7 +2680,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	 * or we might get incomplete data.)
 	 */
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
-		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i], brinleader->custusage + pgCustUsageSize*i);
 
 cleanup:
 
@@ -2793,6 +2804,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	LOCKMODE	indexLockmode;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char		*custusage;
 	int			sortmem;
 
 	/*
@@ -2852,8 +2864,10 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	custusage = shm_toc_lookup(toc, PARALLEL_KEY_CUST_USAGE, false);
 	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
-						  &walusage[ParallelWorkerNumber]);
+						  &walusage[ParallelWorkerNumber],
+						  custusage + ParallelWorkerNumber*pgCustUsageSize);
 
 	index_close(indexRel, indexLockmode);
 	table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2011196579..3e7b567d37 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -71,6 +71,7 @@
 #define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xA000000000000004)
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xA000000000000005)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_CUST_USAGE		    UINT64CONST(0xA000000000000007)
 
 /*
  * DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -197,6 +198,7 @@ typedef struct BTLeader
 	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char       *custusage;
 } BTLeader;
 
 /*
@@ -1467,6 +1469,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BTLeader   *btleader = (BTLeader *) palloc0(sizeof(BTLeader));
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char        *custusage;
 	bool		leaderparticipates = true;
 	int			querylen;
 
@@ -1532,6 +1535,9 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
 	if (debug_query_string)
@@ -1625,6 +1631,10 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	bufferusage = shm_toc_allocate(pcxt->toc,
 								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+	custusage = shm_toc_allocate(pcxt->toc,
+								 mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_CUST_USAGE, custusage);
+
 
 	/* Launch workers, saving status for leader/caller */
 	LaunchParallelWorkers(pcxt);
@@ -1638,6 +1648,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btleader->snapshot = snapshot;
 	btleader->walusage = walusage;
 	btleader->bufferusage = bufferusage;
+	btleader->custusage = custusage;
 
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
@@ -1676,7 +1687,7 @@ _bt_end_parallel(BTLeader *btleader)
 	 * or we might get incomplete data.)
 	 */
 	for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
-		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i], btleader->custusage + pgCustUsageSize*i);
 
 	/* Free last reference to MVCC snapshot, if one was used */
 	if (IsMVCCSnapshot(btleader->snapshot))
@@ -1811,6 +1822,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	LOCKMODE	indexLockmode;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
+	char       *custusage;
 	int			sortmem;
 
 #ifdef BTREE_BUILD_STATS
@@ -1891,8 +1903,10 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	custusage = shm_toc_lookup(toc, PARALLEL_KEY_CUST_USAGE, false);
 	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
-						  &walusage[ParallelWorkerNumber]);
+						  &walusage[ParallelWorkerNumber],
+						  custusage + ParallelWorkerNumber*pgCustUsageSize);
 
 #ifdef BTREE_BUILD_STATS
 	if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 3d590a6b9f..d3ebb2a381 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -119,6 +119,8 @@ static void show_instrumentation_count(const char *qlabel, int which,
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static void show_eval_params(Bitmapset *bms_params, ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
+static void show_custom_usage(ExplainState *es, const char* usage,
+							  bool planning);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage,
 							  bool planning);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
@@ -149,7 +151,6 @@ static void ExplainRestoreGroup(ExplainState *es, int depth, int *state_save);
 static void ExplainDummyGroup(const char *objtype, const char *labelname,
 							  ExplainState *es);
 static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
-static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
@@ -170,6 +171,7 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 	Query	   *query;
 	List	   *rewritten;
 	ListCell   *lc;
+	ListCell   *c;
 	bool		timing_set = false;
 	bool		summary_set = false;
 
@@ -222,11 +224,28 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 						 parser_errposition(pstate, opt->location)));
 		}
 		else
-			ereport(ERROR,
+		{
+			bool found = false;
+			foreach (c, pgCustInstr)
+			{
+				CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(c);
+				if (strcmp(opt->defname, ci->name) == 0)
+				{
+					ci->selected = true;
+					es->custom = true;
+					found = true;
+					break;
+				}
+			}
+			if (!found)
+			{
+				ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
 					 errmsg("unrecognized EXPLAIN option \"%s\"",
 							opt->defname),
 					 parser_errposition(pstate, opt->location)));
+			}
+		}
 	}
 
 	/* check that WAL is used with EXPLAIN ANALYZE */
@@ -320,12 +339,19 @@ ExplainState *
 NewExplainState(void)
 {
 	ExplainState *es = (ExplainState *) palloc0(sizeof(ExplainState));
+	ListCell* lc;
 
 	/* Set default options (most fields can be left as zeroes). */
 	es->costs = true;
 	/* Prepare output buffer. */
 	es->str = makeStringInfo();
 
+	/* Reset custom instrumentations selection flag */
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->selected = false;
+	}
 	return es;
 }
 
@@ -397,9 +423,14 @@ ExplainOneQuery(Query *query, int cursorOptions,
 					planduration;
 		BufferUsage bufusage_start,
 					bufusage;
+		CustomInstrumentationData custusage_start, custusage;
 
 		if (es->buffers)
 			bufusage_start = pgBufferUsage;
+
+		if (es->custom)
+			GetCustomInstrumentationState(custusage_start.data);
+
 		INSTR_TIME_SET_CURRENT(planstart);
 
 		/* plan the query */
@@ -415,9 +446,14 @@ ExplainOneQuery(Query *query, int cursorOptions,
 			BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
 		}
 
+		if (es->custom)
+			AccumulateCustomInstrumentationState(custusage.data, custusage_start.data);
+
 		/* run it (if needed) and produce output */
 		ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-					   &planduration, (es->buffers ? &bufusage : NULL));
+					   &planduration,
+					   (es->buffers ? &bufusage : NULL),
+					   (es->custom ? &custusage : NULL));
 	}
 }
 
@@ -527,7 +563,8 @@ void
 ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 			   const char *queryString, ParamListInfo params,
 			   QueryEnvironment *queryEnv, const instr_time *planduration,
-			   const BufferUsage *bufusage)
+			   const BufferUsage *bufusage,
+			   const CustomInstrumentationData *custusage)
 {
 	DestReceiver *dest;
 	QueryDesc  *queryDesc;
@@ -623,6 +660,13 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		ExplainCloseGroup("Planning", "Planning", true, es);
 	}
 
+	if (custusage)
+	{
+		ExplainOpenGroup("Planning", "Planning", true, es);
+		show_custom_usage(es, custusage->data, true);
+		ExplainCloseGroup("Planning", "Planning", true, es);
+	}
+
 	if (es->summary && planduration)
 	{
 		double		plantime = INSTR_TIME_GET_DOUBLE(*planduration);
@@ -2110,8 +2154,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (es->wal && planstate->instrument)
 		show_wal_usage(es, &planstate->instrument->walusage);
 
+	/* Show custom instrumentation */
+	if (es->custom && planstate->instrument)
+		show_custom_usage(es, planstate->instrument->cust_usage.data, false);
+
 	/* Prepare per-worker buffer/WAL usage */
-	if (es->workers_state && (es->buffers || es->wal) && es->verbose)
+	if (es->workers_state && (es->buffers || es->wal || es->custom) && es->verbose)
 	{
 		WorkerInstrumentation *w = planstate->worker_instrument;
 
@@ -2128,6 +2176,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_buffer_usage(es, &instrument->bufusage, false);
 			if (es->wal)
 				show_wal_usage(es, &instrument->walusage);
+			if (es->custom)
+				show_custom_usage(es, instrument->cust_usage.data, false);
 			ExplainCloseWorker(n, es);
 		}
 	}
@@ -3544,6 +3594,23 @@ explain_get_index_name(Oid indexId)
 	return result;
 }
 
+/*
+ * Show select custom usage details
+ */
+static void
+show_custom_usage(ExplainState *es, const char* usage, bool planning)
+{
+	ListCell* lc;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		if (ci->selected)
+			ci->show(es, usage, planning);
+		usage += ci->size;
+	}
+}
+
 /*
  * Show buffer usage details.
  */
@@ -5017,7 +5084,7 @@ ExplainXMLTag(const char *tagname, int flags, ExplainState *es)
  * data for a parallel worker there might already be data on the current line
  * (cf. ExplainOpenWorker); in that case, don't indent any more.
  */
-static void
+void
 ExplainIndentText(ExplainState *es)
 {
 	Assert(es->format == EXPLAIN_FORMAT_TEXT);
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 703976f633..806dd1e85c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -583,9 +583,13 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	instr_time	planduration;
 	BufferUsage bufusage_start,
 				bufusage;
+	CustomInstrumentationData custusage_start, custusage;
 
 	if (es->buffers)
 		bufusage_start = pgBufferUsage;
+	if (es->custom)
+		GetCustomInstrumentationState(custusage_start.data);
+
 	INSTR_TIME_SET_CURRENT(planstart);
 
 	/* Look it up in the hash table */
@@ -630,6 +634,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 		memset(&bufusage, 0, sizeof(BufferUsage));
 		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
 	}
+	if (es->custom)
+		AccumulateCustomInstrumentationState(custusage.data, custusage_start.data);
 
 	plan_list = cplan->stmt_list;
 
@@ -640,7 +646,9 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 
 		if (pstmt->commandType != CMD_UTILITY)
 			ExplainOnePlan(pstmt, into, es, query_string, paramLI, queryEnv,
-						   &planduration, (es->buffers ? &bufusage : NULL));
+						   &planduration,
+						   (es->buffers ? &bufusage : NULL),
+						   (es->custom ? &custusage : NULL));
 		else
 			ExplainOneUtility(pstmt->utilityStmt, into, es, query_string,
 							  paramLI, queryEnv);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index e087dfd72e..9b2add2b65 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -50,6 +50,7 @@
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
 #define PARALLEL_VACUUM_KEY_INDEX_STATS		6
+#define PARALLEL_VACUUM_KEY_CUSTOM_USAGE	7
 
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
@@ -184,6 +185,9 @@ struct ParallelVacuumState
 	/* Points to WAL usage area in DSM */
 	WalUsage   *wal_usage;
 
+	/* Points to custom usage area in DSM */
+	char *custom_usage;
+
 	/*
 	 * False if the index is totally unsuitable target for all parallel
 	 * processing. For example, the index could be <
@@ -242,6 +246,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	char       *custom_usage;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
@@ -313,6 +318,9 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
 	if (debug_query_string)
@@ -403,6 +411,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 								 mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
 	pvs->wal_usage = wal_usage;
+    custom_usage = shm_toc_allocate(pcxt->toc,
+									mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_CUSTOM_USAGE, custom_usage);
+	pvs->custom_usage = custom_usage;
 
 	/* Store query string for workers */
 	if (debug_query_string)
@@ -706,7 +718,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
 		WaitForParallelWorkersToFinish(pvs->pcxt);
 
 		for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
-			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+			InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i], pvs->custom_usage + pgCustUsageSize*i);
 	}
 
 	/*
@@ -964,6 +976,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	VacDeadItems *dead_items;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	char       *custom_usage;
 	int			nindexes;
 	char	   *sharedquery;
 	ErrorContextCallback errcallback;
@@ -1053,8 +1066,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
+	custom_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_CUSTOM_USAGE, false);
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
-						  &wal_usage[ParallelWorkerNumber]);
+						  &wal_usage[ParallelWorkerNumber],
+						  custom_usage + pgCustUsageSize*ParallelWorkerNumber);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 540f8d21fd..f06553e5b6 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -66,6 +66,7 @@
 #define PARALLEL_KEY_QUERY_TEXT		UINT64CONST(0xE000000000000008)
 #define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_CUSTOM_USAGE		UINT64CONST(0xE00000000000000B)
 
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
@@ -600,6 +601,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	char	   *paramlistinfo_space;
 	BufferUsage *bufusage_space;
 	WalUsage   *walusage_space;
+	char       *customusage_space;
 	SharedExecutorInstrumentation *instrumentation = NULL;
 	SharedJitInstrumentation *jit_instrumentation = NULL;
 	int			pstmt_len;
@@ -680,6 +682,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+	/*
+	 * Same thing for CustomUsage.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 	/* Estimate space for tuple queues. */
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -768,6 +777,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
 	pei->wal_usage = walusage_space;
 
+	customusage_space = shm_toc_allocate(pcxt->toc,
+										 mul_size(pgCustUsageSize, pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_CUSTOM_USAGE, customusage_space);
+	pei->custom_usage = customusage_space;
+
 	/* Set up the tuple queues that the workers will write into. */
 	pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
 
@@ -1164,7 +1178,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
 	 * finish, or we might get incomplete data.)
 	 */
 	for (i = 0; i < nworkers; i++)
-		InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+		InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i], pei->custom_usage + pgCustUsageSize*i);
 
 	pei->finished = true;
 }
@@ -1397,6 +1411,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	FixedParallelExecutorState *fpes;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	char *custom_usage; 
 	DestReceiver *receiver;
 	QueryDesc  *queryDesc;
 	SharedExecutorInstrumentation *instrumentation;
@@ -1472,8 +1487,10 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	/* Report buffer/WAL usage during parallel execution. */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	custom_usage = shm_toc_lookup(toc, PARALLEL_KEY_CUSTOM_USAGE, false);
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
-						  &wal_usage[ParallelWorkerNumber]);
+						  &wal_usage[ParallelWorkerNumber],
+						  custom_usage + ParallelWorkerNumber*pgCustUsageSize);
 
 	/* Report instrumentation data if any instrumentation options are set. */
 	if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 268ae8a945..fd1d1c7b7e 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,15 +16,62 @@
 #include <unistd.h>
 
 #include "executor/instrument.h"
+#include "utils/memutils.h"
 
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
 static WalUsage save_pgWalUsage;
 
+List* pgCustInstr; /* description of custom instriumentations */
+Size pgCustUsageSize;
+static CustomInstrumentationData save_pgCustUsage; /* saved custom instrumentation state */
+
+
 static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void WalUsageAdd(WalUsage *dst, WalUsage *add);
 
+void
+RegisterCustomInsrumentation(CustomInstrumentation* inst)
+{
+	MemoryContext oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+	pgCustInstr = lappend(pgCustInstr, inst);
+	pgCustUsageSize += inst->size;
+	MemoryContextSwitchTo(oldcontext);
+	if (pgCustUsageSize > MAX_CUSTOM_INSTR_SIZE)
+		elog(ERROR, "Total size of custom instrumentations exceed limit %d",  MAX_CUSTOM_INSTR_SIZE);
+}
+
+void
+GetCustomInstrumentationState(char* dst)
+{
+	ListCell* lc;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);	
+		memcpy(dst, ci->usage, ci->size);
+		dst += ci->size;
+	}
+}
+
+void
+AccumulateCustomInstrumentationState(char* dst, char const* before)
+{
+	ListCell* lc;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);	
+		if (ci->selected)
+		{
+			memset(dst, 0, ci->size);
+			ci->accum(dst, ci->usage, before);
+		}
+		dst += ci->size;
+		before += ci->size;
+	}
+}
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -49,7 +96,6 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
 			instr[i].async_mode = async_mode;
 		}
 	}
-
 	return instr;
 }
 
@@ -67,6 +113,8 @@ InstrInit(Instrumentation *instr, int instrument_options)
 void
 InstrStartNode(Instrumentation *instr)
 {
+	ListCell *lc;
+	char* cust_start = instr->cust_usage_start.data;
 	if (instr->need_timer &&
 		!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
 		elog(ERROR, "InstrStartNode called twice in a row");
@@ -77,6 +125,13 @@ InstrStartNode(Instrumentation *instr)
 
 	if (instr->need_walusage)
 		instr->walusage_start = pgWalUsage;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		memcpy(cust_start, ci->usage, ci->size);
+		cust_start += ci->size;
+	}
 }
 
 /* Exit from a plan node */
@@ -85,6 +140,9 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 {
 	double		save_tuplecount = instr->tuplecount;
 	instr_time	endtime;
+	ListCell *lc;
+	char *cust_start = instr->cust_usage_start.data;
+	char *cust_usage = instr->cust_usage.data;
 
 	/* count the returned tuples */
 	instr->tuplecount += nTuples;
@@ -110,7 +168,15 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 		WalUsageAccumDiff(&instr->walusage,
 						  &pgWalUsage, &instr->walusage_start);
 
-	/* Is this the first tuple of this cycle? */
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->accum(cust_usage, ci->usage, cust_start);
+		cust_start += ci->size;
+		cust_usage += ci->size;
+	}
+
+    /* Is this the first tuple of this cycle? */
 	if (!instr->running)
 	{
 		instr->running = true;
@@ -168,6 +234,10 @@ InstrEndLoop(Instrumentation *instr)
 void
 InstrAggNode(Instrumentation *dst, Instrumentation *add)
 {
+	ListCell *lc;
+	char *cust_dst = dst->cust_usage.data;
+	char *cust_add = add->cust_usage.data;
+
 	if (!dst->running && add->running)
 	{
 		dst->running = true;
@@ -193,32 +263,69 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 
 	if (dst->need_walusage)
 		WalUsageAdd(&dst->walusage, &add->walusage);
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->add(cust_dst, cust_add);
+		cust_dst += ci->size;
+		cust_add += ci->size;
+	}
 }
 
 /* note current values during parallel executor startup */
 void
 InstrStartParallelQuery(void)
 {
+	ListCell* lc;
+	char* cust_dst = save_pgCustUsage.data;
+
 	save_pgBufferUsage = pgBufferUsage;
 	save_pgWalUsage = pgWalUsage;
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		memcpy(cust_dst, ci->usage, ci->size);
+		cust_dst += ci->size;
+	}
 }
 
 /* report usage after parallel executor shutdown */
 void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* cust_usage)
 {
+	ListCell *lc;
+	char* cust_save = save_pgCustUsage.data;
+
 	memset(bufusage, 0, sizeof(BufferUsage));
 	BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
 	memset(walusage, 0, sizeof(WalUsage));
 	WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->accum(cust_usage, ci->usage, cust_save);
+		cust_usage += ci->size;
+		cust_save += ci->size;
+	}
 }
 
 /* accumulate work done by workers in leader's stats */
 void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* cust_usage)
 {
+	ListCell *lc;
 	BufferUsageAdd(&pgBufferUsage, bufusage);
 	WalUsageAdd(&pgWalUsage, walusage);
+
+	foreach (lc, pgCustInstr)
+	{
+		CustomInstrumentation *ci = (CustomInstrumentation*)lfirst(lc);
+		ci->add(ci->usage, cust_usage);
+		cust_usage += ci->size;
+	}
 }
 
 /* dst += add */
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 1b44d483d6..c60f6ffbbe 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -41,6 +41,7 @@ typedef struct ExplainState
 	bool		verbose;		/* be verbose */
 	bool		analyze;		/* print actual times */
 	bool		costs;			/* print estimated costs */
+	bool		custom;		    /* print custom usage */
 	bool		buffers;		/* print buffer usage */
 	bool		wal;			/* print WAL usage */
 	bool		timing;			/* print detailed node timing */
@@ -92,7 +93,8 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
 						   ExplainState *es, const char *queryString,
 						   ParamListInfo params, QueryEnvironment *queryEnv,
 						   const instr_time *planduration,
-						   const BufferUsage *bufusage);
+						   const BufferUsage *bufusage,
+						   const CustomInstrumentationData *custusage);
 
 extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
 extern void ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc);
@@ -125,5 +127,6 @@ extern void ExplainOpenGroup(const char *objtype, const char *labelname,
 							 bool labeled, ExplainState *es);
 extern void ExplainCloseGroup(const char *objtype, const char *labelname,
 							  bool labeled, ExplainState *es);
+extern void ExplainIndentText(ExplainState *es);
 
 #endif							/* EXPLAIN_H */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 6b8c00bb0f..8a828c04db 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -27,6 +27,7 @@ typedef struct ParallelExecutorInfo
 	ParallelContext *pcxt;		/* parallel context we're using */
 	BufferUsage *buffer_usage;	/* points to bufusage area in DSM */
 	WalUsage   *wal_usage;		/* walusage area in DSM */
+	char *custom_usage;         /* points to custiom usage area in DSM */
 	SharedExecutorInstrumentation *instrumentation; /* optional */
 	struct SharedJitInstrumentation *jit_instrumentation;	/* optional */
 	dsa_area   *area;			/* points to DSA area in DSM */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index bfd7b6d844..99ba91f57d 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,7 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
-
+#include "nodes/pg_list.h"
 
 /*
  * BufferUsage and WalUsage counters keep being incremented infinitely,
@@ -65,6 +65,36 @@ typedef enum InstrumentOption
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
+/*
+ * Maximal total size of all custom intrumentations
+ */
+#define MAX_CUSTOM_INSTR_SIZE 128
+
+typedef struct {
+	char data[MAX_CUSTOM_INSTR_SIZE];
+} CustomInstrumentationData;
+
+typedef void CustomResourceUsage;
+typedef struct ExplainState ExplainState;
+typedef void (*cust_instr_add_t)(CustomResourceUsage* dst, CustomResourceUsage const* add);
+typedef void (*cust_instr_accum_t)(CustomResourceUsage* acc, CustomResourceUsage const* end, CustomResourceUsage const* start);
+typedef void (*cust_instr_show_t)(ExplainState* es, CustomResourceUsage const* usage, bool planning);
+
+typedef struct
+{
+	char const*          name; /* instrumentation name (as will be recongnized in EXPLAIN options */
+	Size                 size;
+	CustomResourceUsage* usage;
+	cust_instr_add_t     add;
+	cust_instr_accum_t   accum;
+	cust_instr_show_t    show;
+	bool                 selected; /* selected in EXPLAIN options */
+} CustomInstrumentation;
+
+extern PGDLLIMPORT List* pgCustInstr; /* description of custom instrumentations */
+extern Size              pgCustUsageSize;
+
+
 typedef struct Instrumentation
 {
 	/* Parameters set at node creation: */
@@ -90,6 +120,8 @@ typedef struct Instrumentation
 	double		nfiltered2;		/* # of tuples removed by "other" quals */
 	BufferUsage bufusage;		/* total buffer usage */
 	WalUsage	walusage;		/* total WAL usage */
+	CustomInstrumentationData cust_usage_start; /* state of custom usage at start */
+	CustomInstrumentationData cust_usage; /* total custom  usage */
 } Instrumentation;
 
 typedef struct WorkerInstrumentation
@@ -101,6 +133,11 @@ typedef struct WorkerInstrumentation
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 extern PGDLLIMPORT WalUsage pgWalUsage;
 
+
+extern void RegisterCustomInsrumentation(CustomInstrumentation* inst);
+extern void GetCustomInstrumentationState(char* dst);
+extern void AccumulateCustomInstrumentationState(char* dst, char const* before);
+
 extern Instrumentation *InstrAlloc(int n, int instrument_options,
 								   bool async_mode);
 extern void InstrInit(Instrumentation *instr, int instrument_options);
@@ -110,8 +147,8 @@ extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
 extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
 extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* custusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* custusage);
 extern void BufferUsageAccumDiff(BufferUsage *dst,
 								 const BufferUsage *add, const BufferUsage *sub);
 extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,

#12

Andrei Lepikhov

a.lepikhov@postgrespro.ru

about 2 years ago

In reply to: Konstantin Knizhnik (#9)

Re: Custom explain options

On 10/1/2024 20:27, Konstantin Knizhnik wrote:

On 10/01/2024 8:46 am, Michael Paquier wrote:

On Wed, Jan 10, 2024 at 01:29:30PM +0700, Andrei Lepikhov wrote:

What do you think about this really useful feature? Do you wish to
develop
it further?

I am biased here. This seems like a lot of code for something we've
been delegating to the explain hook for ages. Even if I can see the
appeal of pushing that more into explain.c to get more data on a
per-node basis depending on the custom options given by the caller of
an EXPLAIN entry point, I cannot get really excited about the extra
maintenance this facility would involve compared to the potential
gains, knowing that there's a hook.
--
Michael

Well, I am not sure that proposed patch is flexible enough to handle all
possible scenarios.
I just wanted to make it as simple as possible to leave some chances for
it to me merged.
But it is easy to answer the question why existed explain hook is not
enough:

1. It doesn't allow to add some extra options to EXPLAIN. My intention
was to be able to do something like this "explain
(analyze,buffers,prefetch) ...". It is completely not possible with
explain hook.

I agree. Designing mostly planner-related extensions, I also wanted to
add some information to the explain of nodes. For example,
pg_query_state could add the status of the node at the time of
interruption of execution: started, stopped, or loop closed.
Maybe we should gather some statistics on how developers of extensions
deal with that issue ...

--
regards,
Andrei Lepikhov
Postgres Professional

#13

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Custom explain options

On 10/21/23 14:16, Konstantin Knizhnik wrote:

Hi hackers,

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page
prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is
implemented not only for bitmap heap scan as in Vanilla Postgres, but
also for seqscan, indexscan and indexonly scan). Another possible
candidate for explain options is local file cache (extra caching layer
above shared buffers which is used to somehow replace file system cache
in standalone Postgres).

Not quite related to this patch about EXPLAIN options, but can you share
some details how you implemented prefetching for the other nodes?

I'm asking because I've been working on prefetching for index scans, so
I'm wondering if there's a better way to do this, or how to do it in a
way that would allow neon to maybe leverage that too.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#13)

Re: Custom explain options

On 12/01/2024 7:03 pm, Tomas Vondra wrote:

On 10/21/23 14:16, Konstantin Knizhnik wrote:

Hi hackers,

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page
prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is
implemented not only for bitmap heap scan as in Vanilla Postgres, but
also for seqscan, indexscan and indexonly scan). Another possible
candidate for explain options is local file cache (extra caching layer
above shared buffers which is used to somehow replace file system cache
in standalone Postgres).

Not quite related to this patch about EXPLAIN options, but can you share
some details how you implemented prefetching for the other nodes?

I'm asking because I've been working on prefetching for index scans, so
I'm wondering if there's a better way to do this, or how to do it in a
way that would allow neon to maybe leverage that too.

regards

Yes, I am looking at your PR. What we have implemented in Neon is more
specific to Neon architecture where storage is separated from compute.
So each page not found in shared buffers has to be downloaded from page
server. It adds quite noticeable latency, because of network roundtrip.
While vanilla Postgres can rely on OS file system cache when page is not
found in shared buffer (access to OS file cache is certainly slower than
to shared buffers
because of syscall and copying of page, but performance penaly is not
very large - less than 15%), Neon has no local files and so has to send
request to the socket.

This is why we have to perform aggressive prefetching whenever it is
possible (when it it is possible to predict order of subsequent pages).
Unlike vanilla Postgres which implements prefetch only for bitmap heap
scan, we have implemented it for seqscan, index scan, indexonly scan,
bitmap heap scan, vacuum, pg_prewarm.
The main difference between Neon prefetch and vanilla Postgres prefetch
is that first one is backend specific. So each backend prefetches only
pages which it needs.
This is why we have to rewrite prefetch for bitmap heap scan, which is
using `fadvise` and assumes that pages prefetched by one backend in file
cache, can be used by any other backend.

Concerning index scan we have implemented two different approaches: for
index only scan we try to prefetch leave pages and for index scan we
prefetch referenced heap pages.
In both cases we start from prefetch distance 0 and increase it until it
reaches `effective_io_concurrency` for this relation. Doing so we try to
avoid prefetching of useless pages and slowdown of "point" lookups
returning one or few records.

If you are interested, you can look at our implementation in neon repo:
all source are available. But briefly speaking, each backend has its own
prefetch ring (prefetch requests which are waiting for response). The
key idea is that we can send several prefetch requests to page server
and then receive multiple replies. It allows to increased speed of OLAP
queries up to 10 times.

Heikki thinks that prefetch can be somehow combined with async-io
proposal (based on io_uring). But right now they have nothing in common.

#15

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#14)

Re: Custom explain options

On 1/12/24 20:30, Konstantin Knizhnik wrote:

On 12/01/2024 7:03 pm, Tomas Vondra wrote:

On 10/21/23 14:16, Konstantin Knizhnik wrote:

Hi hackers,

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page
prefetching during query execution (prefetching is more critical for
Neon
architecture because of separation of compute and storage, so it is
implemented not only for bitmap heap scan as in Vanilla Postgres, but
also for seqscan, indexscan and indexonly scan). Another possible
candidate for explain options is local file cache (extra caching layer
above shared buffers which is used to somehow replace file system cache
in standalone Postgres).

Not quite related to this patch about EXPLAIN options, but can you share
some details how you implemented prefetching for the other nodes?

I'm asking because I've been working on prefetching for index scans, so
I'm wondering if there's a better way to do this, or how to do it in a
way that would allow neon to maybe leverage that too.

regards

Yes, I am looking at your PR. What we have implemented in Neon is more
specific to Neon architecture where storage is separated from compute.
So each page not found in shared buffers has to be downloaded from page
server. It adds quite noticeable latency, because of network roundtrip.
While vanilla Postgres can rely on OS file system cache when page is not
found in shared buffer (access to OS file cache is certainly slower than
to shared buffers
because of syscall and copying of page, but performance penaly is not
very large - less than 15%), Neon has no local files and so has to send
request to the socket.

This is why we have to perform aggressive prefetching whenever it is
possible (when it it is possible to predict order of subsequent pages).
Unlike vanilla Postgres which implements prefetch only for bitmap heap
scan, we have implemented it for seqscan, index scan, indexonly scan,
bitmap heap scan, vacuum, pg_prewarm.
The main difference between Neon prefetch and vanilla Postgres prefetch
is that first one is backend specific. So each backend prefetches only
pages which it needs.
This is why we have to rewrite prefetch for bitmap heap scan, which is
using `fadvise` and assumes that pages prefetched by one backend in file
cache, can be used by any other backend.

I do understand why prefetching is important in neon (likely more than
for core postgres). I'm interested in how it's actually implemented,
whether it's somehow similar to how my patch does things or in some
different (perhaps neon-specific way), and if the approaches are
different then what are the pros/cons. And so on.

So is it implemented in the neon-specific storage, somehow, or where/how
does neon issue the prefetch requests?

Concerning index scan we have implemented two different approaches: for
index only scan we try to prefetch leave pages and for index scan we
prefetch referenced heap pages.

In my experience the IOS handling (only prefetching leaf pages) is very
limiting, and may easily lead to index-only scans being way slower than
regular index scans. Which is super surprising for users. It's why I
ended up improving the prefetcher to optionally look at the VM etc.

In both cases we start from prefetch distance 0 and increase it until it
reaches `effective_io_concurrency` for this relation. Doing so we try to
avoid prefetching of useless pages and slowdown of "point" lookups
returning one or few records.

Right, the regular prefetch ramp-up. My patch does the same thing.

If you are interested, you can look at our implementation in neon repo:
all source are available. But briefly speaking, each backend has its own
prefetch ring (prefetch requests which are waiting for response). The
key idea is that we can send several prefetch requests to page server
and then receive multiple replies. It allows to increased speed of OLAP
queries up to 10 times.

Can you point me to the actual code / branch where it happens? I did
check the github repo, but I don't see anything relevant in the default
branch (REL_15_STABLE_neon). There are some "prefetch" branches, but
those seem abandoned.

Heikki thinks that prefetch can be somehow combined with async-io
proposal (based on io_uring). But right now they have nothing in common.

I can imagine async I/O being useful here, but I find the flow of I/O
requests is quite complex / goes through multiple layers. Or maybe I
just don't understand how it should work.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#15)

Re: Custom explain options

On 13/01/2024 4:51 pm, Tomas Vondra wrote:

On 1/12/24 20:30, Konstantin Knizhnik wrote:

On 12/01/2024 7:03 pm, Tomas Vondra wrote:

On 10/21/23 14:16, Konstantin Knizhnik wrote:

Hi hackers,

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page
prefetching during query execution (prefetching is more critical for
Neon
architecture because of separation of compute and storage, so it is
implemented not only for bitmap heap scan as in Vanilla Postgres, but
also for seqscan, indexscan and indexonly scan). Another possible
candidate for explain options is local file cache (extra caching layer
above shared buffers which is used to somehow replace file system cache
in standalone Postgres).

Not quite related to this patch about EXPLAIN options, but can you share
some details how you implemented prefetching for the other nodes?

I'm asking because I've been working on prefetching for index scans, so
I'm wondering if there's a better way to do this, or how to do it in a
way that would allow neon to maybe leverage that too.

regards

Yes, I am looking at your PR. What we have implemented in Neon is more
specific to Neon architecture where storage is separated from compute.
So each page not found in shared buffers has to be downloaded from page
server. It adds quite noticeable latency, because of network roundtrip.
While vanilla Postgres can rely on OS file system cache when page is not
found in shared buffer (access to OS file cache is certainly slower than
to shared buffers
because of syscall and copying of page, but performance penaly is not
very large - less than 15%), Neon has no local files and so has to send
request to the socket.

This is why we have to perform aggressive prefetching whenever it is
possible (when it it is possible to predict order of subsequent pages).
Unlike vanilla Postgres which implements prefetch only for bitmap heap
scan, we have implemented it for seqscan, index scan, indexonly scan,
bitmap heap scan, vacuum, pg_prewarm.
The main difference between Neon prefetch and vanilla Postgres prefetch
is that first one is backend specific. So each backend prefetches only
pages which it needs.
This is why we have to rewrite prefetch for bitmap heap scan, which is
using `fadvise` and assumes that pages prefetched by one backend in file
cache, can be used by any other backend.

I do understand why prefetching is important in neon (likely more than
for core postgres). I'm interested in how it's actually implemented,
whether it's somehow similar to how my patch does things or in some
different (perhaps neon-specific way), and if the approaches are
different then what are the pros/cons. And so on.

So is it implemented in the neon-specific storage, somehow, or where/how
does neon issue the prefetch requests?

Neon mostly preservers Postgres prefetch mechanism, so we are using
PrefetchBuffer which checks if page is present in shared buffers
and if not - calls smgrprefetch. We are using own storage manager
implementation which instead of reading pages from local disk, download
them from page server.
And prefetch implementation in Neon storager manager is obviously also
different from one in vanilla Postgres which uses posix_fadvise.
Neon prefetch implementation inserts prefetch request in ring buffer and
sends it to the server. When read operation is performed we check if
there is correspondent prefetch request in ring buffer and if so - waits
its completion.

As I already wrote - prefetch is done locally for each backend. And each
backend has its own connection with page server. It can be changed in
future when we implement multiplexing of page server connections. But
right now prefetch is local. And certainly prefetch can improve
performance only if we correctly predict subsequent page requests.
If not - then page server does useless jobs and backend has to waity and
consume all issues prefetch requests. This is why in prefetch
implementation for most of nodes we start with minimal prefetch
distance and then increase it. It allows to perform prefetch only for
such queries where it is really efficient (OLAP) and doesn't degrade
performance of simple OLTP queries.

Out prefetch implementation is also compatible with parallel plans, but
here we need to preserve some range of pages for each parallel workers
instead of picking page from some shared queue on demand. It is one of
the major difference with Postgres prefetch using posix_fadvise: each
backend shoudl prefetch only those pages which it will going to read.

Concerning index scan we have implemented two different approaches: for
index only scan we try to prefetch leave pages and for index scan we
prefetch referenced heap pages.

In my experience the IOS handling (only prefetching leaf pages) is very
limiting, and may easily lead to index-only scans being way slower than
regular index scans. Which is super surprising for users. It's why I
ended up improving the prefetcher to optionally look at the VM etc.

Well, my assumption was the following: prefetch is most efficient for
OLAP queries.
Although HTAP (hybrid transactional/analytical processing) is popular
trend now,
classical model is that analytic queries are performed on "historical"
data, which was already proceeded by vacuum and all-visible bits were
set in VM.
May be this assumption is wrong but it seems to me that if most heap
pages are not marked as all-visible, then optimizer should prefetch
bitmap scan to index-only scan.
And for combination of index and heap bitmap scans we can efficiently
prefetch both index and heap pages.

In both cases we start from prefetch distance 0 and increase it until it
reaches `effective_io_concurrency` for this relation. Doing so we try to
avoid prefetching of useless pages and slowdown of "point" lookups
returning one or few records.

Right, the regular prefetch ramp-up. My patch does the same thing.

If you are interested, you can look at our implementation in neon repo:
all source are available. But briefly speaking, each backend has its own
prefetch ring (prefetch requests which are waiting for response). The
key idea is that we can send several prefetch requests to page server
and then receive multiple replies. It allows to increased speed of OLAP
queries up to 10 times.

Can you point me to the actual code / branch where it happens? I did
check the github repo, but I don't see anything relevant in the default
branch (REL_15_STABLE_neon). There are some "prefetch" branches, but
those seem abandoned.

Implementation of prefetch mecnahism is in Neon extension:
https://github.com/neondatabase/neon/blob/60ced06586a6811470c16c6386daba79ffaeda13/pgxn/neon/pagestore_smgr.c#L205

But concrete implementation of prefetch for particular nodes is
certainly inside Postgres.
For example, if you are interested how it is implemented for index scan,
then please look at:
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L844
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1166
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1467
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1625
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L2629

Heikki thinks that prefetch can be somehow combined with async-io
proposal (based on io_uring). But right now they have nothing in common.

I can imagine async I/O being useful here, but I find the flow of I/O
requests is quite complex / goes through multiple layers. Or maybe I
just don't understand how it should work.

I also do not think that it will be possible to marry this two approaches.

#17

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#16)

Re: Custom explain options

On 1/13/24 17:13, Konstantin Knizhnik wrote:

On 13/01/2024 4:51 pm, Tomas Vondra wrote:

On 1/12/24 20:30, Konstantin Knizhnik wrote:

On 12/01/2024 7:03 pm, Tomas Vondra wrote:

On 10/21/23 14:16, Konstantin Knizhnik wrote:

Hi hackers,

EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about
page
prefetching during query execution (prefetching is more critical for
Neon
architecture because of separation of compute and storage, so it is
implemented not only for bitmap heap scan as in Vanilla Postgres, but
also for seqscan, indexscan and indexonly scan). Another possible
candidate for explain options is local file cache (extra caching
layer
above shared buffers which is used to somehow replace file system
cache
in standalone Postgres).

Not quite related to this patch about EXPLAIN options, but can you
share
some details how you implemented prefetching for the other nodes?

I'm asking because I've been working on prefetching for index scans, so
I'm wondering if there's a better way to do this, or how to do it in a
way that would allow neon to maybe leverage that too.

regards

Yes, I am looking at your PR. What we have implemented in Neon is more
specific to Neon architecture where storage is separated from compute.
So each page not found in shared buffers has to be downloaded from page
server. It adds quite noticeable latency, because of network roundtrip.
While vanilla Postgres can rely on OS file system cache when page is not
found in shared buffer (access to OS file cache is certainly slower than
to shared buffers
because of syscall and copying of page, but performance penaly is not
very large - less than 15%), Neon has no local files and so has to send
request to the socket.

This is why we have to perform aggressive prefetching whenever it is
possible (when it it is possible to predict order of subsequent pages).
Unlike vanilla Postgres which implements prefetch only for bitmap heap
scan, we have implemented it for seqscan, index scan, indexonly scan,
bitmap heap scan, vacuum, pg_prewarm.
The main difference between Neon prefetch and vanilla Postgres prefetch
is that first one is backend specific. So each backend prefetches only
pages which it needs.
This is why we have to rewrite prefetch for bitmap heap scan, which is
using `fadvise` and assumes that pages prefetched by one backend in file
cache, can be used by any other backend.

I do understand why prefetching is important in neon (likely more than
for core postgres). I'm interested in how it's actually implemented,
whether it's somehow similar to how my patch does things or in some
different (perhaps neon-specific way), and if the approaches are
different then what are the pros/cons. And so on.

So is it implemented in the neon-specific storage, somehow, or where/how
does neon issue the prefetch requests?

Neon mostly preservers Postgres prefetch mechanism, so we are using
PrefetchBuffer which checks if page is present in shared buffers
and if not - calls smgrprefetch. We are using own storage manager
implementation which instead of reading pages from local disk, download
them from page server.
And prefetch implementation in Neon storager manager is obviously also
different from one in vanilla Postgres which uses posix_fadvise.
Neon prefetch implementation inserts prefetch request in ring buffer and
sends it to the server. When read operation is performed we check if
there is correspondent prefetch request in ring buffer and if so - waits
its completion.

Thanks. Sure, neon has to use some custom prefetch implementation,
considering not posix_fadvise, considering there's no local page cache
in the architecture.

The thing that was not clear to me is who decides what to prefetch,
which code issues the prefetch requests etc. In the github links you
shared I see it happens in the index AM code (in nbtsearch.c).

That's interesting, because that's what my first prefetching patch did
too - not the same way, ofc, but in the same layer. Simply because it
seemed like the simplest way to do that. But the feedback was that's the
wrong layer, and that it should happen in the executor. And I agree with
that - the reasons are somewhere in the other thread.

Based on what I saw in the neon code, I think it should be possible for
neon to use "my" approach too, but that only works for the index scans,
ofc. Not sure what to do about the other places.

As I already wrote - prefetch is done locally for each backend. And each
backend has its own connection with page server. It can be changed in
future when we implement multiplexing of page server connections. But
right now prefetch is local. And certainly prefetch can improve
performance only if we correctly predict subsequent page requests.
If not - then page server does useless jobs and backend has to waity and
consume all issues prefetch requests. This is why in prefetch
implementation for most of nodes we start with minimal prefetch
distance and then increase it. It allows to perform prefetch only for
such queries where it is really efficient (OLAP) and doesn't degrade
performance of simple OLTP queries.

Not sure I understand what's so important about prefetches being "local"
for each backend. I mean even in postgres each backend prefetches it's
own buffers, no matter what the other backends do. Although, neon
probably doesn't have the cross-backend sharing through shared buffers
etc. right?

FWIW I certainly agree with the goal to not harm queries that can't
benefit from prefetching. Ramping-up the prefetch distance is something
my patch does too, for exactly this reason.

Out prefetch implementation is also compatible with parallel plans, but
here we need to preserve some range of pages for each parallel workers
instead of picking page from some shared queue on demand. It is one of
the major difference with Postgres prefetch using posix_fadvise: each
backend shoudl prefetch only those pages which it will going to read.

Understood. I have no opinion on this, though.

Concerning index scan we have implemented two different approaches: for
index only scan we try to prefetch leave pages and for index scan we
prefetch referenced heap pages.

In my experience the IOS handling (only prefetching leaf pages) is very
limiting, and may easily lead to index-only scans being way slower than
regular index scans. Which is super surprising for users. It's why I
ended up improving the prefetcher to optionally look at the VM etc.

Well, my assumption was the following: prefetch is most efficient for
OLAP queries.
Although HTAP (hybrid transactional/analytical processing) is popular
trend now,
classical model is that analytic queries are performed on "historical"
data, which was already proceeded by vacuum and all-visible bits were
set in VM.
May be this assumption is wrong but it seems to me that if most heap
pages are not marked as all-visible, then optimizer should prefetch
bitmap scan to index-only scan.

I think this assumption is generally reasonable, but it hinges on the
assumption that OLAP queries have most indexes recently vacuumed and
all-visible. I'm not sure it's wise to rely on that.

Without prefetching it's not that important - the worst thing that would
happen is that the IOS degrades into regular index-scan. But with
prefetching these plans can "invert" with respect to cost.

I'm not saying it's terrible or that IOS must have prefetching, but I
think it's something users may run into fairly often. And it led me to
rework the prefetching so that IOS can prefetch too ...

And for combination of index and heap bitmap scans we can efficiently
prefetch both index and heap pages.

In both cases we start from prefetch distance 0 and increase it until it
reaches `effective_io_concurrency` for this relation. Doing so we try to
avoid prefetching of useless pages and slowdown of "point" lookups
returning one or few records.

Right, the regular prefetch ramp-up. My patch does the same thing.

If you are interested, you can look at our implementation in neon repo:
all source are available. But briefly speaking, each backend has its own
prefetch ring (prefetch requests which are waiting for response). The
key idea is that we can send several prefetch requests to page server
and then receive multiple replies. It allows to increased speed of OLAP
queries up to 10 times.

Can you point me to the actual code / branch where it happens? I did
check the github repo, but I don't see anything relevant in the default
branch (REL_15_STABLE_neon). There are some "prefetch" branches, but
those seem abandoned.

Implementation of prefetch mecnahism is in Neon extension:
https://github.com/neondatabase/neon/blob/60ced06586a6811470c16c6386daba79ffaeda13/pgxn/neon/pagestore_smgr.c#L205

But concrete implementation of prefetch for particular nodes is
certainly inside Postgres.
For example, if you are interested how it is implemented for index scan,
then please look at:
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L844
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1166
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1467
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1625
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L2629

Thanks! Very helpful. As I said, I ended up moving the prefetching to
the executor. For indexscans I think it should be possible for neon to
benefit from that (in a way, it doesn't need to do anything except for
overriding what PrefetchBuffer does). Not sure about the other places
where neon needs to prefetch, I don't have ambition to rework those.

Heikki thinks that prefetch can be somehow combined with async-io
proposal (based on io_uring). But right now they have nothing in common.

I can imagine async I/O being useful here, but I find the flow of I/O
requests is quite complex / goes through multiple layers. Or maybe I
just don't understand how it should work.

I also do not think that it will be possible to marry this two approaches.

I didn't actually say it would be impossible - I think it seems like a
use case where async I/O should be a natural fit. But I'm not sure to do
that in a way that would not be super confusing and/or fragile when
something unexpected happens (like a rescan, or maybe some change to the
index structure - page split, etc.)

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#17)

Re: Custom explain options

On 14/01/2024 11:47 pm, Tomas Vondra wrote:

The thing that was not clear to me is who decides what to prefetch,
which code issues the prefetch requests etc. In the github links you
shared I see it happens in the index AM code (in nbtsearch.c).

It is up to the particular plan node (seqscan, indexscan,...) which
pages to prefetch.

That's interesting, because that's what my first prefetching patch did
too - not the same way, ofc, but in the same layer. Simply because it
seemed like the simplest way to do that. But the feedback was that's the
wrong layer, and that it should happen in the executor. And I agree with
that - the reasons are somewhere in the other thread.

I read the arguments in

/messages/by-id/8c86c3a6-074e-6c88-3e7e-9452b6a37b9b@enterprisedb.com

Separating prefetch info in index scan descriptor is really good idea.
It will be amazing to have generic prefetch mechanism for all indexes.
But unfortunately I do not understand how it is possible. The logic of
index traversal is implemented inside AM. Executor doesn't know it.
For example for B-Tree scan we can prefetch:

- intermediate pages
- leave pages
- referenced by TID heap pages

Before we load next intermediate page, we do not know next leave pages.
And before we load next leave page, we can not find out TIDs from this page.

Another challenge - is how far we should prefetch (as far as I
understand both your and our approach using dynamically extended
prefetch window)

Based on what I saw in the neon code, I think it should be possible for
neon to use "my" approach too, but that only works for the index scans,
ofc. Not sure what to do about the other places.

We definitely need prefetch for heap scan (it gives the most advantages
in performance), for vacuum and also for pg_prewarm. Also I tried to
implement it for custom indexes such as pg_vector. I still not sure
whether it is possible to create some generic solution which will work
for all indexes.

I have also tried to implement alternative approach for prefetch based
on access statistic.
It comes from use case of seqscan of table with larger toasted records.
So for each record we have to extract its TOAST data.
It is done using standard index scan, but unfortunately index prefetch
doesn't help much here: there is usually just one TOAST segment and so
prefetch just have no chance to do something useful. But as far as heap
records are accessed sequentially, there is good chance that toast table
will also be accessed mostly sequentially. So we just can count number
of sequential requests to each relation and if ratio or seq/rand
accesses is above some threshold we can prefetch next pages of this
relation. This is really universal approach but ... working mostly for
TOAST table.

As I already wrote - prefetch is done locally for each backend. And each
backend has its own connection with page server. It can be changed in
future when we implement multiplexing of page server connections. But
right now prefetch is local. And certainly prefetch can improve
performance only if we correctly predict subsequent page requests.
If not - then page server does useless jobs and backend has to waity and
consume all issues prefetch requests. This is why in prefetch
implementation for most of nodes we start with minimal prefetch
distance and then increase it. It allows to perform prefetch only for
such queries where it is really efficient (OLAP) and doesn't degrade
performance of simple OLTP queries.

Not sure I understand what's so important about prefetches being "local"
for each backend. I mean even in postgres each backend prefetches it's
own buffers, no matter what the other backends do. Although, neon
probably doesn't have the cross-backend sharing through shared buffers
etc. right?

Sorry if my explanation was not clear:(

I mean even in postgres each backend prefetches it's own buffers, no matter what the other backends do.

This is exactly the difference. In Neon such approach doesn't work.
Each backend maintains it's own prefetch ring. And if prefetched page was not actually received, then the whole pipe is lost.
I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it has to consume responses for 1,5,10 and issue another request for page 2.
Instead of improving speed we are just doing extra job.
So each backend should prefetch only those pages which it is actually going to read.
This is why prefetch approach used in Postgres for example for parallel bitmap heap scan doesn't work for Neon.
If you do `posic_fadvise` then prefetched page is placed in OS cache and can be used by any parallel worker.
But in Neon each parallel worker should be given its own range of pages to scan and prefetch only this pages.

Well, my assumption was the following: prefetch is most efficient forOLAP queries.
Although HTAP (hybrid transactional/analytical processing) is popular
trend now,
classical model is that analytic queries are performed on "historical"
data, which was already proceeded by vacuum and all-visible bits were
set in VM.
May be this assumption is wrong but it seems to me that if most heap
pages are not marked as all-visible, then optimizer should prefetch
bitmap scan to index-only scan.

I think this assumption is generally reasonable, but it hinges on the
assumption that OLAP queries have most indexes recently vacuumed and
all-visible. I'm not sure it's wise to rely on that.

Without prefetching it's not that important - the worst thing that would
happen is that the IOS degrades into regular index-scan.

I think that it is also problem without prefetch. There are cases where
seqscan or bitmap heap scan are really much faster then IOS because last
one has to perform a lot of visibility checks. Yes, certainly optimizer
takes in account percent of all-visible pages.But with it is not tricial
to adjust optimizer parameters so that it can really choose fastest plan.

But withprefetching these plans can "invert" with respect to cost.

I'm not saying it's terrible or that IOS must have prefetching, but I
think it's something users may run into fairly often. And it led me to
rework the prefetching so that IOS can prefetch too ...

I think that inspecting VM for prefetch is really good idea.

Thanks! Very helpful. As I said, I ended up moving the prefetching to
the executor. For indexscans I think it should be possible for neon to
benefit from that (in a way, it doesn't need to do anything except for
overriding what PrefetchBuffer does). Not sure about the other places
where neon needs to prefetch, I don't have ambition to rework those.

Once your PR will be merged, I will rewrite Neon prefetch implementation
fopr indexces using your approach.

#19

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#18)

Re: Custom explain options

On 1/15/24 15:22, Konstantin Knizhnik wrote:

On 14/01/2024 11:47 pm, Tomas Vondra wrote:

The thing that was not clear to me is who decides what to prefetch,
which code issues the prefetch requests etc. In the github links you
shared I see it happens in the index AM code (in nbtsearch.c).

It is up to the particular plan node (seqscan, indexscan,...) which
pages to prefetch.

That's interesting, because that's what my first prefetching patch did
too - not the same way, ofc, but in the same layer. Simply because it
seemed like the simplest way to do that. But the feedback was that's the
wrong layer, and that it should happen in the executor. And I agree with
that - the reasons are somewhere in the other thread.

I read the arguments in

/messages/by-id/8c86c3a6-074e-6c88-3e7e-9452b6a37b9b@enterprisedb.com

Separating prefetch info in index scan descriptor is really good idea.
It will be amazing to have generic prefetch mechanism for all indexes.
But unfortunately I do not understand how it is possible. The logic of
index traversal is implemented inside AM. Executor doesn't know it.
For example for B-Tree scan we can prefetch:

- intermediate pages
- leave pages
- referenced by TID heap pages

My patch does not care about prefetching internal index pages. Yes, it's
a limitation, but my assumption is the internal pages are maybe 0.1% of
the index, and typically very hot / cached. Yes, if the index is not
used very often, this may be untrue. But I consider it a possible future
improvement, for some other patch. FWIW there's a prefetching patch for
inserts into indexes (which only prefetches just the index leaf pages).

Before we load next intermediate page, we do not know next leave pages.
And before we load next leave page, we can not find out TIDs from this
page.

Not sure I understand what this is about. The patch simply calls the
index AM function index_getnext_tid() enough times to fill the prefetch
queue. It does not prefetch the next index leaf page, it however does
prefetch the heap pages. It does not "stall" at the boundary of the
index leaf page, or something.

Another challenge - is how far we should prefetch (as far as I
understand both your and our approach using dynamically extended
prefetch window)

By dynamic extension of prefetch window you mean the incremental growth
of the prefetch distance from 0 to effective_io_concurrency? I don't
think there's a better solution.

There might be additional information that we could consider (e.g.
expected number of rows for the plan, earlier executions of the scan,
...) but each of these has a failure more.

Based on what I saw in the neon code, I think it should be possible for
neon to use "my" approach too, but that only works for the index scans,
ofc. Not sure what to do about the other places.

We definitely need prefetch for heap scan (it gives the most advantages
in performance), for vacuum and also for pg_prewarm. Also I tried to
implement it for custom indexes such as pg_vector. I still not sure
whether it is possible to create some generic solution which will work
for all indexes.

I haven't tried with pgvector, but I don't see why my patch would not
work for all index AMs that cna return TID.

I have also tried to implement alternative approach for prefetch based
on access statistic.
It comes from use case of seqscan of table with larger toasted records.
So for each record we have to extract its TOAST data.
It is done using standard index scan, but unfortunately index prefetch
doesn't help much here: there is usually just one TOAST segment and so
prefetch just have no chance to do something useful. But as far as heap
records are accessed sequentially, there is good chance that toast table
will also be accessed mostly sequentially. So we just can count number
of sequential requests to each relation and if ratio or seq/rand
accesses is above some threshold we can prefetch next pages of this
relation. This is really universal approach but ... working mostly for
TOAST table.

Are you're talking about what works / doesn't work in neon, or about
postgres in general?

I'm not sure what you mean by "one TOAST segment" and I'd also guess
that if both tables are accessed mostly sequentially, the read-ahead
will do most of the work (in postgres).

It's probably true that as we do a separate index scan for each TOAST-ed
value, that can't really ramp-up the prefetch distance fast enough.
Maybe we could have a mode where we start with the full distance?

As I already wrote - prefetch is done locally for each backend. And each
backend has its own connection with page server. It can be changed in
future when we implement multiplexing of page server connections. But
right now prefetch is local. And certainly prefetch can improve
performance only if we correctly predict subsequent page requests.
If not - then page server does useless jobs and backend has to waity and
consume all issues prefetch requests. This is why in prefetch
implementation for most of nodes we start with minimal prefetch
distance and then increase it. It allows to perform prefetch only for
such queries where it is really efficient (OLAP) and doesn't degrade
performance of simple OLTP queries.

Not sure I understand what's so important about prefetches being "local"
for each backend. I mean even in postgres each backend prefetches it's
own buffers, no matter what the other backends do. Although, neon
probably doesn't have the cross-backend sharing through shared buffers
etc. right?

Sorry if my explanation was not clear:(

I mean even in postgres each backend prefetches it's own buffers, no
matter what the other backends do.

This is exactly the difference. In Neon such approach doesn't work.
Each backend maintains it's own prefetch ring. And if prefetched page
was not actually received, then the whole pipe is lost.
I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it
has to consume responses for 1,5,10 and issue another request for page 2.
Instead of improving speed we are just doing extra job.
So each backend should prefetch only those pages which it is actually
going to read.
This is why prefetch approach used in Postgres for example for parallel
bitmap heap scan doesn't work for Neon.
If you do `posic_fadvise` then prefetched page is placed in OS cache and
can be used by any parallel worker.
But in Neon each parallel worker should be given its own range of pages
to scan and prefetch only this pages.

I still don't quite see/understand the difference. I mean, even in
postgres each backend does it's own prefetches, using it's own prefetch
ring. But I'm not entirely sure about the neon architecture differences.

Does this mean neon can do prefetching from the executor in principle?

Could you perhaps describe a situation where the bitmap can prefetching
(as implemented in Postgres) does not work for neon?

Well, my assumption was the following: prefetch is most efficient
forOLAP queries.
Although HTAP (hybrid transactional/analytical processing) is popular
trend now,
classical model is that analytic queries are performed on "historical"
data, which was already proceeded by vacuum and all-visible bits were
set in VM.
May be this assumption is wrong but it seems to me that if most heap
pages are not marked as all-visible, then optimizer should prefetch
bitmap scan to index-only scan.

I think this assumption is generally reasonable, but it hinges on the
assumption that OLAP queries have most indexes recently vacuumed and
all-visible. I'm not sure it's wise to rely on that.

Without prefetching it's not that important - the worst thing that would
happen is that the IOS degrades into regular index-scan.

I think that it is also problem without prefetch. There are cases where
seqscan or bitmap heap scan are really much faster then IOS because last
one has to perform a lot of visibility checks. Yes, certainly optimizer
takes in account percent of all-visible pages.But with it is not tricial
to adjust optimizer parameters so that it can really choose fastest plan.

True. There's more cases where it can happen, no doubt about it. But I
think those cases are somewhat less likely.

But withprefetching these plans can "invert" with respect to cost.

I'm not saying it's terrible or that IOS must have prefetching, but I
think it's something users may run into fairly often. And it led me to
rework the prefetching so that IOS can prefetch too ...

I think that inspecting VM for prefetch is really good idea.

Thanks! Very helpful. As I said, I ended up moving the prefetching to
the executor. For indexscans I think it should be possible for neon to
benefit from that (in a way, it doesn't need to do anything except for
overriding what PrefetchBuffer does). Not sure about the other places
where neon needs to prefetch, I don't have ambition to rework those.

Once your PR will be merged, I will rewrite Neon prefetch implementation
fopr indexces using your approach.

Well, maybe you could try doing rewriting it now, so that you can give
some feedback to the patch. I'd appreciate that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#19)

Re: Custom explain options

On 15/01/2024 5:08 pm, Tomas Vondra wrote:

My patch does not care about prefetching internal index pages. Yes, it's
a limitation, but my assumption is the internal pages are maybe 0.1% of
the index, and typically very hot / cached. Yes, if the index is not
used very often, this may be untrue. But I consider it a possible future
improvement, for some other patch. FWIW there's a prefetching patch for
inserts into indexes (which only prefetches just the index leaf pages).

We have to prefetch pages at height-1 level (parents of leave pages) for
IOS because otherwise prefetch pipeline is broken at each transition to
next leave page.
When we start with new leave patch we have to fill prefetch ring from
the scratch which certainly has negative impact on performance.

Not sure I understand what this is about. The patch simply calls the
index AM function index_getnext_tid() enough times to fill the prefetch
queue. It does not prefetch the next index leaf page, it however does
prefetch the heap pages. It does not "stall" at the boundary of the
index leaf page, or something.

Ok, now I fully understand your approach. Looks really elegant and works
for all indexes.
There is still issue with IOS and seqscan.

Another challenge - is how far we should prefetch (as far as I
understand both your and our approach using dynamically extended
prefetch window)

By dynamic extension of prefetch window you mean the incremental growth
of the prefetch distance from 0 to effective_io_concurrency?

Yes

I don't
think there's a better solution.

I tried one more solution: propagate information about expected number
of fetched rows to AM. Based on this information it is possible to
choose proper prefetch distance.
Certainly it is not quote precise: we can scan large number rows but
filter only few of them. This is why this approach was not committed in
Neon.
But I still think that using statistics for determining prefetch window
is not so bad idea. May be it needs better thinking.

There might be additional information that we could consider (e.g.
expected number of rows for the plan, earlier executions of the scan,
...) but each of these has a failure more.

I wrote reply above before reading next fragment:)
So I have already tried it.

I haven't tried with pgvector, but I don't see why my patch would not
work for all index AMs that cna return TID.

Yes, I agree. But it will be efficient only if getting next TIS is
cheap - it is located on the same leaf page.

I have also tried to implement alternative approach for prefetch based
on access statistic.
It comes from use case of seqscan of table with larger toasted records.
So for each record we have to extract its TOAST data.
It is done using standard index scan, but unfortunately index prefetch
doesn't help much here: there is usually just one TOAST segment and so
prefetch just have no chance to do something useful. But as far as heap
records are accessed sequentially, there is good chance that toast table
will also be accessed mostly sequentially. So we just can count number
of sequential requests to each relation and if ratio or seq/rand
accesses is above some threshold we can prefetch next pages of this
relation. This is really universal approach but ... working mostly for
TOAST table.

Are you're talking about what works / doesn't work in neon, or about
postgres in general?

I'm not sure what you mean by "one TOAST segment" and I'd also guess
that if both tables are accessed mostly sequentially, the read-ahead
will do most of the work (in postgres).

Yes, I agree: in case of vanilla Postgres OS will do read-ahead. But not
in Neon.
By one TOAST segment I mean "one TOAST record - 2kb.

It's probably true that as we do a separate index scan for each TOAST-ed
value, that can't really ramp-up the prefetch distance fast enough.
Maybe we could have a mode where we start with the full distance?

Sorry, I do not understand. Especially in this case large prefetch
window is undesired.
Most of records fits in 2kb, so we need to fetch onely one head (TOAST)
record per TOAST index search.

This is exactly the difference. In Neon such approach doesn't work.
Each backend maintains it's own prefetch ring. And if prefetched page
was not actually received, then the whole pipe is lost.
I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it
has to consume responses for 1,5,10 and issue another request for page 2.
Instead of improving speed we are just doing extra job.
So each backend should prefetch only those pages which it is actually
going to read.
This is why prefetch approach used in Postgres for example for parallel
bitmap heap scan doesn't work for Neon.
If you do `posic_fadvise` then prefetched page is placed in OS cache and
can be used by any parallel worker.
But in Neon each parallel worker should be given its own range of pages
to scan and prefetch only this pages.

I still don't quite see/understand the difference. I mean, even in
postgres each backend does it's own prefetches, using it's own prefetch
ring. But I'm not entirely sure about the neon architecture differences

I am not speaking about your approach. It will work with Neon as well.
I am describing why implementation of prefetch for heap bitmap scan
doesn't work for Neon:
it issues prefetch requests for pages which never accessed by this
parallel worker.

Does this mean neon can do prefetching from the executor in principle?

Could you perhaps describe a situation where the bitmap can prefetching
(as implemented in Postgres) does not work for neon?

I am speaking about prefetch implementation in nodeBitmpapHeapScan.
Prefetch iterator is not synced with normal iterator, i.e. they can
return different pages.

Well, maybe you could try doing rewriting it now, so that you can give
some feedback to the patch. I'd appreciate that.

I will try.

Best regards,
Konstantin

#21

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#20)

Re: Custom explain options

On 1/15/24 21:42, Konstantin Knizhnik wrote:

On 15/01/2024 5:08 pm, Tomas Vondra wrote:

My patch does not care about prefetching internal index pages. Yes, it's
a limitation, but my assumption is the internal pages are maybe 0.1% of
the index, and typically very hot / cached. Yes, if the index is not
used very often, this may be untrue. But I consider it a possible future
improvement, for some other patch. FWIW there's a prefetching patch for
inserts into indexes (which only prefetches just the index leaf pages).

We have to prefetch pages at height-1 level (parents of leave pages) for
IOS because otherwise prefetch pipeline is broken at each transition to
next leave page.
When we start with new leave patch we have to fill prefetch ring from
the scratch which certainly has negative impact on performance.

By "broken" you mean that you prefetch items only from a single leaf
page, so immediately after reading the next one nothing is prefetched.
Correct? Yeah, I had this problem initially too, when I did the
prefetching in the index AM code. One of the reasons why it got moved to
the executor.

Not sure I understand what this is about. The patch simply calls the
index AM function index_getnext_tid() enough times to fill the prefetch
queue. It does not prefetch the next index leaf page, it however does
prefetch the heap pages. It does not "stall" at the boundary of the
index leaf page, or something.

Ok, now I fully understand your approach. Looks really elegant and works
for all indexes.
There is still issue with IOS and seqscan.

Not sure. For seqscan, I think this has nothing to do with it. Postgres
relies on read-ahad to do the work - of course, if that doesn't work
(e.g. for async/direct I/O that'd be the case), an improvement will be
needed. But it's unrelated to this patch, and I'm certainly not saying
this patch does that. I think Thomas/Andres did some work on that.

For IOS, I think the limitation that this does not prefetch any index
pages (especially the leafs) is there, and it'd be nice to do something
about it. But I see it as a separate thing, which I think does need to
happen in the index AM layer (not in the executor).

Another challenge - is how far we should prefetch (as far as I
understand both your and our approach using dynamically extended
prefetch window)

By dynamic extension of prefetch window you mean the incremental growth
of the prefetch distance from 0 to effective_io_concurrency?

Yes

I don't
think there's a better solution.

I tried one more solution: propagate information about expected number
of fetched rows to AM. Based on this information it is possible to
choose proper prefetch distance.
Certainly it is not quote precise: we can scan large number rows but
filter only few of them. This is why this approach was not committed in
Neon.
But I still think that using statistics for determining prefetch window
is not so bad idea. May be it needs better thinking.

I don't think we should rely on this information too much. It's far too
unreliable - especially the planner estimates. The run-time data may be
more accurate, but I'm worried it may be quite variable (e.g. for
different runs of the scan).

My position is to keep this as simple as possible, and prefer to be more
conservative when possible - that is, shorter prefetch distances. In my
experience the benefit of prefetching is subject to diminishing returns,
i.e. going from 0 => 16 is way bigger difference than 16 => 32. So
better to stick with lower value instead of wasting resources.

There might be additional information that we could consider (e.g.
expected number of rows for the plan, earlier executions of the scan,
...) but each of these has a failure more.

I wrote reply above before reading next fragment:)
So I have already tried it.

I haven't tried with pgvector, but I don't see why my patch would not
work for all index AMs that cna return TID.

Yes, I agree. But it will be efficient only if getting next TIS is
cheap - it is located on the same leaf page.

Maybe. I haven't tried/thought about it, but yes - if it requires doing
a lot of work in between the prefetches, the benefits of prefetching
will diminish naturally. Might be worth doing some experiments.

I have also tried to implement alternative approach for prefetch based
on access statistic.
It comes from use case of seqscan of table with larger toasted records.
So for each record we have to extract its TOAST data.
It is done using standard index scan, but unfortunately index prefetch
doesn't help much here: there is usually just one TOAST segment and so
prefetch just have no chance to do something useful. But as far as heap
records are accessed sequentially, there is good chance that toast table
will also be accessed mostly sequentially. So we just can count number
of sequential requests to each relation and if ratio or seq/rand
accesses is above some threshold we can prefetch next pages of this
relation. This is really universal approach but ... working mostly for
TOAST table.

Are you're talking about what works / doesn't work in neon, or about
postgres in general?

I'm not sure what you mean by "one TOAST segment" and I'd also guess
that if both tables are accessed mostly sequentially, the read-ahead
will do most of the work (in postgres).

Yes, I agree: in case of vanilla Postgres OS will do read-ahead. But not
in Neon.
By one TOAST segment I mean "one TOAST record - 2kb.

Ah, you mean "TOAST chunk". Yes, if a record fits into a single TOAST
chunk, my prefetch won't work. Not sure what to do for neon ...

It's probably true that as we do a separate index scan for each TOAST-ed
value, that can't really ramp-up the prefetch distance fast enough.
Maybe we could have a mode where we start with the full distance?

Sorry, I do not understand. Especially in this case large prefetch
window is undesired.
Most of records fits in 2kb, so we need to fetch onely one head (TOAST)
record per TOAST index search.

Yeah, I was confused what you mean by "segment". My point was that if a
value is TOAST-ed into multiple chunks, maybe we should allow more
aggressive prefetching instead of the slow ramp-up ...

But yeah, if there's just one TOAST chunk, that does not help.

This is exactly the difference. In Neon such approach doesn't work.
Each backend maintains it's own prefetch ring. And if prefetched page
was not actually received, then the whole pipe is lost.
I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it
has to consume responses for 1,5,10 and issue another request for
page 2.
Instead of improving speed we are just doing extra job.
So each backend should prefetch only those pages which it is actually
going to read.
This is why prefetch approach used in Postgres for example for parallel
bitmap heap scan doesn't work for Neon.
If you do `posic_fadvise` then prefetched page is placed in OS cache and
can be used by any parallel worker.
But in Neon each parallel worker should be given its own range of pages
to scan and prefetch only this pages.

I still don't quite see/understand the difference. I mean, even in
postgres each backend does it's own prefetches, using it's own prefetch
ring. But I'm not entirely sure about the neon architecture differences

I am not speaking about your approach. It will work with Neon as well.
I am describing why implementation of prefetch for heap bitmap scan
doesn't work for Neon:
it issues prefetch requests for pages which never accessed by this
parallel worker.

Does this mean neon can do prefetching from the executor in principle?

Could you perhaps describe a situation where the bitmap can prefetching
(as implemented in Postgres) does not work for neon?

I am speaking about prefetch implementation in nodeBitmpapHeapScan.
Prefetch iterator is not synced with normal iterator, i.e. they can
return different pages.

Ah, now I think I understand. The workers don't share memory, so the
pages prefetched by one worker are wasted if some other worker ends up
processing them.

Well, maybe you could try doing rewriting it now, so that you can give
some feedback to the patch. I'd appreciate that.

I will try.

Thanks!

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Konstantin Knizhnik

knizhnik@garret.ru

almost 2 years ago

In reply to: Tomas Vondra (#21)

Re: Custom explain options

On 16/01/2024 5:38 pm, Tomas Vondra wrote:

By "broken" you mean that you prefetch items only from a single leaf
page, so immediately after reading the next one nothing is prefetched.
Correct?

Yes, exactly. It means that reading first heap page from next leaf page
will be done without prefetch which in case of Neon means roundtrip with
page server (~0.2msec within one data center).

Yeah, I had this problem initially too, when I did the
prefetching in the index AM code. One of the reasons why it got moved to
the executor.

Yeh, it works nice for vanilla Postgres. You call index_getnext_tid()
and when it reaches end of leaf page it reads next read page. Because of
OS read-ahead this read is expected to be fast even without prefetch.
But not in Neon case - we have to download this page from page server
(see above). So ideal solution for Neon will be to prefetch both leave
pages and referenced heap pages. And prefetch of last one should be
initiated as soon as leaf page is loaded. Unfortunately it is
non-trivial to implement and current index scan prefetch implementation
for Neon is not doing it.

#23

David G. Johnston

david.g.johnston@gmail.com

almost 2 years ago

In reply to: Konstantin Knizhnik (#22)

Re: Custom explain options

Came across this while looking for patches to review. IMO this thread has
been hijacked to the point of being not useful for the subject. I suggest
this discussion regarding prefetch move to its own thread and this thread
and commitfest entry be ended/returned with feedback.

Also IMO, the commitfest is not for early stage idea patches. The stuff on
there that is ready for review should at least be thought of by the
original author as something they would be willing to commit. I suggest
you post the most recent patch and a summary of the discussion to a new
thread that hopefully won't be hijacked. Consistent and on-topic replies
will keep the topic front-and-center on the lists until a patch is ready
for consideration.

David J.

#24

Pavel Stehule

pavel.stehule@gmail.com

over 1 year ago

In reply to: Konstantin Knizhnik (#22)

Re: Custom explain options

po 22. 7. 2024 v 17:08 odesílatel Konstantin Knizhnik <knizhnik@garret.ru>
napsal:

On 16/01/2024 5:38 pm, Tomas Vondra wrote:

By "broken" you mean that you prefetch items only from a single leaf

page, so immediately after reading the next one nothing is prefetched.
Correct?

Yes, exactly. It means that reading first heap page from next leaf page
will be done without prefetch which in case of Neon means roundtrip with
page server (~0.2msec within one data center).

Yeah, I had this problem initially too, when I did the
prefetching in the index AM code. One of the reasons why it got moved to
the executor.

Yeh, it works nice for vanilla Postgres. You call index_getnext_tid() and
when it reaches end of leaf page it reads next read page. Because of OS
read-ahead this read is expected to be fast even without prefetch. But not
in Neon case - we have to download this page from page server (see above).
So ideal solution for Neon will be to prefetch both leave pages and
referenced heap pages. And prefetch of last one should be initiated as soon
as leaf page is loaded. Unfortunately it is non-trivial to implement and
current index scan prefetch implementation for Neon is not doing it.

What is the current state of this patch - it is abandoned? It needs a
rebase.

Regards

Pavel