PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)

Started by Tomas Vondraover 1 year ago8 messages
#1Tomas Vondra
tomas@vondra.me
4 attachment(s)

Hi,

I'm getting back to work on the index prefetching patch [1]/messages/by-id/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com, but one
annoying aspect of that patch is that it's limited to the context of a
single executor node. It can be very effective when there's an index
scan with many matches for a key, but it's entirely useless for plans
with many tiny index scans.

For example, consider a plan like:

Nested Loop
-> ... some scan of a "fact" table ...
-> Index Scan on PK of a "dimension" table

For this the index prefetching is entirely useless - there'll be just
one match for each outer row.

But there still is opportunity for prefetching - we could look ahead in
the outer relation (which can be arbitrary node feeding the nestloop),
and request the index scan to prefetch the matching tuples.

Of course, this is not something the nestloop can do on it's own, it
would require support from the executor. For some nodes prefetching is
pretty straightforward (e.g. index scan), for other nodes it can be
impossible or at least very hard / too expensive.

I was a bit bored over the weekend, so I decided to experiment a bit and
see how difficult would this be, and how much could it gain. Attached is
an early PoC version of that patch - it's very limited (essentially just
NL + inner index scan), but it seems to be working. It's only about 250
insertions, to tiny.

The patch does this:
--------------------

1) ExecPrefetch executor callback

This call is meant to do the actual prefetching - the parent node sets
everything up almost as if for ExecProcNode(), but does not expect the
actual result. The child either does some prefetching or nothing.

2) ExecSupportsPrefetch to identify what nodes accept ExecPrefetch()

This simply says if a given node supports prefetching. The only place
calling this is the nested loop, to enable prefetching only for nested
loops with (parameterized) index scans.

3) ExecPrefetchIndexScan doing prefetching in index scans

This is just trivial IndexNext() variant, getting TIDs and calling
PrefetchBuffer() on them. Right now it just prefetches everything, but
that's seems wrong - this is where the original index prefetching patch
should kick in.

4) ExecNestLoop changes

This is where the actual magic happens - if the inner child knows how to
prefetch stuff (per ExecSupportsPrefetch), this switches to reading
batches of outer slots, and calls ExecPrefetch() on them. Right now the
batch size is hardcoded to 32, but it might use effective_io_concurrency
or something like that. It's a bit naive in other aspects too - it
always reads and prefetches the whole batch at once, instead of ramping
up and then consuming and prefetching slots one by one. Good enough for
PoC, but probably needs improvements.

5) adds enable_nestloop_prefetch to enable/disable this easily

benchmark
---------

Of course, the main promise of this is faster queries, so I did a simple
benchmark, with a query like this:

SELECT * FROM fact_table f JOIN dimension d ON (f.id = d.id)
WHERE f.r < 0.0001;

The "r" is simply a random value, allowing to select arbitrary fraction
of the large fact table "f". Here it selects 0.01%, so ~10k rows from
100M table. Dimensions have 10M rows. See the .sql file attached.

For a variable number of dimensions (1 to 4) the results look like this:

prefetch 1 2 3 4
----------------------------------------
off 3260 6193 8993 11980
on 2145 3897 5531 7352
----------------------------------------
66% 63% 62% 61%

This is on "cold" data, with a restart + drop caches between runs. The
results suggest the prefetching makes it about twice as fast. I was
hoping for more, but not bad for a Poc, chances are it can be improved.

I just noticed there's a couple failures in the regression tests, if I
change the GUC to "true" by default. I haven't looked into that yet, but
I guess there's some mistake in resetting the child node, or something
like that. Will investigate.

What I find a bit annoying is the amount of processing required to
happen twice - once for the prefetching, once for the actual execution.
In particular, this needs to set up all the PARAM_EXEC slots as if the
inner plan was to do regular execution.

The other thing is that while ExecPrefetchIndexScan() only prefetches
the heap page, it still needs to navigate the index to the leaf page. If
the index is huge, that may require I/O. But we'd have to pay that cost
shortly after anyway. It just isn't asynchronous.

One minor detail is that I ran into some issues with tuple slots. I need
a bunch of them to stash the slots received from the outer plan, so I
created a couple slots with TTSOpsVirtual. And that mostly works, except
that later ExecInterpExpr() happens to call slot_getsomeattrs() and that
fails because tts_virtual_getsomeattrs() says:

elog(ERROR, "getsomeattrs is not required to be called on a virtual
tuple table slot");

OK, that call is not necessary for virtual slots, it's noop. But it's
not me calling that, the interpreter does that for some reason. I did
comment that error out in the patch, but I wonder what's the proper way
to make this work ...

regards

[1]: /messages/by-id/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com
/messages/by-id/cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com

--
Tomas Vondra

Attachments:

0001-nestloop-prefetch-initial-PoC.patchtext/x-patch; charset=UTF-8; name=0001-nestloop-prefetch-initial-PoC.patchDownload
From a30e56a76e48c9ca69e7552a821e49bd0b6605a6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 26 Aug 2024 01:13:56 +0200
Subject: [PATCH] nestloop prefetch - initial PoC

---
 src/backend/executor/execAmi.c       |  41 +++++++++
 src/backend/executor/execTuples.c    |   3 +-
 src/backend/executor/nodeIndexscan.c |  67 ++++++++++++++
 src/backend/executor/nodeNestloop.c  | 131 ++++++++++++++++++++++++++-
 src/backend/utils/misc/guc_tables.c  |  10 ++
 src/include/executor/executor.h      |   2 +
 src/include/executor/nodeIndexscan.h |   1 +
 src/include/nodes/execnodes.h        |   7 ++
 src/include/optimizer/cost.h         |   1 +
 9 files changed, 257 insertions(+), 6 deletions(-)

diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 3289e3e0219..8d336c713a2 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -594,6 +594,28 @@ ExecSupportsBackwardScan(Plan *node)
 	}
 }
 
+/*
+ * ExecSupportsPrefetch - does a plan type support prefetching?
+ *
+ * For now only plain index scans do, we could extend that to IOS. Not sure
+ * about other plans.
+ */
+bool
+ExecSupportsPrefetch(Plan *node)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_IndexScan:
+			return true;
+
+		default:
+			return false;
+	}
+}
+
 /*
  * An IndexScan or IndexOnlyScan node supports backward scan only if the
  * index's AM does.
@@ -651,3 +673,22 @@ ExecMaterializesOutput(NodeTag plantype)
 
 	return false;
 }
+
+
+/*
+ * ExecPrefetch
+ */
+void
+ExecPrefetch(PlanState *node)
+{
+	switch (nodeTag(node))
+	{
+		case T_IndexScanState:
+			ExecPrefetchIndexScan((IndexScanState *) node);
+			break;
+
+		default:
+			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 00dc3396156..9337913d899 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -129,7 +129,8 @@ tts_virtual_clear(TupleTableSlot *slot)
 static void
 tts_virtual_getsomeattrs(TupleTableSlot *slot, int natts)
 {
-	elog(ERROR, "getsomeattrs is not required to be called on a virtual tuple table slot");
+	// we don't know if the slot is virtual or not
+	// elog(ERROR, "getsomeattrs is not required to be called on a virtual tuple table slot");
 }
 
 /*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 8000feff4c9..325230ae1d8 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1729,3 +1729,70 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 					 node->iss_ScanKeys, node->iss_NumScanKeys,
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 }
+
+void
+ExecPrefetchIndexScan(IndexScanState *node)
+{
+	EState	   *estate;
+	ScanDirection direction;
+	IndexScanDesc scandesc;
+	ItemPointer tid;
+
+	/*
+	 * extract necessary information from index scan node
+	 */
+	estate = node->ss.ps.state;
+
+	/*
+	 * Determine which direction to scan the index in based on the plan's scan
+	 * direction and the current direction of execution.
+	 */
+	direction = ScanDirectionCombine(estate->es_direction,
+									 ((IndexScan *) node->ss.ps.plan)->indexorderdir);
+	scandesc = node->iss_ScanDesc;
+
+	if (scandesc == NULL)
+	{
+		/*
+		 * We reach here if the index scan is not parallel, or if we're
+		 * serially executing an index scan that was planned to be parallel.
+		 */
+		scandesc = index_beginscan(node->ss.ss_currentRelation,
+								   node->iss_RelationDesc,
+								   estate->es_snapshot,
+								   node->iss_NumScanKeys,
+								   node->iss_NumOrderByKeys);
+
+		node->iss_ScanDesc = scandesc;
+
+		/*
+		 * If no run-time keys to calculate or they are ready, go ahead and
+		 * pass the scankeys to the index AM.
+		 */
+		if (node->iss_NumRuntimeKeys == 0 || node->iss_RuntimeKeysReady)
+			index_rescan(scandesc,
+						 node->iss_ScanKeys, node->iss_NumScanKeys,
+						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	}
+
+	/*
+	 * XXX This should probably prefetch only a limited number of tuples for
+	 * each key, not all of them - the index can easily have millions of them
+	 * for some keys. And prefetching more of those items would be subject
+	 * to the "regular" index prefetching, if ever implemented.
+	 *
+	 * XXX The one question is how to communicate back how many items would
+	 * be prefetched. If we find a key with 1M TIDs, it probably dones not
+	 * make much sense to prefetch further in nestloop, because the 1M will
+	 * likely trash the cache anyway.
+	 *
+	 * XXX This should consider how many items we actually need. For semi
+	 * join we only need the first one, for example.
+	 */
+	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		PrefetchBuffer(scandesc->heapRelation, MAIN_FORKNUM, ItemPointerGetBlockNumber(tid));
+	}
+}
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 7f4bf6c4dbb..c40acc50b83 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -24,7 +24,9 @@
 #include "executor/execdebug.h"
 #include "executor/nodeNestloop.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 
+bool              enable_nestloop_prefetch = false;
 
 /* ----------------------------------------------------------------
  *		ExecNestLoop(node)
@@ -105,15 +107,110 @@ ExecNestLoop(PlanState *pstate)
 		if (node->nl_NeedNewOuter)
 		{
 			ENL1_printf("getting new outer tuple");
-			outerTupleSlot = ExecProcNode(outerPlan);
 
 			/*
-			 * if there are no more outer tuples, then the join is complete..
+			 * Without prefetching, get the next slot from the outer plan
+			 * directly. With prefetching get the next slot from the batch,
+			 * or fill the batch if needed.
 			 */
-			if (TupIsNull(outerTupleSlot))
+			if (node->nl_PrefetchCount == 0)	/* no prefetching */
 			{
-				ENL1_printf("no outer tuple, ending join");
-				return NULL;
+				outerTupleSlot = ExecProcNode(outerPlan);
+
+				/*
+				 * if there are no more outer tuples, then the join is complete.
+				 */
+				if (TupIsNull(outerTupleSlot))
+				{
+					ENL1_printf("no outer tuple, ending join");
+					return NULL;
+				}
+			}
+			else								/* prefetching */
+			{
+
+				/*
+				 * No more slots availabe in the queue - try to load from
+				 * the outer plan, unless we've already reached the end.
+				 */
+				if ((node->nl_PrefetchNext == node->nl_PrefetchUsed) &&
+					(!node->nl_PrefetchDone))
+				{
+					/* reset */
+					node->nl_PrefetchNext = 0;
+					node->nl_PrefetchUsed = 0;
+
+					while (node->nl_PrefetchUsed < node->nl_PrefetchCount)
+					{
+						outerTupleSlot = ExecProcNode(outerPlan);
+
+						/*
+						 * if there are no more outer tuples, then the join is complete.
+						 */
+						if (TupIsNull(outerTupleSlot))
+						{
+							node->nl_PrefetchDone = true;
+							break;
+						}
+
+						ExecClearTuple(node->nl_PrefetchSlots[node->nl_PrefetchUsed]);
+
+						ExecCopySlot(node->nl_PrefetchSlots[node->nl_PrefetchUsed],
+									 outerTupleSlot);
+
+						ENL1_printf("prefetching inner node");
+						econtext->ecxt_outertuple = node->nl_PrefetchSlots[node->nl_PrefetchUsed];
+
+						/*
+						 * fetch the values of any outer Vars that must be passed to the
+						 * inner scan, and store them in the appropriate PARAM_EXEC slots.
+						 */
+						foreach(lc, nl->nestParams)
+						{
+							NestLoopParam *nlp = (NestLoopParam *) lfirst(lc);
+							int			paramno = nlp->paramno;
+							ParamExecData *prm;
+
+							prm = &(econtext->ecxt_param_exec_vals[paramno]);
+							/* Param value should be an OUTER_VAR var */
+							Assert(IsA(nlp->paramval, Var));
+							Assert(nlp->paramval->varno == OUTER_VAR);
+							Assert(nlp->paramval->varattno > 0);
+							prm->value = slot_getattr(outerTupleSlot,
+													  nlp->paramval->varattno,
+													  &(prm->isnull));
+							/* Flag parameter value as changed */
+							innerPlan->chgParam = bms_add_member(innerPlan->chgParam,
+																 paramno);
+						}
+
+						Assert(node->nl_PrefetchSlots[node->nl_PrefetchUsed]->tts_nvalid > 0);
+
+						/*
+						 * now prefetch the inner plan
+						 */
+						ENL1_printf("prefetching inner plan");
+						ExecReScan(innerPlan);
+						ExecPrefetch(innerPlan);
+
+						node->nl_PrefetchUsed++;
+					}
+				}
+
+				/*
+				 * Now we should either have a slot in the queue, or know
+				 * that we've exhausted the outer side.
+				 */
+				if (node->nl_PrefetchNext == node->nl_PrefetchUsed)
+				{
+					Assert(node->nl_PrefetchDone);
+					ENL1_printf("no outer tuple, ending join");
+					return NULL;
+				}
+
+				/* get the next slot from the queue */
+				outerTupleSlot = node->nl_PrefetchSlots[node->nl_PrefetchNext++];
+				Assert(outerTupleSlot->tts_nvalid > 0);
 			}
 
 			ENL1_printf("saving new outer tuple information");
@@ -345,6 +442,24 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	nlstate->nl_NeedNewOuter = true;
 	nlstate->nl_MatchedOuter = false;
 
+	/* with inner plan supporting prefetching, initialize the batch */
+	nlstate->nl_PrefetchCount = 0;
+	if (enable_nestloop_prefetch &&
+		ExecSupportsPrefetch(innerPlan(node)))
+	{
+		/* batch of 32 slots seem about right for now */
+#define NL_PREFETCH_BATCH_SIZE 32
+		nlstate->nl_PrefetchCount = NL_PREFETCH_BATCH_SIZE;
+		nlstate->nl_PrefetchSlots = palloc0(sizeof(TupleTableSlot *) * nlstate->nl_PrefetchCount);
+
+		for (int i = 0; i < nlstate->nl_PrefetchCount; i++)
+		{
+			nlstate->nl_PrefetchSlots[i]
+				= MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(nlstate)),
+										   &TTSOpsVirtual);
+		}
+	}
+
 	NL1_printf("ExecInitNestLoop: %s\n",
 			   "node initialized");
 
@@ -369,6 +484,12 @@ ExecEndNestLoop(NestLoopState *node)
 	ExecEndNode(outerPlanState(node));
 	ExecEndNode(innerPlanState(node));
 
+	/* cleanup batch of prefetch slots */
+	for (int i = 0; i < node->nl_PrefetchCount; i++)
+	{
+		ExecDropSingleTupleTableSlot(node->nl_PrefetchSlots[i]);
+	}
+
 	NL1_printf("ExecEndNestLoop: %s\n",
 			   "node processing ended");
 }
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index af227b1f248..0d5d1c2a600 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -879,6 +879,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_nestloop_prefetch", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables prefetching in nested-loop join plans."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_nestloop_prefetch,
+		false,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_mergejoin", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of merge join plans."),
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 69c3ebff00a..05ab4cae4a0 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -107,7 +107,9 @@ extern void ExecMarkPos(PlanState *node);
 extern void ExecRestrPos(PlanState *node);
 extern bool ExecSupportsMarkRestore(struct Path *pathnode);
 extern bool ExecSupportsBackwardScan(Plan *node);
+extern bool ExecSupportsPrefetch(Plan *node);
 extern bool ExecMaterializesOutput(NodeTag plantype);
+extern void ExecPrefetch(PlanState *node);
 
 /*
  * prototypes from functions in execCurrent.c
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 3cddece67c8..114ca94352e 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -28,6 +28,7 @@ extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pc
 extern void ExecIndexScanReInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
 extern void ExecIndexScanInitializeWorker(IndexScanState *node,
 										  ParallelWorkerContext *pwcxt);
+extern void ExecPrefetchIndexScan(IndexScanState *node);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index af7d8fd1e72..ddd9f1f3332 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2131,6 +2131,13 @@ typedef struct NestLoopState
 	bool		nl_NeedNewOuter;
 	bool		nl_MatchedOuter;
 	TupleTableSlot *nl_NullInnerTupleSlot;
+
+	/* for prefetch of batch items */
+	int			nl_PrefetchCount;		/* maximum number of queued slots */
+	int			nl_PrefetchUsed;		/* current number of queued slots */
+	int			nl_PrefetchNext;		/* next slot to return from queue */
+	bool		nl_PrefetchDone;		/* no more outer slots */
+	TupleTableSlot **nl_PrefetchSlots;	/* array of virtual slots */
 } NestLoopState;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 854a782944a..3303677c05b 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -58,6 +58,7 @@ extern PGDLLIMPORT bool enable_sort;
 extern PGDLLIMPORT bool enable_incremental_sort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
+extern PGDLLIMPORT bool enable_nestloop_prefetch;
 extern PGDLLIMPORT bool enable_material;
 extern PGDLLIMPORT bool enable_memoize;
 extern PGDLLIMPORT bool enable_mergejoin;
-- 
2.46.0

nestloop.sqlapplication/sql; name=nestloop.sqlDownload
nestloop.csvtext/csv; charset=UTF-8; name=nestloop.csvDownload
nestloop.shapplication/x-shellscript; name=nestloop.shDownload
#2Tomas Vondra
tomas@vondra.me
In reply to: Tomas Vondra (#1)
1 attachment(s)
Re: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)

On 8/26/24 18:06, Tomas Vondra wrote:

I just noticed there's a couple failures in the regression tests, if I
change the GUC to "true" by default. I haven't looked into that yet, but
I guess there's some mistake in resetting the child node, or something
like that. Will investigate.

Turned out to be a silly bug - not resetting the queue on rescan. Fixed,
and also removed two not-quite-correct asserts. No other changes.

regards

--
Tomas Vondra

Attachments:

v20240827-0001-nestloop-prefetch-initial-PoC.patchtext/x-patch; charset=UTF-8; name=v20240827-0001-nestloop-prefetch-initial-PoC.patchDownload
From b8ac5834802e4eb621ed920e8e7dd306ee843995 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 26 Aug 2024 01:13:56 +0200
Subject: [PATCH v20240827] nestloop prefetch - initial PoC

---
 src/backend/executor/execAmi.c         |  41 ++++++++
 src/backend/executor/execTuples.c      |   3 +-
 src/backend/executor/nodeIndexscan.c   |  67 +++++++++++++
 src/backend/executor/nodeNestloop.c    | 133 ++++++++++++++++++++++++-
 src/backend/utils/misc/guc_tables.c    |  10 ++
 src/include/executor/executor.h        |   2 +
 src/include/executor/nodeIndexscan.h   |   1 +
 src/include/nodes/execnodes.h          |   7 ++
 src/include/optimizer/cost.h           |   1 +
 src/test/regress/expected/sysviews.out |   3 +-
 10 files changed, 261 insertions(+), 7 deletions(-)

diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 3289e3e0219..8d336c713a2 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -594,6 +594,28 @@ ExecSupportsBackwardScan(Plan *node)
 	}
 }
 
+/*
+ * ExecSupportsPrefetch - does a plan type support prefetching?
+ *
+ * For now only plain index scans do, we could extend that to IOS. Not sure
+ * about other plans.
+ */
+bool
+ExecSupportsPrefetch(Plan *node)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_IndexScan:
+			return true;
+
+		default:
+			return false;
+	}
+}
+
 /*
  * An IndexScan or IndexOnlyScan node supports backward scan only if the
  * index's AM does.
@@ -651,3 +673,22 @@ ExecMaterializesOutput(NodeTag plantype)
 
 	return false;
 }
+
+
+/*
+ * ExecPrefetch
+ */
+void
+ExecPrefetch(PlanState *node)
+{
+	switch (nodeTag(node))
+	{
+		case T_IndexScanState:
+			ExecPrefetchIndexScan((IndexScanState *) node);
+			break;
+
+		default:
+			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 00dc3396156..9337913d899 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -129,7 +129,8 @@ tts_virtual_clear(TupleTableSlot *slot)
 static void
 tts_virtual_getsomeattrs(TupleTableSlot *slot, int natts)
 {
-	elog(ERROR, "getsomeattrs is not required to be called on a virtual tuple table slot");
+	// we don't know if the slot is virtual or not
+	// elog(ERROR, "getsomeattrs is not required to be called on a virtual tuple table slot");
 }
 
 /*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 8000feff4c9..325230ae1d8 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1729,3 +1729,70 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
 					 node->iss_ScanKeys, node->iss_NumScanKeys,
 					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 }
+
+void
+ExecPrefetchIndexScan(IndexScanState *node)
+{
+	EState	   *estate;
+	ScanDirection direction;
+	IndexScanDesc scandesc;
+	ItemPointer tid;
+
+	/*
+	 * extract necessary information from index scan node
+	 */
+	estate = node->ss.ps.state;
+
+	/*
+	 * Determine which direction to scan the index in based on the plan's scan
+	 * direction and the current direction of execution.
+	 */
+	direction = ScanDirectionCombine(estate->es_direction,
+									 ((IndexScan *) node->ss.ps.plan)->indexorderdir);
+	scandesc = node->iss_ScanDesc;
+
+	if (scandesc == NULL)
+	{
+		/*
+		 * We reach here if the index scan is not parallel, or if we're
+		 * serially executing an index scan that was planned to be parallel.
+		 */
+		scandesc = index_beginscan(node->ss.ss_currentRelation,
+								   node->iss_RelationDesc,
+								   estate->es_snapshot,
+								   node->iss_NumScanKeys,
+								   node->iss_NumOrderByKeys);
+
+		node->iss_ScanDesc = scandesc;
+
+		/*
+		 * If no run-time keys to calculate or they are ready, go ahead and
+		 * pass the scankeys to the index AM.
+		 */
+		if (node->iss_NumRuntimeKeys == 0 || node->iss_RuntimeKeysReady)
+			index_rescan(scandesc,
+						 node->iss_ScanKeys, node->iss_NumScanKeys,
+						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	}
+
+	/*
+	 * XXX This should probably prefetch only a limited number of tuples for
+	 * each key, not all of them - the index can easily have millions of them
+	 * for some keys. And prefetching more of those items would be subject
+	 * to the "regular" index prefetching, if ever implemented.
+	 *
+	 * XXX The one question is how to communicate back how many items would
+	 * be prefetched. If we find a key with 1M TIDs, it probably dones not
+	 * make much sense to prefetch further in nestloop, because the 1M will
+	 * likely trash the cache anyway.
+	 *
+	 * XXX This should consider how many items we actually need. For semi
+	 * join we only need the first one, for example.
+	 */
+	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		PrefetchBuffer(scandesc->heapRelation, MAIN_FORKNUM, ItemPointerGetBlockNumber(tid));
+	}
+}
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 7f4bf6c4dbb..2349a6de085 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -24,7 +24,9 @@
 #include "executor/execdebug.h"
 #include "executor/nodeNestloop.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 
+bool              enable_nestloop_prefetch = false;
 
 /* ----------------------------------------------------------------
  *		ExecNestLoop(node)
@@ -105,15 +107,107 @@ ExecNestLoop(PlanState *pstate)
 		if (node->nl_NeedNewOuter)
 		{
 			ENL1_printf("getting new outer tuple");
-			outerTupleSlot = ExecProcNode(outerPlan);
 
 			/*
-			 * if there are no more outer tuples, then the join is complete..
+			 * Without prefetching, get the next slot from the outer plan
+			 * directly. With prefetching get the next slot from the batch,
+			 * or fill the batch if needed.
 			 */
-			if (TupIsNull(outerTupleSlot))
+			if (node->nl_PrefetchCount == 0)	/* no prefetching */
 			{
-				ENL1_printf("no outer tuple, ending join");
-				return NULL;
+				outerTupleSlot = ExecProcNode(outerPlan);
+
+				/*
+				 * if there are no more outer tuples, then the join is complete.
+				 */
+				if (TupIsNull(outerTupleSlot))
+				{
+					ENL1_printf("no outer tuple, ending join");
+					return NULL;
+				}
+			}
+			else								/* prefetching */
+			{
+
+				/*
+				 * No more slots availabe in the queue - try to load from
+				 * the outer plan, unless we've already reached the end.
+				 */
+				if ((node->nl_PrefetchNext == node->nl_PrefetchUsed) &&
+					(!node->nl_PrefetchDone))
+				{
+					/* reset */
+					node->nl_PrefetchNext = 0;
+					node->nl_PrefetchUsed = 0;
+
+					while (node->nl_PrefetchUsed < node->nl_PrefetchCount)
+					{
+						outerTupleSlot = ExecProcNode(outerPlan);
+
+						/*
+						 * if there are no more outer tuples, then the join is complete.
+						 */
+						if (TupIsNull(outerTupleSlot))
+						{
+							node->nl_PrefetchDone = true;
+							break;
+						}
+
+						ExecClearTuple(node->nl_PrefetchSlots[node->nl_PrefetchUsed]);
+
+						ExecCopySlot(node->nl_PrefetchSlots[node->nl_PrefetchUsed],
+									 outerTupleSlot);
+
+						ENL1_printf("prefetching inner node");
+						econtext->ecxt_outertuple = node->nl_PrefetchSlots[node->nl_PrefetchUsed];
+
+						/*
+						 * fetch the values of any outer Vars that must be passed to the
+						 * inner scan, and store them in the appropriate PARAM_EXEC slots.
+						 */
+						foreach(lc, nl->nestParams)
+						{
+							NestLoopParam *nlp = (NestLoopParam *) lfirst(lc);
+							int			paramno = nlp->paramno;
+							ParamExecData *prm;
+
+							prm = &(econtext->ecxt_param_exec_vals[paramno]);
+							/* Param value should be an OUTER_VAR var */
+							Assert(IsA(nlp->paramval, Var));
+							Assert(nlp->paramval->varno == OUTER_VAR);
+							Assert(nlp->paramval->varattno > 0);
+							prm->value = slot_getattr(node->nl_PrefetchSlots[node->nl_PrefetchUsed],
+													  nlp->paramval->varattno,
+													  &(prm->isnull));
+							/* Flag parameter value as changed */
+							innerPlan->chgParam = bms_add_member(innerPlan->chgParam,
+																 paramno);
+						}
+
+						/*
+						 * now prefetch the inner plan
+						 */
+						ENL1_printf("prefetching inner plan");
+						ExecReScan(innerPlan);
+						ExecPrefetch(innerPlan);
+
+						node->nl_PrefetchUsed++;
+					}
+				}
+
+				/*
+				 * Now we should either have a slot in the queue, or know
+				 * that we've exhausted the outer side.
+				 */
+				if (node->nl_PrefetchNext == node->nl_PrefetchUsed)
+				{
+					Assert(node->nl_PrefetchDone);
+					ENL1_printf("no outer tuple, ending join");
+					return NULL;
+				}
+
+				/* get the next slot from the queue */
+				outerTupleSlot = node->nl_PrefetchSlots[node->nl_PrefetchNext++];
 			}
 
 			ENL1_printf("saving new outer tuple information");
@@ -345,6 +439,24 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	nlstate->nl_NeedNewOuter = true;
 	nlstate->nl_MatchedOuter = false;
 
+	/* with inner plan supporting prefetching, initialize the batch */
+	nlstate->nl_PrefetchCount = 0;
+	if (enable_nestloop_prefetch &&
+		ExecSupportsPrefetch(innerPlan(node)))
+	{
+		/* batch of 32 slots seem about right for now */
+#define NL_PREFETCH_BATCH_SIZE 32
+		nlstate->nl_PrefetchCount = NL_PREFETCH_BATCH_SIZE;
+		nlstate->nl_PrefetchSlots = palloc0(sizeof(TupleTableSlot *) * nlstate->nl_PrefetchCount);
+
+		for (int i = 0; i < nlstate->nl_PrefetchCount; i++)
+		{
+			nlstate->nl_PrefetchSlots[i]
+				= MakeSingleTupleTableSlot(ExecGetResultType(outerPlanState(nlstate)),
+										   &TTSOpsVirtual);
+		}
+	}
+
 	NL1_printf("ExecInitNestLoop: %s\n",
 			   "node initialized");
 
@@ -369,6 +481,12 @@ ExecEndNestLoop(NestLoopState *node)
 	ExecEndNode(outerPlanState(node));
 	ExecEndNode(innerPlanState(node));
 
+	/* cleanup batch of prefetch slots */
+	for (int i = 0; i < node->nl_PrefetchCount; i++)
+	{
+		ExecDropSingleTupleTableSlot(node->nl_PrefetchSlots[i]);
+	}
+
 	NL1_printf("ExecEndNestLoop: %s\n",
 			   "node processing ended");
 }
@@ -397,4 +515,9 @@ ExecReScanNestLoop(NestLoopState *node)
 
 	node->nl_NeedNewOuter = true;
 	node->nl_MatchedOuter = false;
+
+	/* reset the prefetch batch too */
+	node->nl_PrefetchNext = 0;
+	node->nl_PrefetchUsed = 0;
+	node->nl_PrefetchDone = false;
 }
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index af227b1f248..781c3d29b99 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -879,6 +879,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_nestloop_prefetch", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables prefetching in nested-loop join plans."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_nestloop_prefetch,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_mergejoin", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of merge join plans."),
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 69c3ebff00a..05ab4cae4a0 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -107,7 +107,9 @@ extern void ExecMarkPos(PlanState *node);
 extern void ExecRestrPos(PlanState *node);
 extern bool ExecSupportsMarkRestore(struct Path *pathnode);
 extern bool ExecSupportsBackwardScan(Plan *node);
+extern bool ExecSupportsPrefetch(Plan *node);
 extern bool ExecMaterializesOutput(NodeTag plantype);
+extern void ExecPrefetch(PlanState *node);
 
 /*
  * prototypes from functions in execCurrent.c
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 3cddece67c8..114ca94352e 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -28,6 +28,7 @@ extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pc
 extern void ExecIndexScanReInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
 extern void ExecIndexScanInitializeWorker(IndexScanState *node,
 										  ParallelWorkerContext *pwcxt);
+extern void ExecPrefetchIndexScan(IndexScanState *node);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index af7d8fd1e72..ddd9f1f3332 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2131,6 +2131,13 @@ typedef struct NestLoopState
 	bool		nl_NeedNewOuter;
 	bool		nl_MatchedOuter;
 	TupleTableSlot *nl_NullInnerTupleSlot;
+
+	/* for prefetch of batch items */
+	int			nl_PrefetchCount;		/* maximum number of queued slots */
+	int			nl_PrefetchUsed;		/* current number of queued slots */
+	int			nl_PrefetchNext;		/* next slot to return from queue */
+	bool		nl_PrefetchDone;		/* no more outer slots */
+	TupleTableSlot **nl_PrefetchSlots;	/* array of virtual slots */
 } NestLoopState;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 854a782944a..3303677c05b 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -58,6 +58,7 @@ extern PGDLLIMPORT bool enable_sort;
 extern PGDLLIMPORT bool enable_incremental_sort;
 extern PGDLLIMPORT bool enable_hashagg;
 extern PGDLLIMPORT bool enable_nestloop;
+extern PGDLLIMPORT bool enable_nestloop_prefetch;
 extern PGDLLIMPORT bool enable_material;
 extern PGDLLIMPORT bool enable_memoize;
 extern PGDLLIMPORT bool enable_mergejoin;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index fad7fc3a7e0..5ca6c16f3a0 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -161,6 +161,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_memoize                 | on
  enable_mergejoin               | on
  enable_nestloop                | on
+ enable_nestloop_prefetch       | on
  enable_parallel_append         | on
  enable_parallel_hash           | on
  enable_partition_pruning       | on
@@ -170,7 +171,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(22 rows)
+(23 rows)
 
 -- There are always wait event descriptions for various types.  InjectionPoint
 -- may be present or absent, depending on history since last postmaster start.
-- 
2.46.0

#3Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#1)
Re: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)

Hi,

On 2024-08-26 18:06:04 +0200, Tomas Vondra wrote:

I'm getting back to work on the index prefetching patch [1], but one
annoying aspect of that patch is that it's limited to the context of a
single executor node. It can be very effective when there's an index
scan with many matches for a key, but it's entirely useless for plans
with many tiny index scans.

Right.

The patch does this:
--------------------

1) ExecPrefetch executor callback

This call is meant to do the actual prefetching - the parent node sets
everything up almost as if for ExecProcNode(), but does not expect the
actual result. The child either does some prefetching or nothing.

2) ExecSupportsPrefetch to identify what nodes accept ExecPrefetch()

This simply says if a given node supports prefetching. The only place
calling this is the nested loop, to enable prefetching only for nested
loops with (parameterized) index scans.

3) ExecPrefetchIndexScan doing prefetching in index scans

This is just trivial IndexNext() variant, getting TIDs and calling
PrefetchBuffer() on them. Right now it just prefetches everything, but
that's seems wrong - this is where the original index prefetching patch
should kick in.

4) ExecNestLoop changes

This is where the actual magic happens - if the inner child knows how to
prefetch stuff (per ExecSupportsPrefetch), this switches to reading
batches of outer slots, and calls ExecPrefetch() on them. Right now the
batch size is hardcoded to 32, but it might use effective_io_concurrency
or something like that. It's a bit naive in other aspects too - it
always reads and prefetches the whole batch at once, instead of ramping
up and then consuming and prefetching slots one by one. Good enough for
PoC, but probably needs improvements.

5) adds enable_nestloop_prefetch to enable/disable this easily

Hm. Doing this via more executor tree traversals doesn't seem optimal, that's
not exactly free. And because the prefetching can be beneficial even if there
are nodes above the inner parametrized index node, we IMO would want to
iterate through multiple node levels.

Have you considered instead expanding the parameterized scan logic? Right now
nestloop passes down values one-by-one via PARAM_EXEC. What if we expanded
that to allow nodes, e.g. nestloop in this case, to pass down multiple values
in one parameter? That'd e.g. allow passing down multiple rows to fetch from
nodeNestloop.c to nodeIndexscan.c without needing to iterate over the executor
state tree. And it might be more powerful than just doing prefetching -
e.g. we could use one ScalarArrayOps scan in the index instead of doing a
separate scan for each of the to-be-prefetched values.

benchmark
---------

Of course, the main promise of this is faster queries, so I did a simple
benchmark, with a query like this:

SELECT * FROM fact_table f JOIN dimension d ON (f.id = d.id)
WHERE f.r < 0.0001;

The "r" is simply a random value, allowing to select arbitrary fraction
of the large fact table "f". Here it selects 0.01%, so ~10k rows from
100M table. Dimensions have 10M rows. See the .sql file attached.

For a variable number of dimensions (1 to 4) the results look like this:

prefetch 1 2 3 4
----------------------------------------
off 3260 6193 8993 11980
on 2145 3897 5531 7352
----------------------------------------
66% 63% 62% 61%

This is on "cold" data, with a restart + drop caches between runs. The
results suggest the prefetching makes it about twice as fast. I was
hoping for more, but not bad for a Poc, chances are it can be improved.

I think that's indeed a pretty nice win.

One minor detail is that I ran into some issues with tuple slots. I need
a bunch of them to stash the slots received from the outer plan, so I
created a couple slots with TTSOpsVirtual. And that mostly works, except
that later ExecInterpExpr() happens to call slot_getsomeattrs() and that
fails because tts_virtual_getsomeattrs() says:

elog(ERROR, "getsomeattrs is not required to be called on a virtual
tuple table slot");

OK, that call is not necessary for virtual slots, it's noop. But it's
not me calling that, the interpreter does that for some reason. I did
comment that error out in the patch, but I wonder what's the proper way
to make this work ...

Hm, that's odd. slot_getsomeattrs() only calls tts_virtual_getsomeattrs() if
slot->tts_nvalid is smaller than the requested attnum. Which should never be
the case for a virtual slot. So I suspect there might be some bug leading to
wrong contents being stored in slots?

Greetings,

Andres Freund

In reply to: Andres Freund (#3)
Re: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)

On Tue, Aug 27, 2024 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:

Have you considered instead expanding the parameterized scan logic? Right now
nestloop passes down values one-by-one via PARAM_EXEC. What if we expanded
that to allow nodes, e.g. nestloop in this case, to pass down multiple values
in one parameter? That'd e.g. allow passing down multiple rows to fetch from
nodeNestloop.c to nodeIndexscan.c without needing to iterate over the executor
state tree.

This sounds a bit like block nested loop join.

And it might be more powerful than just doing prefetching -
e.g. we could use one ScalarArrayOps scan in the index instead of doing a
separate scan for each of the to-be-prefetched values.

ScalarArrayOps within nbtree are virtually the same thing as regular
index scans these days. That could make a big difference (perhaps this
is obvious).

One reason to do it this way is because it cuts down on index descent
costs, and other executor overheads. But it is likely that it will
also make prefetching itself more effective, too -- just because
prefetching will naturally end up with fewer, larger batches of
logically related work.

--
Peter Geoghegan

#5Tomas Vondra
tomas@vondra.me
In reply to: Andres Freund (#3)
Re: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)

On 8/27/24 20:40, Andres Freund wrote:

Hi,

On 2024-08-26 18:06:04 +0200, Tomas Vondra wrote:

I'm getting back to work on the index prefetching patch [1], but one
annoying aspect of that patch is that it's limited to the context of a
single executor node. It can be very effective when there's an index
scan with many matches for a key, but it's entirely useless for plans
with many tiny index scans.

Right.

The patch does this:
--------------------

1) ExecPrefetch executor callback

This call is meant to do the actual prefetching - the parent node sets
everything up almost as if for ExecProcNode(), but does not expect the
actual result. The child either does some prefetching or nothing.

2) ExecSupportsPrefetch to identify what nodes accept ExecPrefetch()

This simply says if a given node supports prefetching. The only place
calling this is the nested loop, to enable prefetching only for nested
loops with (parameterized) index scans.

3) ExecPrefetchIndexScan doing prefetching in index scans

This is just trivial IndexNext() variant, getting TIDs and calling
PrefetchBuffer() on them. Right now it just prefetches everything, but
that's seems wrong - this is where the original index prefetching patch
should kick in.

4) ExecNestLoop changes

This is where the actual magic happens - if the inner child knows how to
prefetch stuff (per ExecSupportsPrefetch), this switches to reading
batches of outer slots, and calls ExecPrefetch() on them. Right now the
batch size is hardcoded to 32, but it might use effective_io_concurrency
or something like that. It's a bit naive in other aspects too - it
always reads and prefetches the whole batch at once, instead of ramping
up and then consuming and prefetching slots one by one. Good enough for
PoC, but probably needs improvements.

5) adds enable_nestloop_prefetch to enable/disable this easily

Hm. Doing this via more executor tree traversals doesn't seem optimal, that's
not exactly free.

Good point. I was wondering what the cost of the executor call might be
too, so I did a test with cached data (the results I presented in the
first message were with a restart + page cache drop before each query,
i.e. "best case" for prefetching).

If I run each query twice - uncached and cached - I get this:

| prefetch=off | prefetch=on
dimensions | cache nocache | cache nocache | cache nocache
--------------------------------------------------------------------
1 | 61 3314 | 74 2172 | 121% 66%
2 | 100 6327 | 129 3900 | 129% 62%
3 | 137 9374 | 177 5637 | 129% 60%
4 | 169 12211 | 225 7319 | 133% 60%

The columns at the end are (prefetch=on)/(prefetch=off). This shows that
for uncached data, we get ~40% speedup, while for cached it's ~30%
regression. That's not great, but where does the regression come from?

Per flamegraphs, the wast majority of that is due to doing btgettuple
twice. ExecPrefetchIndexScan simply doing index_getnext_tid() too, just
like IndexNext().

If I remove that, leaving ExecPrefetchIndexScan() empty, the difference
entirely disappears. The difference is ~1%, maybe. So at least in this
case the overhead of traversal is quite negligible. I'm actually
surprised copying slots and building the parameters twice does not cause
a regression, but that's what I see.

And because the prefetching can be beneficial even if there
are nodes above the inner parametrized index node, we IMO would want
to iterate through multiple node levels.

I don't think we'd actually want that. It makes it very hard to
determine how far ahead to prefetch, because how would you know what the
child nodes are doing? I think it'd be easy to end up prefetching way
too much data. But also because the subnodes likely need to do sync I/O
to do *their* prefetching.

I mean, what if the inner path has another nestloop? Surely that needs
to get the outer tuple? If we get those tuples, should we prefetch the
matching inner tuples too? Wouldn't that means we could be prefetching
exponential number of tuples?

I honestly don't know - maybe there are cases where this makes sense,
but I'm not sure why would that be "incompatible" with something like
ExecutorPrefetch().

In any case, I think it'd be fine to have prefetching at least for
simple cases, where we know it can help. It wasn't my ambition to make
the whole executor somehow asynchronous ;-)

Have you considered instead expanding the parameterized scan logic? Right now
nestloop passes down values one-by-one via PARAM_EXEC. What if we expanded
that to allow nodes, e.g. nestloop in this case, to pass down multiple values
in one parameter? That'd e.g. allow passing down multiple rows to fetch from
nodeNestloop.c to nodeIndexscan.c without needing to iterate over the executor
state tree. And it might be more powerful than just doing prefetching -
e.g. we could use one ScalarArrayOps scan in the index instead of doing a
separate scan for each of the to-be-prefetched values.

I have not, but it seems "batching" the prefetches in some way might be
a way to go. I'm not sure it'll be much more efficient (not just for
btree, what about other index AMs?).

But then that really starts to look like BNL - why would we even batch
prefetches and then do the rest row-by-row? We could just as well pass
down the batch to the index scan, and let it handle the prefetches.

That'd be much less about prefetching and more about allowing batching
for some nodes.

benchmark
---------

Of course, the main promise of this is faster queries, so I did a simple
benchmark, with a query like this:

SELECT * FROM fact_table f JOIN dimension d ON (f.id = d.id)
WHERE f.r < 0.0001;

The "r" is simply a random value, allowing to select arbitrary fraction
of the large fact table "f". Here it selects 0.01%, so ~10k rows from
100M table. Dimensions have 10M rows. See the .sql file attached.

For a variable number of dimensions (1 to 4) the results look like this:

prefetch 1 2 3 4
----------------------------------------
off 3260 6193 8993 11980
on 2145 3897 5531 7352
----------------------------------------
66% 63% 62% 61%

This is on "cold" data, with a restart + drop caches between runs. The
results suggest the prefetching makes it about twice as fast. I was
hoping for more, but not bad for a Poc, chances are it can be improved.

I think that's indeed a pretty nice win.

Yeah. It's a bit of a "best case" workload, though.

One minor detail is that I ran into some issues with tuple slots. I need
a bunch of them to stash the slots received from the outer plan, so I
created a couple slots with TTSOpsVirtual. And that mostly works, except
that later ExecInterpExpr() happens to call slot_getsomeattrs() and that
fails because tts_virtual_getsomeattrs() says:

elog(ERROR, "getsomeattrs is not required to be called on a virtual
tuple table slot");

OK, that call is not necessary for virtual slots, it's noop. But it's
not me calling that, the interpreter does that for some reason. I did
comment that error out in the patch, but I wonder what's the proper way
to make this work ...

Hm, that's odd. slot_getsomeattrs() only calls tts_virtual_getsomeattrs() if
slot->tts_nvalid is smaller than the requested attnum. Which should never be
the case for a virtual slot. So I suspect there might be some bug leading to
wrong contents being stored in slots?

Hmmm, it seems I can no longer reproduce this :-( Chances are it was
happening because of some other bug that I fixed, and didn't realize it
was causing this too. Sorry :-/

regards

--
Tomas Vondra

#6Tomas Vondra
tomas@vondra.me
In reply to: Peter Geoghegan (#4)
Re: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)

On 8/27/24 20:53, Peter Geoghegan wrote:

On Tue, Aug 27, 2024 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:

Have you considered instead expanding the parameterized scan logic? Right now
nestloop passes down values one-by-one via PARAM_EXEC. What if we expanded
that to allow nodes, e.g. nestloop in this case, to pass down multiple values
in one parameter? That'd e.g. allow passing down multiple rows to fetch from
nodeNestloop.c to nodeIndexscan.c without needing to iterate over the executor
state tree.

This sounds a bit like block nested loop join.

Yeah.

And it might be more powerful than just doing prefetching -
e.g. we could use one ScalarArrayOps scan in the index instead of doing a
separate scan for each of the to-be-prefetched values.

ScalarArrayOps within nbtree are virtually the same thing as regular
index scans these days. That could make a big difference (perhaps this
is obvious).

One reason to do it this way is because it cuts down on index descent
costs, and other executor overheads. But it is likely that it will
also make prefetching itself more effective, too -- just because
prefetching will naturally end up with fewer, larger batches of
logically related work.

Perhaps. So nestloop would pass down multiple values, the inner subplan
would do whatever it wants (including prefetching), and then return the
matching rows, somehow? It's not very clear to me how would we return
the tuples for many matches, but it seems to shift the prefetching
closer to the "normal" index prefetching discussed elsewhere.

--
Tomas Vondra

In reply to: Tomas Vondra (#6)
Re: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)

On Tue, Aug 27, 2024 at 6:44 PM Tomas Vondra <tomas@vondra.me> wrote:

One reason to do it this way is because it cuts down on index descent
costs, and other executor overheads. But it is likely that it will
also make prefetching itself more effective, too -- just because
prefetching will naturally end up with fewer, larger batches of
logically related work.

Perhaps.

I expect this to be particularly effective whenever there is naturally
occuring locality. I think that's fairly common. We'll sort the SAOP
array on the nbtree side, as we always do.

So nestloop would pass down multiple values, the inner subplan
would do whatever it wants (including prefetching), and then return the
matching rows, somehow?

Right.

It's not very clear to me how would we return
the tuples for many matches, but it seems to shift the prefetching
closer to the "normal" index prefetching discussed elsewhere.

It'll be necessary to keep track of which outer side rows relate to
which inner-side array values (within a given batch/block). Some new
data structure will be needed to manage that book keeping.

Currently, we deduplicate arrays for SAOP scans. I suppose that it
works that way because it's not really clear what it would mean for
the scan to have duplicate array keys. I don't see any need to change
that for block nested loop join/whatever this is. We would have to use
the new data structure to "pair up" outer side tuples with their
associated inner side result sets, at the end of processing each
batch/block. That way we avoid repeating the same inner index scan
within a given block/batch -- a little like with a memoize node.

Obviously, that's the case where we can exploit naturally occuring
locality most effectively -- the case where multiple duplicate inner
index scans are literally combined into only one. But, as I already
touched on, locality will be important in a variety of cases, not just
this one.

--
Peter Geoghegan

#8Tomas Vondra
tomas@vondra.me
In reply to: Peter Geoghegan (#7)
Re: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)

FYI I've marked the CF entry as withdrawn.

I still think this (triggering prefetching in another exector node) has
a lot of potential, and not only in the context of a nested loop. But
the patch was just a quick & dirty experiment, I don't have time to work
on this right now. No point in keeping this in the CF app.

regards

--
Tomas Vondra