Shared detoast Datum proposal

Started by Andy Fanabout 2 years ago53 messages
#1Andy Fan
zhihuifan1213@163.com
1 attachment(s)

Problem:
--------
Toast works well for its claimed purposes. However the current detoast
infrastructure doesn't shared the detoast datum in the query's
lifespan, which will cause some inefficiency like this:

SELECT big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c'
FROM t;

In the above case, the big_jsonb_col will be detoast 3 times for every
record.

a more common case maybe:

CREATE TABLE t(a numeric);
insert into t select i FROM generate_series(1, 10)i;
SELECT * FROM t WHERE a > 3;

Here we needs detoasted because of VARATT_CAN_MAKE_SHORT, and it needs to
be detoasted twice, one is in a > 3, the other one is in the targetlist,
where we need to call numeric_out.

Proposal
--------

When we access some desired toast column in EEOP_{SCAN|INNER_OUTER}_VAR
steps, we can detoast it immediately and save it back to
slot->tts_values. With this way, when any other expressions seeks for
the Datum, it get the detoast version. Planner decides which Vars
should use this feature, executor manages it detoast action and memory
management.

Planner design
---------------

1. This feature only happen at the Plan Node where the detoast would
happen in the previous topology, for example:

for example:

SELECT t1.toast_col, t2.toast_col FROM t1 join t2 USING (toast_col);

toast_col just happens at the Join node's slot even if we have a
projection on t1 or t2 at the scan node (except the Parameterized path).

However if

SELECT t1.toast_col, t2.toast_col
FROM t1 join t2
USING (toast_col)
WHERE t1.toast_col > 'a';

the detoast *may* happen at the scan of level t1 since "t1.toast_col >
'a'" accesses the Var within a FuncCall ('>' operator), which will
cause a detoast. (However it should not happen if it is under a Sort
node, for details see Planner Design section 2).

At the implementation side, I added "Bitmapset *reference_attrs;" to
Scan node which show if the Var should be accessed with the
pre-detoast way in expression execution engine. the value is
maintained at the create_plan/set_plan_refs stage.

Two similar fields are added in Join node.

Bitmapset *outer_reference_attrs;
Bitmapset *inner_reference_attrs;

In the both case, I tracked the level of walker/mutator, if the level
greater than 1 when we access a Var, the 'var->varattno - 1' is added
to the bitmapset. Some special node should be ignored, see
increase_level_for_pre_detoast for details.

2. We also need to make sure the detoast datum will not increase the
work_mem usage for the nodes like Sort/Hash etc, all of such nodes
can be found with search 'CP_SMALL_TLIST' flags.

If the a node under a Sort-Hash-like nodes, we have some extra
checking to see if a Var is a *directly input* of such nodes. If yes,
we can't detoast it in advance, or else, we know the Var has been
discarded before goes to these nodes, we still can use the shared
detoast feature.

The simplest cases to show this is:

For example:

2.1
Query-1
explain (verbose) select * from t1 where b > 'a';
-- b can be detoast in advance.

Query-2
explain (verbose) select * from t1 where b > 'a' order by c;
-- b can't be detoast since it will makes the Sort use more work_mem.

Query-3
explain (verbose) select a, c from t1 where b > 'a' order by c;
-- b can be pre-detoasted, since it is discarded before goes to Sort
node. In this case it doesn't do anything good, but for some complex
case like Query-4, it does.

Query-4
explain (costs off, verbose)
select t3.*
from t1, t2, t3
where t2.c > '999999999999999'
and t2.c = t1.c
and t3.b = t1.b;

QUERY PLAN
--------------------------------------------------------------
Hash Join
Output: t3.a, t3.b, t3.c
Hash Cond: (t3.b = t1.b)
-> Seq Scan on public.t3
Output: t3.a, t3.b, t3.c
-> Hash
Output: t1.b
-> Nested Loop
Output: t1.b <-- Note here 3
-> Seq Scan on public.t2
Output: t2.a, t2.b, t2.c
Filter: (t2.c > '9999...'::text) <--- Note Here 1
-> Index Scan using t1_c_idx on public.t1
Output: t1.a, t1.b, t1.c
Index Cond: (t1.c = t2.c) <--- Note Here 2
(15 rows)

In this case, detoast datum for t2.c can be shared and it benefits for
t2.c = t1.c and no harm for the Hash node.

Execution side
--------------

Once we decide a Var should be pre-detoasted for a given plan node, a
special step named as EEOP_{INNER/OUTER/SCAN}_VAR_TOAST will be created
during ExecInitExpr stage. The special steps are introduced to avoid its
impacts on the non-related EEOP_{INNER/OUTER/SCAN}_VAR code path.

slot->tts_values is used to store the detoast datum so that any other
expressions can access it pretty easily.

Because of the above design, the detoast datum should have a same
lifespan as any other slot->tts_values[*], so the default
ecxt_per_tuple_memory is not OK for this. At last I used slot->tts_mcxt
to hold the memory, and maintaining these lifecycles in execTuples.c. To
know which datum in slot->tts_values is pre-detoasted, Bitmapset *
slot->detoast_attrs is introduced.

During the execution of these steps, below code like this is used:

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
Datum oldDatum;
MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

oldDatum = slot->tts_values[attnum];
slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
(struct varlena *) oldDatum));
Assert(slot->tts_nvalid > attnum);
if (oldDatum != slot->tts_values[attnum])
slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
MemoryContextSwitchTo(old);
}
}

Testing
-------
- shared_detoast_slow.sql is used to test the planner related codes changes.

'set jit to off' will enable more INFO logs about which Var is
pre-detoast in which node level.

- the cases the patch doesn't help much.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;

Q3:
select a from w where a > 0;

Q4:
select b from w where b > 0;

pgbench -n -f 1.sql postgres -T 10 -M prepared

run 5 times and calculated the average value.

| Qry No | Master | patched | perf | comment |
|--------+---------+---------+-------+-----------------------------------------|
| 3 | 309.438 | 308.411 | 0 | nearly zero impact on them |
| 4 | 431.735 | 420.833 | +2.6% | save the detoast effort for numeric_out |

- good case

setup:

create table b(big jsonb);
insert into b select
jsonb_object_agg( x::text,
random()::text || random()::text || random()::text )
from generate_series(1,600) f(x);
insert into b select (select big from b) from generate_series(1, 1000)i;

workload:
Q1:
select big->'1', big->'2', big->'3', big->'5', big->'10' from b;

Q2:
select 1 from b where length(big->>'1') > 0 and length(big->>'2') > 2;

| No | Master | patched |
|----+--------+---------|
| 1 | 1677 | 357 |
| 2 | 638 | 332 |

Some Known issues:
------------------

1. Currently only Scan & Join nodes are considered for this feature.
2. JIT is not adapted for this purpose yet.
3. This feature builds a strong relationship between slot->tts_values
and slot->pre_detoast_attrs. for example slot->tts_values[1] holds a
detoast datum, so 1 is in the slot->pre_detoast_attrs bitmapset,
however if someone changes the value directly, like
"slot->tts_values[1] = 2;" but without touching the slot's
pre_detoast_attrs then troubles comes. The good thing is the detoast
can only happen on Scan/Join nodes. so any other slot is impossible
to have such issue. I run the following command to find out such
cases, looks none of them is a slot form Scan/Join node.

egrep -nri 'slot->tts_values\[[^\]]*\] = *

Any thought?

--
Best Regards
Andy Fan

Attachments:

v1-0001-shared-detoast-feature.patchtext/x-diffDownload
From 3aea898dbfe89aa140051aeb3d3a8a1dbaec0b8a Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Wed, 27 Dec 2023 18:43:56 +0800
Subject: [PATCH v1 1/1] shared detoast feature.

---
 src/backend/executor/execExpr.c              |  65 ++-
 src/backend/executor/execExprInterp.c        | 172 ++++++
 src/backend/executor/execTuples.c            | 130 +++++
 src/backend/executor/execUtils.c             |   5 +
 src/backend/executor/nodeHashjoin.c          |   2 +
 src/backend/executor/nodeMergejoin.c         |   2 +
 src/backend/executor/nodeNestloop.c          |   1 +
 src/backend/nodes/bitmapset.c                |  13 +
 src/backend/optimizer/plan/createplan.c      |  73 ++-
 src/backend/optimizer/plan/setrefs.c         | 536 +++++++++++++++----
 src/include/executor/execExpr.h              |   5 +
 src/include/executor/tuptable.h              |  60 +++
 src/include/nodes/bitmapset.h                |   1 +
 src/include/nodes/execnodes.h                |   5 +
 src/include/nodes/plannodes.h                |  50 ++
 src/test/regress/sql/shared_detoast_slow.sql |  70 +++
 16 files changed, 1084 insertions(+), 106 deletions(-)
 create mode 100644 src/test/regress/sql/shared_detoast_slow.sql

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 2c62b0c9c8..749bcc9023 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -935,22 +935,81 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				}
 				else
 				{
+					int			attnum;
+					Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 					/* regular user column */
 					scratch.d.var.attnum = variable->varattno - 1;
 					scratch.d.var.vartype = variable->vartype;
+					attnum = scratch.d.var.attnum;
+
 					switch (variable->varno)
 					{
 						case INNER_VAR:
-							scratch.opcode = EEOP_INNER_VAR;
+
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+							{
+								/* debug purpose. */
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_INNER_VAR_TOAST: flags = %d costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+								}
+								scratch.opcode = EEOP_INNER_VAR_TOAST;
+							}
+							else
+							{
+								scratch.opcode = EEOP_INNER_VAR;
+							}
 							break;
 						case OUTER_VAR:
-							scratch.opcode = EEOP_OUTER_VAR;
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+							{
+								/* debug purpose. */
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_OUTER_VAR_TOAST: flags = %u costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+								}
+								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_OUTER_VAR;
 							break;
 
 							/* INDEX_VAR is handled by default case */
 
 						default:
-							scratch.opcode = EEOP_SCAN_VAR;
+							if (is_scan_plan(plan) && bms_is_member(
+																	attnum,
+																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+							{
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_SCAN_VAR_TOAST: flags = %u costs=%.2f..%.2f, scanId: %d, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 ((Scan *) plan)->scanrelid,
+										 attnum);
+								}
+								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_SCAN_VAR;
 							break;
 					}
 				}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 24c2b60c62..a3191e30e1 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -157,6 +158,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -165,6 +169,9 @@ static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull
 static Datum ExecJustInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -180,6 +187,34 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  AggStatePerGroup pergroup,
 															  ExprContext *aggcontext,
 															  int setno);
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+	if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+	{
+		Datum		oldDatum;
+		MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
+
+		oldDatum = slot->tts_values[attnum];
+		slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
+																(struct varlena *) oldDatum));
+		Assert(slot->tts_nvalid > attnum);
+		if (oldDatum != slot->tts_values[attnum])
+			slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
+		MemoryContextSwitchTo(old);
+	}
+}
+
+static inline void
+ExecEvalToastVar(TupleTableSlot *slot,
+				 ExprEvalStep *op,
+				 int attnum)
+{
+	ExecSlotDetoastDatum(slot, attnum);
+
+	*op->resvalue = slot->tts_values[attnum];
+	*op->resnull = slot->tts_isnull[attnum];
+}
 
 /*
  * ScalarArrayOpExprHashEntry
@@ -295,6 +330,24 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVar;
 			return;
 		}
+		if (step0 == EEOP_INNER_FETCHSOME &&
+			step1 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_FETCHSOME &&
+				 step1 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_FETCHSOME &&
+				 step1 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarToast;
+			return;
+		}
 		else if (step0 == EEOP_INNER_FETCHSOME &&
 				 step1 == EEOP_ASSIGN_INNER_VAR)
 		{
@@ -330,6 +383,7 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustConst;
 			return;
 		}
+		/* ???? */
 		else if (step0 == EEOP_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustInnerVarVirt;
@@ -345,6 +399,21 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVarVirt;
 			return;
 		}
+		else if (step0 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarVirtToast;
+			return;
+		}
 		else if (step0 == EEOP_ASSIGN_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustAssignInnerVarVirt;
@@ -412,6 +481,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
+		&&CASE_EEOP_INNER_VAR_TOAST,
+		&&CASE_EEOP_OUTER_VAR_TOAST,
+		&&CASE_EEOP_SCAN_VAR_TOAST,
 		&&CASE_EEOP_INNER_SYSVAR,
 		&&CASE_EEOP_OUTER_SYSVAR,
 		&&CASE_EEOP_SCAN_SYSVAR,
@@ -595,6 +667,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
 			*op->resvalue = scanslot->tts_values[attnum];
 			*op->resnull = scanslot->tts_isnull[attnum];
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_INNER_VAR_TOAST)
+		{
+			ExecEvalToastVar(innerslot, op, op->d.var.attnum);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_VAR_TOAST)
+		{
+			ExecEvalToastVar(outerslot, op, op->d.var.attnum);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_VAR_TOAST)
+		{
+			ExecEvalToastVar(scanslot, op, op->d.var.attnum);
 
 			EEO_NEXT();
 		}
@@ -2126,6 +2217,42 @@ ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+static pg_attribute_always_inline Datum
+ExecJustVarImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[1];
+	int			attnum = op->d.var.attnum;
+
+	CheckOpSlotCompatibility(&state->steps[0], slot);
+
+	slot_getattr(slot, attnum + 1, isnull);
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Simple reference to inner Var */
+static Datum
+ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Simple reference to outer Var */
+static Datum
+ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Simple reference to scan Var */
+static Datum
+ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)Var */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
@@ -2264,6 +2391,51 @@ ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarVirtImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+/* implementation of ExecJust(Inner|Outer|Scan)VarVirt */
+static pg_attribute_always_inline Datum
+ExecJustVarVirtImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[0];
+	int			attnum = op->d.var.attnum;
+
+	/*
+	 * As it is guaranteed that a virtual slot is used, there never is a need
+	 * to perform tuple deforming (nor would it be possible). Therefore
+	 * execExpr.c has not emitted an EEOP_*_FETCHSOME step. Verify, as much as
+	 * possible, that that determination was accurate.
+	 */
+	Assert(TTS_IS_VIRTUAL(slot));
+	Assert(TTS_FIXED(slot));
+	Assert(attnum >= 0 && attnum < slot->tts_nvalid);
+
+	*isnull = slot->tts_isnull[attnum];
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Like ExecJustInnerVar, optimized for virtual slots */
+static Datum
+ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Like ExecJustOuterVar, optimized for virtual slots */
+static Datum
+ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Like ExecJustScanVar, optimized for virtual slots */
+static Datum
+ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)VarVirt */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarVirtImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 2c2712ceac..68cc3a2f2e 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -79,6 +79,9 @@ static inline void tts_buffer_heap_store_tuple(TupleTableSlot *slot,
 											   bool transfer_pin);
 static void tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree);
 
+static Bitmapset *cal_final_pre_detoast_attrs(Bitmapset *plan_pre_detoast_attrs,
+											  TupleDesc tupleDesc,
+											  List *not_pre_detoast_vars);
 
 const TupleTableSlotOps TTSOpsVirtual;
 const TupleTableSlotOps TTSOpsHeapTuple;
@@ -176,6 +179,10 @@ tts_virtual_materialize(TupleTableSlot *slot)
 		if (att->attbyval || slot->tts_isnull[natt])
 			continue;
 
+		if (bms_is_member(natt, slot->pre_detoasted_attrs))
+			/* it has been in slot->tts_mcxt already. */
+			continue;
+
 		val = slot->tts_values[natt];
 
 		if (att->attlen == -1 &&
@@ -394,6 +401,13 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated as non valid (tts_nvalid = 0), so let free the
+	 * pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
+
 }
 
 static void
@@ -459,6 +473,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -569,6 +586,12 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated as non valid (tts_nvalid = 0), free the
+	 * pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -639,6 +662,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -773,6 +799,12 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_nvalid = 0 means tts_values will be not reliable, so clear the
+	 * information about pre-detoast-datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -862,6 +894,7 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 {
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 
+
 	if (TTS_SHOULDFREE(slot))
 	{
 		/* materialized slot shouldn't have a buffer to release */
@@ -906,6 +939,8 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 		 */
 		ReleaseBuffer(buffer);
 	}
+
+	ExecFreePreDetoastDatum(slot);
 }
 
 /*
@@ -1140,6 +1175,7 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 	slot->tts_tupleDescriptor = tupleDesc;
 	slot->tts_mcxt = CurrentMemoryContext;
 	slot->tts_nvalid = 0;
+	slot->pre_detoasted_attrs = NULL;
 
 	if (tupleDesc != NULL)
 	{
@@ -1812,12 +1848,30 @@ void
 ExecInitScanTupleSlot(EState *estate, ScanState *scanstate,
 					  TupleDesc tupledesc, const TupleTableSlotOps *tts_ops)
 {
+	Scan	   *splan = (Scan *) scanstate->ps.plan;
+
 	scanstate->ss_ScanTupleSlot = ExecAllocTableSlot(&estate->es_tupleTable,
 													 tupledesc, tts_ops);
 	scanstate->ps.scandesc = tupledesc;
 	scanstate->ps.scanopsfixed = tupledesc != NULL;
 	scanstate->ps.scanops = tts_ops;
 	scanstate->ps.scanopsset = true;
+
+	if (is_scan_plan((Plan *) splan))
+	{
+		/*
+		 * We may run detoast in Qual or Projection, but all of them happen at
+		 * the ss_ScanTupleSlot rather than ps_ResultTupleSlot. So we can only
+		 * take care of the ss_ScanTupleSlot.
+		 *
+		 * the ps_ResultTupleSlot may also have detoast on its parent node,
+		 * like as a inner or outer slot in join case, the pre_detoast on
+		 * these slot.
+		 */
+		scanstate->scan_pre_detoast_attrs = cal_final_pre_detoast_attrs(splan->reference_attrs,
+																		tupledesc,
+																		splan->plan.forbid_pre_detoast_vars);
+	}
 }
 
 /* ----------------
@@ -2338,3 +2392,79 @@ end_tup_output(TupOutputState *tstate)
 	ExecDropSingleTupleTableSlot(tstate->slot);
 	pfree(tstate);
 }
+
+static Bitmapset *
+cal_final_pre_detoast_attrs(Bitmapset *plan_pre_detoast_attrs,
+							TupleDesc tupleDesc,
+							List *not_pre_detoast_vars)
+{
+	Bitmapset  *final = NULL,
+			   *toast_attrs = NULL,
+			   *forbid_pre_detoast_attrs = NULL;
+
+	int			i;
+	ListCell   *lc;
+
+	if (bms_is_empty(plan_pre_detoast_attrs))
+		return NULL;
+
+	/*
+	 * there is no exact data type in create_plan or set_plan_refs stage, so
+	 * plan_pre_detoast_attrs may have some attribute which is not toast attrs
+	 * at all, which should be removed.
+	 */
+	for (i = 0; i < tupleDesc->natts; i++)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupleDesc, i);
+
+		if (attr->attlen == -1 && attr->attstorage != TYPSTORAGE_PLAIN)
+			toast_attrs = bms_add_member(toast_attrs, attr->attnum - 1);
+	}
+
+	final = bms_intersect(plan_pre_detoast_attrs, toast_attrs);
+
+	/*
+	 * Due to the fact of detoast-datum will make the tuple bigger which is
+	 * bad for some nodes like Sort/Hash, to avoid performance regression,
+	 * such attribute should be removed as well.
+	 */
+	foreach(lc, not_pre_detoast_vars)
+	{
+		Var		   *var = lfirst_node(Var, lc);
+
+		forbid_pre_detoast_attrs = bms_add_member(forbid_pre_detoast_attrs, var->varattno - 1);
+	}
+
+	final = bms_del_members(final, forbid_pre_detoast_attrs);
+
+	bms_free(toast_attrs);
+	bms_free(forbid_pre_detoast_attrs);
+
+	return final;
+}
+
+
+/* Input slot, result slot. scan slot? */
+void
+SetPredetoastAttrsForScan(ScanState *scanstate)
+{
+}
+
+void
+SetPredetoastAttrsForJoin(JoinState *j)
+{
+	PlanState  *outerstate = outerPlanState(j);
+	PlanState  *innerstate = innerPlanState(j);
+
+	/* Input slot, result slot. scan slot? */
+
+	j->outer_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->outer_reference_attrs,
+															 outerstate->ps_ResultTupleDesc,
+															 outerstate->plan->forbid_pre_detoast_vars);
+
+	j->inner_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->inner_reference_attrs,
+															 innerstate->ps_ResultTupleDesc,
+															 innerstate->plan->forbid_pre_detoast_vars);
+}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 16704c0c2f..3817979b37 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -572,6 +572,11 @@ ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
 		planstate->resultopsset = planstate->scanopsset;
 		planstate->resultopsfixed = planstate->scanopsfixed;
 		planstate->resultops = planstate->scanops;
+
+		/*
+		 * XXX: can I make sure the ps_ResultTupleDesc is set all the time?
+		 */
+		Assert(planstate->ps_ResultTupleDesc != NULL);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 25a2d78f15..e3801d3b20 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -756,6 +756,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
 	innerDesc = ExecGetResultType(innerPlanState(hjstate));
 
+	SetPredetoastAttrsForJoin((JoinState *) hjstate);
+
 	/*
 	 * Initialize result slot, type and projection.
 	 */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 3cdab77dfc..648ab6903d 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1497,6 +1497,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 											  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	SetPredetoastAttrsForJoin((JoinState *) mergestate);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index ebd1406843..c869a516ec 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -306,6 +306,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(&nlstate->js.ps, &TTSOpsVirtual);
 	ExecAssignProjectionInfo(&nlstate->js.ps, NULL);
+	SetPredetoastAttrsForJoin((JoinState *) nlstate);
 
 	/*
 	 * initialize child expressions
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 704879f566..d81f8afc33 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -743,6 +743,19 @@ bms_membership(const Bitmapset *a)
  *		foo = bms_add_member(foo, x);
  */
 
+/*
+ * does this break commit 00b41463c21615f9bf3927f207e37f9e215d32e6?
+ * but I just found alloc memory and free the memory is too bad
+ * for this current feature. So let see ...;
+ */
+void
+bms_zero(Bitmapset *a)
+{
+	if (a == NULL)
+		return;
+
+	memset(a->words, 0, a->nwords * sizeof(bitmapword));
+}
 
 /*
  * bms_add_member - add a specified member to set
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 34ca6d4ac2..4d0e1b8eb1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -42,6 +42,7 @@
 #include "partitioning/partprune.h"
 #include "utils/lsyscache.h"
 
+extern bool jit_enabled;
 
 /*
  * Flag bits that can appear in the flags argument of create_plan_recurse().
@@ -314,7 +315,8 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 List *mergeActionLists, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
-
+static void set_plan_not_detoast_attrs_recurse(Plan *plan,
+											   List *recheck_list);
 
 /*
  * create_plan
@@ -346,6 +348,8 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree, demanding the correct tlist result */
 	plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);
 
+	set_plan_not_detoast_attrs_recurse(plan, NIL);
+
 	/*
 	 * Make sure the topmost plan node's targetlist exposes the original
 	 * column names and other decorative info.  Targetlists generated within
@@ -378,6 +382,68 @@ create_plan(PlannerInfo *root, Path *best_path)
 	return plan;
 }
 
+/*
+ * set_plan_not_pre_detoast_vars
+ *
+ *	set the toast_attrs according recheck_list.
+ *
+ * recheck_list = NIL means we need to do thing.
+ */
+static void
+set_plan_not_pre_detoast_vars(Plan *plan, List *recheck_list)
+{
+	ListCell   *lc;
+	Var		   *var;
+
+	if (recheck_list == NIL)
+		return;
+
+	foreach(lc, plan->targetlist)
+	{
+		TargetEntry *te = lfirst_node(TargetEntry, lc);
+
+		if (!IsA(te->expr, Var))
+			continue;
+		var = castNode(Var, te->expr);
+		if (var->varattno <= 0)
+			continue;
+		if (list_member(recheck_list, var))
+			/* pass the recheck */
+			plan->forbid_pre_detoast_vars = lappend(plan->forbid_pre_detoast_vars, var);
+	}
+}
+
+
+static void
+set_plan_not_detoast_attrs_recurse(Plan *plan, List *recheck_list)
+{
+	if (plan == NULL)
+		return;
+
+	set_plan_not_pre_detoast_vars(plan, recheck_list);
+
+	if (IsA(plan, Sort) || IsA(plan, Memoize) || IsA(plan, WindowAgg) ||
+		IsA(plan, Hash) || IsA(plan, Material) || IsA(plan, IncrementalSort))
+	{
+		List	   *subplan_exprs = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		set_plan_not_pre_detoast_vars(plan, subplan_exprs);
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, subplan_exprs);
+	}
+	else if (IsA(plan, HashJoin) && castNode(HashJoin, plan)->left_small_tlist)
+	{
+		List	   *subplan_exprs = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, subplan_exprs);
+		set_plan_not_detoast_attrs_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+	else
+	{
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, plan->forbid_pre_detoast_vars);
+		set_plan_not_detoast_attrs_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+}
+
 /*
  * create_plan_recurse
  *	  Recursive guts of create_plan().
@@ -4884,6 +4950,11 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	join_plan->left_small_tlist = (best_path->num_batches > 1);
+
+	if (!jit_enabled)
+		elog(INFO, "num_batches = %d", best_path->num_batches);
+
 	return join_plan;
 }
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4bb68ac90e..df8da328a0 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -27,6 +27,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_relation.h"
 #include "tcop/utility.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
@@ -55,11 +56,27 @@ typedef struct
 	tlist_vinfo vars[FLEXIBLE_ARRAY_MEMBER];	/* has num_vars entries */
 } indexed_tlist;
 
+typedef struct
+{
+	/* var is added into existing_attrs for the first time. */
+	Bitmapset  *existing_attrs;
+	/* following add to the final_ref_attrs. */
+	Bitmapset **final_ref_attrs;
+}			intermediate_var_ref_context;
+
+typedef struct
+{
+	int			level_added;
+	int			level;
+}			intermediate_level_context;
+
 typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context scan_reference_attrs;
 } fix_scan_expr_context;
 
 typedef struct
@@ -71,6 +88,9 @@ typedef struct
 	int			rtoffset;
 	NullingRelsMatch nrm_match;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context outer_reference_attrs;
+	intermediate_var_ref_context inner_reference_attrs;
 } fix_join_expr_context;
 
 typedef struct
@@ -127,8 +147,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, lst, rtoffset, num_exec, pre_detoast_attrs) \
+	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec, pre_detoast_attrs))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -158,7 +178,8 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
 static Node *fix_scan_expr(PlannerInfo *root, Node *node,
-						   int rtoffset, double num_exec);
+						   int rtoffset, double num_exec,
+						   Bitmapset **scan_reference_attrs);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
@@ -190,7 +211,10 @@ static List *fix_join_expr(PlannerInfo *root,
 						   Index acceptable_rel,
 						   int rtoffset,
 						   NullingRelsMatch nrm_match,
-						   double num_exec);
+						   double num_exec,
+						   Bitmapset **outer_reference_attrs,
+						   Bitmapset **inner_reference_attrs);
+
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
 static Node *fix_upper_expr(PlannerInfo *root,
@@ -628,10 +652,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_SampleScan:
@@ -641,13 +671,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs
+					);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tablesample = (TableSampleClause *)
 					fix_scan_expr(root, (Node *) splan->tablesample,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexScan:
@@ -657,28 +694,40 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->indexqual =
 					fix_scan_list(root, splan->indexqual,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->indexorderby =
 					fix_scan_list(root, splan->indexorderby,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexorderbyorig =
 					fix_scan_list(root, splan->indexorderbyorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan), &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexOnlyScan:
 			{
 				IndexOnlyScan *splan = (IndexOnlyScan *) plan;
 
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
+
 				return set_indexonlyscan_references(root, splan, rtoffset);
 			}
 			break;
@@ -691,10 +740,15 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, splan->indexqual, rtoffset, 1,
+								  &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_BitmapHeapScan:
@@ -704,13 +758,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->bitmapqualorig =
 					fix_scan_list(root, splan->bitmapqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidScan:
@@ -720,13 +781,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidquals =
 					fix_scan_list(root, splan->tidquals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidRangeScan:
@@ -736,17 +804,25 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidrangequals =
 					fix_scan_list(root, splan->tidrangequals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_SubqueryScan:
 			/* Needs special treatment, see comments below */
+			/* XXX: shall I do anything? */
 			return set_subqueryscan_references(root,
 											   (SubqueryScan *) plan,
 											   rtoffset);
@@ -757,12 +833,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, splan->functions, rtoffset, 1,
+								  &splan->scan.reference_attrs);
+
 			}
 			break;
 		case T_TableFuncScan:
@@ -772,13 +852,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->tablefunc = (TableFunc *)
 					fix_scan_expr(root, (Node *) splan->tablefunc,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ValuesScan:
@@ -788,13 +872,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->values_lists =
 					fix_scan_list(root, splan->values_lists,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_CteScan:
@@ -804,10 +891,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_NamedTuplestoreScan:
@@ -817,10 +910,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_WorkTableScan:
@@ -830,10 +925,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ForeignScan:
@@ -873,7 +970,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
 												   rtoffset,
-												   NUM_EXEC_TLIST(plan));
+												   NUM_EXEC_TLIST(plan),
+												   NULL);
 				break;
 			}
 
@@ -933,9 +1031,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, splan->limitOffset, rtoffset, 1, NULL);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, splan->limitCount, rtoffset, 1, NULL);
 			}
 			break;
 		case T_Agg:
@@ -988,17 +1086,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->startOffset, rtoffset, 1, NULL);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->endOffset, rtoffset, 1, NULL);
 				wplan->runCondition = fix_scan_list(root,
 													wplan->runCondition,
 													rtoffset,
-													NUM_EXEC_TLIST(plan));
+													NUM_EXEC_TLIST(plan), NULL);
 				wplan->runConditionOrig = fix_scan_list(root,
 														wplan->runConditionOrig,
 														rtoffset,
-														NUM_EXEC_TLIST(plan));
+														NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_Result:
@@ -1038,14 +1136,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 					splan->plan.targetlist =
 						fix_scan_list(root, splan->plan.targetlist,
-									  rtoffset, NUM_EXEC_TLIST(plan));
+									  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 					splan->plan.qual =
 						fix_scan_list(root, splan->plan.qual,
-									  rtoffset, NUM_EXEC_QUAL(plan));
+									  rtoffset, NUM_EXEC_QUAL(plan), NULL);
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1, NULL);
 			}
 			break;
 		case T_ProjectSet:
@@ -1061,7 +1159,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->withCheckOptionLists =
 					fix_scan_list(root, splan->withCheckOptionLists,
-								  rtoffset, 1);
+								  rtoffset, 1, NULL);
 
 				if (splan->returningLists)
 				{
@@ -1118,18 +1216,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						fix_join_expr(root, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					splan->onConflictWhere = (Node *)
 						fix_join_expr(root, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1, NULL);
 				}
 
 				/*
@@ -1182,7 +1282,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   resultrel,
 															   rtoffset,
 															   NRM_EQUAL,
-															   NUM_EXEC_TLIST(plan));
+															   NUM_EXEC_TLIST(plan),
+															   NULL, NULL);
 
 							/* Fix quals too. */
 							action->qual = (Node *) fix_join_expr(root,
@@ -1191,7 +1292,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 																  resultrel,
 																  rtoffset,
 																  NRM_EQUAL,
-																  NUM_EXEC_QUAL(plan));
+																  NUM_EXEC_QUAL(plan),
+																  NULL, NULL);
 						}
 					}
 				}
@@ -1356,13 +1458,16 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
 	plan->indexqual = fix_scan_list(root, plan->indexqual,
-									rtoffset, 1);
+									rtoffset, 1,
+									&plan->scan.reference_attrs);
 	/* indexorderby is already transformed to reference index columns */
 	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
-									   rtoffset, 1);
+									   rtoffset, 1,
+									   &plan->scan.reference_attrs);
 	/* indextlist must NOT be transformed to reference index columns */
 	plan->indextlist = fix_scan_list(root, plan->indextlist,
-									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+									 rtoffset, NUM_EXEC_TLIST((Plan *) plan),
+									 &plan->scan.reference_attrs);
 
 	pfree(index_itlist);
 
@@ -1409,10 +1514,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
 			fix_scan_list(root, plan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) plan), NULL);
 		plan->scan.plan.qual =
 			fix_scan_list(root, plan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) plan), NULL);
 
 		result = (Plan *) plan;
 	}
@@ -1612,7 +1717,7 @@ set_foreignscan_references(PlannerInfo *root,
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
 			fix_scan_list(root, fscan->fdw_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 	}
 	else
 	{
@@ -1622,16 +1727,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_recheck_quals =
 			fix_scan_list(root, fscan->fdw_recheck_quals,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 	}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
@@ -1690,20 +1795,20 @@ set_customscan_references(PlannerInfo *root,
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
 			fix_scan_list(root, cscan->custom_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
 			fix_scan_list(root, cscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 		cscan->scan.plan.qual =
 			fix_scan_list(root, cscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 		cscan->custom_exprs =
 			fix_scan_list(root, cscan->custom_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 	}
 
 	/* Adjust child plan-nodes recursively, if needed */
@@ -2111,6 +2216,95 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
 	return (Node *) bestplan;
 }
 
+
+static inline void
+setup_intermediate_level_ctx(intermediate_level_context * ctx)
+{
+	ctx->level = 0;
+	ctx->level_added = false;
+}
+
+static inline void
+setup_intermediate_var_ref_ctx(intermediate_var_ref_context * ctx, Bitmapset **final_ref_attrs)
+{
+	ctx->existing_attrs = NULL;
+	ctx->final_ref_attrs = final_ref_attrs;
+}
+
+/*
+ * increase_level_for_pre_detoast
+ *	Check if the given Expr could detoast a Var directly, if yes,
+ * increase the level and return true. otherwise return false;
+ */
+static inline void
+increase_level_for_pre_detoast(Node *node, intermediate_level_context * ctx)
+{
+	/* The following nodes is impossible to detoast a Var directly. */
+	if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
+	{
+		ctx->level_added = false;
+	}
+	else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
+	{
+		/* let's not detoast first so that pg_column_compression works. */
+		ctx->level_added = false;
+	}
+	else
+	{
+		ctx->level_added = true;
+		ctx->level += 1;
+	}
+}
+
+static inline void
+decreased_level_for_pre_detoast(intermediate_level_context * ctx)
+{
+	if (ctx->level_added)
+		ctx->level -= 1;
+
+	ctx->level_added = false;
+}
+
+/*
+ * add_pre_detoast_vars
+ *		add the var's information into pre_detoast_attrs when the check is pass.
+ */
+static inline void
+add_pre_detoast_vars(intermediate_level_context * level_ctx,
+					 intermediate_var_ref_context * ctx,
+					 Var *var)
+{
+	int			attno;
+
+	if (level_ctx->level <= 1 || ctx->final_ref_attrs == NULL || var->varattno <= 0)
+		return;
+
+	attno = var->varattno - 1;
+	if (bms_is_member(attno, ctx->existing_attrs))
+	{
+		/* not the first time to access it, add it to final result. */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+	else
+	{
+		/* first time. */
+		ctx->existing_attrs = bms_add_member(ctx->existing_attrs, attno);
+
+		/*
+		 * XXX:
+		 *
+		 * The above strategy doesn't help to detect if a Var is detoast
+		 * twice. Reasons are: 1. the context is not maintain in Plan node
+		 * level. so if it is detoast at targetlist and qual, we can't detect
+		 * it. 2. even we can make it at plan node, it still doesn't help for
+		 * the among-nodes case.
+		 *
+		 * So for now, I just disable it.
+		 */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+}
+
 /*
  * fix_scan_expr
  *		Do set_plan_references processing on a scan-level expression
@@ -2130,13 +2324,16 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset,
+			  double num_exec, Bitmapset **scan_reference_attrs)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.scan_reference_attrs, scan_reference_attrs);
 
 	if (rtoffset != 0 ||
 		root->multiexpr_params != NIL ||
@@ -2167,8 +2364,13 @@ fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
 static Node *
 fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 {
+	Node	   *n;
+
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = copyVar((Var *) node);
@@ -2186,10 +2388,18 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 			var->varno += context->rtoffset;
 		if (var->varnosyn > 0)
 			var->varnosyn += context->rtoffset;
+
+		add_pre_detoast_vars(&context->level_ctx, &context->scan_reference_attrs, var);
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) var;
 	}
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		Node	   *n = fix_param_node(context->root, (Param *) node);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n;
+	}
 	if (IsA(node, Aggref))
 	{
 		Aggref	   *aggref = (Aggref *) node;
@@ -2200,7 +2410,10 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		if (aggparam != NULL)
 		{
 			/* Make a copy of the Param for paranoia's sake */
-			return (Node *) copyObject(aggparam);
+			Node	   *n = (Node *) copyObject(aggparam);
+
+			decreased_level_for_pre_detoast(&context->level_ctx);
+			return n;
 		}
 		/* If no match, just fall through to process it normally */
 	}
@@ -2210,6 +2423,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 
 		Assert(!IS_SPECIAL_VARNO(cexpr->cvarno));
 		cexpr->cvarno += context->rtoffset;
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) cexpr;
 	}
 	if (IsA(node, PlaceHolderVar))
@@ -2218,29 +2432,52 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
 		/* XXX can we assert something about phnullingrels? */
-		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
+		Node	   *n = fix_scan_expr_mutator((Node *) phv->phexpr, context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n;
 	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_scan_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		Node	   *n = fix_scan_expr_mutator(fix_alternative_subplan(context->root,
+																	  (AlternativeSubPlan *) node,
+																	  context->num_exec),
+											  context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator,
-								   (void *) context);
+	n = expression_tree_mutator(node, fix_scan_expr_mutator, (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return n;
 }
 
 static bool
 fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 {
+	bool		ret;
+
 	if (node == NULL)
 		return false;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
+	if (IsA(node, Var))
+	{
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 castNode(Var, node));
+	}
 	Assert(!(IsA(node, Var) && ((Var *) node)->varno == ROWID_VAR));
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
-	return expression_tree_walker(node, fix_scan_expr_walker,
-								  (void *) context);
+	ret = expression_tree_walker(node, fix_scan_expr_walker,
+								 (void *) context);
+
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret;
 }
 
 /*
@@ -2276,7 +2513,10 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset,
 								   NRM_EQUAL,
-								   NUM_EXEC_QUAL((Plan *) join));
+								   NUM_EXEC_QUAL((Plan *) join),
+								   &join->outer_reference_attrs,
+								   &join->inner_reference_attrs
+		);
 
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
@@ -2323,7 +2563,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										 (Index) 0,
 										 rtoffset,
 										 NRM_EQUAL,
-										 NUM_EXEC_QUAL((Plan *) join));
+										 NUM_EXEC_QUAL((Plan *) join),
+										 &join->outer_reference_attrs,
+										 &join->inner_reference_attrs);
 	}
 	else if (IsA(join, HashJoin))
 	{
@@ -2336,7 +2578,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										(Index) 0,
 										rtoffset,
 										NRM_EQUAL,
-										NUM_EXEC_QUAL((Plan *) join));
+										NUM_EXEC_QUAL((Plan *) join),
+										&join->outer_reference_attrs,
+										&join->inner_reference_attrs);
 
 		/*
 		 * HashJoin's hashkeys are used to look for matching tuples from its
@@ -2368,7 +2612,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  (Index) 0,
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-										  NUM_EXEC_TLIST((Plan *) join));
+										  NUM_EXEC_TLIST((Plan *) join),
+										  &join->outer_reference_attrs,
+										  &join->inner_reference_attrs);
 	join->plan.qual = fix_join_expr(root,
 									join->plan.qual,
 									outer_itlist,
@@ -2376,8 +2622,20 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 									(Index) 0,
 									rtoffset,
 									(join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-									NUM_EXEC_QUAL((Plan *) join));
-
+									NUM_EXEC_QUAL((Plan *) join),
+									&join->outer_reference_attrs,
+									&join->inner_reference_attrs);
+
+	join->plan.forbid_pre_detoast_vars = fix_join_expr(root,
+													   join->plan.forbid_pre_detoast_vars,
+													   outer_itlist,
+													   inner_itlist,
+													   (Index) 0,
+													   rtoffset,
+													   (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
+													   NUM_EXEC_TLIST((Plan *) join),
+													   NULL,
+													   NULL);
 	pfree(outer_itlist);
 	pfree(inner_itlist);
 }
@@ -3010,9 +3268,12 @@ fix_join_expr(PlannerInfo *root,
 			  Index acceptable_rel,
 			  int rtoffset,
 			  NullingRelsMatch nrm_match,
-			  double num_exec)
+			  double num_exec,
+			  Bitmapset **outer_reference_attrs,
+			  Bitmapset **inner_reference_attrs)
 {
 	fix_join_expr_context context;
+	List	   *ret;
 
 	context.root = root;
 	context.outer_itlist = outer_itlist;
@@ -3021,16 +3282,30 @@ fix_join_expr(PlannerInfo *root,
 	context.rtoffset = rtoffset;
 	context.nrm_match = nrm_match;
 	context.num_exec = num_exec;
-	return (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.outer_reference_attrs, outer_reference_attrs);
+	setup_intermediate_var_ref_ctx(&context.inner_reference_attrs, inner_reference_attrs);
+
+	ret = (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	bms_free(context.outer_reference_attrs.existing_attrs);
+	bms_free(context.inner_reference_attrs.existing_attrs);
+
+	return ret;
 }
 
 static Node *
 fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 {
 	Var		   *newvar;
+	Node	   *ret_node;
 
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = (Var *) node;
@@ -3044,7 +3319,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* then in the inner. */
@@ -3056,7 +3337,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If it's for acceptable_rel, adjust and return it */
@@ -3066,6 +3353,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 			var->varno += context->rtoffset;
 			if (var->varnosyn > 0)
 				var->varnosyn += context->rtoffset;
+			/* XXX acceptable_rel?  we can ignore it for safety. */
+			decreased_level_for_pre_detoast(&context->level_ctx);
+
 			return (Node *) var;
 		}
 
@@ -3084,22 +3374,38 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  OUTER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 		if (context->inner_itlist && context->inner_itlist->has_ph_vars)
 		{
+
 			newvar = search_indexed_tlist_for_phv(phv,
 												  context->inner_itlist,
 												  INNER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If not supplied by input plans, evaluate the contained expr */
 		/* XXX can we assert something about phnullingrels? */
-		return fix_join_expr_mutator((Node *) phv->phexpr, context);
+		ret_node = fix_join_expr_mutator((Node *) phv->phexpr, context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
 	}
+
 	/* Try matching more complex expressions too, if tlists have any */
 	if (context->outer_itlist && context->outer_itlist->has_non_vars)
 	{
@@ -3107,7 +3413,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->outer_itlist,
 												  OUTER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->outer_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	if (context->inner_itlist && context->inner_itlist->has_non_vars)
 	{
@@ -3115,20 +3427,36 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->inner_itlist,
 												  INNER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->inner_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	/* Special cases (apply only AFTER failing to match to lower tlist) */
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		ret_node = fix_param_node(context->root, (Param *) node);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_join_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		ret_node = fix_join_expr_mutator(fix_alternative_subplan(context->root,
+																 (AlternativeSubPlan *) node,
+																 context->num_exec),
+										 context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node,
-								   fix_join_expr_mutator,
-								   (void *) context);
+	ret_node = expression_tree_mutator(node,
+									   fix_join_expr_mutator,
+									   (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret_node;
 }
 
 /*
@@ -3163,7 +3491,8 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * varno = newvarno, varattno = resno of corresponding targetlist element.
  * The original tree is not modified.
  */
-static Node *
+static Node *					/* XXX: shall I care about this for shared
+								 * detoast optimization? */
 fix_upper_expr(PlannerInfo *root,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
@@ -3318,7 +3647,10 @@ set_returning_clause_references(PlannerInfo *root,
 						  resultRelation,
 						  rtoffset,
 						  NRM_EQUAL,
-						  NUM_EXEC_TLIST(topplan));
+						  NUM_EXEC_TLIST(topplan),
+						  NULL,
+						  NULL
+		);
 
 	pfree(itlist);
 
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 048573c2bc..f1da1f20d4 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -77,6 +77,11 @@ typedef enum ExprEvalOp
 	EEOP_OUTER_VAR,
 	EEOP_SCAN_VAR,
 
+	/* compute non-system Var value with shared-detoast-datum logic */
+	EEOP_INNER_VAR_TOAST,
+	EEOP_OUTER_VAR_TOAST,
+	EEOP_SCAN_VAR_TOAST,
+
 	/* compute system Var value */
 	EEOP_INNER_SYSVAR,
 	EEOP_OUTER_SYSVAR,
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 4210d6d838..258ff8906e 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -18,6 +18,7 @@
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/tupdesc.h"
+#include "nodes/bitmapset.h"
 #include "storage/buf.h"
 
 /*----------
@@ -128,6 +129,11 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+
+	/*
+	 * The attributes populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST step.
+	 */
+	Bitmapset  *pre_detoasted_attrs;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -425,12 +431,38 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+	int			attnum;
+
+	if (bms_is_empty(slot->pre_detoasted_attrs))
+		return;
+
+	attnum = -1;
+	/* free the memory used by pre-detoasted datum and reset the flags. */
+	while ((attnum = bms_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
+	{
+		pfree((void *) slot->tts_values[attnum]);
+	}
+
+	/*
+	 * bms_free each time cost too much, so just zero these bits and keep its
+	 * memory, just like what we did for TupleTableSlot. but.. see the
+	 * comments about the bms_zero.
+	 */
+	bms_zero(slot->pre_detoasted_attrs);
+}
+
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
 static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_ops->clear(slot);
 
 	return slot;
@@ -449,6 +481,10 @@ ExecClearTuple(TupleTableSlot *slot)
 static inline void
 ExecMaterializeSlot(TupleTableSlot *slot)
 {
+	/*
+	 * XXX: pre_detoasted_attrs doesn't dependent on any external storage, so
+	 * nothing should be done here.
+	 */
 	slot->tts_ops->materialize(slot);
 }
 
@@ -486,6 +522,30 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 
 	dstslot->tts_ops->copyslot(dstslot, srcslot);
 
+	/* Assert this assumption the below code depends on. */
+	Assert(dstslot->tts_nvalid == 0 ||
+		   dstslot->tts_nvalid == srcslot->tts_nvalid);
+
+	if (dstslot->tts_nvalid == srcslot->tts_nvalid
+		&& !bms_is_empty(srcslot->pre_detoasted_attrs))
+	{
+		int			attnum = -1;
+		MemoryContext old = MemoryContextSwitchTo(dstslot->tts_mcxt);
+
+		dstslot->pre_detoasted_attrs = bms_copy(srcslot->pre_detoasted_attrs);
+
+		while ((attnum = bms_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
+		{
+			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
+			Size		len;
+
+			Assert(!VARATT_IS_EXTENDED(datum));
+			len = VARSIZE(datum);
+			dstslot->tts_values[attnum] = (Datum) palloc(len);
+			memcpy((void *) dstslot->tts_values[attnum], datum, len);
+		}
+		MemoryContextSwitchTo(old);
+	}
 	return dstslot;
 }
 
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 161243b2d0..402ff02027 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -113,6 +113,7 @@ extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
+extern void bms_zero(Bitmapset *a);
 
 /* support for iterating through the integer elements of a set: */
 extern int	bms_next_member(const Bitmapset *a, int prevbit);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5d7f17dee0..c3f7d19ba2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1474,6 +1474,7 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+	Bitmapset  *scan_pre_detoast_attrs;
 } ScanState;
 
 /* ----------------
@@ -2003,6 +2004,8 @@ typedef struct JoinState
 	bool		single_match;	/* True if we should skip to next outer tuple
 								 * after finding one inner match */
 	ExprState  *joinqual;		/* JOIN quals (in addition to ps.qual) */
+	Bitmapset  *outer_pre_detoast_attrs;
+	Bitmapset  *inner_pre_detoast_attrs;
 } JoinState;
 
 /* ----------------
@@ -2764,4 +2767,6 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+extern void SetPredetoastAttrsForJoin(JoinState *joinstate);
+extern void SetPredetoastAttrsForScan(ScanState *scanstate);
 #endif							/* EXECNODES_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d40af8e59f..de4f3f190a 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,13 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+
+	/*
+	 * A list of Vars which should not apply the shared-detoast-datum logic
+	 * since the upper nodes like Sort/Hash want the tuples as small as
+	 * possible. Its a subset of targetlist in each Plan node.
+	 */
+	List	   *forbid_pre_detoast_vars;
 } Plan;
 
 /* ----------------
@@ -385,6 +392,13 @@ typedef struct Scan
 
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 */
+	Bitmapset  *reference_attrs;
 } Scan;
 
 /* ----------------
@@ -789,6 +803,14 @@ typedef struct Join
 	JoinType	jointype;
 	bool		inner_unique;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 */
+	Bitmapset  *outer_reference_attrs;
+	Bitmapset  *inner_reference_attrs;
 } Join;
 
 /* ----------------
@@ -869,6 +891,14 @@ typedef struct HashJoin
 	 * perform lookups in the hashtable over the inner plan.
 	 */
 	List	   *hashkeys;
+
+	/*
+	 * keep the small tlist information in plan for the shared-detoast datum
+	 * logic.  If left_small_tlist is true, then all the datum in outerPlan
+	 * should not apply that logic, used for maintaining
+	 * forbid_pre_detoast_vars fields in Plan.
+	 */
+	bool		left_small_tlist;
 } HashJoin;
 
 /* ----------------
@@ -1588,4 +1618,24 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+static inline bool
+is_join_plan(Plan *plan)
+{
+	return (plan != NULL) && (IsA(plan, NestLoop) || IsA(plan, HashJoin) || IsA(plan, MergeJoin));
+}
+
+static inline bool
+is_scan_plan(Plan *plan)
+{
+	return (plan != NULL) &&
+		(IsA(plan, SeqScan) ||
+		 IsA(plan, SampleScan) ||
+		 IsA(plan, IndexScan) ||
+		 IsA(plan, IndexOnlyScan) ||
+		 IsA(plan, BitmapIndexScan) ||
+		 IsA(plan, BitmapHeapScan) ||
+		 IsA(plan, TidScan) ||
+		 IsA(plan, SubqueryScan));
+}
+
 #endif							/* PLANNODES_H */
diff --git a/src/test/regress/sql/shared_detoast_slow.sql b/src/test/regress/sql/shared_detoast_slow.sql
new file mode 100644
index 0000000000..beecaa6bc5
--- /dev/null
+++ b/src/test/regress/sql/shared_detoast_slow.sql
@@ -0,0 +1,70 @@
+create table t1(a text, b text, c text);
+create table t2(a text, b text, c text);
+create table t3(a text, b text, c text);
+
+insert into t1 select i, i, i from generate_series(1, 1000000)i;
+insert into t2 select i, i, i from generate_series(1, 1000000)i;
+insert into t3 select i, i, i from generate_series(1, 1000000)i;
+
+create index on t1(c);
+
+analyze t1;
+analyze t2;
+analyze t3;
+
+-- Turn off jit first, reasons:
+-- 1. JIT is not adapted for this feature, it may cause crash on jit.
+-- 2. more logging for this feature is enabled when jit=off
+set jit to off;
+
+explain (verbose) select * from t1 where b > 'a';
+
+
+-- NullTest has nothing with tts_values, so its access to toast value
+-- should be ignored.
+explain (verbose) select * from t1 where b is NULL and c is not null;
+
+-- b can't be shared-detoasted since it would make the work_mem bigger.
+explain (verbose) select * from t1 where b > 'a' order by c;
+
+-- b CAN be shared-detoasted since it would NOT make the work_mem bigger.
+-- but compared with the old behavior, it cause the lifespan of the
+-- 'detoast datum'
+-- longer, in the old behavior, it is reset becase of ExecQualAndReset.
+explain (verbose) select a, c from t1 where b > 'a' order by c;
+
+-- The detoast only happen at the join stage.
+explain (verbose) select * from t1 join t2 using(b);
+
+--
+explain (verbose) select * from t1 join t2 using(b) where t1.c > '3';
+
+explain (verbose)
+select t3.*
+from t1, t2, t3
+where t2.c > '999999999999999'
+and t2.c = t1.c
+and t3.b = t1.b;
+
+
+-- t2.b the innerHash can't be pre detoast due to Hash.
+-- t1.b the outerPlan can't be pre detoast due to num_batch.
+explain (verbose) select * from t1 join t2 using(b);
+
+-- Increase the work_mem so that the num_batch = 1, then
+-- the t1.b in outerPlan can be pre-detoast.
+
+set work_mem to '1GB';
+explain (verbose) select * from t1 join t2 using(b);
+reset work_mem;
+
+-- Show even if a datum is under a hash node, but it is
+-- not DIRECTLY input of the Hash, it's still able to be
+-- pre detoasted.
+explain (verbose)
+select t3.*
+from t1, t2, t3
+where t2.c > '999999999999999'
+and t2.c = t1.c
+and t3.b = t1.b
+and t1.b > '0';
-- 
2.34.1

#2Andy Fan
zhihuifan1213@163.com
In reply to: Andy Fan (#1)
1 attachment(s)
Re: Shared detoast Datum proposal

Andy Fan <zhihuifan1213@163.com> writes:

Some Known issues:
------------------

1. Currently only Scan & Join nodes are considered for this feature.
2. JIT is not adapted for this purpose yet.

JIT is adapted for this feature in v2. Any feedback is welcome.

--
Best Regards
Andy Fan

Attachments:

v2-0001-shared-detoast-feature.patchtext/x-diffDownload
From ee44d4721147589dbba8366382a18adeee05419b Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Wed, 27 Dec 2023 18:43:56 +0800
Subject: [PATCH v2 1/1] shared detoast feature.

---
 src/backend/executor/execExpr.c              |  65 ++-
 src/backend/executor/execExprInterp.c        | 181 +++++++
 src/backend/executor/execTuples.c            | 130 +++++
 src/backend/executor/execUtils.c             |   5 +
 src/backend/executor/nodeHashjoin.c          |   2 +
 src/backend/executor/nodeMergejoin.c         |   2 +
 src/backend/executor/nodeNestloop.c          |   1 +
 src/backend/jit/llvm/llvmjit_expr.c          |  22 +
 src/backend/jit/llvm/llvmjit_types.c         |   1 +
 src/backend/nodes/bitmapset.c                |  13 +
 src/backend/optimizer/plan/createplan.c      |  73 ++-
 src/backend/optimizer/plan/setrefs.c         | 536 +++++++++++++++----
 src/include/executor/execExpr.h              |   6 +
 src/include/executor/tuptable.h              |  60 +++
 src/include/nodes/bitmapset.h                |   1 +
 src/include/nodes/execnodes.h                |   5 +
 src/include/nodes/plannodes.h                |  50 ++
 src/test/regress/sql/shared_detoast_slow.sql |  70 +++
 18 files changed, 1117 insertions(+), 106 deletions(-)
 create mode 100644 src/test/regress/sql/shared_detoast_slow.sql

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 2c62b0c9c8..749bcc9023 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -935,22 +935,81 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				}
 				else
 				{
+					int			attnum;
+					Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 					/* regular user column */
 					scratch.d.var.attnum = variable->varattno - 1;
 					scratch.d.var.vartype = variable->vartype;
+					attnum = scratch.d.var.attnum;
+
 					switch (variable->varno)
 					{
 						case INNER_VAR:
-							scratch.opcode = EEOP_INNER_VAR;
+
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+							{
+								/* debug purpose. */
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_INNER_VAR_TOAST: flags = %d costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+								}
+								scratch.opcode = EEOP_INNER_VAR_TOAST;
+							}
+							else
+							{
+								scratch.opcode = EEOP_INNER_VAR;
+							}
 							break;
 						case OUTER_VAR:
-							scratch.opcode = EEOP_OUTER_VAR;
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+							{
+								/* debug purpose. */
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_OUTER_VAR_TOAST: flags = %u costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+								}
+								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_OUTER_VAR;
 							break;
 
 							/* INDEX_VAR is handled by default case */
 
 						default:
-							scratch.opcode = EEOP_SCAN_VAR;
+							if (is_scan_plan(plan) && bms_is_member(
+																	attnum,
+																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+							{
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_SCAN_VAR_TOAST: flags = %u costs=%.2f..%.2f, scanId: %d, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 ((Scan *) plan)->scanrelid,
+										 attnum);
+								}
+								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_SCAN_VAR;
 							break;
 					}
 				}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 24c2b60c62..b5a464bf80 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -157,6 +158,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -165,6 +169,9 @@ static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull
 static Datum ExecJustInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -180,6 +187,43 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  AggStatePerGroup pergroup,
 															  ExprContext *aggcontext,
 															  int setno);
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+	if (!slot->tts_isnull[attnum] &&
+		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+	{
+		Datum		oldDatum;
+		MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
+
+		oldDatum = slot->tts_values[attnum];
+		slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
+																(struct varlena *) oldDatum));
+		Assert(slot->tts_nvalid > attnum);
+		if (oldDatum != slot->tts_values[attnum])
+			slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
+		MemoryContextSwitchTo(old);
+	}
+}
+
+/* JIT requires a non-static (and external?) function */
+void
+ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum)
+{
+	return ExecSlotDetoastDatum(slot, attnum);
+}
+
+
+static inline void
+ExecEvalToastVar(TupleTableSlot *slot,
+				 ExprEvalStep *op,
+				 int attnum)
+{
+	ExecSlotDetoastDatum(slot, attnum);
+
+	*op->resvalue = slot->tts_values[attnum];
+	*op->resnull = slot->tts_isnull[attnum];
+}
 
 /*
  * ScalarArrayOpExprHashEntry
@@ -295,6 +339,24 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVar;
 			return;
 		}
+		if (step0 == EEOP_INNER_FETCHSOME &&
+			step1 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_FETCHSOME &&
+				 step1 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_FETCHSOME &&
+				 step1 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarToast;
+			return;
+		}
 		else if (step0 == EEOP_INNER_FETCHSOME &&
 				 step1 == EEOP_ASSIGN_INNER_VAR)
 		{
@@ -330,6 +392,7 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustConst;
 			return;
 		}
+		/* ???? */
 		else if (step0 == EEOP_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustInnerVarVirt;
@@ -345,6 +408,21 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVarVirt;
 			return;
 		}
+		else if (step0 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarVirtToast;
+			return;
+		}
 		else if (step0 == EEOP_ASSIGN_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustAssignInnerVarVirt;
@@ -412,6 +490,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
+		&&CASE_EEOP_INNER_VAR_TOAST,
+		&&CASE_EEOP_OUTER_VAR_TOAST,
+		&&CASE_EEOP_SCAN_VAR_TOAST,
 		&&CASE_EEOP_INNER_SYSVAR,
 		&&CASE_EEOP_OUTER_SYSVAR,
 		&&CASE_EEOP_SCAN_SYSVAR,
@@ -595,6 +676,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
 			*op->resvalue = scanslot->tts_values[attnum];
 			*op->resnull = scanslot->tts_isnull[attnum];
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_INNER_VAR_TOAST)
+		{
+			ExecEvalToastVar(innerslot, op, op->d.var.attnum);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_VAR_TOAST)
+		{
+			ExecEvalToastVar(outerslot, op, op->d.var.attnum);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_VAR_TOAST)
+		{
+			ExecEvalToastVar(scanslot, op, op->d.var.attnum);
 
 			EEO_NEXT();
 		}
@@ -2126,6 +2226,42 @@ ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+static pg_attribute_always_inline Datum
+ExecJustVarImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[1];
+	int			attnum = op->d.var.attnum;
+
+	CheckOpSlotCompatibility(&state->steps[0], slot);
+
+	slot_getattr(slot, attnum + 1, isnull);
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Simple reference to inner Var */
+static Datum
+ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Simple reference to outer Var */
+static Datum
+ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Simple reference to scan Var */
+static Datum
+ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)Var */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
@@ -2264,6 +2400,51 @@ ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarVirtImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+/* implementation of ExecJust(Inner|Outer|Scan)VarVirt */
+static pg_attribute_always_inline Datum
+ExecJustVarVirtImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[0];
+	int			attnum = op->d.var.attnum;
+
+	/*
+	 * As it is guaranteed that a virtual slot is used, there never is a need
+	 * to perform tuple deforming (nor would it be possible). Therefore
+	 * execExpr.c has not emitted an EEOP_*_FETCHSOME step. Verify, as much as
+	 * possible, that that determination was accurate.
+	 */
+	Assert(TTS_IS_VIRTUAL(slot));
+	Assert(TTS_FIXED(slot));
+	Assert(attnum >= 0 && attnum < slot->tts_nvalid);
+
+	*isnull = slot->tts_isnull[attnum];
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Like ExecJustInnerVar, optimized for virtual slots */
+static Datum
+ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Like ExecJustOuterVar, optimized for virtual slots */
+static Datum
+ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Like ExecJustScanVar, optimized for virtual slots */
+static Datum
+ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)VarVirt */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarVirtImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 2c2712ceac..68cc3a2f2e 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -79,6 +79,9 @@ static inline void tts_buffer_heap_store_tuple(TupleTableSlot *slot,
 											   bool transfer_pin);
 static void tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree);
 
+static Bitmapset *cal_final_pre_detoast_attrs(Bitmapset *plan_pre_detoast_attrs,
+											  TupleDesc tupleDesc,
+											  List *not_pre_detoast_vars);
 
 const TupleTableSlotOps TTSOpsVirtual;
 const TupleTableSlotOps TTSOpsHeapTuple;
@@ -176,6 +179,10 @@ tts_virtual_materialize(TupleTableSlot *slot)
 		if (att->attbyval || slot->tts_isnull[natt])
 			continue;
 
+		if (bms_is_member(natt, slot->pre_detoasted_attrs))
+			/* it has been in slot->tts_mcxt already. */
+			continue;
+
 		val = slot->tts_values[natt];
 
 		if (att->attlen == -1 &&
@@ -394,6 +401,13 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated as non valid (tts_nvalid = 0), so let free the
+	 * pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
+
 }
 
 static void
@@ -459,6 +473,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -569,6 +586,12 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated as non valid (tts_nvalid = 0), free the
+	 * pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -639,6 +662,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -773,6 +799,12 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_nvalid = 0 means tts_values will be not reliable, so clear the
+	 * information about pre-detoast-datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -862,6 +894,7 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 {
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 
+
 	if (TTS_SHOULDFREE(slot))
 	{
 		/* materialized slot shouldn't have a buffer to release */
@@ -906,6 +939,8 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 		 */
 		ReleaseBuffer(buffer);
 	}
+
+	ExecFreePreDetoastDatum(slot);
 }
 
 /*
@@ -1140,6 +1175,7 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 	slot->tts_tupleDescriptor = tupleDesc;
 	slot->tts_mcxt = CurrentMemoryContext;
 	slot->tts_nvalid = 0;
+	slot->pre_detoasted_attrs = NULL;
 
 	if (tupleDesc != NULL)
 	{
@@ -1812,12 +1848,30 @@ void
 ExecInitScanTupleSlot(EState *estate, ScanState *scanstate,
 					  TupleDesc tupledesc, const TupleTableSlotOps *tts_ops)
 {
+	Scan	   *splan = (Scan *) scanstate->ps.plan;
+
 	scanstate->ss_ScanTupleSlot = ExecAllocTableSlot(&estate->es_tupleTable,
 													 tupledesc, tts_ops);
 	scanstate->ps.scandesc = tupledesc;
 	scanstate->ps.scanopsfixed = tupledesc != NULL;
 	scanstate->ps.scanops = tts_ops;
 	scanstate->ps.scanopsset = true;
+
+	if (is_scan_plan((Plan *) splan))
+	{
+		/*
+		 * We may run detoast in Qual or Projection, but all of them happen at
+		 * the ss_ScanTupleSlot rather than ps_ResultTupleSlot. So we can only
+		 * take care of the ss_ScanTupleSlot.
+		 *
+		 * the ps_ResultTupleSlot may also have detoast on its parent node,
+		 * like as a inner or outer slot in join case, the pre_detoast on
+		 * these slot.
+		 */
+		scanstate->scan_pre_detoast_attrs = cal_final_pre_detoast_attrs(splan->reference_attrs,
+																		tupledesc,
+																		splan->plan.forbid_pre_detoast_vars);
+	}
 }
 
 /* ----------------
@@ -2338,3 +2392,79 @@ end_tup_output(TupOutputState *tstate)
 	ExecDropSingleTupleTableSlot(tstate->slot);
 	pfree(tstate);
 }
+
+static Bitmapset *
+cal_final_pre_detoast_attrs(Bitmapset *plan_pre_detoast_attrs,
+							TupleDesc tupleDesc,
+							List *not_pre_detoast_vars)
+{
+	Bitmapset  *final = NULL,
+			   *toast_attrs = NULL,
+			   *forbid_pre_detoast_attrs = NULL;
+
+	int			i;
+	ListCell   *lc;
+
+	if (bms_is_empty(plan_pre_detoast_attrs))
+		return NULL;
+
+	/*
+	 * there is no exact data type in create_plan or set_plan_refs stage, so
+	 * plan_pre_detoast_attrs may have some attribute which is not toast attrs
+	 * at all, which should be removed.
+	 */
+	for (i = 0; i < tupleDesc->natts; i++)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupleDesc, i);
+
+		if (attr->attlen == -1 && attr->attstorage != TYPSTORAGE_PLAIN)
+			toast_attrs = bms_add_member(toast_attrs, attr->attnum - 1);
+	}
+
+	final = bms_intersect(plan_pre_detoast_attrs, toast_attrs);
+
+	/*
+	 * Due to the fact of detoast-datum will make the tuple bigger which is
+	 * bad for some nodes like Sort/Hash, to avoid performance regression,
+	 * such attribute should be removed as well.
+	 */
+	foreach(lc, not_pre_detoast_vars)
+	{
+		Var		   *var = lfirst_node(Var, lc);
+
+		forbid_pre_detoast_attrs = bms_add_member(forbid_pre_detoast_attrs, var->varattno - 1);
+	}
+
+	final = bms_del_members(final, forbid_pre_detoast_attrs);
+
+	bms_free(toast_attrs);
+	bms_free(forbid_pre_detoast_attrs);
+
+	return final;
+}
+
+
+/* Input slot, result slot. scan slot? */
+void
+SetPredetoastAttrsForScan(ScanState *scanstate)
+{
+}
+
+void
+SetPredetoastAttrsForJoin(JoinState *j)
+{
+	PlanState  *outerstate = outerPlanState(j);
+	PlanState  *innerstate = innerPlanState(j);
+
+	/* Input slot, result slot. scan slot? */
+
+	j->outer_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->outer_reference_attrs,
+															 outerstate->ps_ResultTupleDesc,
+															 outerstate->plan->forbid_pre_detoast_vars);
+
+	j->inner_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->inner_reference_attrs,
+															 innerstate->ps_ResultTupleDesc,
+															 innerstate->plan->forbid_pre_detoast_vars);
+}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 16704c0c2f..3817979b37 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -572,6 +572,11 @@ ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
 		planstate->resultopsset = planstate->scanopsset;
 		planstate->resultopsfixed = planstate->scanopsfixed;
 		planstate->resultops = planstate->scanops;
+
+		/*
+		 * XXX: can I make sure the ps_ResultTupleDesc is set all the time?
+		 */
+		Assert(planstate->ps_ResultTupleDesc != NULL);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 25a2d78f15..e3801d3b20 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -756,6 +756,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
 	innerDesc = ExecGetResultType(innerPlanState(hjstate));
 
+	SetPredetoastAttrsForJoin((JoinState *) hjstate);
+
 	/*
 	 * Initialize result slot, type and projection.
 	 */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 3cdab77dfc..648ab6903d 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1497,6 +1497,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 											  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	SetPredetoastAttrsForJoin((JoinState *) mergestate);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index ebd1406843..c869a516ec 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -306,6 +306,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(&nlstate->js.ps, &TTSOpsVirtual);
 	ExecAssignProjectionInfo(&nlstate->js.ps, NULL);
+	SetPredetoastAttrsForJoin((JoinState *) nlstate);
 
 	/*
 	 * initialize child expressions
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index a3a0876bff..2fcc8dad36 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -396,30 +396,52 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
+			case EEOP_INNER_VAR_TOAST:
+			case EEOP_OUTER_VAR_TOAST:
+			case EEOP_SCAN_VAR_TOAST:
 				{
 					LLVMValueRef value,
 								isnull;
 					LLVMValueRef v_attnum;
 					LLVMValueRef v_values;
 					LLVMValueRef v_nulls;
+					LLVMValueRef v_slot;
 
 					if (opcode == EEOP_INNER_VAR)
 					{
+						v_slot = v_innerslot;
 						v_values = v_innervalues;
 						v_nulls = v_innernulls;
 					}
 					else if (opcode == EEOP_OUTER_VAR)
 					{
+						v_slot = v_outerslot;
 						v_values = v_outervalues;
 						v_nulls = v_outernulls;
 					}
 					else
 					{
+						v_slot = v_scanslot;
 						v_values = v_scanvalues;
 						v_nulls = v_scannulls;
 					}
 
 					v_attnum = l_int32_const(lc, op->d.var.attnum);
+
+					if (opcode == EEOP_INNER_VAR_TOAST ||
+						opcode == EEOP_OUTER_VAR_TOAST ||
+						opcode == EEOP_SCAN_VAR_TOAST)
+					{
+						LLVMValueRef params[2];
+
+						params[0] = v_slot;
+						params[1] = l_int32_const(lc, op->d.var.attnum);
+						l_call(b,
+							   llvm_pg_var_func_type("ExecSlotDetoastDatumExternal"),
+							   llvm_pg_func(mod, "ExecSlotDetoastDatumExternal"),
+							   params, lengthof(params), "");
+					}
+
 					value = l_load_gep1(b, TypeSizeT, v_values, v_attnum, "");
 					isnull = l_load_gep1(b, TypeStorageBool, v_nulls, v_attnum, "");
 					LLVMBuildStore(b, value, v_resvaluep);
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 791902ff1f..7aee794100 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -177,4 +177,5 @@ void	   *referenced_functions[] =
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecSlotDetoastDatumExternal,
 };
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 704879f566..d81f8afc33 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -743,6 +743,19 @@ bms_membership(const Bitmapset *a)
  *		foo = bms_add_member(foo, x);
  */
 
+/*
+ * does this break commit 00b41463c21615f9bf3927f207e37f9e215d32e6?
+ * but I just found alloc memory and free the memory is too bad
+ * for this current feature. So let see ...;
+ */
+void
+bms_zero(Bitmapset *a)
+{
+	if (a == NULL)
+		return;
+
+	memset(a->words, 0, a->nwords * sizeof(bitmapword));
+}
 
 /*
  * bms_add_member - add a specified member to set
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 34ca6d4ac2..4d0e1b8eb1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -42,6 +42,7 @@
 #include "partitioning/partprune.h"
 #include "utils/lsyscache.h"
 
+extern bool jit_enabled;
 
 /*
  * Flag bits that can appear in the flags argument of create_plan_recurse().
@@ -314,7 +315,8 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 List *mergeActionLists, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
-
+static void set_plan_not_detoast_attrs_recurse(Plan *plan,
+											   List *recheck_list);
 
 /*
  * create_plan
@@ -346,6 +348,8 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree, demanding the correct tlist result */
 	plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);
 
+	set_plan_not_detoast_attrs_recurse(plan, NIL);
+
 	/*
 	 * Make sure the topmost plan node's targetlist exposes the original
 	 * column names and other decorative info.  Targetlists generated within
@@ -378,6 +382,68 @@ create_plan(PlannerInfo *root, Path *best_path)
 	return plan;
 }
 
+/*
+ * set_plan_not_pre_detoast_vars
+ *
+ *	set the toast_attrs according recheck_list.
+ *
+ * recheck_list = NIL means we need to do thing.
+ */
+static void
+set_plan_not_pre_detoast_vars(Plan *plan, List *recheck_list)
+{
+	ListCell   *lc;
+	Var		   *var;
+
+	if (recheck_list == NIL)
+		return;
+
+	foreach(lc, plan->targetlist)
+	{
+		TargetEntry *te = lfirst_node(TargetEntry, lc);
+
+		if (!IsA(te->expr, Var))
+			continue;
+		var = castNode(Var, te->expr);
+		if (var->varattno <= 0)
+			continue;
+		if (list_member(recheck_list, var))
+			/* pass the recheck */
+			plan->forbid_pre_detoast_vars = lappend(plan->forbid_pre_detoast_vars, var);
+	}
+}
+
+
+static void
+set_plan_not_detoast_attrs_recurse(Plan *plan, List *recheck_list)
+{
+	if (plan == NULL)
+		return;
+
+	set_plan_not_pre_detoast_vars(plan, recheck_list);
+
+	if (IsA(plan, Sort) || IsA(plan, Memoize) || IsA(plan, WindowAgg) ||
+		IsA(plan, Hash) || IsA(plan, Material) || IsA(plan, IncrementalSort))
+	{
+		List	   *subplan_exprs = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		set_plan_not_pre_detoast_vars(plan, subplan_exprs);
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, subplan_exprs);
+	}
+	else if (IsA(plan, HashJoin) && castNode(HashJoin, plan)->left_small_tlist)
+	{
+		List	   *subplan_exprs = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, subplan_exprs);
+		set_plan_not_detoast_attrs_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+	else
+	{
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, plan->forbid_pre_detoast_vars);
+		set_plan_not_detoast_attrs_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+}
+
 /*
  * create_plan_recurse
  *	  Recursive guts of create_plan().
@@ -4884,6 +4950,11 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	join_plan->left_small_tlist = (best_path->num_batches > 1);
+
+	if (!jit_enabled)
+		elog(INFO, "num_batches = %d", best_path->num_batches);
+
 	return join_plan;
 }
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 4bb68ac90e..df8da328a0 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -27,6 +27,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_relation.h"
 #include "tcop/utility.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
@@ -55,11 +56,27 @@ typedef struct
 	tlist_vinfo vars[FLEXIBLE_ARRAY_MEMBER];	/* has num_vars entries */
 } indexed_tlist;
 
+typedef struct
+{
+	/* var is added into existing_attrs for the first time. */
+	Bitmapset  *existing_attrs;
+	/* following add to the final_ref_attrs. */
+	Bitmapset **final_ref_attrs;
+}			intermediate_var_ref_context;
+
+typedef struct
+{
+	int			level_added;
+	int			level;
+}			intermediate_level_context;
+
 typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context scan_reference_attrs;
 } fix_scan_expr_context;
 
 typedef struct
@@ -71,6 +88,9 @@ typedef struct
 	int			rtoffset;
 	NullingRelsMatch nrm_match;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context outer_reference_attrs;
+	intermediate_var_ref_context inner_reference_attrs;
 } fix_join_expr_context;
 
 typedef struct
@@ -127,8 +147,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, lst, rtoffset, num_exec, pre_detoast_attrs) \
+	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec, pre_detoast_attrs))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -158,7 +178,8 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
 static Node *fix_scan_expr(PlannerInfo *root, Node *node,
-						   int rtoffset, double num_exec);
+						   int rtoffset, double num_exec,
+						   Bitmapset **scan_reference_attrs);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
@@ -190,7 +211,10 @@ static List *fix_join_expr(PlannerInfo *root,
 						   Index acceptable_rel,
 						   int rtoffset,
 						   NullingRelsMatch nrm_match,
-						   double num_exec);
+						   double num_exec,
+						   Bitmapset **outer_reference_attrs,
+						   Bitmapset **inner_reference_attrs);
+
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
 static Node *fix_upper_expr(PlannerInfo *root,
@@ -628,10 +652,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_SampleScan:
@@ -641,13 +671,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs
+					);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tablesample = (TableSampleClause *)
 					fix_scan_expr(root, (Node *) splan->tablesample,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexScan:
@@ -657,28 +694,40 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->indexqual =
 					fix_scan_list(root, splan->indexqual,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->indexorderby =
 					fix_scan_list(root, splan->indexorderby,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexorderbyorig =
 					fix_scan_list(root, splan->indexorderbyorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan), &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexOnlyScan:
 			{
 				IndexOnlyScan *splan = (IndexOnlyScan *) plan;
 
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
+
 				return set_indexonlyscan_references(root, splan, rtoffset);
 			}
 			break;
@@ -691,10 +740,15 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, splan->indexqual, rtoffset, 1,
+								  &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_BitmapHeapScan:
@@ -704,13 +758,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->bitmapqualorig =
 					fix_scan_list(root, splan->bitmapqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidScan:
@@ -720,13 +781,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidquals =
 					fix_scan_list(root, splan->tidquals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidRangeScan:
@@ -736,17 +804,25 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidrangequals =
 					fix_scan_list(root, splan->tidrangequals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_SubqueryScan:
 			/* Needs special treatment, see comments below */
+			/* XXX: shall I do anything? */
 			return set_subqueryscan_references(root,
 											   (SubqueryScan *) plan,
 											   rtoffset);
@@ -757,12 +833,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, splan->functions, rtoffset, 1,
+								  &splan->scan.reference_attrs);
+
 			}
 			break;
 		case T_TableFuncScan:
@@ -772,13 +852,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->tablefunc = (TableFunc *)
 					fix_scan_expr(root, (Node *) splan->tablefunc,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ValuesScan:
@@ -788,13 +872,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->values_lists =
 					fix_scan_list(root, splan->values_lists,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_CteScan:
@@ -804,10 +891,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_NamedTuplestoreScan:
@@ -817,10 +910,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_WorkTableScan:
@@ -830,10 +925,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ForeignScan:
@@ -873,7 +970,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
 												   rtoffset,
-												   NUM_EXEC_TLIST(plan));
+												   NUM_EXEC_TLIST(plan),
+												   NULL);
 				break;
 			}
 
@@ -933,9 +1031,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, splan->limitOffset, rtoffset, 1, NULL);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, splan->limitCount, rtoffset, 1, NULL);
 			}
 			break;
 		case T_Agg:
@@ -988,17 +1086,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->startOffset, rtoffset, 1, NULL);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->endOffset, rtoffset, 1, NULL);
 				wplan->runCondition = fix_scan_list(root,
 													wplan->runCondition,
 													rtoffset,
-													NUM_EXEC_TLIST(plan));
+													NUM_EXEC_TLIST(plan), NULL);
 				wplan->runConditionOrig = fix_scan_list(root,
 														wplan->runConditionOrig,
 														rtoffset,
-														NUM_EXEC_TLIST(plan));
+														NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_Result:
@@ -1038,14 +1136,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 					splan->plan.targetlist =
 						fix_scan_list(root, splan->plan.targetlist,
-									  rtoffset, NUM_EXEC_TLIST(plan));
+									  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 					splan->plan.qual =
 						fix_scan_list(root, splan->plan.qual,
-									  rtoffset, NUM_EXEC_QUAL(plan));
+									  rtoffset, NUM_EXEC_QUAL(plan), NULL);
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1, NULL);
 			}
 			break;
 		case T_ProjectSet:
@@ -1061,7 +1159,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->withCheckOptionLists =
 					fix_scan_list(root, splan->withCheckOptionLists,
-								  rtoffset, 1);
+								  rtoffset, 1, NULL);
 
 				if (splan->returningLists)
 				{
@@ -1118,18 +1216,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						fix_join_expr(root, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					splan->onConflictWhere = (Node *)
 						fix_join_expr(root, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1, NULL);
 				}
 
 				/*
@@ -1182,7 +1282,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   resultrel,
 															   rtoffset,
 															   NRM_EQUAL,
-															   NUM_EXEC_TLIST(plan));
+															   NUM_EXEC_TLIST(plan),
+															   NULL, NULL);
 
 							/* Fix quals too. */
 							action->qual = (Node *) fix_join_expr(root,
@@ -1191,7 +1292,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 																  resultrel,
 																  rtoffset,
 																  NRM_EQUAL,
-																  NUM_EXEC_QUAL(plan));
+																  NUM_EXEC_QUAL(plan),
+																  NULL, NULL);
 						}
 					}
 				}
@@ -1356,13 +1458,16 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
 	plan->indexqual = fix_scan_list(root, plan->indexqual,
-									rtoffset, 1);
+									rtoffset, 1,
+									&plan->scan.reference_attrs);
 	/* indexorderby is already transformed to reference index columns */
 	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
-									   rtoffset, 1);
+									   rtoffset, 1,
+									   &plan->scan.reference_attrs);
 	/* indextlist must NOT be transformed to reference index columns */
 	plan->indextlist = fix_scan_list(root, plan->indextlist,
-									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+									 rtoffset, NUM_EXEC_TLIST((Plan *) plan),
+									 &plan->scan.reference_attrs);
 
 	pfree(index_itlist);
 
@@ -1409,10 +1514,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
 			fix_scan_list(root, plan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) plan), NULL);
 		plan->scan.plan.qual =
 			fix_scan_list(root, plan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) plan), NULL);
 
 		result = (Plan *) plan;
 	}
@@ -1612,7 +1717,7 @@ set_foreignscan_references(PlannerInfo *root,
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
 			fix_scan_list(root, fscan->fdw_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 	}
 	else
 	{
@@ -1622,16 +1727,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_recheck_quals =
 			fix_scan_list(root, fscan->fdw_recheck_quals,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 	}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
@@ -1690,20 +1795,20 @@ set_customscan_references(PlannerInfo *root,
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
 			fix_scan_list(root, cscan->custom_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
 			fix_scan_list(root, cscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 		cscan->scan.plan.qual =
 			fix_scan_list(root, cscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 		cscan->custom_exprs =
 			fix_scan_list(root, cscan->custom_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 	}
 
 	/* Adjust child plan-nodes recursively, if needed */
@@ -2111,6 +2216,95 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
 	return (Node *) bestplan;
 }
 
+
+static inline void
+setup_intermediate_level_ctx(intermediate_level_context * ctx)
+{
+	ctx->level = 0;
+	ctx->level_added = false;
+}
+
+static inline void
+setup_intermediate_var_ref_ctx(intermediate_var_ref_context * ctx, Bitmapset **final_ref_attrs)
+{
+	ctx->existing_attrs = NULL;
+	ctx->final_ref_attrs = final_ref_attrs;
+}
+
+/*
+ * increase_level_for_pre_detoast
+ *	Check if the given Expr could detoast a Var directly, if yes,
+ * increase the level and return true. otherwise return false;
+ */
+static inline void
+increase_level_for_pre_detoast(Node *node, intermediate_level_context * ctx)
+{
+	/* The following nodes is impossible to detoast a Var directly. */
+	if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
+	{
+		ctx->level_added = false;
+	}
+	else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
+	{
+		/* let's not detoast first so that pg_column_compression works. */
+		ctx->level_added = false;
+	}
+	else
+	{
+		ctx->level_added = true;
+		ctx->level += 1;
+	}
+}
+
+static inline void
+decreased_level_for_pre_detoast(intermediate_level_context * ctx)
+{
+	if (ctx->level_added)
+		ctx->level -= 1;
+
+	ctx->level_added = false;
+}
+
+/*
+ * add_pre_detoast_vars
+ *		add the var's information into pre_detoast_attrs when the check is pass.
+ */
+static inline void
+add_pre_detoast_vars(intermediate_level_context * level_ctx,
+					 intermediate_var_ref_context * ctx,
+					 Var *var)
+{
+	int			attno;
+
+	if (level_ctx->level <= 1 || ctx->final_ref_attrs == NULL || var->varattno <= 0)
+		return;
+
+	attno = var->varattno - 1;
+	if (bms_is_member(attno, ctx->existing_attrs))
+	{
+		/* not the first time to access it, add it to final result. */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+	else
+	{
+		/* first time. */
+		ctx->existing_attrs = bms_add_member(ctx->existing_attrs, attno);
+
+		/*
+		 * XXX:
+		 *
+		 * The above strategy doesn't help to detect if a Var is detoast
+		 * twice. Reasons are: 1. the context is not maintain in Plan node
+		 * level. so if it is detoast at targetlist and qual, we can't detect
+		 * it. 2. even we can make it at plan node, it still doesn't help for
+		 * the among-nodes case.
+		 *
+		 * So for now, I just disable it.
+		 */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+}
+
 /*
  * fix_scan_expr
  *		Do set_plan_references processing on a scan-level expression
@@ -2130,13 +2324,16 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset,
+			  double num_exec, Bitmapset **scan_reference_attrs)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.scan_reference_attrs, scan_reference_attrs);
 
 	if (rtoffset != 0 ||
 		root->multiexpr_params != NIL ||
@@ -2167,8 +2364,13 @@ fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
 static Node *
 fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 {
+	Node	   *n;
+
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = copyVar((Var *) node);
@@ -2186,10 +2388,18 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 			var->varno += context->rtoffset;
 		if (var->varnosyn > 0)
 			var->varnosyn += context->rtoffset;
+
+		add_pre_detoast_vars(&context->level_ctx, &context->scan_reference_attrs, var);
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) var;
 	}
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		Node	   *n = fix_param_node(context->root, (Param *) node);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n;
+	}
 	if (IsA(node, Aggref))
 	{
 		Aggref	   *aggref = (Aggref *) node;
@@ -2200,7 +2410,10 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		if (aggparam != NULL)
 		{
 			/* Make a copy of the Param for paranoia's sake */
-			return (Node *) copyObject(aggparam);
+			Node	   *n = (Node *) copyObject(aggparam);
+
+			decreased_level_for_pre_detoast(&context->level_ctx);
+			return n;
 		}
 		/* If no match, just fall through to process it normally */
 	}
@@ -2210,6 +2423,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 
 		Assert(!IS_SPECIAL_VARNO(cexpr->cvarno));
 		cexpr->cvarno += context->rtoffset;
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) cexpr;
 	}
 	if (IsA(node, PlaceHolderVar))
@@ -2218,29 +2432,52 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
 		/* XXX can we assert something about phnullingrels? */
-		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
+		Node	   *n = fix_scan_expr_mutator((Node *) phv->phexpr, context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n;
 	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_scan_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		Node	   *n = fix_scan_expr_mutator(fix_alternative_subplan(context->root,
+																	  (AlternativeSubPlan *) node,
+																	  context->num_exec),
+											  context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator,
-								   (void *) context);
+	n = expression_tree_mutator(node, fix_scan_expr_mutator, (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return n;
 }
 
 static bool
 fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 {
+	bool		ret;
+
 	if (node == NULL)
 		return false;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
+	if (IsA(node, Var))
+	{
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 castNode(Var, node));
+	}
 	Assert(!(IsA(node, Var) && ((Var *) node)->varno == ROWID_VAR));
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
-	return expression_tree_walker(node, fix_scan_expr_walker,
-								  (void *) context);
+	ret = expression_tree_walker(node, fix_scan_expr_walker,
+								 (void *) context);
+
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret;
 }
 
 /*
@@ -2276,7 +2513,10 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset,
 								   NRM_EQUAL,
-								   NUM_EXEC_QUAL((Plan *) join));
+								   NUM_EXEC_QUAL((Plan *) join),
+								   &join->outer_reference_attrs,
+								   &join->inner_reference_attrs
+		);
 
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
@@ -2323,7 +2563,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										 (Index) 0,
 										 rtoffset,
 										 NRM_EQUAL,
-										 NUM_EXEC_QUAL((Plan *) join));
+										 NUM_EXEC_QUAL((Plan *) join),
+										 &join->outer_reference_attrs,
+										 &join->inner_reference_attrs);
 	}
 	else if (IsA(join, HashJoin))
 	{
@@ -2336,7 +2578,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										(Index) 0,
 										rtoffset,
 										NRM_EQUAL,
-										NUM_EXEC_QUAL((Plan *) join));
+										NUM_EXEC_QUAL((Plan *) join),
+										&join->outer_reference_attrs,
+										&join->inner_reference_attrs);
 
 		/*
 		 * HashJoin's hashkeys are used to look for matching tuples from its
@@ -2368,7 +2612,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  (Index) 0,
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-										  NUM_EXEC_TLIST((Plan *) join));
+										  NUM_EXEC_TLIST((Plan *) join),
+										  &join->outer_reference_attrs,
+										  &join->inner_reference_attrs);
 	join->plan.qual = fix_join_expr(root,
 									join->plan.qual,
 									outer_itlist,
@@ -2376,8 +2622,20 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 									(Index) 0,
 									rtoffset,
 									(join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-									NUM_EXEC_QUAL((Plan *) join));
-
+									NUM_EXEC_QUAL((Plan *) join),
+									&join->outer_reference_attrs,
+									&join->inner_reference_attrs);
+
+	join->plan.forbid_pre_detoast_vars = fix_join_expr(root,
+													   join->plan.forbid_pre_detoast_vars,
+													   outer_itlist,
+													   inner_itlist,
+													   (Index) 0,
+													   rtoffset,
+													   (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
+													   NUM_EXEC_TLIST((Plan *) join),
+													   NULL,
+													   NULL);
 	pfree(outer_itlist);
 	pfree(inner_itlist);
 }
@@ -3010,9 +3268,12 @@ fix_join_expr(PlannerInfo *root,
 			  Index acceptable_rel,
 			  int rtoffset,
 			  NullingRelsMatch nrm_match,
-			  double num_exec)
+			  double num_exec,
+			  Bitmapset **outer_reference_attrs,
+			  Bitmapset **inner_reference_attrs)
 {
 	fix_join_expr_context context;
+	List	   *ret;
 
 	context.root = root;
 	context.outer_itlist = outer_itlist;
@@ -3021,16 +3282,30 @@ fix_join_expr(PlannerInfo *root,
 	context.rtoffset = rtoffset;
 	context.nrm_match = nrm_match;
 	context.num_exec = num_exec;
-	return (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.outer_reference_attrs, outer_reference_attrs);
+	setup_intermediate_var_ref_ctx(&context.inner_reference_attrs, inner_reference_attrs);
+
+	ret = (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	bms_free(context.outer_reference_attrs.existing_attrs);
+	bms_free(context.inner_reference_attrs.existing_attrs);
+
+	return ret;
 }
 
 static Node *
 fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 {
 	Var		   *newvar;
+	Node	   *ret_node;
 
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = (Var *) node;
@@ -3044,7 +3319,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* then in the inner. */
@@ -3056,7 +3337,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If it's for acceptable_rel, adjust and return it */
@@ -3066,6 +3353,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 			var->varno += context->rtoffset;
 			if (var->varnosyn > 0)
 				var->varnosyn += context->rtoffset;
+			/* XXX acceptable_rel?  we can ignore it for safety. */
+			decreased_level_for_pre_detoast(&context->level_ctx);
+
 			return (Node *) var;
 		}
 
@@ -3084,22 +3374,38 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  OUTER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 		if (context->inner_itlist && context->inner_itlist->has_ph_vars)
 		{
+
 			newvar = search_indexed_tlist_for_phv(phv,
 												  context->inner_itlist,
 												  INNER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If not supplied by input plans, evaluate the contained expr */
 		/* XXX can we assert something about phnullingrels? */
-		return fix_join_expr_mutator((Node *) phv->phexpr, context);
+		ret_node = fix_join_expr_mutator((Node *) phv->phexpr, context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
 	}
+
 	/* Try matching more complex expressions too, if tlists have any */
 	if (context->outer_itlist && context->outer_itlist->has_non_vars)
 	{
@@ -3107,7 +3413,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->outer_itlist,
 												  OUTER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->outer_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	if (context->inner_itlist && context->inner_itlist->has_non_vars)
 	{
@@ -3115,20 +3427,36 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->inner_itlist,
 												  INNER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->inner_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	/* Special cases (apply only AFTER failing to match to lower tlist) */
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		ret_node = fix_param_node(context->root, (Param *) node);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_join_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		ret_node = fix_join_expr_mutator(fix_alternative_subplan(context->root,
+																 (AlternativeSubPlan *) node,
+																 context->num_exec),
+										 context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node,
-								   fix_join_expr_mutator,
-								   (void *) context);
+	ret_node = expression_tree_mutator(node,
+									   fix_join_expr_mutator,
+									   (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret_node;
 }
 
 /*
@@ -3163,7 +3491,8 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * varno = newvarno, varattno = resno of corresponding targetlist element.
  * The original tree is not modified.
  */
-static Node *
+static Node *					/* XXX: shall I care about this for shared
+								 * detoast optimization? */
 fix_upper_expr(PlannerInfo *root,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
@@ -3318,7 +3647,10 @@ set_returning_clause_references(PlannerInfo *root,
 						  resultRelation,
 						  rtoffset,
 						  NRM_EQUAL,
-						  NUM_EXEC_TLIST(topplan));
+						  NUM_EXEC_TLIST(topplan),
+						  NULL,
+						  NULL
+		);
 
 	pfree(itlist);
 
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 048573c2bc..39c33e51b1 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -77,6 +77,11 @@ typedef enum ExprEvalOp
 	EEOP_OUTER_VAR,
 	EEOP_SCAN_VAR,
 
+	/* compute non-system Var value with shared-detoast-datum logic */
+	EEOP_INNER_VAR_TOAST,
+	EEOP_OUTER_VAR_TOAST,
+	EEOP_SCAN_VAR_TOAST,
+
 	/* compute system Var value */
 	EEOP_INNER_SYSVAR,
 	EEOP_OUTER_SYSVAR,
@@ -826,5 +831,6 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
+extern void ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum);
 
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 4210d6d838..258ff8906e 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -18,6 +18,7 @@
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/tupdesc.h"
+#include "nodes/bitmapset.h"
 #include "storage/buf.h"
 
 /*----------
@@ -128,6 +129,11 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+
+	/*
+	 * The attributes populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST step.
+	 */
+	Bitmapset  *pre_detoasted_attrs;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -425,12 +431,38 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+	int			attnum;
+
+	if (bms_is_empty(slot->pre_detoasted_attrs))
+		return;
+
+	attnum = -1;
+	/* free the memory used by pre-detoasted datum and reset the flags. */
+	while ((attnum = bms_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
+	{
+		pfree((void *) slot->tts_values[attnum]);
+	}
+
+	/*
+	 * bms_free each time cost too much, so just zero these bits and keep its
+	 * memory, just like what we did for TupleTableSlot. but.. see the
+	 * comments about the bms_zero.
+	 */
+	bms_zero(slot->pre_detoasted_attrs);
+}
+
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
 static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_ops->clear(slot);
 
 	return slot;
@@ -449,6 +481,10 @@ ExecClearTuple(TupleTableSlot *slot)
 static inline void
 ExecMaterializeSlot(TupleTableSlot *slot)
 {
+	/*
+	 * XXX: pre_detoasted_attrs doesn't dependent on any external storage, so
+	 * nothing should be done here.
+	 */
 	slot->tts_ops->materialize(slot);
 }
 
@@ -486,6 +522,30 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 
 	dstslot->tts_ops->copyslot(dstslot, srcslot);
 
+	/* Assert this assumption the below code depends on. */
+	Assert(dstslot->tts_nvalid == 0 ||
+		   dstslot->tts_nvalid == srcslot->tts_nvalid);
+
+	if (dstslot->tts_nvalid == srcslot->tts_nvalid
+		&& !bms_is_empty(srcslot->pre_detoasted_attrs))
+	{
+		int			attnum = -1;
+		MemoryContext old = MemoryContextSwitchTo(dstslot->tts_mcxt);
+
+		dstslot->pre_detoasted_attrs = bms_copy(srcslot->pre_detoasted_attrs);
+
+		while ((attnum = bms_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
+		{
+			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
+			Size		len;
+
+			Assert(!VARATT_IS_EXTENDED(datum));
+			len = VARSIZE(datum);
+			dstslot->tts_values[attnum] = (Datum) palloc(len);
+			memcpy((void *) dstslot->tts_values[attnum], datum, len);
+		}
+		MemoryContextSwitchTo(old);
+	}
 	return dstslot;
 }
 
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 161243b2d0..402ff02027 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -113,6 +113,7 @@ extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
+extern void bms_zero(Bitmapset *a);
 
 /* support for iterating through the integer elements of a set: */
 extern int	bms_next_member(const Bitmapset *a, int prevbit);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5d7f17dee0..c3f7d19ba2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1474,6 +1474,7 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+	Bitmapset  *scan_pre_detoast_attrs;
 } ScanState;
 
 /* ----------------
@@ -2003,6 +2004,8 @@ typedef struct JoinState
 	bool		single_match;	/* True if we should skip to next outer tuple
 								 * after finding one inner match */
 	ExprState  *joinqual;		/* JOIN quals (in addition to ps.qual) */
+	Bitmapset  *outer_pre_detoast_attrs;
+	Bitmapset  *inner_pre_detoast_attrs;
 } JoinState;
 
 /* ----------------
@@ -2764,4 +2767,6 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+extern void SetPredetoastAttrsForJoin(JoinState *joinstate);
+extern void SetPredetoastAttrsForScan(ScanState *scanstate);
 #endif							/* EXECNODES_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d40af8e59f..de4f3f190a 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,13 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+
+	/*
+	 * A list of Vars which should not apply the shared-detoast-datum logic
+	 * since the upper nodes like Sort/Hash want the tuples as small as
+	 * possible. Its a subset of targetlist in each Plan node.
+	 */
+	List	   *forbid_pre_detoast_vars;
 } Plan;
 
 /* ----------------
@@ -385,6 +392,13 @@ typedef struct Scan
 
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 */
+	Bitmapset  *reference_attrs;
 } Scan;
 
 /* ----------------
@@ -789,6 +803,14 @@ typedef struct Join
 	JoinType	jointype;
 	bool		inner_unique;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 */
+	Bitmapset  *outer_reference_attrs;
+	Bitmapset  *inner_reference_attrs;
 } Join;
 
 /* ----------------
@@ -869,6 +891,14 @@ typedef struct HashJoin
 	 * perform lookups in the hashtable over the inner plan.
 	 */
 	List	   *hashkeys;
+
+	/*
+	 * keep the small tlist information in plan for the shared-detoast datum
+	 * logic.  If left_small_tlist is true, then all the datum in outerPlan
+	 * should not apply that logic, used for maintaining
+	 * forbid_pre_detoast_vars fields in Plan.
+	 */
+	bool		left_small_tlist;
 } HashJoin;
 
 /* ----------------
@@ -1588,4 +1618,24 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+static inline bool
+is_join_plan(Plan *plan)
+{
+	return (plan != NULL) && (IsA(plan, NestLoop) || IsA(plan, HashJoin) || IsA(plan, MergeJoin));
+}
+
+static inline bool
+is_scan_plan(Plan *plan)
+{
+	return (plan != NULL) &&
+		(IsA(plan, SeqScan) ||
+		 IsA(plan, SampleScan) ||
+		 IsA(plan, IndexScan) ||
+		 IsA(plan, IndexOnlyScan) ||
+		 IsA(plan, BitmapIndexScan) ||
+		 IsA(plan, BitmapHeapScan) ||
+		 IsA(plan, TidScan) ||
+		 IsA(plan, SubqueryScan));
+}
+
 #endif							/* PLANNODES_H */
diff --git a/src/test/regress/sql/shared_detoast_slow.sql b/src/test/regress/sql/shared_detoast_slow.sql
new file mode 100644
index 0000000000..beecaa6bc5
--- /dev/null
+++ b/src/test/regress/sql/shared_detoast_slow.sql
@@ -0,0 +1,70 @@
+create table t1(a text, b text, c text);
+create table t2(a text, b text, c text);
+create table t3(a text, b text, c text);
+
+insert into t1 select i, i, i from generate_series(1, 1000000)i;
+insert into t2 select i, i, i from generate_series(1, 1000000)i;
+insert into t3 select i, i, i from generate_series(1, 1000000)i;
+
+create index on t1(c);
+
+analyze t1;
+analyze t2;
+analyze t3;
+
+-- Turn off jit first, reasons:
+-- 1. JIT is not adapted for this feature, it may cause crash on jit.
+-- 2. more logging for this feature is enabled when jit=off
+set jit to off;
+
+explain (verbose) select * from t1 where b > 'a';
+
+
+-- NullTest has nothing with tts_values, so its access to toast value
+-- should be ignored.
+explain (verbose) select * from t1 where b is NULL and c is not null;
+
+-- b can't be shared-detoasted since it would make the work_mem bigger.
+explain (verbose) select * from t1 where b > 'a' order by c;
+
+-- b CAN be shared-detoasted since it would NOT make the work_mem bigger.
+-- but compared with the old behavior, it cause the lifespan of the
+-- 'detoast datum'
+-- longer, in the old behavior, it is reset becase of ExecQualAndReset.
+explain (verbose) select a, c from t1 where b > 'a' order by c;
+
+-- The detoast only happen at the join stage.
+explain (verbose) select * from t1 join t2 using(b);
+
+--
+explain (verbose) select * from t1 join t2 using(b) where t1.c > '3';
+
+explain (verbose)
+select t3.*
+from t1, t2, t3
+where t2.c > '999999999999999'
+and t2.c = t1.c
+and t3.b = t1.b;
+
+
+-- t2.b the innerHash can't be pre detoast due to Hash.
+-- t1.b the outerPlan can't be pre detoast due to num_batch.
+explain (verbose) select * from t1 join t2 using(b);
+
+-- Increase the work_mem so that the num_batch = 1, then
+-- the t1.b in outerPlan can be pre-detoast.
+
+set work_mem to '1GB';
+explain (verbose) select * from t1 join t2 using(b);
+reset work_mem;
+
+-- Show even if a datum is under a hash node, but it is
+-- not DIRECTLY input of the Hash, it's still able to be
+-- pre detoasted.
+explain (verbose)
+select t3.*
+from t1, t2, t3
+where t2.c > '999999999999999'
+and t2.c = t1.c
+and t3.b = t1.b
+and t1.b > '0';
-- 
2.34.1

#3vignesh C
vignesh21@gmail.com
In reply to: Andy Fan (#2)
Re: Shared detoast Datum proposal

On Mon, 1 Jan 2024 at 19:26, Andy Fan <zhihuifan1213@163.com> wrote:

Andy Fan <zhihuifan1213@163.com> writes:

Some Known issues:
------------------

1. Currently only Scan & Join nodes are considered for this feature.
2. JIT is not adapted for this purpose yet.

JIT is adapted for this feature in v2. Any feedback is welcome.

One of the tests was aborted at CFBOT [1]https://cirrus-ci.com/task/4765094966460416 with:
[09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
[09:47:01.035] [New LWP 28182]
[09:47:01.748] [Thread debugging using libthread_db enabled]
[09:47:01.748] Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
[09:47:09.392] Core was generated by `postgres: postgres regression
[local] SELECT '.
[09:47:09.392] Program terminated with signal SIGSEGV, Segmentation fault.
[09:47:09.392] #0 0x00007fa4eed4a5a1 in ?? ()
[09:47:11.123]
[09:47:11.123] Thread 1 (Thread 0x7fa4f8050a40 (LWP 28182)):
[09:47:11.123] #0 0x00007fa4eed4a5a1 in ?? ()
[09:47:11.123] No symbol table info available.
...
...
[09:47:11.123] #4 0x00007fa4ebc7a186 in LLVMOrcGetSymbolAddress () at
/build/llvm-toolchain-11-HMpQvg/llvm-toolchain-11-11.0.1/llvm/lib/ExecutionEngine/Orc/OrcCBindings.cpp:124
[09:47:11.123] No locals.
[09:47:11.123] #5 0x00007fa4eed6fc7a in llvm_get_function
(context=0x564b1813a8a0, funcname=0x7fa4eed4a570 "AWAVATSH\201",
<incomplete sequence \354\210>) at
../src/backend/jit/llvm/llvmjit.c:460
[09:47:11.123] addr = 94880527996960
[09:47:11.123] __func__ = "llvm_get_function"
[09:47:11.123] #6 0x00007fa4eed902e1 in ExecRunCompiledExpr
(state=0x0, econtext=0x564b18269d20, isNull=0x7ffc11054d5f) at
../src/backend/jit/llvm/llvmjit_expr.c:2577
[09:47:11.123] cstate = <optimized out>
[09:47:11.123] func = 0x564b18269d20
[09:47:11.123] #7 0x0000564b1698e614 in ExecEvalExprSwitchContext
(isNull=0x7ffc11054d5f, econtext=0x564b18269d20, state=0x564b182ad820)
at ../src/include/executor/executor.h:355
[09:47:11.123] retDatum = <optimized out>
[09:47:11.123] oldContext = 0x564b182680d0
[09:47:11.123] retDatum = <optimized out>
[09:47:11.123] oldContext = <optimized out>
[09:47:11.123] #8 ExecProject (projInfo=0x564b182ad818) at
../src/include/executor/executor.h:389
[09:47:11.123] econtext = 0x564b18269d20
[09:47:11.123] state = 0x564b182ad820
[09:47:11.123] slot = 0x564b182ad788
[09:47:11.123] isnull = false
[09:47:11.123] econtext = <optimized out>
[09:47:11.123] state = <optimized out>
[09:47:11.123] slot = <optimized out>
[09:47:11.123] isnull = <optimized out>
[09:47:11.123] #9 ExecMergeJoin (pstate=<optimized out>) at
../src/backend/executor/nodeMergejoin.c:836
[09:47:11.123] node = <optimized out>
[09:47:11.123] joinqual = 0x0
[09:47:11.123] otherqual = 0x0
[09:47:11.123] qualResult = <optimized out>
[09:47:11.123] compareResult = <optimized out>
[09:47:11.123] innerPlan = <optimized out>
[09:47:11.123] innerTupleSlot = <optimized out>
[09:47:11.123] outerPlan = <optimized out>
[09:47:11.123] outerTupleSlot = <optimized out>
[09:47:11.123] econtext = 0x564b18269d20
[09:47:11.123] doFillOuter = false
[09:47:11.123] doFillInner = false
[09:47:11.123] __func__ = "ExecMergeJoin"
[09:47:11.123] #10 0x0000564b169275b9 in ExecProcNodeFirst
(node=0x564b18269db0) at ../src/backend/executor/execProcnode.c:464
[09:47:11.123] No locals.
[09:47:11.123] #11 0x0000564b169a2675 in ExecProcNode
(node=0x564b18269db0) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #12 ExecRecursiveUnion (pstate=0x564b182684a0) at
../src/backend/executor/nodeRecursiveunion.c:115
[09:47:11.123] node = 0x564b182684a0
[09:47:11.123] outerPlan = 0x564b18268d00
[09:47:11.123] innerPlan = 0x564b18269db0
[09:47:11.123] plan = 0x564b182ddc78
[09:47:11.123] slot = <optimized out>
[09:47:11.123] isnew = false
[09:47:11.123] #13 0x0000564b1695a421 in ExecProcNode
(node=0x564b182684a0) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #14 CteScanNext (node=0x564b183a6830) at
../src/backend/executor/nodeCtescan.c:103
[09:47:11.123] cteslot = <optimized out>
[09:47:11.123] estate = <optimized out>
[09:47:11.123] dir = ForwardScanDirection
[09:47:11.123] forward = true
[09:47:11.123] tuplestorestate = 0x564b183a5cd0
[09:47:11.123] eof_tuplestore = <optimized out>
[09:47:11.123] slot = 0x564b183a6be0
[09:47:11.123] #15 0x0000564b1692e22b in ExecScanFetch
(node=node@entry=0x564b183a6830,
accessMtd=accessMtd@entry=0x564b1695a183 <CteScanNext>,
recheckMtd=recheckMtd@entry=0x564b16959db3 <CteScanRecheck>) at
../src/backend/executor/execScan.c:132
[09:47:11.123] estate = <optimized out>
[09:47:11.123] #16 0x0000564b1692e332 in ExecScan
(node=0x564b183a6830, accessMtd=accessMtd@entry=0x564b1695a183
<CteScanNext>, recheckMtd=recheckMtd@entry=0x564b16959db3
<CteScanRecheck>) at ../src/backend/executor/execScan.c:181
[09:47:11.123] econtext = 0x564b183a6b50
[09:47:11.123] qual = 0x0
[09:47:11.123] projInfo = 0x0
[09:47:11.123] #17 0x0000564b1695a5ea in ExecCteScan
(pstate=<optimized out>) at ../src/backend/executor/nodeCtescan.c:164
[09:47:11.123] node = <optimized out>
[09:47:11.123] #18 0x0000564b169a81da in ExecProcNode
(node=0x564b183a6830) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #19 ExecSort (pstate=0x564b183a6620) at
../src/backend/executor/nodeSort.c:149
[09:47:11.123] plannode = 0x564b182dfb78
[09:47:11.123] outerNode = 0x564b183a6830
[09:47:11.123] tupDesc = <optimized out>
[09:47:11.123] tuplesortopts = <optimized out>
[09:47:11.123] node = 0x564b183a6620
[09:47:11.123] estate = 0x564b182681d0
[09:47:11.123] dir = ForwardScanDirection
[09:47:11.123] tuplesortstate = 0x564b18207ff0
[09:47:11.123] slot = <optimized out>
[09:47:11.123] #20 0x0000564b169275b9 in ExecProcNodeFirst
(node=0x564b183a6620) at ../src/backend/executor/execProcnode.c:464
[09:47:11.123] No locals.
[09:47:11.123] #21 0x0000564b16913d01 in ExecProcNode
(node=0x564b183a6620) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #22 ExecutePlan (estate=estate@entry=0x564b182681d0,
planstate=0x564b183a6620, use_parallel_mode=<optimized out>,
operation=operation@entry=CMD_SELECT,
sendTuples=sendTuples@entry=true, numberTuples=numberTuples@entry=0,
direction=ForwardScanDirection, dest=0x564b182e2728,
execute_once=true) at ../src/backend/executor/execMain.c:1670
[09:47:11.123] slot = <optimized out>
[09:47:11.123] current_tuple_count = 0
[09:47:11.123] #23 0x0000564b16914024 in standard_ExecutorRun
(queryDesc=0x564b181ba200, direction=ForwardScanDirection, count=0,
execute_once=<optimized out>) at
../src/backend/executor/execMain.c:365
[09:47:11.123] estate = 0x564b182681d0
[09:47:11.123] operation = CMD_SELECT
[09:47:11.123] dest = 0x564b182e2728
[09:47:11.123] sendTuples = true
[09:47:11.123] oldcontext = 0x564b181ba100
[09:47:11.123] __func__ = "standard_ExecutorRun"
[09:47:11.123] #24 0x0000564b1691418f in ExecutorRun
(queryDesc=queryDesc@entry=0x564b181ba200,
direction=direction@entry=ForwardScanDirection, count=count@entry=0,
execute_once=<optimized out>) at
../src/backend/executor/execMain.c:309
[09:47:11.123] No locals.
[09:47:11.123] #25 0x0000564b16d208af in PortalRunSelect
(portal=portal@entry=0x564b1817ae10, forward=forward@entry=true,
count=0, count@entry=9223372036854775807,
dest=dest@entry=0x564b182e2728) at ../src/backend/tcop/pquery.c:924
[09:47:11.123] queryDesc = 0x564b181ba200
[09:47:11.123] direction = <optimized out>
[09:47:11.123] nprocessed = <optimized out>
[09:47:11.123] __func__ = "PortalRunSelect"
[09:47:11.123] #26 0x0000564b16d2405b in PortalRun
(portal=portal@entry=0x564b1817ae10,
count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x564b182e2728, altdest=altdest@entry=0x564b182e2728,
qc=0x7ffc110551e0) at ../src/backend/tcop/pquery.c:768
[09:47:11.123] _save_exception_stack = 0x7ffc11055290
[09:47:11.123] _save_context_stack = 0x0
[09:47:11.123] _local_sigjmp_buf = {{__jmpbuf = {1,
3431825231787999889, 94880528213800, 94880526221072, 94880526741008,
94880526221000, -3433879442176195951, -8991885832768699759},
__mask_was_saved = 0, __saved_mask = {__val = {140720594047343, 688,
94880526749216, 94880511205302, 1, 140720594047343, 94880526999808, 8,
94880526213088, 112, 179, 94880526221072, 94880526221048,
94880526221000, 94880508502828, 2}}}}
[09:47:11.123] _do_rethrow = <optimized out>
[09:47:11.123] result = <optimized out>
[09:47:11.123] nprocessed = <optimized out>
[09:47:11.123] saveTopTransactionResourceOwner = 0x564b18139ad8
[09:47:11.123] saveTopTransactionContext = 0x564b181229f0
[09:47:11.123] saveActivePortal = 0x0
[09:47:11.123] saveResourceOwner = 0x564b18139ad8
[09:47:11.123] savePortalContext = 0x0
[09:47:11.123] saveMemoryContext = 0x564b181229f0
[09:47:11.123] __func__ = "PortalRun"
[09:47:11.123] #27 0x0000564b16d1d098 in exec_simple_query
(query_string=query_string@entry=0x564b180fa0e0 "with recursive
search_graph(f, t, label) as (\n\tselect * from graph0 g\n\tunion
all\n\tselect g.*\n\tfrom graph0 g, search_graph sg\n\twhere g.f =
sg.t\n) search depth first by f, t set seq\nselect * from search"...)
at ../src/backend/tcop/postgres.c:1273
[09:47:11.123] cmdtaglen = 6
[09:47:11.123] snapshot_set = <optimized out>
[09:47:11.123] per_parsetree_context = 0x0
[09:47:11.123] plantree_list = 0x564b182e26d8
[09:47:11.123] parsetree = 0x564b180fbec8
[09:47:11.123] commandTag = <optimized out>
[09:47:11.123] qc = {commandTag = CMDTAG_UNKNOWN, nprocessed = 0}
[09:47:11.123] querytree_list = <optimized out>
[09:47:11.123] portal = 0x564b1817ae10
[09:47:11.123] receiver = 0x564b182e2728
[09:47:11.123] format = 0
[09:47:11.123] cmdtagname = <optimized out>
[09:47:11.123] parsetree_item__state = {l = <optimized out>, i
= <optimized out>}
[09:47:11.123] dest = DestRemote
[09:47:11.123] oldcontext = 0x564b181229f0
[09:47:11.123] parsetree_list = 0x564b180fbef8
[09:47:11.123] parsetree_item = 0x564b180fbf10
[09:47:11.123] save_log_statement_stats = false
[09:47:11.123] was_logged = false
[09:47:11.123] use_implicit_block = false
[09:47:11.123] msec_str =
"\004\000\000\000\000\000\000\000\346[\376\026KV\000\000pR\005\021\374\177\000\000\335\000\000\000\000\000\000"
[09:47:11.123] __func__ = "exec_simple_query"
[09:47:11.123] #28 0x0000564b16d1fe33 in PostgresMain
(dbname=<optimized out>, username=<optimized out>) at
../src/backend/tcop/postgres.c:4653
[09:47:11.123] query_string = 0x564b180fa0e0 "with recursive
search_graph(f, t, label) as (\n\tselect * from graph0 g\n\tunion
all\n\tselect g.*\n\tfrom graph0 g, search_graph sg\n\twhere g.f =
sg.t\n) search depth first by f, t set seq\nselect * from search"...
[09:47:11.123] firstchar = <optimized out>
[09:47:11.123] input_message = {data = 0x564b180fa0e0 "with
recursive search_graph(f, t, label) as (\n\tselect * from graph0
g\n\tunion all\n\tselect g.*\n\tfrom graph0 g, search_graph
sg\n\twhere g.f = sg.t\n) search depth first by f, t set seq\nselect *
from search"..., len = 221, maxlen = 1024, cursor = 221}
[09:47:11.123] local_sigjmp_buf = {{__jmpbuf =
{94880520543032, -8991887800862096751, 0, 4, 140720594048004, 1,
-3433879442247499119, -8991885813233991023}, __mask_was_saved = 1,
__saved_mask = {__val = {4194304, 1, 140346553036196, 94880526187920,
15616, 15680, 94880508418872, 0, 94880526187920, 15616,
94880520537224, 4, 140720594048004, 1, 94880508502235, 1}}}}
[09:47:11.123] send_ready_for_query = false
[09:47:11.123] idle_in_transaction_timeout_enabled = false
[09:47:11.123] idle_session_timeout_enabled = false
[09:47:11.123] __func__ = "PostgresMain"
[09:47:11.123] #29 0x0000564b16bcc4e4 in BackendRun
(port=port@entry=0x564b18126f50) at
../src/backend/postmaster/postmaster.c:4464

[1]: https://cirrus-ci.com/task/4765094966460416

Regards,
Vignesh

#4Andy Fan
zhihuifan1213@163.com
In reply to: vignesh C (#3)
1 attachment(s)
Re: Shared detoast Datum proposal

Hi,

vignesh C <vignesh21@gmail.com> writes:

On Mon, 1 Jan 2024 at 19:26, Andy Fan <zhihuifan1213@163.com> wrote:

Andy Fan <zhihuifan1213@163.com> writes:

Some Known issues:
------------------

1. Currently only Scan & Join nodes are considered for this feature.
2. JIT is not adapted for this purpose yet.

JIT is adapted for this feature in v2. Any feedback is welcome.

One of the tests was aborted at CFBOT [1] with:
[09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
[09:47:01.035] [New LWP 28182]

There was a bug in JIT part, here is the fix. Thanks for taking care of
this!

--
Best Regards
Andy Fan

Attachments:

v3-0001-shared-detoast-feature.patchtext/x-diffDownload
From 06a212ad6b6254cbd38dbacc409165300ff7c677 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Wed, 27 Dec 2023 18:43:56 +0800
Subject: [PATCH v3 1/1] shared detoast feature.

---
 src/backend/executor/execExpr.c              |  65 ++-
 src/backend/executor/execExprInterp.c        | 181 +++++++
 src/backend/executor/execTuples.c            | 130 +++++
 src/backend/executor/execUtils.c             |   5 +
 src/backend/executor/nodeHashjoin.c          |   2 +
 src/backend/executor/nodeMergejoin.c         |   2 +
 src/backend/executor/nodeNestloop.c          |   1 +
 src/backend/jit/llvm/llvmjit_expr.c          |  26 +-
 src/backend/jit/llvm/llvmjit_types.c         |   1 +
 src/backend/nodes/bitmapset.c                |  13 +
 src/backend/optimizer/plan/createplan.c      |  73 ++-
 src/backend/optimizer/plan/setrefs.c         | 536 +++++++++++++++----
 src/include/executor/execExpr.h              |   6 +
 src/include/executor/tuptable.h              |  60 +++
 src/include/nodes/bitmapset.h                |   1 +
 src/include/nodes/execnodes.h                |   5 +
 src/include/nodes/plannodes.h                |  50 ++
 src/test/regress/sql/shared_detoast_slow.sql |  70 +++
 18 files changed, 1119 insertions(+), 108 deletions(-)
 create mode 100644 src/test/regress/sql/shared_detoast_slow.sql

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 91df2009be..4aeec8419f 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -932,22 +932,81 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				}
 				else
 				{
+					int			attnum;
+					Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 					/* regular user column */
 					scratch.d.var.attnum = variable->varattno - 1;
 					scratch.d.var.vartype = variable->vartype;
+					attnum = scratch.d.var.attnum;
+
 					switch (variable->varno)
 					{
 						case INNER_VAR:
-							scratch.opcode = EEOP_INNER_VAR;
+
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+							{
+								/* debug purpose. */
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_INNER_VAR_TOAST: flags = %d costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+								}
+								scratch.opcode = EEOP_INNER_VAR_TOAST;
+							}
+							else
+							{
+								scratch.opcode = EEOP_INNER_VAR;
+							}
 							break;
 						case OUTER_VAR:
-							scratch.opcode = EEOP_OUTER_VAR;
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+							{
+								/* debug purpose. */
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_OUTER_VAR_TOAST: flags = %u costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+								}
+								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_OUTER_VAR;
 							break;
 
 							/* INDEX_VAR is handled by default case */
 
 						default:
-							scratch.opcode = EEOP_SCAN_VAR;
+							if (is_scan_plan(plan) && bms_is_member(
+																	attnum,
+																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+							{
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_SCAN_VAR_TOAST: flags = %u costs=%.2f..%.2f, scanId: %d, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 ((Scan *) plan)->scanrelid,
+										 attnum);
+								}
+								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_SCAN_VAR;
 							break;
 					}
 				}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3c17cc6b1e..bd769fdeb6 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -157,6 +158,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -165,6 +169,9 @@ static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull
 static Datum ExecJustInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -180,6 +187,43 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  AggStatePerGroup pergroup,
 															  ExprContext *aggcontext,
 															  int setno);
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+	if (!slot->tts_isnull[attnum] &&
+		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+	{
+		Datum		oldDatum;
+		MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
+
+		oldDatum = slot->tts_values[attnum];
+		slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
+																(struct varlena *) oldDatum));
+		Assert(slot->tts_nvalid > attnum);
+		if (oldDatum != slot->tts_values[attnum])
+			slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
+		MemoryContextSwitchTo(old);
+	}
+}
+
+/* JIT requires a non-static (and external?) function */
+void
+ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum)
+{
+	return ExecSlotDetoastDatum(slot, attnum);
+}
+
+
+static inline void
+ExecEvalToastVar(TupleTableSlot *slot,
+				 ExprEvalStep *op,
+				 int attnum)
+{
+	ExecSlotDetoastDatum(slot, attnum);
+
+	*op->resvalue = slot->tts_values[attnum];
+	*op->resnull = slot->tts_isnull[attnum];
+}
 
 /*
  * ScalarArrayOpExprHashEntry
@@ -295,6 +339,24 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVar;
 			return;
 		}
+		if (step0 == EEOP_INNER_FETCHSOME &&
+			step1 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_FETCHSOME &&
+				 step1 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_FETCHSOME &&
+				 step1 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarToast;
+			return;
+		}
 		else if (step0 == EEOP_INNER_FETCHSOME &&
 				 step1 == EEOP_ASSIGN_INNER_VAR)
 		{
@@ -330,6 +392,7 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustConst;
 			return;
 		}
+		/* ???? */
 		else if (step0 == EEOP_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustInnerVarVirt;
@@ -345,6 +408,21 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVarVirt;
 			return;
 		}
+		else if (step0 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarVirtToast;
+			return;
+		}
 		else if (step0 == EEOP_ASSIGN_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustAssignInnerVarVirt;
@@ -412,6 +490,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
+		&&CASE_EEOP_INNER_VAR_TOAST,
+		&&CASE_EEOP_OUTER_VAR_TOAST,
+		&&CASE_EEOP_SCAN_VAR_TOAST,
 		&&CASE_EEOP_INNER_SYSVAR,
 		&&CASE_EEOP_OUTER_SYSVAR,
 		&&CASE_EEOP_SCAN_SYSVAR,
@@ -595,6 +676,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
 			*op->resvalue = scanslot->tts_values[attnum];
 			*op->resnull = scanslot->tts_isnull[attnum];
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_INNER_VAR_TOAST)
+		{
+			ExecEvalToastVar(innerslot, op, op->d.var.attnum);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_VAR_TOAST)
+		{
+			ExecEvalToastVar(outerslot, op, op->d.var.attnum);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_VAR_TOAST)
+		{
+			ExecEvalToastVar(scanslot, op, op->d.var.attnum);
 
 			EEO_NEXT();
 		}
@@ -2126,6 +2226,42 @@ ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+static pg_attribute_always_inline Datum
+ExecJustVarImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[1];
+	int			attnum = op->d.var.attnum;
+
+	CheckOpSlotCompatibility(&state->steps[0], slot);
+
+	slot_getattr(slot, attnum + 1, isnull);
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Simple reference to inner Var */
+static Datum
+ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Simple reference to outer Var */
+static Datum
+ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Simple reference to scan Var */
+static Datum
+ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)Var */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
@@ -2264,6 +2400,51 @@ ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarVirtImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+/* implementation of ExecJust(Inner|Outer|Scan)VarVirt */
+static pg_attribute_always_inline Datum
+ExecJustVarVirtImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[0];
+	int			attnum = op->d.var.attnum;
+
+	/*
+	 * As it is guaranteed that a virtual slot is used, there never is a need
+	 * to perform tuple deforming (nor would it be possible). Therefore
+	 * execExpr.c has not emitted an EEOP_*_FETCHSOME step. Verify, as much as
+	 * possible, that that determination was accurate.
+	 */
+	Assert(TTS_IS_VIRTUAL(slot));
+	Assert(TTS_FIXED(slot));
+	Assert(attnum >= 0 && attnum < slot->tts_nvalid);
+
+	*isnull = slot->tts_isnull[attnum];
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Like ExecJustInnerVar, optimized for virtual slots */
+static Datum
+ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Like ExecJustOuterVar, optimized for virtual slots */
+static Datum
+ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Like ExecJustScanVar, optimized for virtual slots */
+static Datum
+ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)VarVirt */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarVirtImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a7aa2ee02b..08c96a42b1 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -79,6 +79,9 @@ static inline void tts_buffer_heap_store_tuple(TupleTableSlot *slot,
 											   bool transfer_pin);
 static void tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree);
 
+static Bitmapset *cal_final_pre_detoast_attrs(Bitmapset *plan_pre_detoast_attrs,
+											  TupleDesc tupleDesc,
+											  List *not_pre_detoast_vars);
 
 const TupleTableSlotOps TTSOpsVirtual;
 const TupleTableSlotOps TTSOpsHeapTuple;
@@ -176,6 +179,10 @@ tts_virtual_materialize(TupleTableSlot *slot)
 		if (att->attbyval || slot->tts_isnull[natt])
 			continue;
 
+		if (bms_is_member(natt, slot->pre_detoasted_attrs))
+			/* it has been in slot->tts_mcxt already. */
+			continue;
+
 		val = slot->tts_values[natt];
 
 		if (att->attlen == -1 &&
@@ -392,6 +399,13 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated as non valid (tts_nvalid = 0), so let free the
+	 * pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
+
 }
 
 static void
@@ -457,6 +471,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -567,6 +584,12 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated as non valid (tts_nvalid = 0), free the
+	 * pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -637,6 +660,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -771,6 +797,12 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_nvalid = 0 means tts_values will be not reliable, so clear the
+	 * information about pre-detoast-datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -860,6 +892,7 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 {
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 
+
 	if (TTS_SHOULDFREE(slot))
 	{
 		/* materialized slot shouldn't have a buffer to release */
@@ -904,6 +937,8 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 		 */
 		ReleaseBuffer(buffer);
 	}
+
+	ExecFreePreDetoastDatum(slot);
 }
 
 /*
@@ -1138,6 +1173,7 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 	slot->tts_tupleDescriptor = tupleDesc;
 	slot->tts_mcxt = CurrentMemoryContext;
 	slot->tts_nvalid = 0;
+	slot->pre_detoasted_attrs = NULL;
 
 	if (tupleDesc != NULL)
 	{
@@ -1810,12 +1846,30 @@ void
 ExecInitScanTupleSlot(EState *estate, ScanState *scanstate,
 					  TupleDesc tupledesc, const TupleTableSlotOps *tts_ops)
 {
+	Scan	   *splan = (Scan *) scanstate->ps.plan;
+
 	scanstate->ss_ScanTupleSlot = ExecAllocTableSlot(&estate->es_tupleTable,
 													 tupledesc, tts_ops);
 	scanstate->ps.scandesc = tupledesc;
 	scanstate->ps.scanopsfixed = tupledesc != NULL;
 	scanstate->ps.scanops = tts_ops;
 	scanstate->ps.scanopsset = true;
+
+	if (is_scan_plan((Plan *) splan))
+	{
+		/*
+		 * We may run detoast in Qual or Projection, but all of them happen at
+		 * the ss_ScanTupleSlot rather than ps_ResultTupleSlot. So we can only
+		 * take care of the ss_ScanTupleSlot.
+		 *
+		 * the ps_ResultTupleSlot may also have detoast on its parent node,
+		 * like as a inner or outer slot in join case, the pre_detoast on
+		 * these slot.
+		 */
+		scanstate->scan_pre_detoast_attrs = cal_final_pre_detoast_attrs(splan->reference_attrs,
+																		tupledesc,
+																		splan->plan.forbid_pre_detoast_vars);
+	}
 }
 
 /* ----------------
@@ -2336,3 +2390,79 @@ end_tup_output(TupOutputState *tstate)
 	ExecDropSingleTupleTableSlot(tstate->slot);
 	pfree(tstate);
 }
+
+static Bitmapset *
+cal_final_pre_detoast_attrs(Bitmapset *plan_pre_detoast_attrs,
+							TupleDesc tupleDesc,
+							List *not_pre_detoast_vars)
+{
+	Bitmapset  *final = NULL,
+			   *toast_attrs = NULL,
+			   *forbid_pre_detoast_attrs = NULL;
+
+	int			i;
+	ListCell   *lc;
+
+	if (bms_is_empty(plan_pre_detoast_attrs))
+		return NULL;
+
+	/*
+	 * there is no exact data type in create_plan or set_plan_refs stage, so
+	 * plan_pre_detoast_attrs may have some attribute which is not toast attrs
+	 * at all, which should be removed.
+	 */
+	for (i = 0; i < tupleDesc->natts; i++)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupleDesc, i);
+
+		if (attr->attlen == -1 && attr->attstorage != TYPSTORAGE_PLAIN)
+			toast_attrs = bms_add_member(toast_attrs, attr->attnum - 1);
+	}
+
+	final = bms_intersect(plan_pre_detoast_attrs, toast_attrs);
+
+	/*
+	 * Due to the fact of detoast-datum will make the tuple bigger which is
+	 * bad for some nodes like Sort/Hash, to avoid performance regression,
+	 * such attribute should be removed as well.
+	 */
+	foreach(lc, not_pre_detoast_vars)
+	{
+		Var		   *var = lfirst_node(Var, lc);
+
+		forbid_pre_detoast_attrs = bms_add_member(forbid_pre_detoast_attrs, var->varattno - 1);
+	}
+
+	final = bms_del_members(final, forbid_pre_detoast_attrs);
+
+	bms_free(toast_attrs);
+	bms_free(forbid_pre_detoast_attrs);
+
+	return final;
+}
+
+
+/* Input slot, result slot. scan slot? */
+void
+SetPredetoastAttrsForScan(ScanState *scanstate)
+{
+}
+
+void
+SetPredetoastAttrsForJoin(JoinState *j)
+{
+	PlanState  *outerstate = outerPlanState(j);
+	PlanState  *innerstate = innerPlanState(j);
+
+	/* Input slot, result slot. scan slot? */
+
+	j->outer_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->outer_reference_attrs,
+															 outerstate->ps_ResultTupleDesc,
+															 outerstate->plan->forbid_pre_detoast_vars);
+
+	j->inner_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->inner_reference_attrs,
+															 innerstate->ps_ResultTupleDesc,
+															 innerstate->plan->forbid_pre_detoast_vars);
+}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index cff5dc723e..43de95f5b0 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -572,6 +572,11 @@ ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
 		planstate->resultopsset = planstate->scanopsset;
 		planstate->resultopsfixed = planstate->scanopsfixed;
 		planstate->resultops = planstate->scanops;
+
+		/*
+		 * XXX: can I make sure the ps_ResultTupleDesc is set all the time?
+		 */
+		Assert(planstate->ps_ResultTupleDesc != NULL);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1cbec4647c..19a05ed624 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -756,6 +756,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
 	innerDesc = ExecGetResultType(innerPlanState(hjstate));
 
+	SetPredetoastAttrsForJoin((JoinState *) hjstate);
+
 	/*
 	 * Initialize result slot, type and projection.
 	 */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index c1a8ca2464..be7cbd7f30 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1497,6 +1497,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 											  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	SetPredetoastAttrsForJoin((JoinState *) mergestate);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 06fa0a9b31..2d40d19192 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -306,6 +306,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(&nlstate->js.ps, &TTSOpsVirtual);
 	ExecAssignProjectionInfo(&nlstate->js.ps, NULL);
+	SetPredetoastAttrsForJoin((JoinState *) nlstate);
 
 	/*
 	 * initialize child expressions
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 33161d812f..56c10c9e4d 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -396,30 +396,52 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
+			case EEOP_INNER_VAR_TOAST:
+			case EEOP_OUTER_VAR_TOAST:
+			case EEOP_SCAN_VAR_TOAST:
 				{
 					LLVMValueRef value,
 								isnull;
 					LLVMValueRef v_attnum;
 					LLVMValueRef v_values;
 					LLVMValueRef v_nulls;
+					LLVMValueRef v_slot;
 
-					if (opcode == EEOP_INNER_VAR)
+					if (opcode == EEOP_INNER_VAR || opcode == EEOP_INNER_VAR_TOAST)
 					{
+						v_slot = v_innerslot;
 						v_values = v_innervalues;
 						v_nulls = v_innernulls;
 					}
-					else if (opcode == EEOP_OUTER_VAR)
+					else if (opcode == EEOP_OUTER_VAR || opcode == EEOP_OUTER_VAR_TOAST)
 					{
+						v_slot = v_outerslot;
 						v_values = v_outervalues;
 						v_nulls = v_outernulls;
 					}
 					else
 					{
+						v_slot = v_scanslot;
 						v_values = v_scanvalues;
 						v_nulls = v_scannulls;
 					}
 
 					v_attnum = l_int32_const(lc, op->d.var.attnum);
+
+					if (opcode == EEOP_INNER_VAR_TOAST ||
+						opcode == EEOP_OUTER_VAR_TOAST ||
+						opcode == EEOP_SCAN_VAR_TOAST)
+					{
+						LLVMValueRef params[2];
+
+						params[0] = v_slot;
+						params[1] = l_int32_const(lc, op->d.var.attnum);
+						l_call(b,
+							   llvm_pg_var_func_type("ExecSlotDetoastDatumExternal"),
+							   llvm_pg_func(mod, "ExecSlotDetoastDatumExternal"),
+							   params, lengthof(params), "");
+					}
+
 					value = l_load_gep1(b, TypeSizeT, v_values, v_attnum, "");
 					isnull = l_load_gep1(b, TypeStorageBool, v_nulls, v_attnum, "");
 					LLVMBuildStore(b, value, v_resvaluep);
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 5212f529c8..11a11028b8 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -177,4 +177,5 @@ void	   *referenced_functions[] =
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecSlotDetoastDatumExternal,
 };
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index f4b61085be..e0eb150e5b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -774,6 +774,19 @@ bms_membership(const Bitmapset *a)
  *		foo = bms_add_member(foo, x);
  */
 
+/*
+ * does this break commit 00b41463c21615f9bf3927f207e37f9e215d32e6?
+ * but I just found alloc memory and free the memory is too bad
+ * for this current feature. So let see ...;
+ */
+void
+bms_zero(Bitmapset *a)
+{
+	if (a == NULL)
+		return;
+
+	memset(a->words, 0, a->nwords * sizeof(bitmapword));
+}
 
 /*
  * bms_add_member - add a specified member to set
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ca619eab94..3f3aea73ce 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -42,6 +42,7 @@
 #include "partitioning/partprune.h"
 #include "utils/lsyscache.h"
 
+extern bool jit_enabled;
 
 /*
  * Flag bits that can appear in the flags argument of create_plan_recurse().
@@ -314,7 +315,8 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 List *mergeActionLists, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
-
+static void set_plan_not_detoast_attrs_recurse(Plan *plan,
+											   List *recheck_list);
 
 /*
  * create_plan
@@ -346,6 +348,8 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree, demanding the correct tlist result */
 	plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);
 
+	set_plan_not_detoast_attrs_recurse(plan, NIL);
+
 	/*
 	 * Make sure the topmost plan node's targetlist exposes the original
 	 * column names and other decorative info.  Targetlists generated within
@@ -378,6 +382,68 @@ create_plan(PlannerInfo *root, Path *best_path)
 	return plan;
 }
 
+/*
+ * set_plan_not_pre_detoast_vars
+ *
+ *	set the toast_attrs according recheck_list.
+ *
+ * recheck_list = NIL means we need to do thing.
+ */
+static void
+set_plan_not_pre_detoast_vars(Plan *plan, List *recheck_list)
+{
+	ListCell   *lc;
+	Var		   *var;
+
+	if (recheck_list == NIL)
+		return;
+
+	foreach(lc, plan->targetlist)
+	{
+		TargetEntry *te = lfirst_node(TargetEntry, lc);
+
+		if (!IsA(te->expr, Var))
+			continue;
+		var = castNode(Var, te->expr);
+		if (var->varattno <= 0)
+			continue;
+		if (list_member(recheck_list, var))
+			/* pass the recheck */
+			plan->forbid_pre_detoast_vars = lappend(plan->forbid_pre_detoast_vars, var);
+	}
+}
+
+
+static void
+set_plan_not_detoast_attrs_recurse(Plan *plan, List *recheck_list)
+{
+	if (plan == NULL)
+		return;
+
+	set_plan_not_pre_detoast_vars(plan, recheck_list);
+
+	if (IsA(plan, Sort) || IsA(plan, Memoize) || IsA(plan, WindowAgg) ||
+		IsA(plan, Hash) || IsA(plan, Material) || IsA(plan, IncrementalSort))
+	{
+		List	   *subplan_exprs = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		set_plan_not_pre_detoast_vars(plan, subplan_exprs);
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, subplan_exprs);
+	}
+	else if (IsA(plan, HashJoin) && castNode(HashJoin, plan)->left_small_tlist)
+	{
+		List	   *subplan_exprs = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, subplan_exprs);
+		set_plan_not_detoast_attrs_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+	else
+	{
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, plan->forbid_pre_detoast_vars);
+		set_plan_not_detoast_attrs_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+}
+
 /*
  * create_plan_recurse
  *	  Recursive guts of create_plan().
@@ -4884,6 +4950,11 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	join_plan->left_small_tlist = (best_path->num_batches > 1);
+
+	if (!jit_enabled)
+		elog(INFO, "num_batches = %d", best_path->num_batches);
+
 	return join_plan;
 }
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 22a1fa29f3..375e176788 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -27,6 +27,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_relation.h"
 #include "tcop/utility.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
@@ -55,11 +56,27 @@ typedef struct
 	tlist_vinfo vars[FLEXIBLE_ARRAY_MEMBER];	/* has num_vars entries */
 } indexed_tlist;
 
+typedef struct
+{
+	/* var is added into existing_attrs for the first time. */
+	Bitmapset  *existing_attrs;
+	/* following add to the final_ref_attrs. */
+	Bitmapset **final_ref_attrs;
+}			intermediate_var_ref_context;
+
+typedef struct
+{
+	int			level_added;
+	int			level;
+}			intermediate_level_context;
+
 typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context scan_reference_attrs;
 } fix_scan_expr_context;
 
 typedef struct
@@ -71,6 +88,9 @@ typedef struct
 	int			rtoffset;
 	NullingRelsMatch nrm_match;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context outer_reference_attrs;
+	intermediate_var_ref_context inner_reference_attrs;
 } fix_join_expr_context;
 
 typedef struct
@@ -127,8 +147,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, lst, rtoffset, num_exec, pre_detoast_attrs) \
+	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec, pre_detoast_attrs))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -158,7 +178,8 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
 static Node *fix_scan_expr(PlannerInfo *root, Node *node,
-						   int rtoffset, double num_exec);
+						   int rtoffset, double num_exec,
+						   Bitmapset **scan_reference_attrs);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
@@ -190,7 +211,10 @@ static List *fix_join_expr(PlannerInfo *root,
 						   Index acceptable_rel,
 						   int rtoffset,
 						   NullingRelsMatch nrm_match,
-						   double num_exec);
+						   double num_exec,
+						   Bitmapset **outer_reference_attrs,
+						   Bitmapset **inner_reference_attrs);
+
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
 static Node *fix_upper_expr(PlannerInfo *root,
@@ -628,10 +652,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_SampleScan:
@@ -641,13 +671,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs
+					);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tablesample = (TableSampleClause *)
 					fix_scan_expr(root, (Node *) splan->tablesample,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexScan:
@@ -657,28 +694,40 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->indexqual =
 					fix_scan_list(root, splan->indexqual,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->indexorderby =
 					fix_scan_list(root, splan->indexorderby,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexorderbyorig =
 					fix_scan_list(root, splan->indexorderbyorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan), &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexOnlyScan:
 			{
 				IndexOnlyScan *splan = (IndexOnlyScan *) plan;
 
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
+
 				return set_indexonlyscan_references(root, splan, rtoffset);
 			}
 			break;
@@ -691,10 +740,15 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, splan->indexqual, rtoffset, 1,
+								  &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_BitmapHeapScan:
@@ -704,13 +758,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->bitmapqualorig =
 					fix_scan_list(root, splan->bitmapqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidScan:
@@ -720,13 +781,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidquals =
 					fix_scan_list(root, splan->tidquals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidRangeScan:
@@ -736,17 +804,25 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidrangequals =
 					fix_scan_list(root, splan->tidrangequals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_SubqueryScan:
 			/* Needs special treatment, see comments below */
+			/* XXX: shall I do anything? */
 			return set_subqueryscan_references(root,
 											   (SubqueryScan *) plan,
 											   rtoffset);
@@ -757,12 +833,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, splan->functions, rtoffset, 1,
+								  &splan->scan.reference_attrs);
+
 			}
 			break;
 		case T_TableFuncScan:
@@ -772,13 +852,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->tablefunc = (TableFunc *)
 					fix_scan_expr(root, (Node *) splan->tablefunc,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ValuesScan:
@@ -788,13 +872,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->values_lists =
 					fix_scan_list(root, splan->values_lists,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_CteScan:
@@ -804,10 +891,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_NamedTuplestoreScan:
@@ -817,10 +910,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_WorkTableScan:
@@ -830,10 +925,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ForeignScan:
@@ -873,7 +970,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
 												   rtoffset,
-												   NUM_EXEC_TLIST(plan));
+												   NUM_EXEC_TLIST(plan),
+												   NULL);
 				break;
 			}
 
@@ -933,9 +1031,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, splan->limitOffset, rtoffset, 1, NULL);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, splan->limitCount, rtoffset, 1, NULL);
 			}
 			break;
 		case T_Agg:
@@ -988,17 +1086,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->startOffset, rtoffset, 1, NULL);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->endOffset, rtoffset, 1, NULL);
 				wplan->runCondition = fix_scan_list(root,
 													wplan->runCondition,
 													rtoffset,
-													NUM_EXEC_TLIST(plan));
+													NUM_EXEC_TLIST(plan), NULL);
 				wplan->runConditionOrig = fix_scan_list(root,
 														wplan->runConditionOrig,
 														rtoffset,
-														NUM_EXEC_TLIST(plan));
+														NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_Result:
@@ -1038,14 +1136,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 					splan->plan.targetlist =
 						fix_scan_list(root, splan->plan.targetlist,
-									  rtoffset, NUM_EXEC_TLIST(plan));
+									  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 					splan->plan.qual =
 						fix_scan_list(root, splan->plan.qual,
-									  rtoffset, NUM_EXEC_QUAL(plan));
+									  rtoffset, NUM_EXEC_QUAL(plan), NULL);
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1, NULL);
 			}
 			break;
 		case T_ProjectSet:
@@ -1061,7 +1159,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->withCheckOptionLists =
 					fix_scan_list(root, splan->withCheckOptionLists,
-								  rtoffset, 1);
+								  rtoffset, 1, NULL);
 
 				if (splan->returningLists)
 				{
@@ -1118,18 +1216,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						fix_join_expr(root, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					splan->onConflictWhere = (Node *)
 						fix_join_expr(root, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1, NULL);
 				}
 
 				/*
@@ -1182,7 +1282,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   resultrel,
 															   rtoffset,
 															   NRM_EQUAL,
-															   NUM_EXEC_TLIST(plan));
+															   NUM_EXEC_TLIST(plan),
+															   NULL, NULL);
 
 							/* Fix quals too. */
 							action->qual = (Node *) fix_join_expr(root,
@@ -1191,7 +1292,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 																  resultrel,
 																  rtoffset,
 																  NRM_EQUAL,
-																  NUM_EXEC_QUAL(plan));
+																  NUM_EXEC_QUAL(plan),
+																  NULL, NULL);
 						}
 					}
 				}
@@ -1356,13 +1458,16 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
 	plan->indexqual = fix_scan_list(root, plan->indexqual,
-									rtoffset, 1);
+									rtoffset, 1,
+									&plan->scan.reference_attrs);
 	/* indexorderby is already transformed to reference index columns */
 	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
-									   rtoffset, 1);
+									   rtoffset, 1,
+									   &plan->scan.reference_attrs);
 	/* indextlist must NOT be transformed to reference index columns */
 	plan->indextlist = fix_scan_list(root, plan->indextlist,
-									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+									 rtoffset, NUM_EXEC_TLIST((Plan *) plan),
+									 &plan->scan.reference_attrs);
 
 	pfree(index_itlist);
 
@@ -1409,10 +1514,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
 			fix_scan_list(root, plan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) plan), NULL);
 		plan->scan.plan.qual =
 			fix_scan_list(root, plan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) plan), NULL);
 
 		result = (Plan *) plan;
 	}
@@ -1612,7 +1717,7 @@ set_foreignscan_references(PlannerInfo *root,
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
 			fix_scan_list(root, fscan->fdw_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 	}
 	else
 	{
@@ -1622,16 +1727,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_recheck_quals =
 			fix_scan_list(root, fscan->fdw_recheck_quals,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 	}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
@@ -1690,20 +1795,20 @@ set_customscan_references(PlannerInfo *root,
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
 			fix_scan_list(root, cscan->custom_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
 			fix_scan_list(root, cscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 		cscan->scan.plan.qual =
 			fix_scan_list(root, cscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 		cscan->custom_exprs =
 			fix_scan_list(root, cscan->custom_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 	}
 
 	/* Adjust child plan-nodes recursively, if needed */
@@ -2111,6 +2216,95 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
 	return (Node *) bestplan;
 }
 
+
+static inline void
+setup_intermediate_level_ctx(intermediate_level_context * ctx)
+{
+	ctx->level = 0;
+	ctx->level_added = false;
+}
+
+static inline void
+setup_intermediate_var_ref_ctx(intermediate_var_ref_context * ctx, Bitmapset **final_ref_attrs)
+{
+	ctx->existing_attrs = NULL;
+	ctx->final_ref_attrs = final_ref_attrs;
+}
+
+/*
+ * increase_level_for_pre_detoast
+ *	Check if the given Expr could detoast a Var directly, if yes,
+ * increase the level and return true. otherwise return false;
+ */
+static inline void
+increase_level_for_pre_detoast(Node *node, intermediate_level_context * ctx)
+{
+	/* The following nodes is impossible to detoast a Var directly. */
+	if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
+	{
+		ctx->level_added = false;
+	}
+	else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
+	{
+		/* let's not detoast first so that pg_column_compression works. */
+		ctx->level_added = false;
+	}
+	else
+	{
+		ctx->level_added = true;
+		ctx->level += 1;
+	}
+}
+
+static inline void
+decreased_level_for_pre_detoast(intermediate_level_context * ctx)
+{
+	if (ctx->level_added)
+		ctx->level -= 1;
+
+	ctx->level_added = false;
+}
+
+/*
+ * add_pre_detoast_vars
+ *		add the var's information into pre_detoast_attrs when the check is pass.
+ */
+static inline void
+add_pre_detoast_vars(intermediate_level_context * level_ctx,
+					 intermediate_var_ref_context * ctx,
+					 Var *var)
+{
+	int			attno;
+
+	if (level_ctx->level <= 1 || ctx->final_ref_attrs == NULL || var->varattno <= 0)
+		return;
+
+	attno = var->varattno - 1;
+	if (bms_is_member(attno, ctx->existing_attrs))
+	{
+		/* not the first time to access it, add it to final result. */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+	else
+	{
+		/* first time. */
+		ctx->existing_attrs = bms_add_member(ctx->existing_attrs, attno);
+
+		/*
+		 * XXX:
+		 *
+		 * The above strategy doesn't help to detect if a Var is detoast
+		 * twice. Reasons are: 1. the context is not maintain in Plan node
+		 * level. so if it is detoast at targetlist and qual, we can't detect
+		 * it. 2. even we can make it at plan node, it still doesn't help for
+		 * the among-nodes case.
+		 *
+		 * So for now, I just disable it.
+		 */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+}
+
 /*
  * fix_scan_expr
  *		Do set_plan_references processing on a scan-level expression
@@ -2130,13 +2324,16 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset,
+			  double num_exec, Bitmapset **scan_reference_attrs)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.scan_reference_attrs, scan_reference_attrs);
 
 	if (rtoffset != 0 ||
 		root->multiexpr_params != NIL ||
@@ -2167,8 +2364,13 @@ fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
 static Node *
 fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 {
+	Node	   *n;
+
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = copyVar((Var *) node);
@@ -2186,10 +2388,18 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 			var->varno += context->rtoffset;
 		if (var->varnosyn > 0)
 			var->varnosyn += context->rtoffset;
+
+		add_pre_detoast_vars(&context->level_ctx, &context->scan_reference_attrs, var);
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) var;
 	}
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		Node	   *n = fix_param_node(context->root, (Param *) node);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n;
+	}
 	if (IsA(node, Aggref))
 	{
 		Aggref	   *aggref = (Aggref *) node;
@@ -2200,7 +2410,10 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		if (aggparam != NULL)
 		{
 			/* Make a copy of the Param for paranoia's sake */
-			return (Node *) copyObject(aggparam);
+			Node	   *n = (Node *) copyObject(aggparam);
+
+			decreased_level_for_pre_detoast(&context->level_ctx);
+			return n;
 		}
 		/* If no match, just fall through to process it normally */
 	}
@@ -2210,6 +2423,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 
 		Assert(!IS_SPECIAL_VARNO(cexpr->cvarno));
 		cexpr->cvarno += context->rtoffset;
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) cexpr;
 	}
 	if (IsA(node, PlaceHolderVar))
@@ -2218,29 +2432,52 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
 		/* XXX can we assert something about phnullingrels? */
-		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
+		Node	   *n = fix_scan_expr_mutator((Node *) phv->phexpr, context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n;
 	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_scan_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		Node	   *n = fix_scan_expr_mutator(fix_alternative_subplan(context->root,
+																	  (AlternativeSubPlan *) node,
+																	  context->num_exec),
+											  context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator,
-								   (void *) context);
+	n = expression_tree_mutator(node, fix_scan_expr_mutator, (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return n;
 }
 
 static bool
 fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 {
+	bool		ret;
+
 	if (node == NULL)
 		return false;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
+	if (IsA(node, Var))
+	{
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 castNode(Var, node));
+	}
 	Assert(!(IsA(node, Var) && ((Var *) node)->varno == ROWID_VAR));
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
-	return expression_tree_walker(node, fix_scan_expr_walker,
-								  (void *) context);
+	ret = expression_tree_walker(node, fix_scan_expr_walker,
+								 (void *) context);
+
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret;
 }
 
 /*
@@ -2276,7 +2513,10 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset,
 								   NRM_EQUAL,
-								   NUM_EXEC_QUAL((Plan *) join));
+								   NUM_EXEC_QUAL((Plan *) join),
+								   &join->outer_reference_attrs,
+								   &join->inner_reference_attrs
+		);
 
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
@@ -2323,7 +2563,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										 (Index) 0,
 										 rtoffset,
 										 NRM_EQUAL,
-										 NUM_EXEC_QUAL((Plan *) join));
+										 NUM_EXEC_QUAL((Plan *) join),
+										 &join->outer_reference_attrs,
+										 &join->inner_reference_attrs);
 	}
 	else if (IsA(join, HashJoin))
 	{
@@ -2336,7 +2578,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										(Index) 0,
 										rtoffset,
 										NRM_EQUAL,
-										NUM_EXEC_QUAL((Plan *) join));
+										NUM_EXEC_QUAL((Plan *) join),
+										&join->outer_reference_attrs,
+										&join->inner_reference_attrs);
 
 		/*
 		 * HashJoin's hashkeys are used to look for matching tuples from its
@@ -2368,7 +2612,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  (Index) 0,
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-										  NUM_EXEC_TLIST((Plan *) join));
+										  NUM_EXEC_TLIST((Plan *) join),
+										  &join->outer_reference_attrs,
+										  &join->inner_reference_attrs);
 	join->plan.qual = fix_join_expr(root,
 									join->plan.qual,
 									outer_itlist,
@@ -2376,8 +2622,20 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 									(Index) 0,
 									rtoffset,
 									(join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-									NUM_EXEC_QUAL((Plan *) join));
-
+									NUM_EXEC_QUAL((Plan *) join),
+									&join->outer_reference_attrs,
+									&join->inner_reference_attrs);
+
+	join->plan.forbid_pre_detoast_vars = fix_join_expr(root,
+													   join->plan.forbid_pre_detoast_vars,
+													   outer_itlist,
+													   inner_itlist,
+													   (Index) 0,
+													   rtoffset,
+													   (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
+													   NUM_EXEC_TLIST((Plan *) join),
+													   NULL,
+													   NULL);
 	pfree(outer_itlist);
 	pfree(inner_itlist);
 }
@@ -3010,9 +3268,12 @@ fix_join_expr(PlannerInfo *root,
 			  Index acceptable_rel,
 			  int rtoffset,
 			  NullingRelsMatch nrm_match,
-			  double num_exec)
+			  double num_exec,
+			  Bitmapset **outer_reference_attrs,
+			  Bitmapset **inner_reference_attrs)
 {
 	fix_join_expr_context context;
+	List	   *ret;
 
 	context.root = root;
 	context.outer_itlist = outer_itlist;
@@ -3021,16 +3282,30 @@ fix_join_expr(PlannerInfo *root,
 	context.rtoffset = rtoffset;
 	context.nrm_match = nrm_match;
 	context.num_exec = num_exec;
-	return (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.outer_reference_attrs, outer_reference_attrs);
+	setup_intermediate_var_ref_ctx(&context.inner_reference_attrs, inner_reference_attrs);
+
+	ret = (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	bms_free(context.outer_reference_attrs.existing_attrs);
+	bms_free(context.inner_reference_attrs.existing_attrs);
+
+	return ret;
 }
 
 static Node *
 fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 {
 	Var		   *newvar;
+	Node	   *ret_node;
 
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = (Var *) node;
@@ -3044,7 +3319,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* then in the inner. */
@@ -3056,7 +3337,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If it's for acceptable_rel, adjust and return it */
@@ -3066,6 +3353,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 			var->varno += context->rtoffset;
 			if (var->varnosyn > 0)
 				var->varnosyn += context->rtoffset;
+			/* XXX acceptable_rel?  we can ignore it for safety. */
+			decreased_level_for_pre_detoast(&context->level_ctx);
+
 			return (Node *) var;
 		}
 
@@ -3084,22 +3374,38 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  OUTER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 		if (context->inner_itlist && context->inner_itlist->has_ph_vars)
 		{
+
 			newvar = search_indexed_tlist_for_phv(phv,
 												  context->inner_itlist,
 												  INNER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If not supplied by input plans, evaluate the contained expr */
 		/* XXX can we assert something about phnullingrels? */
-		return fix_join_expr_mutator((Node *) phv->phexpr, context);
+		ret_node = fix_join_expr_mutator((Node *) phv->phexpr, context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
 	}
+
 	/* Try matching more complex expressions too, if tlists have any */
 	if (context->outer_itlist && context->outer_itlist->has_non_vars)
 	{
@@ -3107,7 +3413,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->outer_itlist,
 												  OUTER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->outer_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	if (context->inner_itlist && context->inner_itlist->has_non_vars)
 	{
@@ -3115,20 +3427,36 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->inner_itlist,
 												  INNER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->inner_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	/* Special cases (apply only AFTER failing to match to lower tlist) */
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		ret_node = fix_param_node(context->root, (Param *) node);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_join_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		ret_node = fix_join_expr_mutator(fix_alternative_subplan(context->root,
+																 (AlternativeSubPlan *) node,
+																 context->num_exec),
+										 context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node,
-								   fix_join_expr_mutator,
-								   (void *) context);
+	ret_node = expression_tree_mutator(node,
+									   fix_join_expr_mutator,
+									   (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret_node;
 }
 
 /*
@@ -3163,7 +3491,8 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * varno = newvarno, varattno = resno of corresponding targetlist element.
  * The original tree is not modified.
  */
-static Node *
+static Node *					/* XXX: shall I care about this for shared
+								 * detoast optimization? */
 fix_upper_expr(PlannerInfo *root,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
@@ -3318,7 +3647,10 @@ set_returning_clause_references(PlannerInfo *root,
 						  resultRelation,
 						  rtoffset,
 						  NRM_EQUAL,
-						  NUM_EXEC_TLIST(topplan));
+						  NUM_EXEC_TLIST(topplan),
+						  NULL,
+						  NULL
+		);
 
 	pfree(itlist);
 
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index a20c539a25..9a31e083dd 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -77,6 +77,11 @@ typedef enum ExprEvalOp
 	EEOP_OUTER_VAR,
 	EEOP_SCAN_VAR,
 
+	/* compute non-system Var value with shared-detoast-datum logic */
+	EEOP_INNER_VAR_TOAST,
+	EEOP_OUTER_VAR_TOAST,
+	EEOP_SCAN_VAR_TOAST,
+
 	/* compute system Var value */
 	EEOP_INNER_SYSVAR,
 	EEOP_OUTER_SYSVAR,
@@ -826,5 +831,6 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
+extern void ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum);
 
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 6133dbcd0a..5b42eb65f8 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -18,6 +18,7 @@
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/tupdesc.h"
+#include "nodes/bitmapset.h"
 #include "storage/buf.h"
 
 /*----------
@@ -128,6 +129,11 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+
+	/*
+	 * The attributes populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST step.
+	 */
+	Bitmapset  *pre_detoasted_attrs;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -426,12 +432,38 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+	int			attnum;
+
+	if (bms_is_empty(slot->pre_detoasted_attrs))
+		return;
+
+	attnum = -1;
+	/* free the memory used by pre-detoasted datum and reset the flags. */
+	while ((attnum = bms_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
+	{
+		pfree((void *) slot->tts_values[attnum]);
+	}
+
+	/*
+	 * bms_free each time cost too much, so just zero these bits and keep its
+	 * memory, just like what we did for TupleTableSlot. but.. see the
+	 * comments about the bms_zero.
+	 */
+	bms_zero(slot->pre_detoasted_attrs);
+}
+
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
 static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_ops->clear(slot);
 
 	return slot;
@@ -450,6 +482,10 @@ ExecClearTuple(TupleTableSlot *slot)
 static inline void
 ExecMaterializeSlot(TupleTableSlot *slot)
 {
+	/*
+	 * XXX: pre_detoasted_attrs doesn't dependent on any external storage, so
+	 * nothing should be done here.
+	 */
 	slot->tts_ops->materialize(slot);
 }
 
@@ -494,6 +530,30 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 
 	dstslot->tts_ops->copyslot(dstslot, srcslot);
 
+	/* Assert this assumption the below code depends on. */
+	Assert(dstslot->tts_nvalid == 0 ||
+		   dstslot->tts_nvalid == srcslot->tts_nvalid);
+
+	if (dstslot->tts_nvalid == srcslot->tts_nvalid
+		&& !bms_is_empty(srcslot->pre_detoasted_attrs))
+	{
+		int			attnum = -1;
+		MemoryContext old = MemoryContextSwitchTo(dstslot->tts_mcxt);
+
+		dstslot->pre_detoasted_attrs = bms_copy(srcslot->pre_detoasted_attrs);
+
+		while ((attnum = bms_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
+		{
+			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
+			Size		len;
+
+			Assert(!VARATT_IS_EXTENDED(datum));
+			len = VARSIZE(datum);
+			dstslot->tts_values[attnum] = (Datum) palloc(len);
+			memcpy((void *) dstslot->tts_values[attnum], datum, len);
+		}
+		MemoryContextSwitchTo(old);
+	}
 	return dstslot;
 }
 
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index efc8890ce6..e91f334254 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -113,6 +113,7 @@ extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
+extern void bms_zero(Bitmapset *a);
 
 /* support for iterating through the integer elements of a set: */
 extern int	bms_next_member(const Bitmapset *a, int prevbit);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 561fdd98f1..1a01d480ce 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1474,6 +1474,7 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+	Bitmapset  *scan_pre_detoast_attrs;
 } ScanState;
 
 /* ----------------
@@ -2003,6 +2004,8 @@ typedef struct JoinState
 	bool		single_match;	/* True if we should skip to next outer tuple
 								 * after finding one inner match */
 	ExprState  *joinqual;		/* JOIN quals (in addition to ps.qual) */
+	Bitmapset  *outer_pre_detoast_attrs;
+	Bitmapset  *inner_pre_detoast_attrs;
 } JoinState;
 
 /* ----------------
@@ -2764,4 +2767,6 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+extern void SetPredetoastAttrsForJoin(JoinState *joinstate);
+extern void SetPredetoastAttrsForScan(ScanState *scanstate);
 #endif							/* EXECNODES_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c..2f2fec2740 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,13 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+
+	/*
+	 * A list of Vars which should not apply the shared-detoast-datum logic
+	 * since the upper nodes like Sort/Hash want the tuples as small as
+	 * possible. Its a subset of targetlist in each Plan node.
+	 */
+	List	   *forbid_pre_detoast_vars;
 } Plan;
 
 /* ----------------
@@ -385,6 +392,13 @@ typedef struct Scan
 
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 */
+	Bitmapset  *reference_attrs;
 } Scan;
 
 /* ----------------
@@ -789,6 +803,14 @@ typedef struct Join
 	JoinType	jointype;
 	bool		inner_unique;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 */
+	Bitmapset  *outer_reference_attrs;
+	Bitmapset  *inner_reference_attrs;
 } Join;
 
 /* ----------------
@@ -869,6 +891,14 @@ typedef struct HashJoin
 	 * perform lookups in the hashtable over the inner plan.
 	 */
 	List	   *hashkeys;
+
+	/*
+	 * keep the small tlist information in plan for the shared-detoast datum
+	 * logic.  If left_small_tlist is true, then all the datum in outerPlan
+	 * should not apply that logic, used for maintaining
+	 * forbid_pre_detoast_vars fields in Plan.
+	 */
+	bool		left_small_tlist;
 } HashJoin;
 
 /* ----------------
@@ -1588,4 +1618,24 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+static inline bool
+is_join_plan(Plan *plan)
+{
+	return (plan != NULL) && (IsA(plan, NestLoop) || IsA(plan, HashJoin) || IsA(plan, MergeJoin));
+}
+
+static inline bool
+is_scan_plan(Plan *plan)
+{
+	return (plan != NULL) &&
+		(IsA(plan, SeqScan) ||
+		 IsA(plan, SampleScan) ||
+		 IsA(plan, IndexScan) ||
+		 IsA(plan, IndexOnlyScan) ||
+		 IsA(plan, BitmapIndexScan) ||
+		 IsA(plan, BitmapHeapScan) ||
+		 IsA(plan, TidScan) ||
+		 IsA(plan, SubqueryScan));
+}
+
 #endif							/* PLANNODES_H */
diff --git a/src/test/regress/sql/shared_detoast_slow.sql b/src/test/regress/sql/shared_detoast_slow.sql
new file mode 100644
index 0000000000..beecaa6bc5
--- /dev/null
+++ b/src/test/regress/sql/shared_detoast_slow.sql
@@ -0,0 +1,70 @@
+create table t1(a text, b text, c text);
+create table t2(a text, b text, c text);
+create table t3(a text, b text, c text);
+
+insert into t1 select i, i, i from generate_series(1, 1000000)i;
+insert into t2 select i, i, i from generate_series(1, 1000000)i;
+insert into t3 select i, i, i from generate_series(1, 1000000)i;
+
+create index on t1(c);
+
+analyze t1;
+analyze t2;
+analyze t3;
+
+-- Turn off jit first, reasons:
+-- 1. JIT is not adapted for this feature, it may cause crash on jit.
+-- 2. more logging for this feature is enabled when jit=off
+set jit to off;
+
+explain (verbose) select * from t1 where b > 'a';
+
+
+-- NullTest has nothing with tts_values, so its access to toast value
+-- should be ignored.
+explain (verbose) select * from t1 where b is NULL and c is not null;
+
+-- b can't be shared-detoasted since it would make the work_mem bigger.
+explain (verbose) select * from t1 where b > 'a' order by c;
+
+-- b CAN be shared-detoasted since it would NOT make the work_mem bigger.
+-- but compared with the old behavior, it cause the lifespan of the
+-- 'detoast datum'
+-- longer, in the old behavior, it is reset becase of ExecQualAndReset.
+explain (verbose) select a, c from t1 where b > 'a' order by c;
+
+-- The detoast only happen at the join stage.
+explain (verbose) select * from t1 join t2 using(b);
+
+--
+explain (verbose) select * from t1 join t2 using(b) where t1.c > '3';
+
+explain (verbose)
+select t3.*
+from t1, t2, t3
+where t2.c > '999999999999999'
+and t2.c = t1.c
+and t3.b = t1.b;
+
+
+-- t2.b the innerHash can't be pre detoast due to Hash.
+-- t1.b the outerPlan can't be pre detoast due to num_batch.
+explain (verbose) select * from t1 join t2 using(b);
+
+-- Increase the work_mem so that the num_batch = 1, then
+-- the t1.b in outerPlan can be pre-detoast.
+
+set work_mem to '1GB';
+explain (verbose) select * from t1 join t2 using(b);
+reset work_mem;
+
+-- Show even if a datum is under a hash node, but it is
+-- not DIRECTLY input of the Hash, it's still able to be
+-- pre detoasted.
+explain (verbose)
+select t3.*
+from t1, t2, t3
+where t2.c > '999999999999999'
+and t2.c = t1.c
+and t3.b = t1.b
+and t1.b > '0';
-- 
2.34.1

#5Andy Fan
zhihuifan1213@163.com
In reply to: Andy Fan (#4)
1 attachment(s)
Re: Shared detoast Datum proposal

Andy Fan <zhihuifan1213@163.com> writes:

One of the tests was aborted at CFBOT [1] with:
[09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
[09:47:01.035] [New LWP 28182]

There was a bug in JIT part, here is the fix. Thanks for taking care of
this!

Fixed a GCC warning in cirrus-ci, hope everything is fine now.

--
Best Regards
Andy Fan

Attachments:

v4-0001-shared-detoast-feature.patchtext/x-diffDownload
From 6dc858fae29486bea9125e7a3fb43a7081e62097 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Wed, 27 Dec 2023 18:43:56 +0800
Subject: [PATCH v4 1/1] shared detoast feature.

---
 src/backend/executor/execExpr.c              |  65 ++-
 src/backend/executor/execExprInterp.c        | 181 +++++++
 src/backend/executor/execTuples.c            | 130 +++++
 src/backend/executor/execUtils.c             |   5 +
 src/backend/executor/nodeHashjoin.c          |   2 +
 src/backend/executor/nodeMergejoin.c         |   2 +
 src/backend/executor/nodeNestloop.c          |   1 +
 src/backend/jit/llvm/llvmjit_expr.c          |  26 +-
 src/backend/jit/llvm/llvmjit_types.c         |   1 +
 src/backend/nodes/bitmapset.c                |  13 +
 src/backend/optimizer/plan/createplan.c      |  73 ++-
 src/backend/optimizer/plan/setrefs.c         | 529 +++++++++++++++----
 src/include/executor/execExpr.h              |   6 +
 src/include/executor/tuptable.h              |  60 +++
 src/include/nodes/bitmapset.h                |   1 +
 src/include/nodes/execnodes.h                |   5 +
 src/include/nodes/plannodes.h                |  50 ++
 src/test/regress/sql/shared_detoast_slow.sql |  70 +++
 18 files changed, 1114 insertions(+), 106 deletions(-)
 create mode 100644 src/test/regress/sql/shared_detoast_slow.sql

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 91df2009be..4aeec8419f 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -932,22 +932,81 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				}
 				else
 				{
+					int			attnum;
+					Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 					/* regular user column */
 					scratch.d.var.attnum = variable->varattno - 1;
 					scratch.d.var.vartype = variable->vartype;
+					attnum = scratch.d.var.attnum;
+
 					switch (variable->varno)
 					{
 						case INNER_VAR:
-							scratch.opcode = EEOP_INNER_VAR;
+
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+							{
+								/* debug purpose. */
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_INNER_VAR_TOAST: flags = %d costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+								}
+								scratch.opcode = EEOP_INNER_VAR_TOAST;
+							}
+							else
+							{
+								scratch.opcode = EEOP_INNER_VAR;
+							}
 							break;
 						case OUTER_VAR:
-							scratch.opcode = EEOP_OUTER_VAR;
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+							{
+								/* debug purpose. */
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_OUTER_VAR_TOAST: flags = %u costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+								}
+								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_OUTER_VAR;
 							break;
 
 							/* INDEX_VAR is handled by default case */
 
 						default:
-							scratch.opcode = EEOP_SCAN_VAR;
+							if (is_scan_plan(plan) && bms_is_member(
+																	attnum,
+																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+							{
+								if (!jit_enabled)
+								{
+									elog(INFO,
+										 "EEOP_SCAN_VAR_TOAST: flags = %u costs=%.2f..%.2f, scanId: %d, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 ((Scan *) plan)->scanrelid,
+										 attnum);
+								}
+								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_SCAN_VAR;
 							break;
 					}
 				}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3c17cc6b1e..bd769fdeb6 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -157,6 +158,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -165,6 +169,9 @@ static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull
 static Datum ExecJustInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -180,6 +187,43 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  AggStatePerGroup pergroup,
 															  ExprContext *aggcontext,
 															  int setno);
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+	if (!slot->tts_isnull[attnum] &&
+		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+	{
+		Datum		oldDatum;
+		MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
+
+		oldDatum = slot->tts_values[attnum];
+		slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
+																(struct varlena *) oldDatum));
+		Assert(slot->tts_nvalid > attnum);
+		if (oldDatum != slot->tts_values[attnum])
+			slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
+		MemoryContextSwitchTo(old);
+	}
+}
+
+/* JIT requires a non-static (and external?) function */
+void
+ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum)
+{
+	return ExecSlotDetoastDatum(slot, attnum);
+}
+
+
+static inline void
+ExecEvalToastVar(TupleTableSlot *slot,
+				 ExprEvalStep *op,
+				 int attnum)
+{
+	ExecSlotDetoastDatum(slot, attnum);
+
+	*op->resvalue = slot->tts_values[attnum];
+	*op->resnull = slot->tts_isnull[attnum];
+}
 
 /*
  * ScalarArrayOpExprHashEntry
@@ -295,6 +339,24 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVar;
 			return;
 		}
+		if (step0 == EEOP_INNER_FETCHSOME &&
+			step1 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_FETCHSOME &&
+				 step1 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_FETCHSOME &&
+				 step1 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarToast;
+			return;
+		}
 		else if (step0 == EEOP_INNER_FETCHSOME &&
 				 step1 == EEOP_ASSIGN_INNER_VAR)
 		{
@@ -330,6 +392,7 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustConst;
 			return;
 		}
+		/* ???? */
 		else if (step0 == EEOP_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustInnerVarVirt;
@@ -345,6 +408,21 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVarVirt;
 			return;
 		}
+		else if (step0 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarVirtToast;
+			return;
+		}
 		else if (step0 == EEOP_ASSIGN_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustAssignInnerVarVirt;
@@ -412,6 +490,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
+		&&CASE_EEOP_INNER_VAR_TOAST,
+		&&CASE_EEOP_OUTER_VAR_TOAST,
+		&&CASE_EEOP_SCAN_VAR_TOAST,
 		&&CASE_EEOP_INNER_SYSVAR,
 		&&CASE_EEOP_OUTER_SYSVAR,
 		&&CASE_EEOP_SCAN_SYSVAR,
@@ -595,6 +676,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
 			*op->resvalue = scanslot->tts_values[attnum];
 			*op->resnull = scanslot->tts_isnull[attnum];
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_INNER_VAR_TOAST)
+		{
+			ExecEvalToastVar(innerslot, op, op->d.var.attnum);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_VAR_TOAST)
+		{
+			ExecEvalToastVar(outerslot, op, op->d.var.attnum);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_VAR_TOAST)
+		{
+			ExecEvalToastVar(scanslot, op, op->d.var.attnum);
 
 			EEO_NEXT();
 		}
@@ -2126,6 +2226,42 @@ ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+static pg_attribute_always_inline Datum
+ExecJustVarImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[1];
+	int			attnum = op->d.var.attnum;
+
+	CheckOpSlotCompatibility(&state->steps[0], slot);
+
+	slot_getattr(slot, attnum + 1, isnull);
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Simple reference to inner Var */
+static Datum
+ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Simple reference to outer Var */
+static Datum
+ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Simple reference to scan Var */
+static Datum
+ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)Var */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
@@ -2264,6 +2400,51 @@ ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarVirtImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+/* implementation of ExecJust(Inner|Outer|Scan)VarVirt */
+static pg_attribute_always_inline Datum
+ExecJustVarVirtImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[0];
+	int			attnum = op->d.var.attnum;
+
+	/*
+	 * As it is guaranteed that a virtual slot is used, there never is a need
+	 * to perform tuple deforming (nor would it be possible). Therefore
+	 * execExpr.c has not emitted an EEOP_*_FETCHSOME step. Verify, as much as
+	 * possible, that that determination was accurate.
+	 */
+	Assert(TTS_IS_VIRTUAL(slot));
+	Assert(TTS_FIXED(slot));
+	Assert(attnum >= 0 && attnum < slot->tts_nvalid);
+
+	*isnull = slot->tts_isnull[attnum];
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Like ExecJustInnerVar, optimized for virtual slots */
+static Datum
+ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Like ExecJustOuterVar, optimized for virtual slots */
+static Datum
+ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Like ExecJustScanVar, optimized for virtual slots */
+static Datum
+ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)VarVirt */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarVirtImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a7aa2ee02b..08c96a42b1 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -79,6 +79,9 @@ static inline void tts_buffer_heap_store_tuple(TupleTableSlot *slot,
 											   bool transfer_pin);
 static void tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree);
 
+static Bitmapset *cal_final_pre_detoast_attrs(Bitmapset *plan_pre_detoast_attrs,
+											  TupleDesc tupleDesc,
+											  List *not_pre_detoast_vars);
 
 const TupleTableSlotOps TTSOpsVirtual;
 const TupleTableSlotOps TTSOpsHeapTuple;
@@ -176,6 +179,10 @@ tts_virtual_materialize(TupleTableSlot *slot)
 		if (att->attbyval || slot->tts_isnull[natt])
 			continue;
 
+		if (bms_is_member(natt, slot->pre_detoasted_attrs))
+			/* it has been in slot->tts_mcxt already. */
+			continue;
+
 		val = slot->tts_values[natt];
 
 		if (att->attlen == -1 &&
@@ -392,6 +399,13 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated as non valid (tts_nvalid = 0), so let free the
+	 * pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
+
 }
 
 static void
@@ -457,6 +471,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -567,6 +584,12 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated as non valid (tts_nvalid = 0), free the
+	 * pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -637,6 +660,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -771,6 +797,12 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_nvalid = 0 means tts_values will be not reliable, so clear the
+	 * information about pre-detoast-datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -860,6 +892,7 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 {
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 
+
 	if (TTS_SHOULDFREE(slot))
 	{
 		/* materialized slot shouldn't have a buffer to release */
@@ -904,6 +937,8 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 		 */
 		ReleaseBuffer(buffer);
 	}
+
+	ExecFreePreDetoastDatum(slot);
 }
 
 /*
@@ -1138,6 +1173,7 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 	slot->tts_tupleDescriptor = tupleDesc;
 	slot->tts_mcxt = CurrentMemoryContext;
 	slot->tts_nvalid = 0;
+	slot->pre_detoasted_attrs = NULL;
 
 	if (tupleDesc != NULL)
 	{
@@ -1810,12 +1846,30 @@ void
 ExecInitScanTupleSlot(EState *estate, ScanState *scanstate,
 					  TupleDesc tupledesc, const TupleTableSlotOps *tts_ops)
 {
+	Scan	   *splan = (Scan *) scanstate->ps.plan;
+
 	scanstate->ss_ScanTupleSlot = ExecAllocTableSlot(&estate->es_tupleTable,
 													 tupledesc, tts_ops);
 	scanstate->ps.scandesc = tupledesc;
 	scanstate->ps.scanopsfixed = tupledesc != NULL;
 	scanstate->ps.scanops = tts_ops;
 	scanstate->ps.scanopsset = true;
+
+	if (is_scan_plan((Plan *) splan))
+	{
+		/*
+		 * We may run detoast in Qual or Projection, but all of them happen at
+		 * the ss_ScanTupleSlot rather than ps_ResultTupleSlot. So we can only
+		 * take care of the ss_ScanTupleSlot.
+		 *
+		 * the ps_ResultTupleSlot may also have detoast on its parent node,
+		 * like as a inner or outer slot in join case, the pre_detoast on
+		 * these slot.
+		 */
+		scanstate->scan_pre_detoast_attrs = cal_final_pre_detoast_attrs(splan->reference_attrs,
+																		tupledesc,
+																		splan->plan.forbid_pre_detoast_vars);
+	}
 }
 
 /* ----------------
@@ -2336,3 +2390,79 @@ end_tup_output(TupOutputState *tstate)
 	ExecDropSingleTupleTableSlot(tstate->slot);
 	pfree(tstate);
 }
+
+static Bitmapset *
+cal_final_pre_detoast_attrs(Bitmapset *plan_pre_detoast_attrs,
+							TupleDesc tupleDesc,
+							List *not_pre_detoast_vars)
+{
+	Bitmapset  *final = NULL,
+			   *toast_attrs = NULL,
+			   *forbid_pre_detoast_attrs = NULL;
+
+	int			i;
+	ListCell   *lc;
+
+	if (bms_is_empty(plan_pre_detoast_attrs))
+		return NULL;
+
+	/*
+	 * there is no exact data type in create_plan or set_plan_refs stage, so
+	 * plan_pre_detoast_attrs may have some attribute which is not toast attrs
+	 * at all, which should be removed.
+	 */
+	for (i = 0; i < tupleDesc->natts; i++)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupleDesc, i);
+
+		if (attr->attlen == -1 && attr->attstorage != TYPSTORAGE_PLAIN)
+			toast_attrs = bms_add_member(toast_attrs, attr->attnum - 1);
+	}
+
+	final = bms_intersect(plan_pre_detoast_attrs, toast_attrs);
+
+	/*
+	 * Due to the fact of detoast-datum will make the tuple bigger which is
+	 * bad for some nodes like Sort/Hash, to avoid performance regression,
+	 * such attribute should be removed as well.
+	 */
+	foreach(lc, not_pre_detoast_vars)
+	{
+		Var		   *var = lfirst_node(Var, lc);
+
+		forbid_pre_detoast_attrs = bms_add_member(forbid_pre_detoast_attrs, var->varattno - 1);
+	}
+
+	final = bms_del_members(final, forbid_pre_detoast_attrs);
+
+	bms_free(toast_attrs);
+	bms_free(forbid_pre_detoast_attrs);
+
+	return final;
+}
+
+
+/* Input slot, result slot. scan slot? */
+void
+SetPredetoastAttrsForScan(ScanState *scanstate)
+{
+}
+
+void
+SetPredetoastAttrsForJoin(JoinState *j)
+{
+	PlanState  *outerstate = outerPlanState(j);
+	PlanState  *innerstate = innerPlanState(j);
+
+	/* Input slot, result slot. scan slot? */
+
+	j->outer_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->outer_reference_attrs,
+															 outerstate->ps_ResultTupleDesc,
+															 outerstate->plan->forbid_pre_detoast_vars);
+
+	j->inner_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->inner_reference_attrs,
+															 innerstate->ps_ResultTupleDesc,
+															 innerstate->plan->forbid_pre_detoast_vars);
+}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index cff5dc723e..43de95f5b0 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -572,6 +572,11 @@ ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
 		planstate->resultopsset = planstate->scanopsset;
 		planstate->resultopsfixed = planstate->scanopsfixed;
 		planstate->resultops = planstate->scanops;
+
+		/*
+		 * XXX: can I make sure the ps_ResultTupleDesc is set all the time?
+		 */
+		Assert(planstate->ps_ResultTupleDesc != NULL);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1cbec4647c..19a05ed624 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -756,6 +756,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
 	innerDesc = ExecGetResultType(innerPlanState(hjstate));
 
+	SetPredetoastAttrsForJoin((JoinState *) hjstate);
+
 	/*
 	 * Initialize result slot, type and projection.
 	 */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index c1a8ca2464..be7cbd7f30 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1497,6 +1497,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 											  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	SetPredetoastAttrsForJoin((JoinState *) mergestate);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 06fa0a9b31..2d40d19192 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -306,6 +306,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(&nlstate->js.ps, &TTSOpsVirtual);
 	ExecAssignProjectionInfo(&nlstate->js.ps, NULL);
+	SetPredetoastAttrsForJoin((JoinState *) nlstate);
 
 	/*
 	 * initialize child expressions
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 33161d812f..56c10c9e4d 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -396,30 +396,52 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
+			case EEOP_INNER_VAR_TOAST:
+			case EEOP_OUTER_VAR_TOAST:
+			case EEOP_SCAN_VAR_TOAST:
 				{
 					LLVMValueRef value,
 								isnull;
 					LLVMValueRef v_attnum;
 					LLVMValueRef v_values;
 					LLVMValueRef v_nulls;
+					LLVMValueRef v_slot;
 
-					if (opcode == EEOP_INNER_VAR)
+					if (opcode == EEOP_INNER_VAR || opcode == EEOP_INNER_VAR_TOAST)
 					{
+						v_slot = v_innerslot;
 						v_values = v_innervalues;
 						v_nulls = v_innernulls;
 					}
-					else if (opcode == EEOP_OUTER_VAR)
+					else if (opcode == EEOP_OUTER_VAR || opcode == EEOP_OUTER_VAR_TOAST)
 					{
+						v_slot = v_outerslot;
 						v_values = v_outervalues;
 						v_nulls = v_outernulls;
 					}
 					else
 					{
+						v_slot = v_scanslot;
 						v_values = v_scanvalues;
 						v_nulls = v_scannulls;
 					}
 
 					v_attnum = l_int32_const(lc, op->d.var.attnum);
+
+					if (opcode == EEOP_INNER_VAR_TOAST ||
+						opcode == EEOP_OUTER_VAR_TOAST ||
+						opcode == EEOP_SCAN_VAR_TOAST)
+					{
+						LLVMValueRef params[2];
+
+						params[0] = v_slot;
+						params[1] = l_int32_const(lc, op->d.var.attnum);
+						l_call(b,
+							   llvm_pg_var_func_type("ExecSlotDetoastDatumExternal"),
+							   llvm_pg_func(mod, "ExecSlotDetoastDatumExternal"),
+							   params, lengthof(params), "");
+					}
+
 					value = l_load_gep1(b, TypeSizeT, v_values, v_attnum, "");
 					isnull = l_load_gep1(b, TypeStorageBool, v_nulls, v_attnum, "");
 					LLVMBuildStore(b, value, v_resvaluep);
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 5212f529c8..11a11028b8 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -177,4 +177,5 @@ void	   *referenced_functions[] =
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecSlotDetoastDatumExternal,
 };
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index f4b61085be..e0eb150e5b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -774,6 +774,19 @@ bms_membership(const Bitmapset *a)
  *		foo = bms_add_member(foo, x);
  */
 
+/*
+ * does this break commit 00b41463c21615f9bf3927f207e37f9e215d32e6?
+ * but I just found alloc memory and free the memory is too bad
+ * for this current feature. So let see ...;
+ */
+void
+bms_zero(Bitmapset *a)
+{
+	if (a == NULL)
+		return;
+
+	memset(a->words, 0, a->nwords * sizeof(bitmapword));
+}
 
 /*
  * bms_add_member - add a specified member to set
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ca619eab94..3f3aea73ce 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -42,6 +42,7 @@
 #include "partitioning/partprune.h"
 #include "utils/lsyscache.h"
 
+extern bool jit_enabled;
 
 /*
  * Flag bits that can appear in the flags argument of create_plan_recurse().
@@ -314,7 +315,8 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 List *mergeActionLists, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
-
+static void set_plan_not_detoast_attrs_recurse(Plan *plan,
+											   List *recheck_list);
 
 /*
  * create_plan
@@ -346,6 +348,8 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree, demanding the correct tlist result */
 	plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);
 
+	set_plan_not_detoast_attrs_recurse(plan, NIL);
+
 	/*
 	 * Make sure the topmost plan node's targetlist exposes the original
 	 * column names and other decorative info.  Targetlists generated within
@@ -378,6 +382,68 @@ create_plan(PlannerInfo *root, Path *best_path)
 	return plan;
 }
 
+/*
+ * set_plan_not_pre_detoast_vars
+ *
+ *	set the toast_attrs according recheck_list.
+ *
+ * recheck_list = NIL means we need to do thing.
+ */
+static void
+set_plan_not_pre_detoast_vars(Plan *plan, List *recheck_list)
+{
+	ListCell   *lc;
+	Var		   *var;
+
+	if (recheck_list == NIL)
+		return;
+
+	foreach(lc, plan->targetlist)
+	{
+		TargetEntry *te = lfirst_node(TargetEntry, lc);
+
+		if (!IsA(te->expr, Var))
+			continue;
+		var = castNode(Var, te->expr);
+		if (var->varattno <= 0)
+			continue;
+		if (list_member(recheck_list, var))
+			/* pass the recheck */
+			plan->forbid_pre_detoast_vars = lappend(plan->forbid_pre_detoast_vars, var);
+	}
+}
+
+
+static void
+set_plan_not_detoast_attrs_recurse(Plan *plan, List *recheck_list)
+{
+	if (plan == NULL)
+		return;
+
+	set_plan_not_pre_detoast_vars(plan, recheck_list);
+
+	if (IsA(plan, Sort) || IsA(plan, Memoize) || IsA(plan, WindowAgg) ||
+		IsA(plan, Hash) || IsA(plan, Material) || IsA(plan, IncrementalSort))
+	{
+		List	   *subplan_exprs = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		set_plan_not_pre_detoast_vars(plan, subplan_exprs);
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, subplan_exprs);
+	}
+	else if (IsA(plan, HashJoin) && castNode(HashJoin, plan)->left_small_tlist)
+	{
+		List	   *subplan_exprs = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, subplan_exprs);
+		set_plan_not_detoast_attrs_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+	else
+	{
+		set_plan_not_detoast_attrs_recurse(plan->lefttree, plan->forbid_pre_detoast_vars);
+		set_plan_not_detoast_attrs_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+}
+
 /*
  * create_plan_recurse
  *	  Recursive guts of create_plan().
@@ -4884,6 +4950,11 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	join_plan->left_small_tlist = (best_path->num_batches > 1);
+
+	if (!jit_enabled)
+		elog(INFO, "num_batches = %d", best_path->num_batches);
+
 	return join_plan;
 }
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 22a1fa29f3..b81ddf72c3 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -27,6 +27,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_relation.h"
 #include "tcop/utility.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
@@ -55,11 +56,27 @@ typedef struct
 	tlist_vinfo vars[FLEXIBLE_ARRAY_MEMBER];	/* has num_vars entries */
 } indexed_tlist;
 
+typedef struct
+{
+	/* var is added into existing_attrs for the first time. */
+	Bitmapset  *existing_attrs;
+	/* following add to the final_ref_attrs. */
+	Bitmapset **final_ref_attrs;
+}			intermediate_var_ref_context;
+
+typedef struct
+{
+	int			level_added;
+	int			level;
+}			intermediate_level_context;
+
 typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context scan_reference_attrs;
 } fix_scan_expr_context;
 
 typedef struct
@@ -71,6 +88,9 @@ typedef struct
 	int			rtoffset;
 	NullingRelsMatch nrm_match;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context outer_reference_attrs;
+	intermediate_var_ref_context inner_reference_attrs;
 } fix_join_expr_context;
 
 typedef struct
@@ -127,8 +147,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, lst, rtoffset, num_exec, pre_detoast_attrs) \
+	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec, pre_detoast_attrs))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -158,7 +178,8 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
 static Node *fix_scan_expr(PlannerInfo *root, Node *node,
-						   int rtoffset, double num_exec);
+						   int rtoffset, double num_exec,
+						   Bitmapset **scan_reference_attrs);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
@@ -190,7 +211,10 @@ static List *fix_join_expr(PlannerInfo *root,
 						   Index acceptable_rel,
 						   int rtoffset,
 						   NullingRelsMatch nrm_match,
-						   double num_exec);
+						   double num_exec,
+						   Bitmapset **outer_reference_attrs,
+						   Bitmapset **inner_reference_attrs);
+
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
 static Node *fix_upper_expr(PlannerInfo *root,
@@ -628,10 +652,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_SampleScan:
@@ -641,13 +671,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs
+					);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tablesample = (TableSampleClause *)
 					fix_scan_expr(root, (Node *) splan->tablesample,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexScan:
@@ -657,28 +694,40 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->indexqual =
 					fix_scan_list(root, splan->indexqual,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->indexorderby =
 					fix_scan_list(root, splan->indexorderby,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexorderbyorig =
 					fix_scan_list(root, splan->indexorderbyorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan), &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexOnlyScan:
 			{
 				IndexOnlyScan *splan = (IndexOnlyScan *) plan;
 
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
+
 				return set_indexonlyscan_references(root, splan, rtoffset);
 			}
 			break;
@@ -691,10 +740,15 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, splan->indexqual, rtoffset, 1,
+								  &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_BitmapHeapScan:
@@ -704,13 +758,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->bitmapqualorig =
 					fix_scan_list(root, splan->bitmapqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidScan:
@@ -720,13 +781,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidquals =
 					fix_scan_list(root, splan->tidquals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidRangeScan:
@@ -736,17 +804,25 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidrangequals =
 					fix_scan_list(root, splan->tidrangequals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_SubqueryScan:
 			/* Needs special treatment, see comments below */
+			/* XXX: shall I do anything? */
 			return set_subqueryscan_references(root,
 											   (SubqueryScan *) plan,
 											   rtoffset);
@@ -757,12 +833,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, splan->functions, rtoffset, 1,
+								  &splan->scan.reference_attrs);
+
 			}
 			break;
 		case T_TableFuncScan:
@@ -772,13 +852,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->tablefunc = (TableFunc *)
 					fix_scan_expr(root, (Node *) splan->tablefunc,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ValuesScan:
@@ -788,13 +872,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->values_lists =
 					fix_scan_list(root, splan->values_lists,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_CteScan:
@@ -804,10 +891,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_NamedTuplestoreScan:
@@ -817,10 +910,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_WorkTableScan:
@@ -830,10 +925,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ForeignScan:
@@ -873,7 +970,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
 												   rtoffset,
-												   NUM_EXEC_TLIST(plan));
+												   NUM_EXEC_TLIST(plan),
+												   NULL);
 				break;
 			}
 
@@ -933,9 +1031,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, splan->limitOffset, rtoffset, 1, NULL);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, splan->limitCount, rtoffset, 1, NULL);
 			}
 			break;
 		case T_Agg:
@@ -988,17 +1086,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->startOffset, rtoffset, 1, NULL);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->endOffset, rtoffset, 1, NULL);
 				wplan->runCondition = fix_scan_list(root,
 													wplan->runCondition,
 													rtoffset,
-													NUM_EXEC_TLIST(plan));
+													NUM_EXEC_TLIST(plan), NULL);
 				wplan->runConditionOrig = fix_scan_list(root,
 														wplan->runConditionOrig,
 														rtoffset,
-														NUM_EXEC_TLIST(plan));
+														NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_Result:
@@ -1038,14 +1136,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 					splan->plan.targetlist =
 						fix_scan_list(root, splan->plan.targetlist,
-									  rtoffset, NUM_EXEC_TLIST(plan));
+									  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 					splan->plan.qual =
 						fix_scan_list(root, splan->plan.qual,
-									  rtoffset, NUM_EXEC_QUAL(plan));
+									  rtoffset, NUM_EXEC_QUAL(plan), NULL);
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1, NULL);
 			}
 			break;
 		case T_ProjectSet:
@@ -1061,7 +1159,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->withCheckOptionLists =
 					fix_scan_list(root, splan->withCheckOptionLists,
-								  rtoffset, 1);
+								  rtoffset, 1, NULL);
 
 				if (splan->returningLists)
 				{
@@ -1118,18 +1216,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						fix_join_expr(root, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					splan->onConflictWhere = (Node *)
 						fix_join_expr(root, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1, NULL);
 				}
 
 				/*
@@ -1182,7 +1282,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   resultrel,
 															   rtoffset,
 															   NRM_EQUAL,
-															   NUM_EXEC_TLIST(plan));
+															   NUM_EXEC_TLIST(plan),
+															   NULL, NULL);
 
 							/* Fix quals too. */
 							action->qual = (Node *) fix_join_expr(root,
@@ -1191,7 +1292,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 																  resultrel,
 																  rtoffset,
 																  NRM_EQUAL,
-																  NUM_EXEC_QUAL(plan));
+																  NUM_EXEC_QUAL(plan),
+																  NULL, NULL);
 						}
 					}
 				}
@@ -1356,13 +1458,16 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
 	plan->indexqual = fix_scan_list(root, plan->indexqual,
-									rtoffset, 1);
+									rtoffset, 1,
+									&plan->scan.reference_attrs);
 	/* indexorderby is already transformed to reference index columns */
 	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
-									   rtoffset, 1);
+									   rtoffset, 1,
+									   &plan->scan.reference_attrs);
 	/* indextlist must NOT be transformed to reference index columns */
 	plan->indextlist = fix_scan_list(root, plan->indextlist,
-									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+									 rtoffset, NUM_EXEC_TLIST((Plan *) plan),
+									 &plan->scan.reference_attrs);
 
 	pfree(index_itlist);
 
@@ -1409,10 +1514,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
 			fix_scan_list(root, plan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) plan), NULL);
 		plan->scan.plan.qual =
 			fix_scan_list(root, plan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) plan), NULL);
 
 		result = (Plan *) plan;
 	}
@@ -1612,7 +1717,7 @@ set_foreignscan_references(PlannerInfo *root,
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
 			fix_scan_list(root, fscan->fdw_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 	}
 	else
 	{
@@ -1622,16 +1727,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_recheck_quals =
 			fix_scan_list(root, fscan->fdw_recheck_quals,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 	}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
@@ -1690,20 +1795,20 @@ set_customscan_references(PlannerInfo *root,
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
 			fix_scan_list(root, cscan->custom_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
 			fix_scan_list(root, cscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 		cscan->scan.plan.qual =
 			fix_scan_list(root, cscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 		cscan->custom_exprs =
 			fix_scan_list(root, cscan->custom_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 	}
 
 	/* Adjust child plan-nodes recursively, if needed */
@@ -2111,6 +2216,95 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
 	return (Node *) bestplan;
 }
 
+
+static inline void
+setup_intermediate_level_ctx(intermediate_level_context * ctx)
+{
+	ctx->level = 0;
+	ctx->level_added = false;
+}
+
+static inline void
+setup_intermediate_var_ref_ctx(intermediate_var_ref_context * ctx, Bitmapset **final_ref_attrs)
+{
+	ctx->existing_attrs = NULL;
+	ctx->final_ref_attrs = final_ref_attrs;
+}
+
+/*
+ * increase_level_for_pre_detoast
+ *	Check if the given Expr could detoast a Var directly, if yes,
+ * increase the level and return true. otherwise return false;
+ */
+static inline void
+increase_level_for_pre_detoast(Node *node, intermediate_level_context * ctx)
+{
+	/* The following nodes is impossible to detoast a Var directly. */
+	if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
+	{
+		ctx->level_added = false;
+	}
+	else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
+	{
+		/* let's not detoast first so that pg_column_compression works. */
+		ctx->level_added = false;
+	}
+	else
+	{
+		ctx->level_added = true;
+		ctx->level += 1;
+	}
+}
+
+static inline void
+decreased_level_for_pre_detoast(intermediate_level_context * ctx)
+{
+	if (ctx->level_added)
+		ctx->level -= 1;
+
+	ctx->level_added = false;
+}
+
+/*
+ * add_pre_detoast_vars
+ *		add the var's information into pre_detoast_attrs when the check is pass.
+ */
+static inline void
+add_pre_detoast_vars(intermediate_level_context * level_ctx,
+					 intermediate_var_ref_context * ctx,
+					 Var *var)
+{
+	int			attno;
+
+	if (level_ctx->level <= 1 || ctx->final_ref_attrs == NULL || var->varattno <= 0)
+		return;
+
+	attno = var->varattno - 1;
+	if (bms_is_member(attno, ctx->existing_attrs))
+	{
+		/* not the first time to access it, add it to final result. */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+	else
+	{
+		/* first time. */
+		ctx->existing_attrs = bms_add_member(ctx->existing_attrs, attno);
+
+		/*
+		 * XXX:
+		 *
+		 * The above strategy doesn't help to detect if a Var is detoast
+		 * twice. Reasons are: 1. the context is not maintain in Plan node
+		 * level. so if it is detoast at targetlist and qual, we can't detect
+		 * it. 2. even we can make it at plan node, it still doesn't help for
+		 * the among-nodes case.
+		 *
+		 * So for now, I just disable it.
+		 */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+}
+
 /*
  * fix_scan_expr
  *		Do set_plan_references processing on a scan-level expression
@@ -2130,13 +2324,16 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset,
+			  double num_exec, Bitmapset **scan_reference_attrs)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.scan_reference_attrs, scan_reference_attrs);
 
 	if (rtoffset != 0 ||
 		root->multiexpr_params != NIL ||
@@ -2167,8 +2364,13 @@ fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
 static Node *
 fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 {
+	Node	   *n;
+
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = copyVar((Var *) node);
@@ -2186,10 +2388,16 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 			var->varno += context->rtoffset;
 		if (var->varnosyn > 0)
 			var->varnosyn += context->rtoffset;
+
+		add_pre_detoast_vars(&context->level_ctx, &context->scan_reference_attrs, var);
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) var;
 	}
 	if (IsA(node, Param))
+	{
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return fix_param_node(context->root, (Param *) node);
+	}
 	if (IsA(node, Aggref))
 	{
 		Aggref	   *aggref = (Aggref *) node;
@@ -2199,8 +2407,10 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		aggparam = find_minmax_agg_replacement_param(context->root, aggref);
 		if (aggparam != NULL)
 		{
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			/* Make a copy of the Param for paranoia's sake */
 			return (Node *) copyObject(aggparam);
+
 		}
 		/* If no match, just fall through to process it normally */
 	}
@@ -2210,6 +2420,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 
 		Assert(!IS_SPECIAL_VARNO(cexpr->cvarno));
 		cexpr->cvarno += context->rtoffset;
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) cexpr;
 	}
 	if (IsA(node, PlaceHolderVar))
@@ -2218,29 +2429,52 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
 		/* XXX can we assert something about phnullingrels? */
-		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
+		Node	   *n2 = fix_scan_expr_mutator((Node *) phv->phexpr, context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
 	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_scan_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		Node	   *n2 = fix_scan_expr_mutator(fix_alternative_subplan(context->root,
+																	   (AlternativeSubPlan *) node,
+																	   context->num_exec),
+											   context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator,
-								   (void *) context);
+	n = expression_tree_mutator(node, fix_scan_expr_mutator, (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return n;
 }
 
 static bool
 fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 {
+	bool		ret;
+
 	if (node == NULL)
 		return false;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
+	if (IsA(node, Var))
+	{
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 castNode(Var, node));
+	}
 	Assert(!(IsA(node, Var) && ((Var *) node)->varno == ROWID_VAR));
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
-	return expression_tree_walker(node, fix_scan_expr_walker,
-								  (void *) context);
+	ret = expression_tree_walker(node, fix_scan_expr_walker,
+								 (void *) context);
+
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret;
 }
 
 /*
@@ -2276,7 +2510,10 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset,
 								   NRM_EQUAL,
-								   NUM_EXEC_QUAL((Plan *) join));
+								   NUM_EXEC_QUAL((Plan *) join),
+								   &join->outer_reference_attrs,
+								   &join->inner_reference_attrs
+		);
 
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
@@ -2323,7 +2560,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										 (Index) 0,
 										 rtoffset,
 										 NRM_EQUAL,
-										 NUM_EXEC_QUAL((Plan *) join));
+										 NUM_EXEC_QUAL((Plan *) join),
+										 &join->outer_reference_attrs,
+										 &join->inner_reference_attrs);
 	}
 	else if (IsA(join, HashJoin))
 	{
@@ -2336,7 +2575,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										(Index) 0,
 										rtoffset,
 										NRM_EQUAL,
-										NUM_EXEC_QUAL((Plan *) join));
+										NUM_EXEC_QUAL((Plan *) join),
+										&join->outer_reference_attrs,
+										&join->inner_reference_attrs);
 
 		/*
 		 * HashJoin's hashkeys are used to look for matching tuples from its
@@ -2368,7 +2609,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  (Index) 0,
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-										  NUM_EXEC_TLIST((Plan *) join));
+										  NUM_EXEC_TLIST((Plan *) join),
+										  &join->outer_reference_attrs,
+										  &join->inner_reference_attrs);
 	join->plan.qual = fix_join_expr(root,
 									join->plan.qual,
 									outer_itlist,
@@ -2376,8 +2619,20 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 									(Index) 0,
 									rtoffset,
 									(join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-									NUM_EXEC_QUAL((Plan *) join));
-
+									NUM_EXEC_QUAL((Plan *) join),
+									&join->outer_reference_attrs,
+									&join->inner_reference_attrs);
+
+	join->plan.forbid_pre_detoast_vars = fix_join_expr(root,
+													   join->plan.forbid_pre_detoast_vars,
+													   outer_itlist,
+													   inner_itlist,
+													   (Index) 0,
+													   rtoffset,
+													   (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
+													   NUM_EXEC_TLIST((Plan *) join),
+													   NULL,
+													   NULL);
 	pfree(outer_itlist);
 	pfree(inner_itlist);
 }
@@ -3010,9 +3265,12 @@ fix_join_expr(PlannerInfo *root,
 			  Index acceptable_rel,
 			  int rtoffset,
 			  NullingRelsMatch nrm_match,
-			  double num_exec)
+			  double num_exec,
+			  Bitmapset **outer_reference_attrs,
+			  Bitmapset **inner_reference_attrs)
 {
 	fix_join_expr_context context;
+	List	   *ret;
 
 	context.root = root;
 	context.outer_itlist = outer_itlist;
@@ -3021,16 +3279,30 @@ fix_join_expr(PlannerInfo *root,
 	context.rtoffset = rtoffset;
 	context.nrm_match = nrm_match;
 	context.num_exec = num_exec;
-	return (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.outer_reference_attrs, outer_reference_attrs);
+	setup_intermediate_var_ref_ctx(&context.inner_reference_attrs, inner_reference_attrs);
+
+	ret = (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	bms_free(context.outer_reference_attrs.existing_attrs);
+	bms_free(context.inner_reference_attrs.existing_attrs);
+
+	return ret;
 }
 
 static Node *
 fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 {
 	Var		   *newvar;
+	Node	   *ret_node;
 
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = (Var *) node;
@@ -3044,7 +3316,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* then in the inner. */
@@ -3056,7 +3334,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If it's for acceptable_rel, adjust and return it */
@@ -3066,6 +3350,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 			var->varno += context->rtoffset;
 			if (var->varnosyn > 0)
 				var->varnosyn += context->rtoffset;
+			/* XXX acceptable_rel?  we can ignore it for safety. */
+			decreased_level_for_pre_detoast(&context->level_ctx);
+
 			return (Node *) var;
 		}
 
@@ -3084,22 +3371,38 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  OUTER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 		if (context->inner_itlist && context->inner_itlist->has_ph_vars)
 		{
+
 			newvar = search_indexed_tlist_for_phv(phv,
 												  context->inner_itlist,
 												  INNER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If not supplied by input plans, evaluate the contained expr */
 		/* XXX can we assert something about phnullingrels? */
-		return fix_join_expr_mutator((Node *) phv->phexpr, context);
+		ret_node = fix_join_expr_mutator((Node *) phv->phexpr, context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
 	}
+
 	/* Try matching more complex expressions too, if tlists have any */
 	if (context->outer_itlist && context->outer_itlist->has_non_vars)
 	{
@@ -3107,7 +3410,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->outer_itlist,
 												  OUTER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->outer_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	if (context->inner_itlist && context->inner_itlist->has_non_vars)
 	{
@@ -3115,20 +3424,36 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->inner_itlist,
 												  INNER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->inner_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	/* Special cases (apply only AFTER failing to match to lower tlist) */
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		ret_node = fix_param_node(context->root, (Param *) node);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_join_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		ret_node = fix_join_expr_mutator(fix_alternative_subplan(context->root,
+																 (AlternativeSubPlan *) node,
+																 context->num_exec),
+										 context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node,
-								   fix_join_expr_mutator,
-								   (void *) context);
+	ret_node = expression_tree_mutator(node,
+									   fix_join_expr_mutator,
+									   (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret_node;
 }
 
 /*
@@ -3163,7 +3488,8 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * varno = newvarno, varattno = resno of corresponding targetlist element.
  * The original tree is not modified.
  */
-static Node *
+static Node *					/* XXX: shall I care about this for shared
+								 * detoast optimization? */
 fix_upper_expr(PlannerInfo *root,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
@@ -3318,7 +3644,10 @@ set_returning_clause_references(PlannerInfo *root,
 						  resultRelation,
 						  rtoffset,
 						  NRM_EQUAL,
-						  NUM_EXEC_TLIST(topplan));
+						  NUM_EXEC_TLIST(topplan),
+						  NULL,
+						  NULL
+		);
 
 	pfree(itlist);
 
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index a20c539a25..9a31e083dd 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -77,6 +77,11 @@ typedef enum ExprEvalOp
 	EEOP_OUTER_VAR,
 	EEOP_SCAN_VAR,
 
+	/* compute non-system Var value with shared-detoast-datum logic */
+	EEOP_INNER_VAR_TOAST,
+	EEOP_OUTER_VAR_TOAST,
+	EEOP_SCAN_VAR_TOAST,
+
 	/* compute system Var value */
 	EEOP_INNER_SYSVAR,
 	EEOP_OUTER_SYSVAR,
@@ -826,5 +831,6 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
+extern void ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum);
 
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 6133dbcd0a..5b42eb65f8 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -18,6 +18,7 @@
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/tupdesc.h"
+#include "nodes/bitmapset.h"
 #include "storage/buf.h"
 
 /*----------
@@ -128,6 +129,11 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+
+	/*
+	 * The attributes populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST step.
+	 */
+	Bitmapset  *pre_detoasted_attrs;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -426,12 +432,38 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+	int			attnum;
+
+	if (bms_is_empty(slot->pre_detoasted_attrs))
+		return;
+
+	attnum = -1;
+	/* free the memory used by pre-detoasted datum and reset the flags. */
+	while ((attnum = bms_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
+	{
+		pfree((void *) slot->tts_values[attnum]);
+	}
+
+	/*
+	 * bms_free each time cost too much, so just zero these bits and keep its
+	 * memory, just like what we did for TupleTableSlot. but.. see the
+	 * comments about the bms_zero.
+	 */
+	bms_zero(slot->pre_detoasted_attrs);
+}
+
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
 static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_ops->clear(slot);
 
 	return slot;
@@ -450,6 +482,10 @@ ExecClearTuple(TupleTableSlot *slot)
 static inline void
 ExecMaterializeSlot(TupleTableSlot *slot)
 {
+	/*
+	 * XXX: pre_detoasted_attrs doesn't dependent on any external storage, so
+	 * nothing should be done here.
+	 */
 	slot->tts_ops->materialize(slot);
 }
 
@@ -494,6 +530,30 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 
 	dstslot->tts_ops->copyslot(dstslot, srcslot);
 
+	/* Assert this assumption the below code depends on. */
+	Assert(dstslot->tts_nvalid == 0 ||
+		   dstslot->tts_nvalid == srcslot->tts_nvalid);
+
+	if (dstslot->tts_nvalid == srcslot->tts_nvalid
+		&& !bms_is_empty(srcslot->pre_detoasted_attrs))
+	{
+		int			attnum = -1;
+		MemoryContext old = MemoryContextSwitchTo(dstslot->tts_mcxt);
+
+		dstslot->pre_detoasted_attrs = bms_copy(srcslot->pre_detoasted_attrs);
+
+		while ((attnum = bms_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
+		{
+			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
+			Size		len;
+
+			Assert(!VARATT_IS_EXTENDED(datum));
+			len = VARSIZE(datum);
+			dstslot->tts_values[attnum] = (Datum) palloc(len);
+			memcpy((void *) dstslot->tts_values[attnum], datum, len);
+		}
+		MemoryContextSwitchTo(old);
+	}
 	return dstslot;
 }
 
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index efc8890ce6..e91f334254 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -113,6 +113,7 @@ extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
+extern void bms_zero(Bitmapset *a);
 
 /* support for iterating through the integer elements of a set: */
 extern int	bms_next_member(const Bitmapset *a, int prevbit);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 561fdd98f1..1a01d480ce 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1474,6 +1474,7 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+	Bitmapset  *scan_pre_detoast_attrs;
 } ScanState;
 
 /* ----------------
@@ -2003,6 +2004,8 @@ typedef struct JoinState
 	bool		single_match;	/* True if we should skip to next outer tuple
 								 * after finding one inner match */
 	ExprState  *joinqual;		/* JOIN quals (in addition to ps.qual) */
+	Bitmapset  *outer_pre_detoast_attrs;
+	Bitmapset  *inner_pre_detoast_attrs;
 } JoinState;
 
 /* ----------------
@@ -2764,4 +2767,6 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+extern void SetPredetoastAttrsForJoin(JoinState *joinstate);
+extern void SetPredetoastAttrsForScan(ScanState *scanstate);
 #endif							/* EXECNODES_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c..2f2fec2740 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,13 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+
+	/*
+	 * A list of Vars which should not apply the shared-detoast-datum logic
+	 * since the upper nodes like Sort/Hash want the tuples as small as
+	 * possible. Its a subset of targetlist in each Plan node.
+	 */
+	List	   *forbid_pre_detoast_vars;
 } Plan;
 
 /* ----------------
@@ -385,6 +392,13 @@ typedef struct Scan
 
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 */
+	Bitmapset  *reference_attrs;
 } Scan;
 
 /* ----------------
@@ -789,6 +803,14 @@ typedef struct Join
 	JoinType	jointype;
 	bool		inner_unique;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 */
+	Bitmapset  *outer_reference_attrs;
+	Bitmapset  *inner_reference_attrs;
 } Join;
 
 /* ----------------
@@ -869,6 +891,14 @@ typedef struct HashJoin
 	 * perform lookups in the hashtable over the inner plan.
 	 */
 	List	   *hashkeys;
+
+	/*
+	 * keep the small tlist information in plan for the shared-detoast datum
+	 * logic.  If left_small_tlist is true, then all the datum in outerPlan
+	 * should not apply that logic, used for maintaining
+	 * forbid_pre_detoast_vars fields in Plan.
+	 */
+	bool		left_small_tlist;
 } HashJoin;
 
 /* ----------------
@@ -1588,4 +1618,24 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+static inline bool
+is_join_plan(Plan *plan)
+{
+	return (plan != NULL) && (IsA(plan, NestLoop) || IsA(plan, HashJoin) || IsA(plan, MergeJoin));
+}
+
+static inline bool
+is_scan_plan(Plan *plan)
+{
+	return (plan != NULL) &&
+		(IsA(plan, SeqScan) ||
+		 IsA(plan, SampleScan) ||
+		 IsA(plan, IndexScan) ||
+		 IsA(plan, IndexOnlyScan) ||
+		 IsA(plan, BitmapIndexScan) ||
+		 IsA(plan, BitmapHeapScan) ||
+		 IsA(plan, TidScan) ||
+		 IsA(plan, SubqueryScan));
+}
+
 #endif							/* PLANNODES_H */
diff --git a/src/test/regress/sql/shared_detoast_slow.sql b/src/test/regress/sql/shared_detoast_slow.sql
new file mode 100644
index 0000000000..beecaa6bc5
--- /dev/null
+++ b/src/test/regress/sql/shared_detoast_slow.sql
@@ -0,0 +1,70 @@
+create table t1(a text, b text, c text);
+create table t2(a text, b text, c text);
+create table t3(a text, b text, c text);
+
+insert into t1 select i, i, i from generate_series(1, 1000000)i;
+insert into t2 select i, i, i from generate_series(1, 1000000)i;
+insert into t3 select i, i, i from generate_series(1, 1000000)i;
+
+create index on t1(c);
+
+analyze t1;
+analyze t2;
+analyze t3;
+
+-- Turn off jit first, reasons:
+-- 1. JIT is not adapted for this feature, it may cause crash on jit.
+-- 2. more logging for this feature is enabled when jit=off
+set jit to off;
+
+explain (verbose) select * from t1 where b > 'a';
+
+
+-- NullTest has nothing with tts_values, so its access to toast value
+-- should be ignored.
+explain (verbose) select * from t1 where b is NULL and c is not null;
+
+-- b can't be shared-detoasted since it would make the work_mem bigger.
+explain (verbose) select * from t1 where b > 'a' order by c;
+
+-- b CAN be shared-detoasted since it would NOT make the work_mem bigger.
+-- but compared with the old behavior, it cause the lifespan of the
+-- 'detoast datum'
+-- longer, in the old behavior, it is reset becase of ExecQualAndReset.
+explain (verbose) select a, c from t1 where b > 'a' order by c;
+
+-- The detoast only happen at the join stage.
+explain (verbose) select * from t1 join t2 using(b);
+
+--
+explain (verbose) select * from t1 join t2 using(b) where t1.c > '3';
+
+explain (verbose)
+select t3.*
+from t1, t2, t3
+where t2.c > '999999999999999'
+and t2.c = t1.c
+and t3.b = t1.b;
+
+
+-- t2.b the innerHash can't be pre detoast due to Hash.
+-- t1.b the outerPlan can't be pre detoast due to num_batch.
+explain (verbose) select * from t1 join t2 using(b);
+
+-- Increase the work_mem so that the num_batch = 1, then
+-- the t1.b in outerPlan can be pre-detoast.
+
+set work_mem to '1GB';
+explain (verbose) select * from t1 join t2 using(b);
+reset work_mem;
+
+-- Show even if a datum is under a hash node, but it is
+-- not DIRECTLY input of the Hash, it's still able to be
+-- pre detoasted.
+explain (verbose)
+select t3.*
+from t1, t2, t3
+where t2.c > '999999999999999'
+and t2.c = t1.c
+and t3.b = t1.b
+and t1.b > '0';
-- 
2.34.1

#6Peter Smith
smithpb2250@gmail.com
In reply to: Andy Fan (#5)
Re: Shared detoast Datum proposal

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/4759/, but it seems
there were CFbot test failures last time it was run [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759. Please have a
look and post an updated version if necessary.

======
[1]: https://commitfest.postgresql.org/46/4759/
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

Kind Regards,
Peter Smith.

#7Andy Fan
zhihuifan1213@163.com
In reply to: Peter Smith (#6)
Re: Shared detoast Datum proposal

Hi,

Peter Smith <smithpb2250@gmail.com> writes:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4759/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

v5 attached, it should fix the above issue. This version also introduce
a data struct called bitset, which has a similar APIs like bitmapset but
have the ability to reset all bits without recycle its allocated memory,
this is important for this feature.

commit 44754fb03accb0dec9710a962a334ee73eba3c49 (HEAD -> shared_detoast_value_v2)
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date: Tue Jan 23 13:38:34 2024 +0800

shared detoast feature.

commit 14a6eafef9ff4926b8b877d694de476657deee8a
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date: Mon Jan 22 15:48:33 2024 +0800

Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

[1]: /messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com
[2]: /messages/by-id/875xzqxbv5.fsf@163.com

I didn't write a good commit message for commit 2, the people who is
interested with this can see the first message in this thread for
explaination.

I think anyone whose customer uses lots of jsonb probably can get
benefits from this. the precondition is the toast value should be
accessed 1+ times, including the jsonb_out function. I think this would
be not rare to happen.

--
Best Regards
Andy Fan

#8Michael Zhilin
m.zhilin@postgrespro.ru
In reply to: Andy Fan (#7)
Re: Shared detoast Datum proposal

Hi Andy,

It looks like v5 is missing in your mail. Could you please check and
resend it?

Thanks,
 Michael.

On 1/23/24 08:44, Andy Fan wrote:

Hi,

Peter Smith<smithpb2250@gmail.com> writes:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1]https://commitfest.postgresql.org/46/4759/
[2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

v5 attached, it should fix the above issue. This version also introduce
a data struct called bitset, which has a similar APIs like bitmapset but
have the ability to reset all bits without recycle its allocated memory,
this is important for this feature.

commit 44754fb03accb0dec9710a962a334ee73eba3c49 (HEAD -> shared_detoast_value_v2)
Author: yizhi.fzh<yizhi.fzh@alibaba-inc.com>
Date: Tue Jan 23 13:38:34 2024 +0800

shared detoast feature.

commit 14a6eafef9ff4926b8b877d694de476657deee8a
Author: yizhi.fzh<yizhi.fzh@alibaba-inc.com>
Date: Mon Jan 22 15:48:33 2024 +0800

Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

[1]/messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com
[2]/messages/by-id/875xzqxbv5.fsf@163.com

I didn't write a good commit message for commit 2, the people who is
interested with this can see the first message in this thread for
explaination.

I think anyone whose customer uses lots of jsonb probably can get
benefits from this. the precondition is the toast value should be
accessed 1+ times, including the jsonb_out function. I think this would
be not rare to happen.

--
Michael Zhilin
Postgres Professional
+7(925)3366270
https://www.postgrespro.ru

#9Andy Fan
zhihuifan1213@163.com
In reply to: Michael Zhilin (#8)
2 attachment(s)
Re: Shared detoast Datum proposal

Michael Zhilin <m.zhilin@postgrespro.ru> writes:

Hi Andy,

It looks like v5 is missing in your mail. Could you please check and resend it?

ha, yes.. v5 is really attached this time.

commit eee0b2058f912d0d56282711c5d88bc0b1b75c2f (HEAD -> shared_detoast_value_v3)
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date: Tue Jan 23 13:38:34 2024 +0800

shared detoast feature.

details at /messages/by-id/87il4jrk1l.fsf@163.com

commit eeca405f5ae87e7d4e5496de989ac7b5173bcaa9
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date: Mon Jan 22 15:48:33 2024 +0800

Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

[1]: /messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com
[2]: /messages/by-id/875xzqxbv5.fsf@163.com

As for the commit "Introduce a Bitset data struct.", the test coverage
is 100% now. So it would be great that people can review this first.

--
Best Regards
Andy Fan

Attachments:

v5-0001-Introduce-a-Bitset-data-struct.patchtext/x-diffDownload
From eeca405f5ae87e7d4e5496de989ac7b5173bcaa9 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Mon, 22 Jan 2024 15:48:33 +0800
Subject: [PATCH v5 1/2] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

[1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com
---
 src/backend/nodes/bitmapset.c                 | 201 +++++++++++++++++-
 src/backend/nodes/outfuncs.c                  |  51 +++++
 src/include/nodes/bitmapset.h                 |  28 +++
 src/include/nodes/nodes.h                     |   4 +
 src/test/modules/test_misc/Makefile           |  11 +
 src/test/modules/test_misc/README             |   4 +-
 .../test_misc/expected/test_bitset.out        |   7 +
 src/test/modules/test_misc/meson.build        |  17 ++
 .../modules/test_misc/sql/test_bitset.sql     |   3 +
 src/test/modules/test_misc/test_misc--1.0.sql |   5 +
 src/test/modules/test_misc/test_misc.c        | 132 ++++++++++++
 src/test/modules/test_misc/test_misc.control  |   4 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 456 insertions(+), 12 deletions(-)
 create mode 100644 src/test/modules/test_misc/expected/test_bitset.out
 create mode 100644 src/test/modules/test_misc/sql/test_bitset.sql
 create mode 100644 src/test/modules/test_misc/test_misc--1.0.sql
 create mode 100644 src/test/modules/test_misc/test_misc.c
 create mode 100644 src/test/modules/test_misc/test_misc.control

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 65805d4527..eb38834e21 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1315,23 +1315,18 @@ bms_join(Bitmapset *a, Bitmapset *b)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-int
-bms_next_member(const Bitmapset *a, int prevbit)
+
+static int
+bms_next_member_internal(int nwords, const bitmapword *words, int prevbit)
 {
-	int			nwords;
 	int			wordnum;
 	bitmapword	mask;
 
-	Assert(bms_is_valid_set(a));
-
-	if (a == NULL)
-		return -2;
-	nwords = a->nwords;
 	prevbit++;
 	mask = (~(bitmapword) 0) << BITNUM(prevbit);
 	for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
 	{
-		bitmapword	w = a->words[wordnum];
+		bitmapword	w = words[wordnum];
 
 		/* ignore bits before prevbit */
 		w &= mask;
@@ -1348,9 +1343,23 @@ bms_next_member(const Bitmapset *a, int prevbit)
 		/* in subsequent words, consider all bits */
 		mask = (~(bitmapword) 0);
 	}
+
 	return -2;
 }
 
+int
+bms_next_member(const Bitmapset *a, int prevbit)
+{
+	Assert(a == NULL || IsA(a, Bitmapset));
+
+	Assert(bms_is_valid_set(a));
+
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
 /*
  * bms_prev_member - find prev member of a set
  *
@@ -1458,3 +1467,177 @@ bitmap_match(const void *key1, const void *key2, Size keysize)
 	return !bms_equal(*((const Bitmapset *const *) key1),
 					  *((const Bitmapset *const *) key2));
 }
+
+/*
+ * bitset_init - create a Bitset. the set will be round up to nwords;
+ */
+Bitset *
+bitset_init(size_t size)
+{
+	int			nword = size + BITS_PER_BITMAPWORD - 1 / BITS_PER_BITMAPWORD;
+	Bitset	   *result;
+
+	if (size == 0)
+		return NULL;
+
+	result = (Bitset *) palloc0(sizeof(Bitset *) + nword * sizeof(bitmapword));
+	result->nwords = nword;
+
+	return result;
+}
+
+/*
+ * bitset_clear - clear the bits only, but the memory is still there.
+ */
+void
+bitset_clear(Bitset *a)
+{
+	if (a != NULL)
+		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
+}
+
+void
+bitset_free(Bitset *a)
+{
+	if (a != NULL)
+		pfree(a);
+}
+
+bool
+bitset_is_empty(Bitset *a)
+{
+	int			i;
+
+	if (a == NULL)
+		return true;
+
+	for (i = 0; i < a->nwords; i++)
+	{
+		bitmapword	w = a->words[i];
+
+		if (w != 0)
+			return false;
+	}
+
+	return true;
+}
+
+Bitset *
+bitset_copy(Bitset *a)
+{
+	Bitset	   *result;
+
+	if (a == NULL)
+		return NULL;
+
+	result = bitset_init(a->nwords * BITS_PER_BITMAPWORD);
+
+	memcpy(result->words, a->words, sizeof(bitmapword) * a->nwords);
+	return result;
+}
+
+void
+bitset_add_member(Bitset *a, int x)
+{
+	int			wordnum,
+				bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] |= ((bitmapword) 1 << bitnum);
+}
+
+void
+bitset_del_member(Bitset *a, int x)
+{
+	int			wordnum,
+				bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] &= ~((bitmapword) 1 << bitnum);
+}
+
+int
+bitset_is_member(int x, Bitset *a)
+{
+	int			wordnum,
+				bitnum;
+
+	/* used in expression engine */
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	if (a == NULL)
+		return false;
+
+	if (wordnum >= a->nwords)
+		return false;
+
+	return (a->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+int
+bitset_next_member(const Bitset *a, int prevbit)
+{
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
+
+/*
+ * bitset_to_bitmap - build a legal bitmapset from bitset.
+ */
+Bitmapset *
+bitset_to_bitmap(Bitset *a)
+{
+	int			n;
+
+	bool		found = false;	/* any non-empty bits */
+	Bitmapset  *result;
+	int			i;
+
+	if (a == NULL)
+		return NULL;
+
+	n = a->nwords - 1;
+	do
+	{
+		if (a->words[n] > 0)
+		{
+			found = true;
+			break;
+		}
+	} while (--n >= 0);
+
+	if (!found)
+		return NULL;
+
+	result = (Bitmapset *) palloc0(BITMAPSET_SIZE(n + 1));
+	result->type = T_Bitmapset;
+	result->nwords = n + 1;
+
+	Assert(result->nwords <= a->nwords);
+
+	i = 0;
+	do
+	{
+		result->words[i] = a->words[i];
+	} while (++i < result->nwords);
+
+	return result;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 296ba84518..2607cd2bfd 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -331,6 +331,43 @@ outBitmapset(StringInfo str, const Bitmapset *bms)
 	appendStringInfoChar(str, ')');
 }
 
+
+
+/*
+ * outBitset -
+ *	similar to outBitmapset, but for Bitset.
+ */
+static void
+outBitsetInternal(struct StringInfoData *str,
+				  const struct Bitset *bs,
+				  bool asBitmap)
+{
+	int			x;
+
+	appendStringInfoChar(str, '(');
+	if (asBitmap)
+		appendStringInfoChar(str, 'b');
+	else
+		appendStringInfo(str, "bs");
+	x = -1;
+	while ((x = bitset_next_member(bs, x)) >= 0)
+		appendStringInfo(str, " %d", x);
+	appendStringInfoChar(str, ')');
+}
+
+
+/*
+ * outBitset -
+ *	similar to outBitmapset, but for Bitset.
+ */
+void
+outBitset(struct StringInfoData *str,
+		  const struct Bitset *bs)
+{
+	outBitsetInternal(str, bs, false);
+}
+
+
 /*
  * Print the value of a Datum given its type.
  */
@@ -904,3 +941,17 @@ bmsToString(const Bitmapset *bms)
 	outBitmapset(&str, bms);
 	return str.data;
 }
+
+/*
+ * bitsetToString -
+ *	   similar to bmsToString, but for Bitset
+ */
+char *
+bitsetToString(const struct Bitset *bs, bool asBitmap)
+{
+	StringInfoData str;
+
+	initStringInfo(&str);
+	outBitsetInternal(&str, bs, asBitmap);
+	return str.data;
+}
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 906e8dcc15..95ff37c6e9 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -55,6 +55,24 @@ typedef struct Bitmapset
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];	/* really [nwords] */
 } Bitmapset;
 
+/*
+ * While Bitmapset is designed for variable-length of bits, Bitset is
+ * designed for fixed-length of bits, the fixed length must be specified at
+ * the bitset_init stage and keep unchanged at the whole lifespan. Because
+ * of this, some operations on Bitset is simpler than Bitmapset.
+ *
+ * The bitset_clear unsets all the bits but kept the allocated memory, this
+ * capacity is impossible for bit Bitmapset for some solid reasons.
+ *
+ * Also for performance aspect, the functions for Bitset removed some
+ * unlikely checks, instead with some Asserts.
+ */
+
+typedef struct Bitset
+{
+	int			nwords;			/* number of words in array */
+	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];	/* really [nwords] */
+} Bitset;
 
 /* result of bms_subset_compare */
 typedef enum
@@ -124,4 +142,14 @@ extern uint32 bms_hash_value(const Bitmapset *a);
 extern uint32 bitmap_hash(const void *key, Size keysize);
 extern int	bitmap_match(const void *key1, const void *key2, Size keysize);
 
+extern Bitset *bitset_init(size_t size);
+extern void bitset_clear(Bitset *a);
+extern void bitset_free(Bitset *a);
+extern bool bitset_is_empty(Bitset *a);
+extern Bitset *bitset_copy(Bitset *a);
+extern void bitset_add_member(Bitset *a, int x);
+extern void bitset_del_member(Bitset *a, int x);
+extern int	bitset_is_member(int bit, Bitset *a);
+extern int	bitset_next_member(const Bitset *a, int prevbit);
+extern Bitmapset *bitset_to_bitmap(Bitset *a);
 #endif							/* BITMAPSET_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2969dd831b..4d13107990 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -186,16 +186,20 @@ castNodeImpl(NodeTag type, void *ptr)
  * nodes/{outfuncs.c,print.c}
  */
 struct Bitmapset;				/* not to include bitmapset.h here */
+struct Bitset;					/* not to include bitmapset.h here */
 struct StringInfoData;			/* not to include stringinfo.h here */
 
 extern void outNode(struct StringInfoData *str, const void *obj);
 extern void outToken(struct StringInfoData *str, const char *s);
 extern void outBitmapset(struct StringInfoData *str,
 						 const struct Bitmapset *bms);
+extern void outBitset(struct StringInfoData *str, const struct Bitset *bs);
+
 extern void outDatum(struct StringInfoData *str, uintptr_t value,
 					 int typlen, bool typbyval);
 extern char *nodeToString(const void *obj);
 extern char *bmsToString(const struct Bitmapset *bms);
+extern char *bitsetToString(const struct Bitset *bs, bool asBitmap);
 
 /*
  * nodes/{readfuncs.c,read.c}
diff --git a/src/test/modules/test_misc/Makefile b/src/test/modules/test_misc/Makefile
index 39c6c2014a..af96604096 100644
--- a/src/test/modules/test_misc/Makefile
+++ b/src/test/modules/test_misc/Makefile
@@ -2,6 +2,17 @@
 
 TAP_TESTS = 1
 
+MODULE_big = test_misc
+OBJS = \
+	$(WIN32RES) \
+	test_misc.o
+PGFILEDESC = "test_misc"
+
+EXTENSION = test_misc
+DATA = test_misc--1.0.sql
+
+REGRESS = test_bitset
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/src/test/modules/test_misc/README b/src/test/modules/test_misc/README
index 4876733fa2..ec426c4ad5 100644
--- a/src/test/modules/test_misc/README
+++ b/src/test/modules/test_misc/README
@@ -1,4 +1,2 @@
-This directory doesn't actually contain any extension module.
-
-What it is is a home for otherwise-unclassified TAP tests that exercise core
+What it is is a home for otherwise-unclassified tests that exercise core
 server features.  We might equally well have called it, say, src/test/misc.
diff --git a/src/test/modules/test_misc/expected/test_bitset.out b/src/test/modules/test_misc/expected/test_bitset.out
new file mode 100644
index 0000000000..3d0302d30d
--- /dev/null
+++ b/src/test/modules/test_misc/expected/test_bitset.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_misc;
+SELECT test_bitset();
+ test_bitset 
+-------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 964d95db26..a23f3e3f47 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -1,5 +1,22 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+test_misc_sources = files(
+    'test_misc.c',
+)
+
+if host_system == 'windows'
+  test_misc_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_misc',
+    '--FILEDESC', 'test_misc - ',])
+endif
+
+test_misc = shared_module('test_misc',
+  test_misc_sources,
+  kwargs: pg_test_mod_args,
+)
+
+test_install_libs += test_misc
+
 tests += {
   'name': 'test_misc',
   'sd': meson.current_source_dir(),
diff --git a/src/test/modules/test_misc/sql/test_bitset.sql b/src/test/modules/test_misc/sql/test_bitset.sql
new file mode 100644
index 0000000000..0f73bbf532
--- /dev/null
+++ b/src/test/modules/test_misc/sql/test_bitset.sql
@@ -0,0 +1,3 @@
+CREATE EXTENSION test_misc;
+
+SELECT test_bitset();
diff --git a/src/test/modules/test_misc/test_misc--1.0.sql b/src/test/modules/test_misc/test_misc--1.0.sql
new file mode 100644
index 0000000000..79afaa6263
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc--1.0.sql
@@ -0,0 +1,5 @@
+\echo Use "CREATE EXTENSION test_misc" to load this file. \quit
+
+CREATE FUNCTION test_bitset()
+       RETURNS pg_catalog.void
+       AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_misc/test_misc.c b/src/test/modules/test_misc/test_misc.c
new file mode 100644
index 0000000000..8cc8d6fc1e
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc.c
@@ -0,0 +1,132 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_misc.c
+ *
+ * Copyright (c) 2022-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_dsa/test_misc.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "nodes/bitmapset.h"
+#include "nodes/nodes.h"
+
+#define BIT_ADD 0
+#define BIT_DEL 1
+
+static void compare_bms_bs(Bitmapset **bms, Bitset *bs, int member, int op);
+
+
+PG_MODULE_MAGIC;
+
+/* Test basic DSA functionality */
+PG_FUNCTION_INFO_V1(test_bitset);
+
+Datum
+test_bitset(PG_FUNCTION_ARGS)
+{
+	Bitset	   *bs;
+	Bitset	   *bs2;
+	char	   *str1,
+			   *str2,
+			   *empty_str;
+	Bitmapset  *bms = NULL;
+	int			i;
+
+	empty_str = bmsToString(NULL);
+
+	/* size = 0 */
+	bs = bitset_init(0);
+	Assert(bs == NULL);
+	bitset_clear(bs);
+	Assert(bitset_is_empty(bs));
+	/* bitset_add_member(bs, 0); // crash. */
+	/* bitset_del_member(bs, 0); // crash. */
+	Assert(!bitset_is_member(0, bs));
+	Assert(bitset_next_member(bs, -1) == -2);
+	bs2 = bitset_copy(bs);
+	Assert(bs2 == NULL);
+	bitset_free(bs);
+	bitset_free(bs2);
+
+	/* size == 68, nword == 2 */
+	bs = bitset_init(68);
+
+	for (i = 0; i < 68; i = i + 3)
+	{
+		compare_bms_bs(&bms, bs, i, BIT_ADD);
+	}
+
+	for (i = 67; i >= 0; i = i - 3)
+	{
+		compare_bms_bs(&bms, bs, i, BIT_DEL);
+	}
+
+	bitset_clear(bs);
+	str1 = bitsetToString(bs, true);
+	Assert(strcmp(str1, empty_str) == 0);
+
+	bms = bitset_to_bitmap(bs);
+	str2 = bmsToString(bms);
+	Assert(strcmp(str1, str2) == 0);
+
+	bms = bitset_to_bitmap(NULL);
+	Assert(strcmp(bmsToString(bms), empty_str) == 0);
+
+	bitset_free(bs);
+
+	PG_RETURN_VOID();
+}
+
+
+static void
+compare_bms_bs(Bitmapset **bms, Bitset *bs, int member, int op)
+{
+	char	   *str1,
+			   *str2,
+			   *str3,
+			   *str4;
+	Bitmapset  *bms3;
+	Bitset	   *bs4;
+
+	if (op == BIT_ADD)
+	{
+		*bms = bms_add_member(*bms, member);
+		bitset_add_member(bs, member);
+		Assert(bms_is_member(member, *bms));
+		Assert(bitset_is_member(member, bs));
+	}
+	else if (op == BIT_DEL)
+	{
+		*bms = bms_del_member(*bms, member);
+		bitset_del_member(bs, member);
+		Assert(!bms_is_member(member, *bms));
+		Assert(!bitset_is_member(member, bs));
+	}
+	else
+		Assert(false);
+
+	/* compare the rest existing bit */
+	str1 = bmsToString(*bms);
+	str2 = bitsetToString(bs, true);
+	Assert(strcmp(str1, str2) == 0);
+
+	/* test bitset_to_bitmap */
+	bms3 = bitset_to_bitmap(bs);
+	str3 = bmsToString(bms3);
+	Assert(strcmp(str3, str2) == 0);
+
+	/* test bitset_copy */
+	bs4 = bitset_copy(bs);
+	str4 = bitsetToString(bs4, true);
+	Assert(strcmp(str3, str4) == 0);
+
+	pfree(str1);
+	pfree(str2);
+	pfree(str3);
+	pfree(str4);
+}
diff --git a/src/test/modules/test_misc/test_misc.control b/src/test/modules/test_misc/test_misc.control
new file mode 100644
index 0000000000..48fd08758f
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc.control
@@ -0,0 +1,4 @@
+comment = 'Test misc'
+default_version = '1.0'
+module_pathname = '$libdir/test_misc'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7e866e3c3d..50b06fd878 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -267,6 +267,7 @@ BitmapOr
 BitmapOrPath
 BitmapOrState
 Bitmapset
+Bitset
 Block
 BlockId
 BlockIdData
-- 
2.34.1

v5-0002-shared-detoast-feature.patchtext/x-diffDownload
From eee0b2058f912d0d56282711c5d88bc0b1b75c2f Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 23 Jan 2024 13:38:34 +0800
Subject: [PATCH v5 2/2] shared detoast feature.

details at https://postgr.es/m/87il4jrk1l.fsf%40163.com
---
 src/backend/executor/execExpr.c         |  35 +-
 src/backend/executor/execExprInterp.c   | 180 ++++++++
 src/backend/executor/execTuples.c       | 135 ++++++
 src/backend/executor/execUtils.c        |   2 +
 src/backend/executor/nodeHashjoin.c     |   2 +
 src/backend/executor/nodeMergejoin.c    |   2 +
 src/backend/executor/nodeNestloop.c     |   1 +
 src/backend/jit/llvm/llvmjit_expr.c     |  26 +-
 src/backend/jit/llvm/llvmjit_types.c    |   1 +
 src/backend/optimizer/plan/createplan.c |  94 ++++-
 src/backend/optimizer/plan/setrefs.c    | 528 +++++++++++++++++++-----
 src/include/executor/execExpr.h         |   6 +
 src/include/executor/tuptable.h         |  60 +++
 src/include/nodes/execnodes.h           |   4 +
 src/include/nodes/plannodes.h           |  47 +++
 src/tools/pgindent/typedefs.list        |   2 +
 16 files changed, 1019 insertions(+), 106 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 91df2009be..776d4623d1 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -932,22 +932,51 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				}
 				else
 				{
+					int			attnum;
+					Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 					/* regular user column */
 					scratch.d.var.attnum = variable->varattno - 1;
 					scratch.d.var.vartype = variable->vartype;
+					attnum = scratch.d.var.attnum;
+
 					switch (variable->varno)
 					{
 						case INNER_VAR:
-							scratch.opcode = EEOP_INNER_VAR;
+
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_INNER_VAR_TOAST;
+							}
+							else
+							{
+								scratch.opcode = EEOP_INNER_VAR;
+							}
 							break;
 						case OUTER_VAR:
-							scratch.opcode = EEOP_OUTER_VAR;
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_OUTER_VAR;
 							break;
 
 							/* INDEX_VAR is handled by default case */
 
 						default:
-							scratch.opcode = EEOP_SCAN_VAR;
+							if (is_scan_plan(plan) && bms_is_member(
+																	attnum,
+																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_SCAN_VAR;
 							break;
 					}
 				}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3c17cc6b1e..f3cd365a09 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -157,6 +158,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -165,6 +169,9 @@ static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull
 static Datum ExecJustInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -180,6 +187,43 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  AggStatePerGroup pergroup,
 															  ExprContext *aggcontext,
 															  int setno);
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+	if (!slot->tts_isnull[attnum] &&
+		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+	{
+		Datum		oldDatum;
+		MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
+
+		oldDatum = slot->tts_values[attnum];
+		slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
+																(struct varlena *) oldDatum));
+		Assert(slot->tts_nvalid > attnum);
+		if (oldDatum != slot->tts_values[attnum])
+			bitset_add_member(slot->pre_detoasted_attrs, attnum);
+		MemoryContextSwitchTo(old);
+	}
+}
+
+/* JIT requires a non-static (and external?) function */
+void
+ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum)
+{
+	return ExecSlotDetoastDatum(slot, attnum);
+}
+
+
+static inline void
+ExecEvalToastVar(TupleTableSlot *slot,
+				 ExprEvalStep *op,
+				 int attnum)
+{
+	ExecSlotDetoastDatum(slot, attnum);
+
+	*op->resvalue = slot->tts_values[attnum];
+	*op->resnull = slot->tts_isnull[attnum];
+}
 
 /*
  * ScalarArrayOpExprHashEntry
@@ -295,6 +339,24 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVar;
 			return;
 		}
+		if (step0 == EEOP_INNER_FETCHSOME &&
+			step1 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_FETCHSOME &&
+				 step1 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_FETCHSOME &&
+				 step1 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarToast;
+			return;
+		}
 		else if (step0 == EEOP_INNER_FETCHSOME &&
 				 step1 == EEOP_ASSIGN_INNER_VAR)
 		{
@@ -345,6 +407,21 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVarVirt;
 			return;
 		}
+		else if (step0 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarVirtToast;
+			return;
+		}
 		else if (step0 == EEOP_ASSIGN_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustAssignInnerVarVirt;
@@ -412,6 +489,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
+		&&CASE_EEOP_INNER_VAR_TOAST,
+		&&CASE_EEOP_OUTER_VAR_TOAST,
+		&&CASE_EEOP_SCAN_VAR_TOAST,
 		&&CASE_EEOP_INNER_SYSVAR,
 		&&CASE_EEOP_OUTER_SYSVAR,
 		&&CASE_EEOP_SCAN_SYSVAR,
@@ -595,6 +675,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
 			*op->resvalue = scanslot->tts_values[attnum];
 			*op->resnull = scanslot->tts_isnull[attnum];
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_INNER_VAR_TOAST)
+		{
+			ExecEvalToastVar(innerslot, op, op->d.var.attnum);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_VAR_TOAST)
+		{
+			ExecEvalToastVar(outerslot, op, op->d.var.attnum);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_VAR_TOAST)
+		{
+			ExecEvalToastVar(scanslot, op, op->d.var.attnum);
 
 			EEO_NEXT();
 		}
@@ -2126,6 +2225,42 @@ ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+static pg_attribute_always_inline Datum
+ExecJustVarImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[1];
+	int			attnum = op->d.var.attnum;
+
+	CheckOpSlotCompatibility(&state->steps[0], slot);
+
+	slot_getattr(slot, attnum + 1, isnull);
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Simple reference to inner Var */
+static Datum
+ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Simple reference to outer Var */
+static Datum
+ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Simple reference to scan Var */
+static Datum
+ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)Var */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
@@ -2264,6 +2399,51 @@ ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarVirtImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+/* implementation of ExecJust(Inner|Outer|Scan)VarVirt */
+static pg_attribute_always_inline Datum
+ExecJustVarVirtImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[0];
+	int			attnum = op->d.var.attnum;
+
+	/*
+	 * As it is guaranteed that a virtual slot is used, there never is a need
+	 * to perform tuple deforming (nor would it be possible). Therefore
+	 * execExpr.c has not emitted an EEOP_*_FETCHSOME step. Verify, as much as
+	 * possible, that that determination was accurate.
+	 */
+	Assert(TTS_IS_VIRTUAL(slot));
+	Assert(TTS_FIXED(slot));
+	Assert(attnum >= 0 && attnum < slot->tts_nvalid);
+
+	*isnull = slot->tts_isnull[attnum];
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Like ExecJustInnerVar, optimized for virtual slots */
+static Datum
+ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Like ExecJustOuterVar, optimized for virtual slots */
+static Datum
+ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Like ExecJustScanVar, optimized for virtual slots */
+static Datum
+ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)VarVirt */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarVirtImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a7aa2ee02b..e5af9962b6 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -79,6 +79,9 @@ static inline void tts_buffer_heap_store_tuple(TupleTableSlot *slot,
 											   bool transfer_pin);
 static void tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree);
 
+static Bitmapset *cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+											  TupleDesc tupleDesc,
+											  List *forbid_pre_detoast_vars);
 
 const TupleTableSlotOps TTSOpsVirtual;
 const TupleTableSlotOps TTSOpsHeapTuple;
@@ -176,6 +179,10 @@ tts_virtual_materialize(TupleTableSlot *slot)
 		if (att->attbyval || slot->tts_isnull[natt])
 			continue;
 
+		if (bitset_is_member(natt, slot->pre_detoasted_attrs))
+			/* it has been in slot->tts_mcxt already. */
+			continue;
+
 		val = slot->tts_values[natt];
 
 		if (att->attlen == -1 &&
@@ -392,6 +399,13 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated since tts_nvalid is set to 0, so let free the
+	 * pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
+
 }
 
 static void
@@ -457,6 +471,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -567,6 +584,12 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated as non valid (tts_nvalid = 0), free the
+	 * pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -637,6 +660,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -771,6 +797,12 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_nvalid = 0 means tts_values will be not reliable, so clear the
+	 * information about pre-detoast-datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -904,6 +936,9 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 		 */
 		ReleaseBuffer(buffer);
 	}
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 /*
@@ -1150,7 +1185,10 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 			 + MAXALIGN(tupleDesc->natts * sizeof(Datum)));
 
 		PinTupleDesc(tupleDesc);
+		slot->pre_detoasted_attrs = bitset_init(tupleDesc->natts);
 	}
+	else
+		slot->pre_detoasted_attrs = NULL;
 
 	/*
 	 * And allow slot type specific initialization.
@@ -1304,6 +1342,10 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 		pfree(slot->tts_values);
 	if (slot->tts_isnull)
 		pfree(slot->tts_isnull);
+	if (slot->pre_detoasted_attrs)
+		bitset_free(slot->pre_detoasted_attrs);
+
+	slot->pre_detoasted_attrs = bitset_init(tupdesc->natts);
 
 	/*
 	 * Install the new descriptor; if it's refcounted, bump its refcount.
@@ -1810,12 +1852,26 @@ void
 ExecInitScanTupleSlot(EState *estate, ScanState *scanstate,
 					  TupleDesc tupledesc, const TupleTableSlotOps *tts_ops)
 {
+	Scan	   *splan = (Scan *) scanstate->ps.plan;
+
 	scanstate->ss_ScanTupleSlot = ExecAllocTableSlot(&estate->es_tupleTable,
 													 tupledesc, tts_ops);
 	scanstate->ps.scandesc = tupledesc;
 	scanstate->ps.scanopsfixed = tupledesc != NULL;
 	scanstate->ps.scanops = tts_ops;
 	scanstate->ps.scanopsset = true;
+
+	if (is_scan_plan((Plan *) splan))
+	{
+		/*
+		 * We may run detoast in Qual or Projection, but all of them happen at
+		 * the ss_ScanTupleSlot rather than ps_ResultTupleSlot. So we can only
+		 * take care of the ss_ScanTupleSlot.
+		 */
+		scanstate->scan_pre_detoast_attrs = cal_final_pre_detoast_attrs(splan->reference_attrs,
+																		tupledesc,
+																		splan->plan.forbid_pre_detoast_vars);
+	}
 }
 
 /* ----------------
@@ -2336,3 +2392,82 @@ end_tup_output(TupOutputState *tstate)
 	ExecDropSingleTupleTableSlot(tstate->slot);
 	pfree(tstate);
 }
+
+/*
+ * cal_final_pre_detoast_attrs
+ *		Calculate the final attributes which pre-detoast be helpful.
+ *
+ * reference_attrs: the attributes which will be detoast at this plan level.
+ * due to the implementation issue, some non-toast attribute may be included
+ * which should be filtered out with tupleDesc.
+ *
+ * forbid_pre_detoast_vars: the vars which should not be pre-detoast as the
+ * small_tlist reason.
+ */
+static Bitmapset *
+cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+							TupleDesc tupleDesc,
+							List *forbid_pre_detoast_vars)
+{
+	Bitmapset  *final = NULL,
+			   *toast_attrs = NULL,
+			   *forbid_pre_detoast_attrs = NULL;
+
+	int			i;
+	ListCell   *lc;
+
+	if (bms_is_empty(reference_attrs))
+		return NULL;
+
+	/*
+	 * there is no exact data type in create_plan or set_plan_refs stage, so
+	 * reference_attrs may have some attribute which is not toast attrs at
+	 * all, which should be removed.
+	 */
+	for (i = 0; i < tupleDesc->natts; i++)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupleDesc, i);
+
+		if (attr->attlen == -1 && attr->attstorage != TYPSTORAGE_PLAIN)
+			toast_attrs = bms_add_member(toast_attrs, attr->attnum - 1);
+	}
+
+	final = bms_intersect(reference_attrs, toast_attrs);
+
+	/*
+	 * Due to the fact of detoast-datum will make the tuple bigger which is
+	 * bad for some nodes like Sort/Hash, to avoid performance regression,
+	 * such attribute should be removed as well.
+	 */
+	foreach(lc, forbid_pre_detoast_vars)
+	{
+		Var		   *var = lfirst_node(Var, lc);
+
+		forbid_pre_detoast_attrs = bms_add_member(forbid_pre_detoast_attrs, var->varattno - 1);
+	}
+
+	final = bms_del_members(final, forbid_pre_detoast_attrs);
+
+	bms_free(toast_attrs);
+	bms_free(forbid_pre_detoast_attrs);
+
+	return final;
+}
+
+
+void
+SetPredetoastAttrsForJoin(JoinState *j)
+{
+	PlanState  *outerstate = outerPlanState(j);
+	PlanState  *innerstate = innerPlanState(j);
+
+	j->outer_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->outer_reference_attrs,
+															 outerstate->ps_ResultTupleDesc,
+															 outerstate->plan->forbid_pre_detoast_vars);
+
+	j->inner_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->inner_reference_attrs,
+															 innerstate->ps_ResultTupleDesc,
+															 innerstate->plan->forbid_pre_detoast_vars);
+}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index cff5dc723e..a8646ded02 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -572,6 +572,8 @@ ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
 		planstate->resultopsset = planstate->scanopsset;
 		planstate->resultopsfixed = planstate->scanopsfixed;
 		planstate->resultops = planstate->scanops;
+
+		Assert(planstate->ps_ResultTupleDesc != NULL);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1cbec4647c..19a05ed624 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -756,6 +756,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
 	innerDesc = ExecGetResultType(innerPlanState(hjstate));
 
+	SetPredetoastAttrsForJoin((JoinState *) hjstate);
+
 	/*
 	 * Initialize result slot, type and projection.
 	 */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index c1a8ca2464..be7cbd7f30 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1497,6 +1497,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 											  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	SetPredetoastAttrsForJoin((JoinState *) mergestate);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 06fa0a9b31..2d40d19192 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -306,6 +306,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(&nlstate->js.ps, &TTSOpsVirtual);
 	ExecAssignProjectionInfo(&nlstate->js.ps, NULL);
+	SetPredetoastAttrsForJoin((JoinState *) nlstate);
 
 	/*
 	 * initialize child expressions
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 33161d812f..56c10c9e4d 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -396,30 +396,52 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
+			case EEOP_INNER_VAR_TOAST:
+			case EEOP_OUTER_VAR_TOAST:
+			case EEOP_SCAN_VAR_TOAST:
 				{
 					LLVMValueRef value,
 								isnull;
 					LLVMValueRef v_attnum;
 					LLVMValueRef v_values;
 					LLVMValueRef v_nulls;
+					LLVMValueRef v_slot;
 
-					if (opcode == EEOP_INNER_VAR)
+					if (opcode == EEOP_INNER_VAR || opcode == EEOP_INNER_VAR_TOAST)
 					{
+						v_slot = v_innerslot;
 						v_values = v_innervalues;
 						v_nulls = v_innernulls;
 					}
-					else if (opcode == EEOP_OUTER_VAR)
+					else if (opcode == EEOP_OUTER_VAR || opcode == EEOP_OUTER_VAR_TOAST)
 					{
+						v_slot = v_outerslot;
 						v_values = v_outervalues;
 						v_nulls = v_outernulls;
 					}
 					else
 					{
+						v_slot = v_scanslot;
 						v_values = v_scanvalues;
 						v_nulls = v_scannulls;
 					}
 
 					v_attnum = l_int32_const(lc, op->d.var.attnum);
+
+					if (opcode == EEOP_INNER_VAR_TOAST ||
+						opcode == EEOP_OUTER_VAR_TOAST ||
+						opcode == EEOP_SCAN_VAR_TOAST)
+					{
+						LLVMValueRef params[2];
+
+						params[0] = v_slot;
+						params[1] = l_int32_const(lc, op->d.var.attnum);
+						l_call(b,
+							   llvm_pg_var_func_type("ExecSlotDetoastDatumExternal"),
+							   llvm_pg_func(mod, "ExecSlotDetoastDatumExternal"),
+							   params, lengthof(params), "");
+					}
+
 					value = l_load_gep1(b, TypeSizeT, v_values, v_attnum, "");
 					isnull = l_load_gep1(b, TypeStorageBool, v_nulls, v_attnum, "");
 					LLVMBuildStore(b, value, v_resvaluep);
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 5212f529c8..11a11028b8 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -177,4 +177,5 @@ void	   *referenced_functions[] =
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecSlotDetoastDatumExternal,
 };
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ca619eab94..69f5e72c43 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -314,7 +314,8 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 List *mergeActionLists, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
-
+static void set_plan_forbid_pre_detoast_vars_recurse(Plan *plan,
+													 List *recheck_list);
 
 /*
  * create_plan
@@ -346,6 +347,12 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree, demanding the correct tlist result */
 	plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);
 
+	/*
+	 * After the plan tree is built completed, we start to walk for which
+	 * expressions should not used the shared-detoast feature.
+	 */
+	set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);
+
 	/*
 	 * Make sure the topmost plan node's targetlist exposes the original
 	 * column names and other decorative info.  Targetlists generated within
@@ -378,6 +385,89 @@ create_plan(PlannerInfo *root, Path *best_path)
 	return plan;
 }
 
+/*
+ * set_plan_not_pre_detoast_vars
+ *
+ *	set the toast_attrs according recheck_list.
+ *
+ * recheck_list = NIL means we need to do thing.
+ */
+static void
+set_plan_not_pre_detoast_vars(Plan *plan, List *recheck_list)
+{
+	ListCell   *lc;
+	Var		   *var;
+
+	/*
+	 * fast path, if we don't have a recheck_list, the var in targetlist is
+	 * impossible member of it. and this case might be a pretty common case.
+	 */
+	if (recheck_list == NIL)
+		return;
+
+	foreach(lc, plan->targetlist)
+	{
+		TargetEntry *te = lfirst_node(TargetEntry, lc);
+
+		if (!IsA(te->expr, Var))
+			continue;
+		var = castNode(Var, te->expr);
+		if (var->varattno <= 0)
+			continue;
+		if (list_member(recheck_list, var))
+			/* pass the recheck */
+			plan->forbid_pre_detoast_vars = lappend(plan->forbid_pre_detoast_vars, var);
+	}
+}
+
+
+/*
+ * set_plan_forbid_pre_detoast_vars_recurse
+ *		Walking the Plan tree to gather the vars which should be as small
+ * as possible, hence all of them should not using shared detoast datum
+ * feature.
+ */
+static void
+set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *recheck_list)
+{
+	if (plan == NULL)
+		return;
+
+	set_plan_not_pre_detoast_vars(plan, recheck_list);
+
+	if (IsA(plan, Sort) || IsA(plan, Memoize) || IsA(plan, WindowAgg) ||
+		IsA(plan, Hash) || IsA(plan, Material) || IsA(plan, IncrementalSort))
+	{
+		List	   *subplan_exprs = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * for the sort-like nodes, we want the subplan's output(only) as
+		 * small as possible, but the subplan's other expressions like Qual
+		 * doesn't have this restriction since they are not output to the
+		 * upper nodes. so we set the recheck_list to the subplan->targetlist.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, subplan_exprs);
+	}
+	else if (IsA(plan, HashJoin) && castNode(HashJoin, plan)->left_small_tlist)
+	{
+		List	   *subplan_exprs = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * If the left_small_tlist wants a as small as possible tlist, set it
+		 * in a way like sort.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, subplan_exprs);
+		/* The righttree is a Hash node, it can be set with its own rule */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+	else
+	{
+		/* just push down the forbid_pre_detoast_vars from its children. */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, plan->forbid_pre_detoast_vars);
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+}
+
 /*
  * create_plan_recurse
  *	  Recursive guts of create_plan().
@@ -4884,6 +4974,8 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	join_plan->left_small_tlist = (best_path->num_batches > 1);
+
 	return join_plan;
 }
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 22a1fa29f3..d11177d688 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -27,6 +27,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_relation.h"
 #include "tcop/utility.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
@@ -55,11 +56,27 @@ typedef struct
 	tlist_vinfo vars[FLEXIBLE_ARRAY_MEMBER];	/* has num_vars entries */
 } indexed_tlist;
 
+typedef struct
+{
+	/* var is added into existing_attrs for the first time. */
+	Bitmapset  *existing_attrs;
+	/* following add to the final_ref_attrs. */
+	Bitmapset **final_ref_attrs;
+} intermediate_var_ref_context;
+
+typedef struct
+{
+	int			level_added;
+	int			level;
+} intermediate_level_context;
+
 typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context scan_reference_attrs;
 } fix_scan_expr_context;
 
 typedef struct
@@ -71,6 +88,9 @@ typedef struct
 	int			rtoffset;
 	NullingRelsMatch nrm_match;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context outer_reference_attrs;
+	intermediate_var_ref_context inner_reference_attrs;
 } fix_join_expr_context;
 
 typedef struct
@@ -127,8 +147,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, lst, rtoffset, num_exec, pre_detoast_attrs) \
+	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec, pre_detoast_attrs))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -158,7 +178,8 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
 static Node *fix_scan_expr(PlannerInfo *root, Node *node,
-						   int rtoffset, double num_exec);
+						   int rtoffset, double num_exec,
+						   Bitmapset **scan_reference_attrs);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
@@ -190,7 +211,10 @@ static List *fix_join_expr(PlannerInfo *root,
 						   Index acceptable_rel,
 						   int rtoffset,
 						   NullingRelsMatch nrm_match,
-						   double num_exec);
+						   double num_exec,
+						   Bitmapset **outer_reference_attrs,
+						   Bitmapset **inner_reference_attrs);
+
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
 static Node *fix_upper_expr(PlannerInfo *root,
@@ -628,10 +652,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_SampleScan:
@@ -641,13 +671,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs
+					);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tablesample = (TableSampleClause *)
 					fix_scan_expr(root, (Node *) splan->tablesample,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexScan:
@@ -657,28 +694,40 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->indexqual =
 					fix_scan_list(root, splan->indexqual,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->indexorderby =
 					fix_scan_list(root, splan->indexorderby,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexorderbyorig =
 					fix_scan_list(root, splan->indexorderbyorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan), &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexOnlyScan:
 			{
 				IndexOnlyScan *splan = (IndexOnlyScan *) plan;
 
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
+
 				return set_indexonlyscan_references(root, splan, rtoffset);
 			}
 			break;
@@ -691,10 +740,15 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, splan->indexqual, rtoffset, 1,
+								  &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_BitmapHeapScan:
@@ -704,13 +758,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->bitmapqualorig =
 					fix_scan_list(root, splan->bitmapqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidScan:
@@ -720,13 +781,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidquals =
 					fix_scan_list(root, splan->tidquals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidRangeScan:
@@ -736,13 +804,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidrangequals =
 					fix_scan_list(root, splan->tidrangequals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_SubqueryScan:
@@ -757,12 +832,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, splan->functions, rtoffset, 1,
+								  &splan->scan.reference_attrs);
+
 			}
 			break;
 		case T_TableFuncScan:
@@ -772,13 +851,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->tablefunc = (TableFunc *)
 					fix_scan_expr(root, (Node *) splan->tablefunc,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ValuesScan:
@@ -788,13 +871,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->values_lists =
 					fix_scan_list(root, splan->values_lists,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_CteScan:
@@ -804,10 +890,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_NamedTuplestoreScan:
@@ -817,10 +909,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_WorkTableScan:
@@ -830,10 +924,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ForeignScan:
@@ -873,7 +969,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
 												   rtoffset,
-												   NUM_EXEC_TLIST(plan));
+												   NUM_EXEC_TLIST(plan),
+												   NULL);
 				break;
 			}
 
@@ -933,9 +1030,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, splan->limitOffset, rtoffset, 1, NULL);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, splan->limitCount, rtoffset, 1, NULL);
 			}
 			break;
 		case T_Agg:
@@ -988,17 +1085,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->startOffset, rtoffset, 1, NULL);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->endOffset, rtoffset, 1, NULL);
 				wplan->runCondition = fix_scan_list(root,
 													wplan->runCondition,
 													rtoffset,
-													NUM_EXEC_TLIST(plan));
+													NUM_EXEC_TLIST(plan), NULL);
 				wplan->runConditionOrig = fix_scan_list(root,
 														wplan->runConditionOrig,
 														rtoffset,
-														NUM_EXEC_TLIST(plan));
+														NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_Result:
@@ -1038,14 +1135,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 					splan->plan.targetlist =
 						fix_scan_list(root, splan->plan.targetlist,
-									  rtoffset, NUM_EXEC_TLIST(plan));
+									  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 					splan->plan.qual =
 						fix_scan_list(root, splan->plan.qual,
-									  rtoffset, NUM_EXEC_QUAL(plan));
+									  rtoffset, NUM_EXEC_QUAL(plan), NULL);
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1, NULL);
 			}
 			break;
 		case T_ProjectSet:
@@ -1061,7 +1158,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->withCheckOptionLists =
 					fix_scan_list(root, splan->withCheckOptionLists,
-								  rtoffset, 1);
+								  rtoffset, 1, NULL);
 
 				if (splan->returningLists)
 				{
@@ -1118,18 +1215,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						fix_join_expr(root, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					splan->onConflictWhere = (Node *)
 						fix_join_expr(root, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1, NULL);
 				}
 
 				/*
@@ -1182,7 +1281,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   resultrel,
 															   rtoffset,
 															   NRM_EQUAL,
-															   NUM_EXEC_TLIST(plan));
+															   NUM_EXEC_TLIST(plan),
+															   NULL, NULL);
 
 							/* Fix quals too. */
 							action->qual = (Node *) fix_join_expr(root,
@@ -1191,7 +1291,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 																  resultrel,
 																  rtoffset,
 																  NRM_EQUAL,
-																  NUM_EXEC_QUAL(plan));
+																  NUM_EXEC_QUAL(plan),
+																  NULL, NULL);
 						}
 					}
 				}
@@ -1356,13 +1457,16 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
 	plan->indexqual = fix_scan_list(root, plan->indexqual,
-									rtoffset, 1);
+									rtoffset, 1,
+									&plan->scan.reference_attrs);
 	/* indexorderby is already transformed to reference index columns */
 	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
-									   rtoffset, 1);
+									   rtoffset, 1,
+									   &plan->scan.reference_attrs);
 	/* indextlist must NOT be transformed to reference index columns */
 	plan->indextlist = fix_scan_list(root, plan->indextlist,
-									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+									 rtoffset, NUM_EXEC_TLIST((Plan *) plan),
+									 &plan->scan.reference_attrs);
 
 	pfree(index_itlist);
 
@@ -1409,10 +1513,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
 			fix_scan_list(root, plan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) plan), NULL);
 		plan->scan.plan.qual =
 			fix_scan_list(root, plan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) plan), NULL);
 
 		result = (Plan *) plan;
 	}
@@ -1612,7 +1716,7 @@ set_foreignscan_references(PlannerInfo *root,
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
 			fix_scan_list(root, fscan->fdw_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 	}
 	else
 	{
@@ -1622,16 +1726,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_recheck_quals =
 			fix_scan_list(root, fscan->fdw_recheck_quals,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 	}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
@@ -1690,20 +1794,20 @@ set_customscan_references(PlannerInfo *root,
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
 			fix_scan_list(root, cscan->custom_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
 			fix_scan_list(root, cscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 		cscan->scan.plan.qual =
 			fix_scan_list(root, cscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 		cscan->custom_exprs =
 			fix_scan_list(root, cscan->custom_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 	}
 
 	/* Adjust child plan-nodes recursively, if needed */
@@ -2111,6 +2215,95 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
 	return (Node *) bestplan;
 }
 
+
+static inline void
+setup_intermediate_level_ctx(intermediate_level_context *ctx)
+{
+	ctx->level = 0;
+	ctx->level_added = false;
+}
+
+static inline void
+setup_intermediate_var_ref_ctx(intermediate_var_ref_context *ctx, Bitmapset **final_ref_attrs)
+{
+	ctx->existing_attrs = NULL;
+	ctx->final_ref_attrs = final_ref_attrs;
+}
+
+/*
+ * increase_level_for_pre_detoast
+ *	Check if the given Expr could detoast a Var directly, if yes,
+ * increase the level and return true. otherwise return false;
+ */
+static inline void
+increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
+{
+	/* The following nodes is impossible to detoast a Var directly. */
+	if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
+	{
+		ctx->level_added = false;
+	}
+	else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
+	{
+		/* let's not detoast first so that pg_column_compression works. */
+		ctx->level_added = false;
+	}
+	else
+	{
+		ctx->level_added = true;
+		ctx->level += 1;
+	}
+}
+
+static inline void
+decreased_level_for_pre_detoast(intermediate_level_context *ctx)
+{
+	if (ctx->level_added)
+		ctx->level -= 1;
+
+	ctx->level_added = false;
+}
+
+/*
+ * add_pre_detoast_vars
+ *		add the var's information into pre_detoast_attrs when the check is pass.
+ */
+static inline void
+add_pre_detoast_vars(intermediate_level_context *level_ctx,
+					 intermediate_var_ref_context *ctx,
+					 Var *var)
+{
+	int			attno;
+
+	if (level_ctx->level <= 1 || ctx->final_ref_attrs == NULL || var->varattno <= 0)
+		return;
+
+	attno = var->varattno - 1;
+	if (bms_is_member(attno, ctx->existing_attrs))
+	{
+		/* not the first time to access it, add it to final result. */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+	else
+	{
+		/* first time. */
+		ctx->existing_attrs = bms_add_member(ctx->existing_attrs, attno);
+
+		/*
+		 * XXX:
+		 *
+		 * The above strategy doesn't help to detect if a Var is detoast
+		 * twice. Reasons are: 1. the context is not maintain in Plan node
+		 * level. so if it is detoast at targetlist and qual, we can't detect
+		 * it. 2. even we can make it at plan node, it still doesn't help for
+		 * the among-nodes case.
+		 *
+		 * So for now, I just disable it.
+		 */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+}
+
 /*
  * fix_scan_expr
  *		Do set_plan_references processing on a scan-level expression
@@ -2130,13 +2323,16 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset,
+			  double num_exec, Bitmapset **scan_reference_attrs)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.scan_reference_attrs, scan_reference_attrs);
 
 	if (rtoffset != 0 ||
 		root->multiexpr_params != NIL ||
@@ -2167,8 +2363,13 @@ fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
 static Node *
 fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 {
+	Node	   *n;
+
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = copyVar((Var *) node);
@@ -2186,10 +2387,16 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 			var->varno += context->rtoffset;
 		if (var->varnosyn > 0)
 			var->varnosyn += context->rtoffset;
+
+		add_pre_detoast_vars(&context->level_ctx, &context->scan_reference_attrs, var);
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) var;
 	}
 	if (IsA(node, Param))
+	{
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return fix_param_node(context->root, (Param *) node);
+	}
 	if (IsA(node, Aggref))
 	{
 		Aggref	   *aggref = (Aggref *) node;
@@ -2199,8 +2406,10 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		aggparam = find_minmax_agg_replacement_param(context->root, aggref);
 		if (aggparam != NULL)
 		{
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			/* Make a copy of the Param for paranoia's sake */
 			return (Node *) copyObject(aggparam);
+
 		}
 		/* If no match, just fall through to process it normally */
 	}
@@ -2210,6 +2419,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 
 		Assert(!IS_SPECIAL_VARNO(cexpr->cvarno));
 		cexpr->cvarno += context->rtoffset;
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) cexpr;
 	}
 	if (IsA(node, PlaceHolderVar))
@@ -2218,29 +2428,52 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
 		/* XXX can we assert something about phnullingrels? */
-		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
+		Node	   *n2 = fix_scan_expr_mutator((Node *) phv->phexpr, context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
 	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_scan_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		Node	   *n2 = fix_scan_expr_mutator(fix_alternative_subplan(context->root,
+																	   (AlternativeSubPlan *) node,
+																	   context->num_exec),
+											   context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator,
-								   (void *) context);
+	n = expression_tree_mutator(node, fix_scan_expr_mutator, (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return n;
 }
 
 static bool
 fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 {
+	bool		ret;
+
 	if (node == NULL)
 		return false;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
+	if (IsA(node, Var))
+	{
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 castNode(Var, node));
+	}
 	Assert(!(IsA(node, Var) && ((Var *) node)->varno == ROWID_VAR));
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
-	return expression_tree_walker(node, fix_scan_expr_walker,
-								  (void *) context);
+	ret = expression_tree_walker(node, fix_scan_expr_walker,
+								 (void *) context);
+
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret;
 }
 
 /*
@@ -2276,7 +2509,10 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset,
 								   NRM_EQUAL,
-								   NUM_EXEC_QUAL((Plan *) join));
+								   NUM_EXEC_QUAL((Plan *) join),
+								   &join->outer_reference_attrs,
+								   &join->inner_reference_attrs
+		);
 
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
@@ -2323,7 +2559,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										 (Index) 0,
 										 rtoffset,
 										 NRM_EQUAL,
-										 NUM_EXEC_QUAL((Plan *) join));
+										 NUM_EXEC_QUAL((Plan *) join),
+										 &join->outer_reference_attrs,
+										 &join->inner_reference_attrs);
 	}
 	else if (IsA(join, HashJoin))
 	{
@@ -2336,7 +2574,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										(Index) 0,
 										rtoffset,
 										NRM_EQUAL,
-										NUM_EXEC_QUAL((Plan *) join));
+										NUM_EXEC_QUAL((Plan *) join),
+										&join->outer_reference_attrs,
+										&join->inner_reference_attrs);
 
 		/*
 		 * HashJoin's hashkeys are used to look for matching tuples from its
@@ -2368,7 +2608,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  (Index) 0,
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-										  NUM_EXEC_TLIST((Plan *) join));
+										  NUM_EXEC_TLIST((Plan *) join),
+										  &join->outer_reference_attrs,
+										  &join->inner_reference_attrs);
 	join->plan.qual = fix_join_expr(root,
 									join->plan.qual,
 									outer_itlist,
@@ -2376,8 +2618,20 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 									(Index) 0,
 									rtoffset,
 									(join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-									NUM_EXEC_QUAL((Plan *) join));
-
+									NUM_EXEC_QUAL((Plan *) join),
+									&join->outer_reference_attrs,
+									&join->inner_reference_attrs);
+
+	join->plan.forbid_pre_detoast_vars = fix_join_expr(root,
+													   join->plan.forbid_pre_detoast_vars,
+													   outer_itlist,
+													   inner_itlist,
+													   (Index) 0,
+													   rtoffset,
+													   (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
+													   NUM_EXEC_TLIST((Plan *) join),
+													   NULL,
+													   NULL);
 	pfree(outer_itlist);
 	pfree(inner_itlist);
 }
@@ -3010,9 +3264,12 @@ fix_join_expr(PlannerInfo *root,
 			  Index acceptable_rel,
 			  int rtoffset,
 			  NullingRelsMatch nrm_match,
-			  double num_exec)
+			  double num_exec,
+			  Bitmapset **outer_reference_attrs,
+			  Bitmapset **inner_reference_attrs)
 {
 	fix_join_expr_context context;
+	List	   *ret;
 
 	context.root = root;
 	context.outer_itlist = outer_itlist;
@@ -3021,16 +3278,30 @@ fix_join_expr(PlannerInfo *root,
 	context.rtoffset = rtoffset;
 	context.nrm_match = nrm_match;
 	context.num_exec = num_exec;
-	return (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.outer_reference_attrs, outer_reference_attrs);
+	setup_intermediate_var_ref_ctx(&context.inner_reference_attrs, inner_reference_attrs);
+
+	ret = (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	bms_free(context.outer_reference_attrs.existing_attrs);
+	bms_free(context.inner_reference_attrs.existing_attrs);
+
+	return ret;
 }
 
 static Node *
 fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 {
 	Var		   *newvar;
+	Node	   *ret_node;
 
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = (Var *) node;
@@ -3044,7 +3315,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* then in the inner. */
@@ -3056,7 +3333,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If it's for acceptable_rel, adjust and return it */
@@ -3066,6 +3349,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 			var->varno += context->rtoffset;
 			if (var->varnosyn > 0)
 				var->varnosyn += context->rtoffset;
+			/* XXX acceptable_rel?  we can ignore it for safety. */
+			decreased_level_for_pre_detoast(&context->level_ctx);
+
 			return (Node *) var;
 		}
 
@@ -3084,22 +3370,38 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  OUTER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 		if (context->inner_itlist && context->inner_itlist->has_ph_vars)
 		{
+
 			newvar = search_indexed_tlist_for_phv(phv,
 												  context->inner_itlist,
 												  INNER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If not supplied by input plans, evaluate the contained expr */
 		/* XXX can we assert something about phnullingrels? */
-		return fix_join_expr_mutator((Node *) phv->phexpr, context);
+		ret_node = fix_join_expr_mutator((Node *) phv->phexpr, context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
 	}
+
 	/* Try matching more complex expressions too, if tlists have any */
 	if (context->outer_itlist && context->outer_itlist->has_non_vars)
 	{
@@ -3107,7 +3409,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->outer_itlist,
 												  OUTER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->outer_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	if (context->inner_itlist && context->inner_itlist->has_non_vars)
 	{
@@ -3115,20 +3423,36 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->inner_itlist,
 												  INNER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->inner_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	/* Special cases (apply only AFTER failing to match to lower tlist) */
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		ret_node = fix_param_node(context->root, (Param *) node);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_join_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		ret_node = fix_join_expr_mutator(fix_alternative_subplan(context->root,
+																 (AlternativeSubPlan *) node,
+																 context->num_exec),
+										 context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node,
-								   fix_join_expr_mutator,
-								   (void *) context);
+	ret_node = expression_tree_mutator(node,
+									   fix_join_expr_mutator,
+									   (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret_node;
 }
 
 /*
@@ -3163,7 +3487,8 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * varno = newvarno, varattno = resno of corresponding targetlist element.
  * The original tree is not modified.
  */
-static Node *
+static Node *					/* XXX: shall I care about this for shared
+								 * detoast optimization? */
 fix_upper_expr(PlannerInfo *root,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
@@ -3318,7 +3643,10 @@ set_returning_clause_references(PlannerInfo *root,
 						  resultRelation,
 						  rtoffset,
 						  NRM_EQUAL,
-						  NUM_EXEC_TLIST(topplan));
+						  NUM_EXEC_TLIST(topplan),
+						  NULL,
+						  NULL
+		);
 
 	pfree(itlist);
 
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index a20c539a25..9a31e083dd 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -77,6 +77,11 @@ typedef enum ExprEvalOp
 	EEOP_OUTER_VAR,
 	EEOP_SCAN_VAR,
 
+	/* compute non-system Var value with shared-detoast-datum logic */
+	EEOP_INNER_VAR_TOAST,
+	EEOP_OUTER_VAR_TOAST,
+	EEOP_SCAN_VAR_TOAST,
+
 	/* compute system Var value */
 	EEOP_INNER_SYSVAR,
 	EEOP_OUTER_SYSVAR,
@@ -826,5 +831,6 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
+extern void ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum);
 
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 6133dbcd0a..f660ebec28 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -18,6 +18,7 @@
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/tupdesc.h"
+#include "nodes/bitmapset.h"
 #include "storage/buf.h"
 
 /*----------
@@ -128,6 +129,11 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+
+	/*
+	 * The attributes populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST step.
+	 */
+	Bitset	   *pre_detoasted_attrs;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -426,12 +432,38 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+	int			attnum;
+
+	if (bitset_is_empty(slot->pre_detoasted_attrs))
+		return;
+
+	attnum = -1;
+	/* free the memory used by pre-detoasted datum and reset the flags. */
+	while ((attnum = bitset_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
+	{
+		pfree((void *) slot->tts_values[attnum]);
+	}
+
+	/*
+	 * bms_free each time cost too much, so just zero these bits and keep its
+	 * memory, just like what we did for TupleTableSlot. but.. see the
+	 * comments about the bms_zero.
+	 */
+	bitset_clear(slot->pre_detoasted_attrs);
+}
+
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
 static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_ops->clear(slot);
 
 	return slot;
@@ -450,6 +482,10 @@ ExecClearTuple(TupleTableSlot *slot)
 static inline void
 ExecMaterializeSlot(TupleTableSlot *slot)
 {
+	/*
+	 * XXX: pre_detoasted_attrs doesn't dependent on any external storage, so
+	 * nothing should be done here.
+	 */
 	slot->tts_ops->materialize(slot);
 }
 
@@ -494,6 +530,30 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 
 	dstslot->tts_ops->copyslot(dstslot, srcslot);
 
+	/* Assert this assumption the below code depends on. */
+	Assert(dstslot->tts_nvalid == 0 ||
+		   dstslot->tts_nvalid == srcslot->tts_nvalid);
+
+	if (dstslot->tts_nvalid == srcslot->tts_nvalid &&
+		!bitset_is_empty(srcslot->pre_detoasted_attrs))
+	{
+		int			attnum = -1;
+		MemoryContext old = MemoryContextSwitchTo(dstslot->tts_mcxt);
+
+		dstslot->pre_detoasted_attrs = bitset_copy(srcslot->pre_detoasted_attrs);
+
+		while ((attnum = bitset_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
+		{
+			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
+			Size		len;
+
+			Assert(!VARATT_IS_EXTENDED(datum));
+			len = VARSIZE(datum);
+			dstslot->tts_values[attnum] = (Datum) palloc(len);
+			memcpy((void *) dstslot->tts_values[attnum], datum, len);
+		}
+		MemoryContextSwitchTo(old);
+	}
 	return dstslot;
 }
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 561fdd98f1..72febbd37d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1474,6 +1474,7 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+	Bitmapset  *scan_pre_detoast_attrs;
 } ScanState;
 
 /* ----------------
@@ -2003,6 +2004,8 @@ typedef struct JoinState
 	bool		single_match;	/* True if we should skip to next outer tuple
 								 * after finding one inner match */
 	ExprState  *joinqual;		/* JOIN quals (in addition to ps.qual) */
+	Bitmapset  *outer_pre_detoast_attrs;
+	Bitmapset  *inner_pre_detoast_attrs;
 } JoinState;
 
 /* ----------------
@@ -2764,4 +2767,5 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+extern void SetPredetoastAttrsForJoin(JoinState *joinstate);
 #endif							/* EXECNODES_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c..79c442385a 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,13 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+
+	/*
+	 * A list of Vars which should not apply the shared-detoast-datum logic
+	 * since the upper nodes like Sort/Hash wants them as small as possible.
+	 * Its a subset of targetlist in each Plan node.
+	 */
+	List	   *forbid_pre_detoast_vars;
 } Plan;
 
 /* ----------------
@@ -385,6 +392,13 @@ typedef struct Scan
 
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 */
+	Bitmapset  *reference_attrs;
 } Scan;
 
 /* ----------------
@@ -789,6 +803,14 @@ typedef struct Join
 	JoinType	jointype;
 	bool		inner_unique;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 */
+	Bitmapset  *outer_reference_attrs;
+	Bitmapset  *inner_reference_attrs;
 } Join;
 
 /* ----------------
@@ -869,6 +891,11 @@ typedef struct HashJoin
 	 * perform lookups in the hashtable over the inner plan.
 	 */
 	List	   *hashkeys;
+
+	/*
+	 * Whether the left plan tree should use a SMALL_TLIST.
+	 */
+	bool		left_small_tlist;
 } HashJoin;
 
 /* ----------------
@@ -1588,4 +1615,24 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+static inline bool
+is_join_plan(Plan *plan)
+{
+	return (plan != NULL) && (IsA(plan, NestLoop) || IsA(plan, HashJoin) || IsA(plan, MergeJoin));
+}
+
+static inline bool
+is_scan_plan(Plan *plan)
+{
+	return (plan != NULL) &&
+		(IsA(plan, SeqScan) ||
+		 IsA(plan, SampleScan) ||
+		 IsA(plan, IndexScan) ||
+		 IsA(plan, IndexOnlyScan) ||
+		 IsA(plan, BitmapIndexScan) ||
+		 IsA(plan, BitmapHeapScan) ||
+		 IsA(plan, TidScan) ||
+		 IsA(plan, SubqueryScan));
+}
+
 #endif							/* PLANNODES_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 50b06fd878..078d6e0728 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4043,6 +4043,8 @@ cb_cleanup_dir
 cb_options
 cb_tablespace
 cb_tablespace_mapping
+intermediate_var_ref_context
+intermediate_level_context
 manifest_data
 manifest_writer
 rfile
-- 
2.34.1

#10Andy Fan
zhihuifan1213@163.com
In reply to: Andy Fan (#9)
2 attachment(s)
Re: Shared detoast Datum proposal

Hi,

I didn't another round of self-review. Comments, variable names, the
order of function definition are improved so that it can be read as
smooth as possible. so v6 attached.

--
Best Regards
Andy Fan

Attachments:

v6-0001-Introduce-a-Bitset-data-struct.patchtext/x-diffDownload
From f2e7772228e8a18027b9c29f10caba9c6570d934 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 20 Feb 2024 11:11:53 +0800
Subject: [PATCH v6 1/2] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

[1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com
---
 src/backend/nodes/bitmapset.c                 | 200 +++++++++++++++++-
 src/backend/nodes/outfuncs.c                  |  51 +++++
 src/include/nodes/bitmapset.h                 |  28 +++
 src/include/nodes/nodes.h                     |   4 +
 src/test/modules/test_misc/Makefile           |  11 +
 src/test/modules/test_misc/README             |   4 +-
 .../test_misc/expected/test_bitset.out        |   7 +
 src/test/modules/test_misc/meson.build        |  17 ++
 .../modules/test_misc/sql/test_bitset.sql     |   3 +
 src/test/modules/test_misc/test_misc--1.0.sql |   5 +
 src/test/modules/test_misc/test_misc.c        | 118 +++++++++++
 src/test/modules/test_misc/test_misc.control  |   4 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 441 insertions(+), 12 deletions(-)
 create mode 100644 src/test/modules/test_misc/expected/test_bitset.out
 create mode 100644 src/test/modules/test_misc/sql/test_bitset.sql
 create mode 100644 src/test/modules/test_misc/test_misc--1.0.sql
 create mode 100644 src/test/modules/test_misc/test_misc.c
 create mode 100644 src/test/modules/test_misc/test_misc.control

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 65805d4527..40cfea2308 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1315,23 +1315,18 @@ bms_join(Bitmapset *a, Bitmapset *b)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-int
-bms_next_member(const Bitmapset *a, int prevbit)
+
+static int
+bms_next_member_internal(int nwords, const bitmapword *words, int prevbit)
 {
-	int			nwords;
 	int			wordnum;
 	bitmapword	mask;
 
-	Assert(bms_is_valid_set(a));
-
-	if (a == NULL)
-		return -2;
-	nwords = a->nwords;
 	prevbit++;
 	mask = (~(bitmapword) 0) << BITNUM(prevbit);
 	for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
 	{
-		bitmapword	w = a->words[wordnum];
+		bitmapword	w = words[wordnum];
 
 		/* ignore bits before prevbit */
 		w &= mask;
@@ -1351,6 +1346,19 @@ bms_next_member(const Bitmapset *a, int prevbit)
 	return -2;
 }
 
+int
+bms_next_member(const Bitmapset *a, int prevbit)
+{
+	Assert(a == NULL || IsA(a, Bitmapset));
+
+	Assert(bms_is_valid_set(a));
+
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
 /*
  * bms_prev_member - find prev member of a set
  *
@@ -1458,3 +1466,177 @@ bitmap_match(const void *key1, const void *key2, Size keysize)
 	return !bms_equal(*((const Bitmapset *const *) key1),
 					  *((const Bitmapset *const *) key2));
 }
+
+/*
+ * bitset_init - create a Bitset. the set will be round up to nwords;
+ */
+Bitset *
+bitset_init(size_t size)
+{
+	int			nword = (size + BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+	Bitset	   *result;
+
+	if (size == 0)
+		return NULL;
+
+	result = (Bitset *) palloc0(sizeof(Bitset) + nword * sizeof(bitmapword));
+	result->nwords = nword;
+
+	return result;
+}
+
+/*
+ * bitset_clear - clear the bits only, but the memory is still there.
+ */
+void
+bitset_clear(Bitset *a)
+{
+	if (a != NULL)
+		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
+}
+
+void
+bitset_free(Bitset *a)
+{
+	if (a != NULL)
+		pfree(a);
+}
+
+bool
+bitset_is_empty(Bitset *a)
+{
+	int			i;
+
+	if (a == NULL)
+		return true;
+
+	for (i = 0; i < a->nwords; i++)
+	{
+		bitmapword	w = a->words[i];
+
+		if (w != 0)
+			return false;
+	}
+
+	return true;
+}
+
+Bitset *
+bitset_copy(Bitset *a)
+{
+	Bitset	   *result;
+
+	if (a == NULL)
+		return NULL;
+
+	result = bitset_init(a->nwords * BITS_PER_BITMAPWORD);
+
+	memcpy(result->words, a->words, sizeof(bitmapword) * a->nwords);
+	return result;
+}
+
+void
+bitset_add_member(Bitset *a, int x)
+{
+	int			wordnum,
+				bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] |= ((bitmapword) 1 << bitnum);
+}
+
+void
+bitset_del_member(Bitset *a, int x)
+{
+	int			wordnum,
+				bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] &= ~((bitmapword) 1 << bitnum);
+}
+
+int
+bitset_is_member(int x, Bitset *a)
+{
+	int			wordnum,
+				bitnum;
+
+	/* used in expression engine */
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	if (a == NULL)
+		return false;
+
+	if (wordnum >= a->nwords)
+		return false;
+
+	return (a->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+int
+bitset_next_member(const Bitset *a, int prevbit)
+{
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
+
+/*
+ * bitset_to_bitmap - build a legal bitmapset from bitset.
+ */
+Bitmapset *
+bitset_to_bitmap(Bitset *a)
+{
+	int			n;
+
+	bool		found = false;	/* any non-empty bits */
+	Bitmapset  *result;
+	int			i;
+
+	if (a == NULL)
+		return NULL;
+
+	n = a->nwords - 1;
+	do
+	{
+		if (a->words[n] > 0)
+		{
+			found = true;
+			break;
+		}
+	} while (--n >= 0);
+
+	if (!found)
+		return NULL;
+
+	result = (Bitmapset *) palloc0(BITMAPSET_SIZE(n + 1));
+	result->type = T_Bitmapset;
+	result->nwords = n + 1;
+
+	Assert(result->nwords <= a->nwords);
+
+	i = 0;
+	do
+	{
+		result->words[i] = a->words[i];
+	} while (++i < result->nwords);
+
+	return result;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 25171864db..e144b62418 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -331,6 +331,43 @@ outBitmapset(StringInfo str, const Bitmapset *bms)
 	appendStringInfoChar(str, ')');
 }
 
+
+
+/*
+ * outBitset -
+ *	similar to outBitmapset, but for Bitset.
+ */
+static void
+outBitsetInternal(struct StringInfoData *str,
+				  const struct Bitset *bs,
+				  bool asBitmap)
+{
+	int			x;
+
+	appendStringInfoChar(str, '(');
+	if (asBitmap)
+		appendStringInfoChar(str, 'b');
+	else
+		appendStringInfo(str, "bs");
+	x = -1;
+	while ((x = bitset_next_member(bs, x)) >= 0)
+		appendStringInfo(str, " %d", x);
+	appendStringInfoChar(str, ')');
+}
+
+
+/*
+ * outBitset -
+ *	similar to outBitmapset, but for Bitset.
+ */
+void
+outBitset(struct StringInfoData *str,
+		  const struct Bitset *bs)
+{
+	outBitsetInternal(str, bs, false);
+}
+
+
 /*
  * Print the value of a Datum given its type.
  */
@@ -911,3 +948,17 @@ bmsToString(const Bitmapset *bms)
 	outBitmapset(&str, bms);
 	return str.data;
 }
+
+/*
+ * bitsetToString -
+ *	   similar to bmsToString, but for Bitset
+ */
+char *
+bitsetToString(const struct Bitset *bs, bool asBitmap)
+{
+	StringInfoData str;
+
+	initStringInfo(&str);
+	outBitsetInternal(&str, bs, asBitmap);
+	return str.data;
+}
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 906e8dcc15..95ff37c6e9 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -55,6 +55,24 @@ typedef struct Bitmapset
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];	/* really [nwords] */
 } Bitmapset;
 
+/*
+ * While Bitmapset is designed for variable-length of bits, Bitset is
+ * designed for fixed-length of bits, the fixed length must be specified at
+ * the bitset_init stage and keep unchanged at the whole lifespan. Because
+ * of this, some operations on Bitset is simpler than Bitmapset.
+ *
+ * The bitset_clear unsets all the bits but kept the allocated memory, this
+ * capacity is impossible for bit Bitmapset for some solid reasons.
+ *
+ * Also for performance aspect, the functions for Bitset removed some
+ * unlikely checks, instead with some Asserts.
+ */
+
+typedef struct Bitset
+{
+	int			nwords;			/* number of words in array */
+	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];	/* really [nwords] */
+} Bitset;
 
 /* result of bms_subset_compare */
 typedef enum
@@ -124,4 +142,14 @@ extern uint32 bms_hash_value(const Bitmapset *a);
 extern uint32 bitmap_hash(const void *key, Size keysize);
 extern int	bitmap_match(const void *key1, const void *key2, Size keysize);
 
+extern Bitset *bitset_init(size_t size);
+extern void bitset_clear(Bitset *a);
+extern void bitset_free(Bitset *a);
+extern bool bitset_is_empty(Bitset *a);
+extern Bitset *bitset_copy(Bitset *a);
+extern void bitset_add_member(Bitset *a, int x);
+extern void bitset_del_member(Bitset *a, int x);
+extern int	bitset_is_member(int bit, Bitset *a);
+extern int	bitset_next_member(const Bitset *a, int prevbit);
+extern Bitmapset *bitset_to_bitmap(Bitset *a);
 #endif							/* BITMAPSET_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2969dd831b..4d13107990 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -186,16 +186,20 @@ castNodeImpl(NodeTag type, void *ptr)
  * nodes/{outfuncs.c,print.c}
  */
 struct Bitmapset;				/* not to include bitmapset.h here */
+struct Bitset;					/* not to include bitmapset.h here */
 struct StringInfoData;			/* not to include stringinfo.h here */
 
 extern void outNode(struct StringInfoData *str, const void *obj);
 extern void outToken(struct StringInfoData *str, const char *s);
 extern void outBitmapset(struct StringInfoData *str,
 						 const struct Bitmapset *bms);
+extern void outBitset(struct StringInfoData *str, const struct Bitset *bs);
+
 extern void outDatum(struct StringInfoData *str, uintptr_t value,
 					 int typlen, bool typbyval);
 extern char *nodeToString(const void *obj);
 extern char *bmsToString(const struct Bitmapset *bms);
+extern char *bitsetToString(const struct Bitset *bs, bool asBitmap);
 
 /*
  * nodes/{readfuncs.c,read.c}
diff --git a/src/test/modules/test_misc/Makefile b/src/test/modules/test_misc/Makefile
index 39c6c2014a..af96604096 100644
--- a/src/test/modules/test_misc/Makefile
+++ b/src/test/modules/test_misc/Makefile
@@ -2,6 +2,17 @@
 
 TAP_TESTS = 1
 
+MODULE_big = test_misc
+OBJS = \
+	$(WIN32RES) \
+	test_misc.o
+PGFILEDESC = "test_misc"
+
+EXTENSION = test_misc
+DATA = test_misc--1.0.sql
+
+REGRESS = test_bitset
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/src/test/modules/test_misc/README b/src/test/modules/test_misc/README
index 4876733fa2..ec426c4ad5 100644
--- a/src/test/modules/test_misc/README
+++ b/src/test/modules/test_misc/README
@@ -1,4 +1,2 @@
-This directory doesn't actually contain any extension module.
-
-What it is is a home for otherwise-unclassified TAP tests that exercise core
+What it is is a home for otherwise-unclassified tests that exercise core
 server features.  We might equally well have called it, say, src/test/misc.
diff --git a/src/test/modules/test_misc/expected/test_bitset.out b/src/test/modules/test_misc/expected/test_bitset.out
new file mode 100644
index 0000000000..3d0302d30d
--- /dev/null
+++ b/src/test/modules/test_misc/expected/test_bitset.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_misc;
+SELECT test_bitset();
+ test_bitset 
+-------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 964d95db26..a23f3e3f47 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -1,5 +1,22 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+test_misc_sources = files(
+    'test_misc.c',
+)
+
+if host_system == 'windows'
+  test_misc_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_misc',
+    '--FILEDESC', 'test_misc - ',])
+endif
+
+test_misc = shared_module('test_misc',
+  test_misc_sources,
+  kwargs: pg_test_mod_args,
+)
+
+test_install_libs += test_misc
+
 tests += {
   'name': 'test_misc',
   'sd': meson.current_source_dir(),
diff --git a/src/test/modules/test_misc/sql/test_bitset.sql b/src/test/modules/test_misc/sql/test_bitset.sql
new file mode 100644
index 0000000000..0f73bbf532
--- /dev/null
+++ b/src/test/modules/test_misc/sql/test_bitset.sql
@@ -0,0 +1,3 @@
+CREATE EXTENSION test_misc;
+
+SELECT test_bitset();
diff --git a/src/test/modules/test_misc/test_misc--1.0.sql b/src/test/modules/test_misc/test_misc--1.0.sql
new file mode 100644
index 0000000000..79afaa6263
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc--1.0.sql
@@ -0,0 +1,5 @@
+\echo Use "CREATE EXTENSION test_misc" to load this file. \quit
+
+CREATE FUNCTION test_bitset()
+       RETURNS pg_catalog.void
+       AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_misc/test_misc.c b/src/test/modules/test_misc/test_misc.c
new file mode 100644
index 0000000000..70d0255ada
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc.c
@@ -0,0 +1,118 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_misc.c
+ *
+ * Copyright (c) 2022-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_dsa/test_misc.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+#include "fmgr.h"
+#include "nodes/bitmapset.h"
+#include "nodes/nodes.h"
+#define BIT_ADD 0
+#define BIT_DEL 1
+#ifdef USE_ASSERT_CHECKING
+static void compare_bms_bs(Bitmapset **bms, Bitset *bs, int member, int op);
+#endif
+PG_MODULE_MAGIC;
+/* Test basic DSA functionality */
+PG_FUNCTION_INFO_V1(test_bitset);
+Datum
+test_bitset(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ASSERT_CHECKING
+	Bitset	   *bs;
+	Bitset	   *bs2;
+	char	   *str1,
+			   *str2,
+			   *empty_str;
+	Bitmapset  *bms = NULL;
+	int			i;
+
+	empty_str = bmsToString(NULL);
+	/* size = 0 */
+	bs = bitset_init(0);
+	Assert(bs == NULL);
+	bitset_clear(bs);
+	Assert(bitset_is_empty(bs));
+	/* bitset_add_member(bs, 0); // crash. */
+	/* bitset_del_member(bs, 0); // crash. */
+	Assert(!bitset_is_member(0, bs));
+	Assert(bitset_next_member(bs, -1) == -2);
+	bs2 = bitset_copy(bs);
+	Assert(bs2 == NULL);
+	bitset_free(bs);
+	bitset_free(bs2);
+	/* size == 68, nword == 2 */
+	bs = bitset_init(68);
+	for (i = 0; i < 68; i = i + 3)
+	{
+		compare_bms_bs(&bms, bs, i, BIT_ADD);
+	}
+	Assert(!bitset_is_empty(bs));
+	for (i = 0; i < 68; i = i + 3)
+	{
+		compare_bms_bs(&bms, bs, i, BIT_DEL);
+	}
+	Assert(bitset_is_empty(bs));
+	bitset_clear(bs);
+	str1 = bitsetToString(bs, true);
+	Assert(strcmp(str1, empty_str) == 0);
+	bms = bitset_to_bitmap(bs);
+	str2 = bmsToString(bms);
+	Assert(strcmp(str1, str2) == 0);
+	bms = bitset_to_bitmap(NULL);
+	Assert(strcmp(bmsToString(bms), empty_str) == 0);
+	bitset_free(bs);
+#endif
+	PG_RETURN_VOID();
+}
+#ifdef USE_ASSERT_CHECKING
+static void
+compare_bms_bs(Bitmapset **bms, Bitset *bs, int member, int op)
+{
+	char	   *str1,
+			   *str2,
+			   *str3,
+			   *str4;
+	Bitmapset  *bms3;
+	Bitset	   *bs4;
+
+	if (op == BIT_ADD)
+	{
+		*bms = bms_add_member(*bms, member);
+		bitset_add_member(bs, member);
+		Assert(bms_is_member(member, *bms));
+		Assert(bitset_is_member(member, bs));
+	}
+	else if (op == BIT_DEL)
+	{
+		*bms = bms_del_member(*bms, member);
+		bitset_del_member(bs, member);
+		Assert(!bms_is_member(member, *bms));
+		Assert(!bitset_is_member(member, bs));
+	}
+	else
+		Assert(false);
+	/* compare the rest existing bit */
+	str1 = bmsToString(*bms);
+	str2 = bitsetToString(bs, true);
+	Assert(strcmp(str1, str2) == 0);
+	/* test bitset_to_bitmap */
+	bms3 = bitset_to_bitmap(bs);
+	str3 = bmsToString(bms3);
+	Assert(strcmp(str3, str2) == 0);
+	/* test bitset_copy */
+	bs4 = bitset_copy(bs);
+	str4 = bitsetToString(bs4, true);
+	Assert(strcmp(str3, str4) == 0);
+	pfree(str1);
+	pfree(str2);
+	pfree(str3);
+	pfree(str4);
+}
+#endif
diff --git a/src/test/modules/test_misc/test_misc.control b/src/test/modules/test_misc/test_misc.control
new file mode 100644
index 0000000000..48fd08758f
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc.control
@@ -0,0 +1,4 @@
+comment = 'Test misc'
+default_version = '1.0'
+module_pathname = '$libdir/test_misc'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d808aad8b0..dcba329ee4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -268,6 +268,7 @@ BitmapOr
 BitmapOrPath
 BitmapOrState
 Bitmapset
+Bitset
 Block
 BlockId
 BlockIdData
-- 
2.34.1

v6-0002-Shared-detoast-feature.patchtext/x-diffDownload
From 1b169cf256efe94cea48ef20fa0bd2bffa9e2368 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 20 Feb 2024 14:16:10 +0800
Subject: [PATCH v6 2/2] Shared detoast feature.

details at https://postgr.es/m/87il4jrk1l.fsf%40163.com
---
 src/backend/executor/execExpr.c         |  35 +-
 src/backend/executor/execExprInterp.c   | 180 ++++++++
 src/backend/executor/execTuples.c       | 134 ++++++
 src/backend/executor/execUtils.c        |   2 +
 src/backend/executor/nodeHashjoin.c     |   2 +
 src/backend/executor/nodeMergejoin.c    |   2 +
 src/backend/executor/nodeNestloop.c     |   1 +
 src/backend/jit/llvm/llvmjit_expr.c     |  26 +-
 src/backend/jit/llvm/llvmjit_types.c    |   1 +
 src/backend/optimizer/plan/createplan.c | 103 ++++-
 src/backend/optimizer/plan/setrefs.c    | 550 +++++++++++++++++++-----
 src/include/executor/execExpr.h         |  12 +
 src/include/executor/tuptable.h         |  60 +++
 src/include/nodes/execnodes.h           |  14 +
 src/include/nodes/plannodes.h           |  53 +++
 src/tools/pgindent/typedefs.list        |   2 +
 16 files changed, 1071 insertions(+), 106 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 3181b1136a..779fcfaab1 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -932,22 +932,51 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				}
 				else
 				{
+					int			attnum;
+					Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 					/* regular user column */
 					scratch.d.var.attnum = variable->varattno - 1;
 					scratch.d.var.vartype = variable->vartype;
+					attnum = scratch.d.var.attnum;
+
 					switch (variable->varno)
 					{
 						case INNER_VAR:
-							scratch.opcode = EEOP_INNER_VAR;
+
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_INNER_VAR_TOAST;
+							}
+							else
+							{
+								scratch.opcode = EEOP_INNER_VAR;
+							}
 							break;
 						case OUTER_VAR:
-							scratch.opcode = EEOP_OUTER_VAR;
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_OUTER_VAR;
 							break;
 
 							/* INDEX_VAR is handled by default case */
 
 						default:
-							scratch.opcode = EEOP_SCAN_VAR;
+							if (is_scan_plan(plan) && bms_is_member(
+																	attnum,
+																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_SCAN_VAR;
 							break;
 					}
 				}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3f20f1dd31..6472fa463a 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -158,6 +159,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -166,6 +170,9 @@ static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull
 static Datum ExecJustInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -181,6 +188,43 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  AggStatePerGroup pergroup,
 															  ExprContext *aggcontext,
 															  int setno);
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+	if (!slot->tts_isnull[attnum] &&
+		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+	{
+		Datum		oldDatum;
+		MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
+
+		oldDatum = slot->tts_values[attnum];
+		slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
+																(struct varlena *) oldDatum));
+		Assert(slot->tts_nvalid > attnum);
+		if (oldDatum != slot->tts_values[attnum])
+			bitset_add_member(slot->pre_detoasted_attrs, attnum);
+		MemoryContextSwitchTo(old);
+	}
+}
+
+/* JIT requires a non-static (and external?) function */
+void
+ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum)
+{
+	return ExecSlotDetoastDatum(slot, attnum);
+}
+
+
+static inline void
+ExecEvalToastVar(TupleTableSlot *slot,
+				 ExprEvalStep *op,
+				 int attnum)
+{
+	ExecSlotDetoastDatum(slot, attnum);
+
+	*op->resvalue = slot->tts_values[attnum];
+	*op->resnull = slot->tts_isnull[attnum];
+}
 
 /*
  * ScalarArrayOpExprHashEntry
@@ -296,6 +340,24 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVar;
 			return;
 		}
+		if (step0 == EEOP_INNER_FETCHSOME &&
+			step1 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_FETCHSOME &&
+				 step1 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_FETCHSOME &&
+				 step1 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarToast;
+			return;
+		}
 		else if (step0 == EEOP_INNER_FETCHSOME &&
 				 step1 == EEOP_ASSIGN_INNER_VAR)
 		{
@@ -346,6 +408,21 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVarVirt;
 			return;
 		}
+		else if (step0 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarVirtToast;
+			return;
+		}
 		else if (step0 == EEOP_ASSIGN_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustAssignInnerVarVirt;
@@ -413,6 +490,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
+		&&CASE_EEOP_INNER_VAR_TOAST,
+		&&CASE_EEOP_OUTER_VAR_TOAST,
+		&&CASE_EEOP_SCAN_VAR_TOAST,
 		&&CASE_EEOP_INNER_SYSVAR,
 		&&CASE_EEOP_OUTER_SYSVAR,
 		&&CASE_EEOP_SCAN_SYSVAR,
@@ -597,6 +677,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
 			*op->resvalue = scanslot->tts_values[attnum];
 			*op->resnull = scanslot->tts_isnull[attnum];
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_INNER_VAR_TOAST)
+		{
+			ExecEvalToastVar(innerslot, op, op->d.var.attnum);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_VAR_TOAST)
+		{
+			ExecEvalToastVar(outerslot, op, op->d.var.attnum);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_VAR_TOAST)
+		{
+			ExecEvalToastVar(scanslot, op, op->d.var.attnum);
 
 			EEO_NEXT();
 		}
@@ -2137,6 +2236,42 @@ ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+static pg_attribute_always_inline Datum
+ExecJustVarImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[1];
+	int			attnum = op->d.var.attnum;
+
+	CheckOpSlotCompatibility(&state->steps[0], slot);
+
+	slot_getattr(slot, attnum + 1, isnull);
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Simple reference to inner Var */
+static Datum
+ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Simple reference to outer Var */
+static Datum
+ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Simple reference to scan Var */
+static Datum
+ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)Var */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
@@ -2275,6 +2410,51 @@ ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarVirtImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+/* implementation of ExecJust(Inner|Outer|Scan)VarVirt */
+static pg_attribute_always_inline Datum
+ExecJustVarVirtImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[0];
+	int			attnum = op->d.var.attnum;
+
+	/*
+	 * As it is guaranteed that a virtual slot is used, there never is a need
+	 * to perform tuple deforming (nor would it be possible). Therefore
+	 * execExpr.c has not emitted an EEOP_*_FETCHSOME step. Verify, as much as
+	 * possible, that that determination was accurate.
+	 */
+	Assert(TTS_IS_VIRTUAL(slot));
+	Assert(TTS_FIXED(slot));
+	Assert(attnum >= 0 && attnum < slot->tts_nvalid);
+
+	*isnull = slot->tts_isnull[attnum];
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Like ExecJustInnerVar, optimized for virtual slots */
+static Datum
+ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Like ExecJustOuterVar, optimized for virtual slots */
+static Datum
+ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Like ExecJustScanVar, optimized for virtual slots */
+static Datum
+ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)VarVirt */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarVirtImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a7aa2ee02b..830e1d4ea6 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -79,6 +79,9 @@ static inline void tts_buffer_heap_store_tuple(TupleTableSlot *slot,
 											   bool transfer_pin);
 static void tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree);
 
+static Bitmapset *cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+											  TupleDesc tupleDesc,
+											  List *forbid_pre_detoast_vars);
 
 const TupleTableSlotOps TTSOpsVirtual;
 const TupleTableSlotOps TTSOpsHeapTuple;
@@ -176,6 +179,10 @@ tts_virtual_materialize(TupleTableSlot *slot)
 		if (att->attbyval || slot->tts_isnull[natt])
 			continue;
 
+		if (bitset_is_member(natt, slot->pre_detoasted_attrs))
+			/* it has been in slot->tts_mcxt already. */
+			continue;
+
 		val = slot->tts_values[natt];
 
 		if (att->attlen == -1 &&
@@ -392,6 +399,13 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated invalidated since tts_nvalid is set to 0, so
+	 * let's free the pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
+
 }
 
 static void
@@ -457,6 +471,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -567,6 +584,9 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -637,6 +657,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -771,6 +794,9 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -904,6 +930,9 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 		 */
 		ReleaseBuffer(buffer);
 	}
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 /*
@@ -1150,7 +1179,10 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 			 + MAXALIGN(tupleDesc->natts * sizeof(Datum)));
 
 		PinTupleDesc(tupleDesc);
+		slot->pre_detoasted_attrs = bitset_init(tupleDesc->natts);
 	}
+	else
+		slot->pre_detoasted_attrs = NULL;
 
 	/*
 	 * And allow slot type specific initialization.
@@ -1288,6 +1320,8 @@ void
 ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 					  TupleDesc tupdesc)	/* new tuple descriptor */
 {
+	MemoryContext old;
+
 	Assert(!TTS_FIXED(slot));
 
 	/* For safety, make sure slot is empty before changing it */
@@ -1304,6 +1338,8 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 		pfree(slot->tts_values);
 	if (slot->tts_isnull)
 		pfree(slot->tts_isnull);
+	if (slot->pre_detoasted_attrs)
+		bitset_free(slot->pre_detoasted_attrs);
 
 	/*
 	 * Install the new descriptor; if it's refcounted, bump its refcount.
@@ -1319,6 +1355,10 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 		MemoryContextAlloc(slot->tts_mcxt, tupdesc->natts * sizeof(Datum));
 	slot->tts_isnull = (bool *)
 		MemoryContextAlloc(slot->tts_mcxt, tupdesc->natts * sizeof(bool));
+
+	old = MemoryContextSwitchTo(slot->tts_mcxt);
+	slot->pre_detoasted_attrs = bitset_init(tupdesc->natts);
+	MemoryContextSwitchTo(old);
 }
 
 /* --------------------------------
@@ -1810,12 +1850,26 @@ void
 ExecInitScanTupleSlot(EState *estate, ScanState *scanstate,
 					  TupleDesc tupledesc, const TupleTableSlotOps *tts_ops)
 {
+	Scan	   *splan = (Scan *) scanstate->ps.plan;
+
 	scanstate->ss_ScanTupleSlot = ExecAllocTableSlot(&estate->es_tupleTable,
 													 tupledesc, tts_ops);
 	scanstate->ps.scandesc = tupledesc;
 	scanstate->ps.scanopsfixed = tupledesc != NULL;
 	scanstate->ps.scanops = tts_ops;
 	scanstate->ps.scanopsset = true;
+
+	if (is_scan_plan((Plan *) splan))
+	{
+		/*
+		 * We may run detoast in Qual or Projection, but all of them happen at
+		 * the ss_ScanTupleSlot rather than ps_ResultTupleSlot. So we can only
+		 * take care of the ss_ScanTupleSlot.
+		 */
+		scanstate->scan_pre_detoast_attrs = cal_final_pre_detoast_attrs(splan->reference_attrs,
+																		tupledesc,
+																		splan->plan.forbid_pre_detoast_vars);
+	}
 }
 
 /* ----------------
@@ -2336,3 +2390,83 @@ end_tup_output(TupOutputState *tstate)
 	ExecDropSingleTupleTableSlot(tstate->slot);
 	pfree(tstate);
 }
+
+/*
+ * cal_final_pre_detoast_attrs
+ *		Calculate the final attributes which pre-detoast be helpful.
+ *
+ * reference_attrs: the attributes which will be detoast at this plan level.
+ * due to the implementation issue, some non-toast attribute may be included
+ * which should be filtered out with tupleDesc.
+ *
+ * forbid_pre_detoast_vars: the vars which should not be pre-detoast as the
+ * small_tlist reason.
+ */
+static Bitmapset *
+cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+							TupleDesc tupleDesc,
+							List *forbid_pre_detoast_vars)
+{
+	Bitmapset  *final = NULL,
+			   *toast_attrs = NULL,
+			   *forbid_pre_detoast_attrs = NULL;
+
+	int			i;
+	ListCell   *lc;
+
+	if (bms_is_empty(reference_attrs))
+		return NULL;
+
+	/*
+	 * there is no exact data type in create_plan or set_plan_refs stage, so
+	 * reference_attrs may have some attribute which is not toast attrs at
+	 * all, which should be removed.
+	 */
+	for (i = 0; i < tupleDesc->natts; i++)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupleDesc, i);
+
+		if (attr->attlen == -1 && attr->attstorage != TYPSTORAGE_PLAIN)
+			toast_attrs = bms_add_member(toast_attrs, attr->attnum - 1);
+	}
+
+	/* Filter out the non-toastable attributes. */
+	final = bms_intersect(reference_attrs, toast_attrs);
+
+	/*
+	 * Due to the fact of detoast-datum will make the tuple bigger which is
+	 * bad for some nodes like Sort/Hash, to avoid performance regression,
+	 * such attribute should be removed as well.
+	 */
+	foreach(lc, forbid_pre_detoast_vars)
+	{
+		Var		   *var = lfirst_node(Var, lc);
+
+		forbid_pre_detoast_attrs = bms_add_member(forbid_pre_detoast_attrs, var->varattno - 1);
+	}
+
+	final = bms_del_members(final, forbid_pre_detoast_attrs);
+
+	bms_free(toast_attrs);
+	bms_free(forbid_pre_detoast_attrs);
+
+	return final;
+}
+
+
+void
+SetPredetoastAttrsForJoin(JoinState *j)
+{
+	PlanState  *outerstate = outerPlanState(j);
+	PlanState  *innerstate = innerPlanState(j);
+
+	j->outer_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->outer_reference_attrs,
+															 outerstate->ps_ResultTupleDesc,
+															 outerstate->plan->forbid_pre_detoast_vars);
+
+	j->inner_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->inner_reference_attrs,
+															 innerstate->ps_ResultTupleDesc,
+															 innerstate->plan->forbid_pre_detoast_vars);
+}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index cff5dc723e..a8646ded02 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -572,6 +572,8 @@ ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
 		planstate->resultopsset = planstate->scanopsset;
 		planstate->resultopsfixed = planstate->scanopsfixed;
 		planstate->resultops = planstate->scanops;
+
+		Assert(planstate->ps_ResultTupleDesc != NULL);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1cbec4647c..19a05ed624 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -756,6 +756,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
 	innerDesc = ExecGetResultType(innerPlanState(hjstate));
 
+	SetPredetoastAttrsForJoin((JoinState *) hjstate);
+
 	/*
 	 * Initialize result slot, type and projection.
 	 */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index c1a8ca2464..be7cbd7f30 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1497,6 +1497,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 											  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	SetPredetoastAttrsForJoin((JoinState *) mergestate);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 06fa0a9b31..2d40d19192 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -306,6 +306,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(&nlstate->js.ps, &TTSOpsVirtual);
 	ExecAssignProjectionInfo(&nlstate->js.ps, NULL);
+	SetPredetoastAttrsForJoin((JoinState *) nlstate);
 
 	/*
 	 * initialize child expressions
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 0c448422e2..74563c3454 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -396,30 +396,52 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
+			case EEOP_INNER_VAR_TOAST:
+			case EEOP_OUTER_VAR_TOAST:
+			case EEOP_SCAN_VAR_TOAST:
 				{
 					LLVMValueRef value,
 								isnull;
 					LLVMValueRef v_attnum;
 					LLVMValueRef v_values;
 					LLVMValueRef v_nulls;
+					LLVMValueRef v_slot;
 
-					if (opcode == EEOP_INNER_VAR)
+					if (opcode == EEOP_INNER_VAR || opcode == EEOP_INNER_VAR_TOAST)
 					{
+						v_slot = v_innerslot;
 						v_values = v_innervalues;
 						v_nulls = v_innernulls;
 					}
-					else if (opcode == EEOP_OUTER_VAR)
+					else if (opcode == EEOP_OUTER_VAR || opcode == EEOP_OUTER_VAR_TOAST)
 					{
+						v_slot = v_outerslot;
 						v_values = v_outervalues;
 						v_nulls = v_outernulls;
 					}
 					else
 					{
+						v_slot = v_scanslot;
 						v_values = v_scanvalues;
 						v_nulls = v_scannulls;
 					}
 
 					v_attnum = l_int32_const(lc, op->d.var.attnum);
+
+					if (opcode == EEOP_INNER_VAR_TOAST ||
+						opcode == EEOP_OUTER_VAR_TOAST ||
+						opcode == EEOP_SCAN_VAR_TOAST)
+					{
+						LLVMValueRef params[2];
+
+						params[0] = v_slot;
+						params[1] = l_int32_const(lc, op->d.var.attnum);
+						l_call(b,
+							   llvm_pg_var_func_type("ExecSlotDetoastDatumExternal"),
+							   llvm_pg_func(mod, "ExecSlotDetoastDatumExternal"),
+							   params, lengthof(params), "");
+					}
+
 					value = l_load_gep1(b, TypeSizeT, v_values, v_attnum, "");
 					isnull = l_load_gep1(b, TypeStorageBool, v_nulls, v_attnum, "");
 					LLVMBuildStore(b, value, v_resvaluep);
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 47c9daf402..1dcf0c2fd8 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -178,4 +178,5 @@ void	   *referenced_functions[] =
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecSlotDetoastDatumExternal,
 };
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 610f4a56d6..7ac933436c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -314,7 +314,9 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 List *mergeActionLists, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
-
+static void set_plan_forbid_pre_detoast_vars_recurse(Plan *plan,
+													 List *small_tlist);
+static void set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist);
 
 /*
  * create_plan
@@ -346,6 +348,12 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree, demanding the correct tlist result */
 	plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);
 
+	/*
+	 * After the plan tree is built completed, we start to walk for which
+	 * expressions should not used the shared-detoast feature.
+	 */
+	set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);
+
 	/*
 	 * Make sure the topmost plan node's targetlist exposes the original
 	 * column names and other decorative info.  Targetlists generated within
@@ -378,6 +386,97 @@ create_plan(PlannerInfo *root, Path *best_path)
 	return plan;
 }
 
+/*
+ * set_plan_forbid_pre_detoast_vars_recurse
+ *		Walking the Plan tree to gather the vars which should be as small
+ * as possible, hence all of them should be used for pre detoast datum logic.
+ */
+static void
+set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)
+{
+	if (plan == NULL)
+		return;
+
+	set_plan_not_pre_detoast_vars(plan, small_tlist);
+
+	/* Recurse to its subplan.. */
+	if (IsA(plan, Sort) || IsA(plan, Memoize) || IsA(plan, WindowAgg) ||
+		IsA(plan, Hash) || IsA(plan, Material) || IsA(plan, IncrementalSort))
+	{
+		List	   *small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * For the sort-like nodes, we want the output of its subplan as small
+		 * as possible, but the subplan's other expressions like Qual doesn't
+		 * have this restriction since they are not output to the upper nodes.
+		 * so we set the small_tlist to the subplan->targetlist.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, small_tlist);
+	}
+	else if (IsA(plan, HashJoin) && castNode(HashJoin, plan)->left_small_tlist)
+	{
+		List	   *small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * If the left_small_tlist wants a as small as possible tlist, set it
+		 * in a way like sort for the left node.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, small_tlist);
+
+		/*
+		 * The righttree is a Hash node, it can be set with its own rule, so
+		 * the small_tlist provided is not important, we just need to recuse
+		 * to its subplan.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+	else
+	{
+		/*
+		 * Recurse to its children, just push down the forbid_pre_detoast_vars
+		 * to its children.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, plan->forbid_pre_detoast_vars);
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+}
+
+/*
+ * set_plan_not_pre_detoast_vars
+ *
+ *	Set the Plan.forbid_pre_detoast_vars according the small_tlist information.
+ *
+ * small_tlist = NIL means nothing is forbidden, or else if a Var belongs to the
+ * small_tlist, then it must not be pre-detoasted.
+ */
+static void
+set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist)
+{
+	ListCell   *lc;
+	Var		   *var;
+
+	/*
+	 * fast path, if we don't have a small_tlist, the var in targetlist is
+	 * impossible member of it. and this case might be a pretty common case.
+	 */
+	if (small_tlist == NIL)
+		return;
+
+	foreach(lc, plan->targetlist)
+	{
+		TargetEntry *te = lfirst_node(TargetEntry, lc);
+
+		if (!IsA(te->expr, Var))
+			continue;
+		var = castNode(Var, te->expr);
+		if (var->varattno <= 0)
+			continue;
+		if (list_member(small_tlist, var))
+			/* pass the recheck */
+			plan->forbid_pre_detoast_vars = lappend(plan->forbid_pre_detoast_vars, var);
+	}
+}
+
 /*
  * create_plan_recurse
  *	  Recursive guts of create_plan().
@@ -4893,6 +4992,8 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	join_plan->left_small_tlist = (best_path->num_batches > 1);
+
 	return join_plan;
 }
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 22a1fa29f3..1847c8eada 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -27,6 +27,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_relation.h"
 #include "tcop/utility.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
@@ -55,11 +56,47 @@ typedef struct
 	tlist_vinfo vars[FLEXIBLE_ARRAY_MEMBER];	/* has num_vars entries */
 } indexed_tlist;
 
+/*
+ * Decide which attrs are detoasted in a expressions level, this is judged
+ * at the fix_scan/join_expr stage. The recursed level is tracked when we
+ * walk to a Var, if the level is greater than 1, then it means the
+ * var needs an detoast in this expression list, there are some exceptions
+ * here, see increase_level_for_pre_detoast for details.
+ */
+typedef struct
+{
+	/* if the level is added during a certain walk. */
+	bool		level_added;
+	/* the current level during the walk. */
+	int			level;
+} intermediate_level_context;
+
+/*
+ * Context to hold the detoast attribute for a expression.
+ *
+ * XXX: this design was intent to avoid the pre-detoast-logic if the var
+ * only need to be detoasted *once*, but for now, the context is only
+ * maintain at the expression level. So the 'first/2+' times logic
+ * doesn't work really. Recording the times of a Var is accessed in the
+ * plan tree level is complex, so before we decide it is a must, I am not
+ * willing to do too many changes here.
+ */
+typedef struct
+{
+	/* var is accessed for the first time. */
+	Bitmapset  *existing_attrs;
+	/* var is accessed for the 2+ times. */
+	Bitmapset **final_ref_attrs;
+} intermediate_var_ref_context;
+
+
 typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context scan_reference_attrs;
 } fix_scan_expr_context;
 
 typedef struct
@@ -71,6 +108,9 @@ typedef struct
 	int			rtoffset;
 	NullingRelsMatch nrm_match;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context outer_reference_attrs;
+	intermediate_var_ref_context inner_reference_attrs;
 } fix_join_expr_context;
 
 typedef struct
@@ -127,8 +167,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, lst, rtoffset, num_exec, pre_detoast_attrs) \
+	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec, pre_detoast_attrs))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -158,7 +198,8 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
 static Node *fix_scan_expr(PlannerInfo *root, Node *node,
-						   int rtoffset, double num_exec);
+						   int rtoffset, double num_exec,
+						   Bitmapset **scan_reference_attrs);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
@@ -190,7 +231,10 @@ static List *fix_join_expr(PlannerInfo *root,
 						   Index acceptable_rel,
 						   int rtoffset,
 						   NullingRelsMatch nrm_match,
-						   double num_exec);
+						   double num_exec,
+						   Bitmapset **outer_reference_attrs,
+						   Bitmapset **inner_reference_attrs);
+
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
 static Node *fix_upper_expr(PlannerInfo *root,
@@ -628,10 +672,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_SampleScan:
@@ -641,13 +691,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs
+					);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tablesample = (TableSampleClause *)
 					fix_scan_expr(root, (Node *) splan->tablesample,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexScan:
@@ -657,28 +714,40 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->indexqual =
 					fix_scan_list(root, splan->indexqual,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->indexorderby =
 					fix_scan_list(root, splan->indexorderby,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexorderbyorig =
 					fix_scan_list(root, splan->indexorderbyorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan), &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexOnlyScan:
 			{
 				IndexOnlyScan *splan = (IndexOnlyScan *) plan;
 
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
+
 				return set_indexonlyscan_references(root, splan, rtoffset);
 			}
 			break;
@@ -691,10 +760,15 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, splan->indexqual, rtoffset, 1,
+								  &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_BitmapHeapScan:
@@ -704,13 +778,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->bitmapqualorig =
 					fix_scan_list(root, splan->bitmapqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidScan:
@@ -720,13 +801,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidquals =
 					fix_scan_list(root, splan->tidquals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidRangeScan:
@@ -736,13 +824,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidrangequals =
 					fix_scan_list(root, splan->tidrangequals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_SubqueryScan:
@@ -757,12 +852,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, splan->functions, rtoffset, 1,
+								  &splan->scan.reference_attrs);
+
 			}
 			break;
 		case T_TableFuncScan:
@@ -772,13 +871,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->tablefunc = (TableFunc *)
 					fix_scan_expr(root, (Node *) splan->tablefunc,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ValuesScan:
@@ -788,13 +891,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->values_lists =
 					fix_scan_list(root, splan->values_lists,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_CteScan:
@@ -804,10 +910,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_NamedTuplestoreScan:
@@ -817,10 +929,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_WorkTableScan:
@@ -830,10 +944,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ForeignScan:
@@ -873,7 +989,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
 												   rtoffset,
-												   NUM_EXEC_TLIST(plan));
+												   NUM_EXEC_TLIST(plan),
+												   NULL);
 				break;
 			}
 
@@ -933,9 +1050,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, splan->limitOffset, rtoffset, 1, NULL);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, splan->limitCount, rtoffset, 1, NULL);
 			}
 			break;
 		case T_Agg:
@@ -988,17 +1105,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->startOffset, rtoffset, 1, NULL);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->endOffset, rtoffset, 1, NULL);
 				wplan->runCondition = fix_scan_list(root,
 													wplan->runCondition,
 													rtoffset,
-													NUM_EXEC_TLIST(plan));
+													NUM_EXEC_TLIST(plan), NULL);
 				wplan->runConditionOrig = fix_scan_list(root,
 														wplan->runConditionOrig,
 														rtoffset,
-														NUM_EXEC_TLIST(plan));
+														NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_Result:
@@ -1038,14 +1155,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 					splan->plan.targetlist =
 						fix_scan_list(root, splan->plan.targetlist,
-									  rtoffset, NUM_EXEC_TLIST(plan));
+									  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 					splan->plan.qual =
 						fix_scan_list(root, splan->plan.qual,
-									  rtoffset, NUM_EXEC_QUAL(plan));
+									  rtoffset, NUM_EXEC_QUAL(plan), NULL);
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1, NULL);
 			}
 			break;
 		case T_ProjectSet:
@@ -1061,7 +1178,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->withCheckOptionLists =
 					fix_scan_list(root, splan->withCheckOptionLists,
-								  rtoffset, 1);
+								  rtoffset, 1, NULL);
 
 				if (splan->returningLists)
 				{
@@ -1118,18 +1235,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						fix_join_expr(root, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					splan->onConflictWhere = (Node *)
 						fix_join_expr(root, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1, NULL);
 				}
 
 				/*
@@ -1182,7 +1301,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   resultrel,
 															   rtoffset,
 															   NRM_EQUAL,
-															   NUM_EXEC_TLIST(plan));
+															   NUM_EXEC_TLIST(plan),
+															   NULL, NULL);
 
 							/* Fix quals too. */
 							action->qual = (Node *) fix_join_expr(root,
@@ -1191,7 +1311,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 																  resultrel,
 																  rtoffset,
 																  NRM_EQUAL,
-																  NUM_EXEC_QUAL(plan));
+																  NUM_EXEC_QUAL(plan),
+																  NULL, NULL);
 						}
 					}
 				}
@@ -1356,13 +1477,16 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
 	plan->indexqual = fix_scan_list(root, plan->indexqual,
-									rtoffset, 1);
+									rtoffset, 1,
+									&plan->scan.reference_attrs);
 	/* indexorderby is already transformed to reference index columns */
 	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
-									   rtoffset, 1);
+									   rtoffset, 1,
+									   &plan->scan.reference_attrs);
 	/* indextlist must NOT be transformed to reference index columns */
 	plan->indextlist = fix_scan_list(root, plan->indextlist,
-									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+									 rtoffset, NUM_EXEC_TLIST((Plan *) plan),
+									 &plan->scan.reference_attrs);
 
 	pfree(index_itlist);
 
@@ -1409,10 +1533,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
 			fix_scan_list(root, plan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) plan), NULL);
 		plan->scan.plan.qual =
 			fix_scan_list(root, plan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) plan), NULL);
 
 		result = (Plan *) plan;
 	}
@@ -1612,7 +1736,7 @@ set_foreignscan_references(PlannerInfo *root,
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
 			fix_scan_list(root, fscan->fdw_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 	}
 	else
 	{
@@ -1622,16 +1746,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_recheck_quals =
 			fix_scan_list(root, fscan->fdw_recheck_quals,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 	}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
@@ -1690,20 +1814,20 @@ set_customscan_references(PlannerInfo *root,
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
 			fix_scan_list(root, cscan->custom_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
 			fix_scan_list(root, cscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 		cscan->scan.plan.qual =
 			fix_scan_list(root, cscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 		cscan->custom_exprs =
 			fix_scan_list(root, cscan->custom_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 	}
 
 	/* Adjust child plan-nodes recursively, if needed */
@@ -2111,6 +2235,95 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
 	return (Node *) bestplan;
 }
 
+
+static inline void
+setup_intermediate_level_ctx(intermediate_level_context *ctx)
+{
+	ctx->level = 0;
+	ctx->level_added = false;
+}
+
+static inline void
+setup_intermediate_var_ref_ctx(intermediate_var_ref_context *ctx, Bitmapset **final_ref_attrs)
+{
+	ctx->existing_attrs = NULL;
+	ctx->final_ref_attrs = final_ref_attrs;
+}
+
+/*
+ * increase_level_for_pre_detoast
+ *	Check if the given Expr could detoast a Var directly, if yes,
+ * increase the level and return true. otherwise return false;
+ */
+static inline void
+increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
+{
+	/* The following nodes is impossible to detoast a Var directly. */
+	if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
+	{
+		ctx->level_added = false;
+	}
+	else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
+	{
+		/* let's not detoast first so that pg_column_compression works. */
+		ctx->level_added = false;
+	}
+	else
+	{
+		ctx->level_added = true;
+		ctx->level += 1;
+	}
+}
+
+static inline void
+decreased_level_for_pre_detoast(intermediate_level_context *ctx)
+{
+	if (ctx->level_added)
+		ctx->level -= 1;
+
+	ctx->level_added = false;
+}
+
+/*
+ * add_pre_detoast_vars
+ *		add the var's information into pre_detoast_attrs when the check is pass.
+ */
+static inline void
+add_pre_detoast_vars(intermediate_level_context *level_ctx,
+					 intermediate_var_ref_context *ctx,
+					 Var *var)
+{
+	int			attno;
+
+	if (level_ctx->level <= 1 || ctx->final_ref_attrs == NULL || var->varattno <= 0)
+		return;
+
+	attno = var->varattno - 1;
+	if (bms_is_member(attno, ctx->existing_attrs))
+	{
+		/* not the first time to access it, add it to final result. */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+	else
+	{
+		/* first time. */
+		ctx->existing_attrs = bms_add_member(ctx->existing_attrs, attno);
+
+		/*
+		 * XXX:
+		 *
+		 * The above strategy doesn't help to detect if a Var is detoast
+		 * twice. Reasons are: 1. the context is not maintain in Plan node
+		 * level. so if it is detoast at targetlist and qual, we can't detect
+		 * it. 2. even we can make it at plan node, it still doesn't help for
+		 * the among-nodes case.
+		 *
+		 * So for now, I just disable it.
+		 */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+}
+
 /*
  * fix_scan_expr
  *		Do set_plan_references processing on a scan-level expression
@@ -2125,18 +2338,23 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * 'node': the expression to be modified
  * 'rtoffset': how much to increment varnos by
  * 'num_exec': estimated number of executions of expression
+ * 'scan_reference_attrs': gather which vars are potential to run the detoast
+ * 	on this expr, NULL means the caller doesn't have interests on this.
  *
  * The expression tree is either copied-and-modified, or modified in-place
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset,
+			  double num_exec, Bitmapset **scan_reference_attrs)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.scan_reference_attrs, scan_reference_attrs);
 
 	if (rtoffset != 0 ||
 		root->multiexpr_params != NIL ||
@@ -2167,8 +2385,13 @@ fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
 static Node *
 fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 {
+	Node	   *n;
+
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = copyVar((Var *) node);
@@ -2186,10 +2409,16 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 			var->varno += context->rtoffset;
 		if (var->varnosyn > 0)
 			var->varnosyn += context->rtoffset;
+
+		add_pre_detoast_vars(&context->level_ctx, &context->scan_reference_attrs, var);
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) var;
 	}
 	if (IsA(node, Param))
+	{
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return fix_param_node(context->root, (Param *) node);
+	}
 	if (IsA(node, Aggref))
 	{
 		Aggref	   *aggref = (Aggref *) node;
@@ -2199,8 +2428,10 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		aggparam = find_minmax_agg_replacement_param(context->root, aggref);
 		if (aggparam != NULL)
 		{
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			/* Make a copy of the Param for paranoia's sake */
 			return (Node *) copyObject(aggparam);
+
 		}
 		/* If no match, just fall through to process it normally */
 	}
@@ -2210,6 +2441,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 
 		Assert(!IS_SPECIAL_VARNO(cexpr->cvarno));
 		cexpr->cvarno += context->rtoffset;
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) cexpr;
 	}
 	if (IsA(node, PlaceHolderVar))
@@ -2218,29 +2450,52 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
 		/* XXX can we assert something about phnullingrels? */
-		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
+		Node	   *n2 = fix_scan_expr_mutator((Node *) phv->phexpr, context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
 	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_scan_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		Node	   *n2 = fix_scan_expr_mutator(fix_alternative_subplan(context->root,
+																	   (AlternativeSubPlan *) node,
+																	   context->num_exec),
+											   context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator,
-								   (void *) context);
+	n = expression_tree_mutator(node, fix_scan_expr_mutator, (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return n;
 }
 
 static bool
 fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 {
+	bool		ret;
+
 	if (node == NULL)
 		return false;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
+	if (IsA(node, Var))
+	{
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 castNode(Var, node));
+	}
 	Assert(!(IsA(node, Var) && ((Var *) node)->varno == ROWID_VAR));
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
-	return expression_tree_walker(node, fix_scan_expr_walker,
-								  (void *) context);
+	ret = expression_tree_walker(node, fix_scan_expr_walker,
+								 (void *) context);
+
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret;
 }
 
 /*
@@ -2276,7 +2531,10 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset,
 								   NRM_EQUAL,
-								   NUM_EXEC_QUAL((Plan *) join));
+								   NUM_EXEC_QUAL((Plan *) join),
+								   &join->outer_reference_attrs,
+								   &join->inner_reference_attrs
+		);
 
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
@@ -2323,7 +2581,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										 (Index) 0,
 										 rtoffset,
 										 NRM_EQUAL,
-										 NUM_EXEC_QUAL((Plan *) join));
+										 NUM_EXEC_QUAL((Plan *) join),
+										 &join->outer_reference_attrs,
+										 &join->inner_reference_attrs);
 	}
 	else if (IsA(join, HashJoin))
 	{
@@ -2336,7 +2596,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										(Index) 0,
 										rtoffset,
 										NRM_EQUAL,
-										NUM_EXEC_QUAL((Plan *) join));
+										NUM_EXEC_QUAL((Plan *) join),
+										&join->outer_reference_attrs,
+										&join->inner_reference_attrs);
 
 		/*
 		 * HashJoin's hashkeys are used to look for matching tuples from its
@@ -2368,7 +2630,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  (Index) 0,
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-										  NUM_EXEC_TLIST((Plan *) join));
+										  NUM_EXEC_TLIST((Plan *) join),
+										  &join->outer_reference_attrs,
+										  &join->inner_reference_attrs);
 	join->plan.qual = fix_join_expr(root,
 									join->plan.qual,
 									outer_itlist,
@@ -2376,8 +2640,20 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 									(Index) 0,
 									rtoffset,
 									(join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-									NUM_EXEC_QUAL((Plan *) join));
-
+									NUM_EXEC_QUAL((Plan *) join),
+									&join->outer_reference_attrs,
+									&join->inner_reference_attrs);
+
+	join->plan.forbid_pre_detoast_vars = fix_join_expr(root,
+													   join->plan.forbid_pre_detoast_vars,
+													   outer_itlist,
+													   inner_itlist,
+													   (Index) 0,
+													   rtoffset,
+													   (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
+													   NUM_EXEC_TLIST((Plan *) join),
+													   NULL,
+													   NULL);
 	pfree(outer_itlist);
 	pfree(inner_itlist);
 }
@@ -3010,9 +3286,12 @@ fix_join_expr(PlannerInfo *root,
 			  Index acceptable_rel,
 			  int rtoffset,
 			  NullingRelsMatch nrm_match,
-			  double num_exec)
+			  double num_exec,
+			  Bitmapset **outer_reference_attrs,
+			  Bitmapset **inner_reference_attrs)
 {
 	fix_join_expr_context context;
+	List	   *ret;
 
 	context.root = root;
 	context.outer_itlist = outer_itlist;
@@ -3021,16 +3300,30 @@ fix_join_expr(PlannerInfo *root,
 	context.rtoffset = rtoffset;
 	context.nrm_match = nrm_match;
 	context.num_exec = num_exec;
-	return (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.outer_reference_attrs, outer_reference_attrs);
+	setup_intermediate_var_ref_ctx(&context.inner_reference_attrs, inner_reference_attrs);
+
+	ret = (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	bms_free(context.outer_reference_attrs.existing_attrs);
+	bms_free(context.inner_reference_attrs.existing_attrs);
+
+	return ret;
 }
 
 static Node *
 fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 {
 	Var		   *newvar;
+	Node	   *ret_node;
 
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = (Var *) node;
@@ -3044,7 +3337,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* then in the inner. */
@@ -3056,7 +3355,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If it's for acceptable_rel, adjust and return it */
@@ -3066,6 +3371,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 			var->varno += context->rtoffset;
 			if (var->varnosyn > 0)
 				var->varnosyn += context->rtoffset;
+			/* XXX acceptable_rel?  we can ignore it for safety. */
+			decreased_level_for_pre_detoast(&context->level_ctx);
+
 			return (Node *) var;
 		}
 
@@ -3084,22 +3392,38 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  OUTER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 		if (context->inner_itlist && context->inner_itlist->has_ph_vars)
 		{
+
 			newvar = search_indexed_tlist_for_phv(phv,
 												  context->inner_itlist,
 												  INNER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If not supplied by input plans, evaluate the contained expr */
 		/* XXX can we assert something about phnullingrels? */
-		return fix_join_expr_mutator((Node *) phv->phexpr, context);
+		ret_node = fix_join_expr_mutator((Node *) phv->phexpr, context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
 	}
+
 	/* Try matching more complex expressions too, if tlists have any */
 	if (context->outer_itlist && context->outer_itlist->has_non_vars)
 	{
@@ -3107,7 +3431,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->outer_itlist,
 												  OUTER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->outer_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	if (context->inner_itlist && context->inner_itlist->has_non_vars)
 	{
@@ -3115,20 +3445,36 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->inner_itlist,
 												  INNER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->inner_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	/* Special cases (apply only AFTER failing to match to lower tlist) */
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		ret_node = fix_param_node(context->root, (Param *) node);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_join_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		ret_node = fix_join_expr_mutator(fix_alternative_subplan(context->root,
+																 (AlternativeSubPlan *) node,
+																 context->num_exec),
+										 context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node,
-								   fix_join_expr_mutator,
-								   (void *) context);
+	ret_node = expression_tree_mutator(node,
+									   fix_join_expr_mutator,
+									   (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret_node;
 }
 
 /*
@@ -3163,7 +3509,8 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * varno = newvarno, varattno = resno of corresponding targetlist element.
  * The original tree is not modified.
  */
-static Node *
+static Node *					/* XXX: shall I care about this for shared
+								 * detoast optimization? */
 fix_upper_expr(PlannerInfo *root,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
@@ -3318,7 +3665,10 @@ set_returning_clause_references(PlannerInfo *root,
 						  resultRelation,
 						  rtoffset,
 						  NRM_EQUAL,
-						  NUM_EXEC_TLIST(topplan));
+						  NUM_EXEC_TLIST(topplan),
+						  NULL,
+						  NULL
+		);
 
 	pfree(itlist);
 
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index a28ddcdd77..9304786bb2 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,17 @@ typedef enum ExprEvalOp
 	EEOP_OUTER_VAR,
 	EEOP_SCAN_VAR,
 
+	/*
+	 * compute non-system Var value with shared-detoast-datum logic, use some
+	 * dedicated steps rather than add extra logic to existing steps is for
+	 * performance aspect, within this way, we just decide if the extra logic
+	 * is needed at ExecInitExpr stage once rather than every time of
+	 * ExecInterpExpr.
+	 */
+	EEOP_INNER_VAR_TOAST,
+	EEOP_OUTER_VAR_TOAST,
+	EEOP_SCAN_VAR_TOAST,
+
 	/* compute system Var value */
 	EEOP_INNER_SYSVAR,
 	EEOP_OUTER_SYSVAR,
@@ -830,5 +841,6 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
+extern void ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum);
 
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 6133dbcd0a..2e04bb028e 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -18,6 +18,7 @@
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/tupdesc.h"
+#include "nodes/bitmapset.h"
 #include "storage/buf.h"
 
 /*----------
@@ -128,6 +129,18 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+
+	/*
+	 * The attributes whose values are the detoasted version in tts_values[*],
+	 * if so these memory needs some extra clearup. These memory can't be put
+	 * into ecxt_per_tuple_memory since many of them needs a longer life span,
+	 * for example the Datum in outer join. These memory is put into
+	 * TupleTableSlot.tts_mcxt and be clear whenever the tts_values[*] is
+	 * invalidated.
+	 *
+	 * These values are populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST steps.
+	 */
+	Bitset	   *pre_detoasted_attrs;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -426,12 +439,36 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+/*
+ * ExecFreePreDetoastDatum - free the memory which is allocated in pre-detoast-datum.
+ */
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+	int			attnum;
+
+	attnum = -1;
+	while ((attnum = bitset_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
+	{
+		pfree((void *) slot->tts_values[attnum]);
+	}
+
+	/*
+	 * unset the bits but keep the memory for later use, this is importance
+	 * for at the performance aspect.
+	 */
+	bitset_clear(slot->pre_detoasted_attrs);
+}
+
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
 static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_ops->clear(slot);
 
 	return slot;
@@ -450,6 +487,10 @@ ExecClearTuple(TupleTableSlot *slot)
 static inline void
 ExecMaterializeSlot(TupleTableSlot *slot)
 {
+	/*
+	 * XXX: pre_detoasted_attrs doesn't dependent on any external storage, so
+	 * nothing should be done here.
+	 */
 	slot->tts_ops->materialize(slot);
 }
 
@@ -494,6 +535,25 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 
 	dstslot->tts_ops->copyslot(dstslot, srcslot);
 
+	if (dstslot->tts_nvalid > 0 && srcslot->tts_nvalid > 0)
+	{
+		int			attnum = -1;
+		MemoryContext old = MemoryContextSwitchTo(dstslot->tts_mcxt);
+
+		dstslot->pre_detoasted_attrs = bitset_copy(srcslot->pre_detoasted_attrs);
+
+		while ((attnum = bitset_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
+		{
+			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
+			Size		len;
+
+			Assert(!VARATT_IS_EXTENDED(datum));
+			len = VARSIZE(datum);
+			dstslot->tts_values[attnum] = (Datum) palloc(len);
+			memcpy((void *) dstslot->tts_values[attnum], datum, len);
+		}
+		MemoryContextSwitchTo(old);
+	}
 	return dstslot;
 }
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd5..30fdb37d1c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1481,6 +1481,12 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the Scan nodes.
+	 */
+	Bitmapset  *scan_pre_detoast_attrs;
 } ScanState;
 
 /* ----------------
@@ -2010,6 +2016,13 @@ typedef struct JoinState
 	bool		single_match;	/* True if we should skip to next outer tuple
 								 * after finding one inner match */
 	ExprState  *joinqual;		/* JOIN quals (in addition to ps.qual) */
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the join nodes.
+	 */
+	Bitmapset  *outer_pre_detoast_attrs;
+	Bitmapset  *inner_pre_detoast_attrs;
 } JoinState;
 
 /* ----------------
@@ -2771,4 +2784,5 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+extern void SetPredetoastAttrsForJoin(JoinState *joinstate);
 #endif							/* EXECNODES_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c..ea5033aaa0 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,13 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+
+	/*
+	 * A list of Vars which should not apply the shared-detoast-datum logic
+	 * since the upper nodes like Sort/Hash wants them as small as possible.
+	 * It's a subset of targetlist in each Plan node.
+	 */
+	List	   *forbid_pre_detoast_vars;
 } Plan;
 
 /* ----------------
@@ -385,6 +392,16 @@ typedef struct Scan
 
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *reference_attrs;
 } Scan;
 
 /* ----------------
@@ -789,6 +806,17 @@ typedef struct Join
 	JoinType	jointype;
 	bool		inner_unique;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *outer_reference_attrs;
+	Bitmapset  *inner_reference_attrs;
 } Join;
 
 /* ----------------
@@ -869,6 +897,11 @@ typedef struct HashJoin
 	 * perform lookups in the hashtable over the inner plan.
 	 */
 	List	   *hashkeys;
+
+	/*
+	 * Whether the left plan tree should use a SMALL_TLIST.
+	 */
+	bool		left_small_tlist;
 } HashJoin;
 
 /* ----------------
@@ -1588,4 +1621,24 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+static inline bool
+is_join_plan(Plan *plan)
+{
+	return (plan != NULL) && (IsA(plan, NestLoop) || IsA(plan, HashJoin) || IsA(plan, MergeJoin));
+}
+
+static inline bool
+is_scan_plan(Plan *plan)
+{
+	return (plan != NULL) &&
+		(IsA(plan, SeqScan) ||
+		 IsA(plan, SampleScan) ||
+		 IsA(plan, IndexScan) ||
+		 IsA(plan, IndexOnlyScan) ||
+		 IsA(plan, BitmapIndexScan) ||
+		 IsA(plan, BitmapHeapScan) ||
+		 IsA(plan, TidScan) ||
+		 IsA(plan, SubqueryScan));
+}
+
 #endif							/* PLANNODES_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dcba329ee4..308f19d61a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4047,6 +4047,8 @@ cb_cleanup_dir
 cb_options
 cb_tablespace
 cb_tablespace_mapping
+intermediate_var_ref_context
+intermediate_level_context
 manifest_data
 manifest_writer
 rfile
-- 
2.34.1

#11Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andy Fan (#10)
Re: Shared detoast Datum proposal

Hi,

I took a quick look on this thread/patch today, so let me share a couple
initial thoughts. I may not have a particularly coherent/consistent
opinion on the patch or what would be a better way to do this yet, but
perhaps it'll start a discussion ...

The goal of the patch (as I understand it) is essentially to cache
detoasted values, so that the value does not need to be detoasted
repeatedly in different parts of the plan. I think that's perfectly
sensible and worthwhile goal - detoasting is not cheap, and complex
plans may easily spend a lot of time on it.

That being said, the approach seems somewhat invasive, and touching
parts I wouldn't have expected to need a change to implement this. For
example, I certainly would not have guessed the patch to need changes in
createplan.c or setrefs.c.

Perhaps it really needs to do these things, but neither the thread nor
the comments are very enlightening as for why it's needed :-( In many
cases I can guess, but I'm not sure my guess is correct. And comments in
code generally describe what's happening locally / next line, not the
bigger picture & why it's happening.

IIUC we walk the plan to decide which Vars should be detoasted (and
cached) once, and which cases should not do that because it'd inflate
the amount of data we need to keep in a Sort, Hash etc. Not sure if
there's a better way to do this - it depends on what happens in the
upper parts of the plan, so we can't decide while building the paths.

But maybe we could decide this while transforming the paths into a plan?
(I realize the JIT thread nearby needs to do something like that in
create_plan, and in that one I suggested maybe walking the plan would be
a better approach, so I may be contradicting myself a little bit.).

In any case, set_plan_forbid_pre_detoast_vars_recurse should probably
explain the overall strategy / reasoning in a bit more detail. Maybe
it's somewhere in this thread, but that's not great for reviewers.

Similar for the setrefs.c changes. It seems a bit suspicious to piggy
back the new code into fix_scan_expr/fix_scan_list and similar code.
Those functions have a pretty clearly defined purpose, not sure we want
to also extend them to also deal with this new thing. (FWIW I'd 100%%
did it this way if I hacked on a PoC of this, to make it work. But I'm
not sure it's the right solution.)

I don't know what to thing about the Bitset - maybe it's necessary, but
how would I know? I don't have any way to measure the benefits, because
the 0002 patch uses it right away. I think it should be done the other
way around, i.e. the patch should introduce the main feature first
(using the traditional Bitmapset), and then add Bitset on top of that.
That way we could easily measure the impact and see if it's useful.

On the whole, my biggest concern is memory usage & leaks. It's not
difficult to already have problems with large detoasted values, and if
we start keeping more of them, that may get worse. Or at least that's my
intuition - it can't really get better by keeping the values longer, right?

The other thing is the risk of leaks (in the sense of keeping detoasted
values longer than expected). I see the values are allocated in
tts_mcxt, and maybe that's the right solution - not sure.

FWIW while looking at the patch, I couldn't help but to think about
expanded datums. There's similarity in what these two features do - keep
detoasted values for a while, so that we don't need to do the expensive
processing if we access them repeatedly. Of course, expanded datums are
not meant to be long-lived, while "shared detoasted values" are meant to
exist (potentially) for the query duration. But maybe there's something
we could learn from expanded datums? For example how the varlena pointer
is leveraged to point to the expanded object.

For example, what if we add a "TOAST cache" as a query-level hash table,
and modify the detoasting to first check the hash table (with the TOAST
pointer as a key)? It'd be fairly trivial to enforce a memory limit on
the hash table, evict values from it, etc. And it wouldn't require any
of the createplan/setrefs changes, I think ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Andy Fan
zhihuifan1213@163.com
In reply to: Tomas Vondra (#11)
2 attachment(s)
Re: Shared detoast Datum proposal

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

Hi Tomas,

I took a quick look on this thread/patch today, so let me share a couple
initial thoughts. I may not have a particularly coherent/consistent
opinion on the patch or what would be a better way to do this yet, but
perhaps it'll start a discussion ...

Thank you for this!

The goal of the patch (as I understand it) is essentially to cache
detoasted values, so that the value does not need to be detoasted
repeatedly in different parts of the plan. I think that's perfectly
sensible and worthwhile goal - detoasting is not cheap, and complex
plans may easily spend a lot of time on it.

exactly.

That being said, the approach seems somewhat invasive, and touching
parts I wouldn't have expected to need a change to implement this. For
example, I certainly would not have guessed the patch to need changes in
createplan.c or setrefs.c.

Perhaps it really needs to do these things, but neither the thread nor
the comments are very enlightening as for why it's needed :-( In many
cases I can guess, but I'm not sure my guess is correct. And comments in
code generally describe what's happening locally / next line, not the
bigger picture & why it's happening.

there were explaination at [1]/messages/by-id/87il4jrk1l.fsf@163.com, but it probably is too high level.
Writing a proper comments is challenging for me, but I am pretty happy
to try more. At the end of this writing, I explained the data workflow,
I am feeling that would be useful for reviewers.

IIUC we walk the plan to decide which Vars should be detoasted (and
cached) once, and which cases should not do that because it'd inflate
the amount of data we need to keep in a Sort, Hash etc.

Exactly.

Not sure if
there's a better way to do this - it depends on what happens in the
upper parts of the plan, so we can't decide while building the paths.

I'd say I did this intentionally. Deciding such things in paths will be
more expensive than create_plan stage IMO.

But maybe we could decide this while transforming the paths into a plan?
(I realize the JIT thread nearby needs to do something like that in
create_plan, and in that one I suggested maybe walking the plan would be
a better approach, so I may be contradicting myself a little bit.).

I think that's pretty similar what I'm doing now. Just that I did it
*just after* the create_plan. This is because the create_plan doesn't
transform the path to plan in the top->down manner all the time, the
known exception is create_mergejoin_plan. so I have to walk just after
the create_plan is
done.

In the create_mergejoin_plan, the Sort node is created *after* the
subplan for the Sort is created.

/* Recursively process the path tree, demanding the correct tlist result */
plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);

+ /*
+  * After the plan tree is built completed, we start to walk for which
+  * expressions should not used the shared-detoast feature.
+  */
+ set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);

In any case, set_plan_forbid_pre_detoast_vars_recurse should probably
explain the overall strategy / reasoning in a bit more detail. Maybe
it's somewhere in this thread, but that's not great for reviewers.

a lession learnt, thanks.

a revisted version of comments from the lastest patch.

/*
* set_plan_forbid_pre_detoast_vars_recurse
* Walking the Plan tree in the top-down manner to gather the vars which
* should be as small as possible and record them in Plan.forbid_pre_detoast_vars
*
* plan: the plan node to walk right now.
* small_tlist: a list of nodes which its subplan should provide them as
* small as possible.
*/
static void
set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)

Similar for the setrefs.c changes. It seems a bit suspicious to piggy
back the new code into fix_scan_expr/fix_scan_list and similar code.
Those functions have a pretty clearly defined purpose, not sure we want
to also extend them to also deal with this new thing. (FWIW I'd 100%%
did it this way if I hacked on a PoC of this, to make it work. But I'm
not sure it's the right solution.)

The main reason of doing so is because I want to share the same walk
effort as fix_scan_expr. otherwise I have to walk the plan for
every expression again. I thought this as a best practice in the past
and thought we can treat the pre_detoast_attrs as a valuable side
effects:(

I don't know what to thing about the Bitset - maybe it's necessary, but
how would I know? I don't have any way to measure the benefits, because
the 0002 patch uses it right away.

a revisted version of comments from the latest patch. graph 2 explains
this decision.

/*
* The attributes whose values are the detoasted version in tts_values[*],
* if so these memory needs some extra clean-up. These memory can't be put
* into ecxt_per_tuple_memory since many of them needs a longer life span,
* for example the Datum in outer join. These memory is put into
* TupleTableSlot.tts_mcxt and be clear whenever the tts_values[*] is
* invalidated.
*
* Bitset rather than Bitmapset is chosen here because when all the members
* of Bitmapset are deleted, the allocated memory will be deallocated
* automatically, which is too expensive in this case since we need to
* deleted all the members in each ExecClearTuple and repopulate it again
* when fill the detoast datum to tts_values[*]. This situation will be run
* again and again in an execution cycle.
*
* These values are populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST steps.
*/
Bitset *pre_detoasted_attrs;

I think it should be done the other
way around, i.e. the patch should introduce the main feature first
(using the traditional Bitmapset), and then add Bitset on top of that.
That way we could easily measure the impact and see if it's useful.

Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
indicate it is too expensive. and after talk with David at [2]/messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com, I
introduced bitset and use it here. the test case I used comes from [1]/messages/by-id/87il4jrk1l.fsf@163.com.
IRCC, there were 5% performance difference because of this.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;
select b from w where b > 0;

To reproduce the difference, we can replace the bitset_clear() with

bitset_free(slot->pre_detoasted_attrs);
slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);

in ExecFreePreDetoastDatum. then it works same as Bitmapset.

On the whole, my biggest concern is memory usage & leaks. It's not
difficult to already have problems with large detoasted values, and if
we start keeping more of them, that may get worse. Or at least that's my
intuition - it can't really get better by keeping the values longer, right?

The other thing is the risk of leaks (in the sense of keeping detoasted
values longer than expected). I see the values are allocated in
tts_mcxt, and maybe that's the right solution - not sure.

about the memory usage, first it is kept as the same lifesplan as the
tts_values[*] which can be released pretty quickly, only if the certain
values of the tuples is not needed. it is true that we keep the detoast
version longer than before, but that's something we have to pay I
think.

Leaks may happen since tts_mcxt is reset at the end of *executor*. So if
we forget to release the memory when the tts_values[*] is invalidated
somehow, the memory will be leaked until the end of executor. I think
that will be enough to cause an issue. Currently besides I release such
memory at the ExecClearTuple, I also relase such memory whenever we set
tts_nvalid to 0, the theory used here is:

/*
* tts_values is treated invalidated since tts_nvalid is set to 0, so
* let's free the pre-detoast datum.
*/
ExecFreePreDetoastDatum(slot);

I will do more test on the memory leak stuff, since there are so many
operation aginst slot like ExecCopySlot etc, I don't know how to test it
fully. the method in my mind now is use TPCH with 10GB data size, and
monitor the query runtime memory usage.

FWIW while looking at the patch, I couldn't help but to think about
expanded datums. There's similarity in what these two features do - keep
detoasted values for a while, so that we don't need to do the expensive
processing if we access them repeatedly.

Could you provide some keyword or function names for the expanded datum
here, I probably miss this.

Of course, expanded datums are
not meant to be long-lived, while "shared detoasted values" are meant to
exist (potentially) for the query duration.

hmm, acutally the "shared detoast value" just live in the
TupleTableSlot->tts_values[*], rather than the whole query duration. The
simple case is:

SELECT * FROM t WHERE a_text LIKE 'abc%';

when we scan to the next tuple, the detoast value for the previous tuple
will be relased.

But maybe there's something
we could learn from expanded datums? For example how the varlena pointer
is leveraged to point to the expanded object.

maybe. currently I just use detoast_attr to get the desired version. I'm
pleasure if we have more effective way.

if (!slot->tts_isnull[attnum] &&
VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
Datum oldDatum;
MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

oldDatum = slot->tts_values[attnum];
slot->tts_values[attnum] = PointerGetDatum(detoast_attr( (struct varlena *) oldDatum));
Assert(slot->tts_nvalid > attnum);
Assert(oldDatum != slot->tts_values[attnum]);
bitset_add_member(slot->pre_detoasted_attrs, attnum);
MemoryContextSwitchTo(old);
}

For example, what if we add a "TOAST cache" as a query-level hash table,
and modify the detoasting to first check the hash table (with the TOAST
pointer as a key)? It'd be fairly trivial to enforce a memory limit on
the hash table, evict values from it, etc. And it wouldn't require any
of the createplan/setrefs changes, I think ...

Hmm, I am not sure I understand you correctly at this part. In the
current patch, to avoid the run-time (ExecExprInterp) check if we
should detoast and save the datum, I defined 3 extra steps so that
the *extra check itself* is not needed for unnecessary attributes.
for example an datum for int or a detoast datum should not be saved back
to tts_values[*] due to the small_tlist reason. However these steps can
be generated is based on the output of createplan/setrefs changes. take
the INNER_VAR for example:

In ExecInitExprRec:

switch (variable->varno)
{
case INNER_VAR:
if (is_join_plan(plan) &&
bms_is_member(attnum,
((JoinState *) state->parent)->inner_pre_detoast_attrs))
{
scratch.opcode = EEOP_INNER_VAR_TOAST;
}
else
{
scratch.opcode = EEOP_INNER_VAR;
}
}

The data workflow is:

1. set_plan_forbid_pre_detoast_vars_recurse (in the createplan.c)
decides which Vars should *not* be pre_detoasted because of small_tlist
reason and record it in Plan.forbid_pre_detoast_vars.

2. fix_scan_expr (in the setrefs.c) tracks which Vars should be
detoasted for the specific plan node and record them in it. Currently
only Scan and Join nodes support this feature.

typedef struct Scan
{
...
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *reference_attrs;
} Scan;

typedef struct Join
{
..
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *outer_reference_attrs;
Bitmapset *inner_reference_attrs;
} Join;

3. during the InitPlan stage, we maintain the
PlanState.xxx_pre_detoast_attrs and generated different StepOp for them.

4. At the ExecExprInterp stage, only the new StepOp do the extra check
to see if the detoast should happen. Other steps doesn't need this
check at all.

If we avoid the createplan/setref.c changes, probabaly some unrelated
StepOp needs the extra check as well?

When I worked with the UniqueKey feature, I maintained a
UniqueKey.README to summaried all the dicussed topics in threads, the
README is designed to save the effort for more reviewer, I think I
should apply the same logic for this feature.

Thank you very much for your feedback!

v7 attached, just some comments and Assert changes.

[1]: /messages/by-id/87il4jrk1l.fsf@163.com
[2]: /messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com
/messages/by-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ+ZYA@mail.gmail.com

--
Best Regards
Andy Fan

Attachments:

v7-0001-Introduce-a-Bitset-data-struct.patchtext/x-diffDownload
From f2e7772228e8a18027b9c29f10caba9c6570d934 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 20 Feb 2024 11:11:53 +0800
Subject: [PATCH v7 1/2] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

[1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com
---
 src/backend/nodes/bitmapset.c                 | 200 +++++++++++++++++-
 src/backend/nodes/outfuncs.c                  |  51 +++++
 src/include/nodes/bitmapset.h                 |  28 +++
 src/include/nodes/nodes.h                     |   4 +
 src/test/modules/test_misc/Makefile           |  11 +
 src/test/modules/test_misc/README             |   4 +-
 .../test_misc/expected/test_bitset.out        |   7 +
 src/test/modules/test_misc/meson.build        |  17 ++
 .../modules/test_misc/sql/test_bitset.sql     |   3 +
 src/test/modules/test_misc/test_misc--1.0.sql |   5 +
 src/test/modules/test_misc/test_misc.c        | 118 +++++++++++
 src/test/modules/test_misc/test_misc.control  |   4 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 441 insertions(+), 12 deletions(-)
 create mode 100644 src/test/modules/test_misc/expected/test_bitset.out
 create mode 100644 src/test/modules/test_misc/sql/test_bitset.sql
 create mode 100644 src/test/modules/test_misc/test_misc--1.0.sql
 create mode 100644 src/test/modules/test_misc/test_misc.c
 create mode 100644 src/test/modules/test_misc/test_misc.control

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 65805d4527..40cfea2308 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1315,23 +1315,18 @@ bms_join(Bitmapset *a, Bitmapset *b)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-int
-bms_next_member(const Bitmapset *a, int prevbit)
+
+static int
+bms_next_member_internal(int nwords, const bitmapword *words, int prevbit)
 {
-	int			nwords;
 	int			wordnum;
 	bitmapword	mask;
 
-	Assert(bms_is_valid_set(a));
-
-	if (a == NULL)
-		return -2;
-	nwords = a->nwords;
 	prevbit++;
 	mask = (~(bitmapword) 0) << BITNUM(prevbit);
 	for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
 	{
-		bitmapword	w = a->words[wordnum];
+		bitmapword	w = words[wordnum];
 
 		/* ignore bits before prevbit */
 		w &= mask;
@@ -1351,6 +1346,19 @@ bms_next_member(const Bitmapset *a, int prevbit)
 	return -2;
 }
 
+int
+bms_next_member(const Bitmapset *a, int prevbit)
+{
+	Assert(a == NULL || IsA(a, Bitmapset));
+
+	Assert(bms_is_valid_set(a));
+
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
 /*
  * bms_prev_member - find prev member of a set
  *
@@ -1458,3 +1466,177 @@ bitmap_match(const void *key1, const void *key2, Size keysize)
 	return !bms_equal(*((const Bitmapset *const *) key1),
 					  *((const Bitmapset *const *) key2));
 }
+
+/*
+ * bitset_init - create a Bitset. the set will be round up to nwords;
+ */
+Bitset *
+bitset_init(size_t size)
+{
+	int			nword = (size + BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+	Bitset	   *result;
+
+	if (size == 0)
+		return NULL;
+
+	result = (Bitset *) palloc0(sizeof(Bitset) + nword * sizeof(bitmapword));
+	result->nwords = nword;
+
+	return result;
+}
+
+/*
+ * bitset_clear - clear the bits only, but the memory is still there.
+ */
+void
+bitset_clear(Bitset *a)
+{
+	if (a != NULL)
+		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
+}
+
+void
+bitset_free(Bitset *a)
+{
+	if (a != NULL)
+		pfree(a);
+}
+
+bool
+bitset_is_empty(Bitset *a)
+{
+	int			i;
+
+	if (a == NULL)
+		return true;
+
+	for (i = 0; i < a->nwords; i++)
+	{
+		bitmapword	w = a->words[i];
+
+		if (w != 0)
+			return false;
+	}
+
+	return true;
+}
+
+Bitset *
+bitset_copy(Bitset *a)
+{
+	Bitset	   *result;
+
+	if (a == NULL)
+		return NULL;
+
+	result = bitset_init(a->nwords * BITS_PER_BITMAPWORD);
+
+	memcpy(result->words, a->words, sizeof(bitmapword) * a->nwords);
+	return result;
+}
+
+void
+bitset_add_member(Bitset *a, int x)
+{
+	int			wordnum,
+				bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] |= ((bitmapword) 1 << bitnum);
+}
+
+void
+bitset_del_member(Bitset *a, int x)
+{
+	int			wordnum,
+				bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] &= ~((bitmapword) 1 << bitnum);
+}
+
+int
+bitset_is_member(int x, Bitset *a)
+{
+	int			wordnum,
+				bitnum;
+
+	/* used in expression engine */
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	if (a == NULL)
+		return false;
+
+	if (wordnum >= a->nwords)
+		return false;
+
+	return (a->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+int
+bitset_next_member(const Bitset *a, int prevbit)
+{
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
+
+/*
+ * bitset_to_bitmap - build a legal bitmapset from bitset.
+ */
+Bitmapset *
+bitset_to_bitmap(Bitset *a)
+{
+	int			n;
+
+	bool		found = false;	/* any non-empty bits */
+	Bitmapset  *result;
+	int			i;
+
+	if (a == NULL)
+		return NULL;
+
+	n = a->nwords - 1;
+	do
+	{
+		if (a->words[n] > 0)
+		{
+			found = true;
+			break;
+		}
+	} while (--n >= 0);
+
+	if (!found)
+		return NULL;
+
+	result = (Bitmapset *) palloc0(BITMAPSET_SIZE(n + 1));
+	result->type = T_Bitmapset;
+	result->nwords = n + 1;
+
+	Assert(result->nwords <= a->nwords);
+
+	i = 0;
+	do
+	{
+		result->words[i] = a->words[i];
+	} while (++i < result->nwords);
+
+	return result;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 25171864db..e144b62418 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -331,6 +331,43 @@ outBitmapset(StringInfo str, const Bitmapset *bms)
 	appendStringInfoChar(str, ')');
 }
 
+
+
+/*
+ * outBitset -
+ *	similar to outBitmapset, but for Bitset.
+ */
+static void
+outBitsetInternal(struct StringInfoData *str,
+				  const struct Bitset *bs,
+				  bool asBitmap)
+{
+	int			x;
+
+	appendStringInfoChar(str, '(');
+	if (asBitmap)
+		appendStringInfoChar(str, 'b');
+	else
+		appendStringInfo(str, "bs");
+	x = -1;
+	while ((x = bitset_next_member(bs, x)) >= 0)
+		appendStringInfo(str, " %d", x);
+	appendStringInfoChar(str, ')');
+}
+
+
+/*
+ * outBitset -
+ *	similar to outBitmapset, but for Bitset.
+ */
+void
+outBitset(struct StringInfoData *str,
+		  const struct Bitset *bs)
+{
+	outBitsetInternal(str, bs, false);
+}
+
+
 /*
  * Print the value of a Datum given its type.
  */
@@ -911,3 +948,17 @@ bmsToString(const Bitmapset *bms)
 	outBitmapset(&str, bms);
 	return str.data;
 }
+
+/*
+ * bitsetToString -
+ *	   similar to bmsToString, but for Bitset
+ */
+char *
+bitsetToString(const struct Bitset *bs, bool asBitmap)
+{
+	StringInfoData str;
+
+	initStringInfo(&str);
+	outBitsetInternal(&str, bs, asBitmap);
+	return str.data;
+}
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 906e8dcc15..95ff37c6e9 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -55,6 +55,24 @@ typedef struct Bitmapset
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];	/* really [nwords] */
 } Bitmapset;
 
+/*
+ * While Bitmapset is designed for variable-length of bits, Bitset is
+ * designed for fixed-length of bits, the fixed length must be specified at
+ * the bitset_init stage and keep unchanged at the whole lifespan. Because
+ * of this, some operations on Bitset is simpler than Bitmapset.
+ *
+ * The bitset_clear unsets all the bits but kept the allocated memory, this
+ * capacity is impossible for bit Bitmapset for some solid reasons.
+ *
+ * Also for performance aspect, the functions for Bitset removed some
+ * unlikely checks, instead with some Asserts.
+ */
+
+typedef struct Bitset
+{
+	int			nwords;			/* number of words in array */
+	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];	/* really [nwords] */
+} Bitset;
 
 /* result of bms_subset_compare */
 typedef enum
@@ -124,4 +142,14 @@ extern uint32 bms_hash_value(const Bitmapset *a);
 extern uint32 bitmap_hash(const void *key, Size keysize);
 extern int	bitmap_match(const void *key1, const void *key2, Size keysize);
 
+extern Bitset *bitset_init(size_t size);
+extern void bitset_clear(Bitset *a);
+extern void bitset_free(Bitset *a);
+extern bool bitset_is_empty(Bitset *a);
+extern Bitset *bitset_copy(Bitset *a);
+extern void bitset_add_member(Bitset *a, int x);
+extern void bitset_del_member(Bitset *a, int x);
+extern int	bitset_is_member(int bit, Bitset *a);
+extern int	bitset_next_member(const Bitset *a, int prevbit);
+extern Bitmapset *bitset_to_bitmap(Bitset *a);
 #endif							/* BITMAPSET_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2969dd831b..4d13107990 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -186,16 +186,20 @@ castNodeImpl(NodeTag type, void *ptr)
  * nodes/{outfuncs.c,print.c}
  */
 struct Bitmapset;				/* not to include bitmapset.h here */
+struct Bitset;					/* not to include bitmapset.h here */
 struct StringInfoData;			/* not to include stringinfo.h here */
 
 extern void outNode(struct StringInfoData *str, const void *obj);
 extern void outToken(struct StringInfoData *str, const char *s);
 extern void outBitmapset(struct StringInfoData *str,
 						 const struct Bitmapset *bms);
+extern void outBitset(struct StringInfoData *str, const struct Bitset *bs);
+
 extern void outDatum(struct StringInfoData *str, uintptr_t value,
 					 int typlen, bool typbyval);
 extern char *nodeToString(const void *obj);
 extern char *bmsToString(const struct Bitmapset *bms);
+extern char *bitsetToString(const struct Bitset *bs, bool asBitmap);
 
 /*
  * nodes/{readfuncs.c,read.c}
diff --git a/src/test/modules/test_misc/Makefile b/src/test/modules/test_misc/Makefile
index 39c6c2014a..af96604096 100644
--- a/src/test/modules/test_misc/Makefile
+++ b/src/test/modules/test_misc/Makefile
@@ -2,6 +2,17 @@
 
 TAP_TESTS = 1
 
+MODULE_big = test_misc
+OBJS = \
+	$(WIN32RES) \
+	test_misc.o
+PGFILEDESC = "test_misc"
+
+EXTENSION = test_misc
+DATA = test_misc--1.0.sql
+
+REGRESS = test_bitset
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/src/test/modules/test_misc/README b/src/test/modules/test_misc/README
index 4876733fa2..ec426c4ad5 100644
--- a/src/test/modules/test_misc/README
+++ b/src/test/modules/test_misc/README
@@ -1,4 +1,2 @@
-This directory doesn't actually contain any extension module.
-
-What it is is a home for otherwise-unclassified TAP tests that exercise core
+What it is is a home for otherwise-unclassified tests that exercise core
 server features.  We might equally well have called it, say, src/test/misc.
diff --git a/src/test/modules/test_misc/expected/test_bitset.out b/src/test/modules/test_misc/expected/test_bitset.out
new file mode 100644
index 0000000000..3d0302d30d
--- /dev/null
+++ b/src/test/modules/test_misc/expected/test_bitset.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_misc;
+SELECT test_bitset();
+ test_bitset 
+-------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 964d95db26..a23f3e3f47 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -1,5 +1,22 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+test_misc_sources = files(
+    'test_misc.c',
+)
+
+if host_system == 'windows'
+  test_misc_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_misc',
+    '--FILEDESC', 'test_misc - ',])
+endif
+
+test_misc = shared_module('test_misc',
+  test_misc_sources,
+  kwargs: pg_test_mod_args,
+)
+
+test_install_libs += test_misc
+
 tests += {
   'name': 'test_misc',
   'sd': meson.current_source_dir(),
diff --git a/src/test/modules/test_misc/sql/test_bitset.sql b/src/test/modules/test_misc/sql/test_bitset.sql
new file mode 100644
index 0000000000..0f73bbf532
--- /dev/null
+++ b/src/test/modules/test_misc/sql/test_bitset.sql
@@ -0,0 +1,3 @@
+CREATE EXTENSION test_misc;
+
+SELECT test_bitset();
diff --git a/src/test/modules/test_misc/test_misc--1.0.sql b/src/test/modules/test_misc/test_misc--1.0.sql
new file mode 100644
index 0000000000..79afaa6263
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc--1.0.sql
@@ -0,0 +1,5 @@
+\echo Use "CREATE EXTENSION test_misc" to load this file. \quit
+
+CREATE FUNCTION test_bitset()
+       RETURNS pg_catalog.void
+       AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_misc/test_misc.c b/src/test/modules/test_misc/test_misc.c
new file mode 100644
index 0000000000..70d0255ada
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc.c
@@ -0,0 +1,118 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_misc.c
+ *
+ * Copyright (c) 2022-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_dsa/test_misc.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+#include "fmgr.h"
+#include "nodes/bitmapset.h"
+#include "nodes/nodes.h"
+#define BIT_ADD 0
+#define BIT_DEL 1
+#ifdef USE_ASSERT_CHECKING
+static void compare_bms_bs(Bitmapset **bms, Bitset *bs, int member, int op);
+#endif
+PG_MODULE_MAGIC;
+/* Test basic DSA functionality */
+PG_FUNCTION_INFO_V1(test_bitset);
+Datum
+test_bitset(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ASSERT_CHECKING
+	Bitset	   *bs;
+	Bitset	   *bs2;
+	char	   *str1,
+			   *str2,
+			   *empty_str;
+	Bitmapset  *bms = NULL;
+	int			i;
+
+	empty_str = bmsToString(NULL);
+	/* size = 0 */
+	bs = bitset_init(0);
+	Assert(bs == NULL);
+	bitset_clear(bs);
+	Assert(bitset_is_empty(bs));
+	/* bitset_add_member(bs, 0); // crash. */
+	/* bitset_del_member(bs, 0); // crash. */
+	Assert(!bitset_is_member(0, bs));
+	Assert(bitset_next_member(bs, -1) == -2);
+	bs2 = bitset_copy(bs);
+	Assert(bs2 == NULL);
+	bitset_free(bs);
+	bitset_free(bs2);
+	/* size == 68, nword == 2 */
+	bs = bitset_init(68);
+	for (i = 0; i < 68; i = i + 3)
+	{
+		compare_bms_bs(&bms, bs, i, BIT_ADD);
+	}
+	Assert(!bitset_is_empty(bs));
+	for (i = 0; i < 68; i = i + 3)
+	{
+		compare_bms_bs(&bms, bs, i, BIT_DEL);
+	}
+	Assert(bitset_is_empty(bs));
+	bitset_clear(bs);
+	str1 = bitsetToString(bs, true);
+	Assert(strcmp(str1, empty_str) == 0);
+	bms = bitset_to_bitmap(bs);
+	str2 = bmsToString(bms);
+	Assert(strcmp(str1, str2) == 0);
+	bms = bitset_to_bitmap(NULL);
+	Assert(strcmp(bmsToString(bms), empty_str) == 0);
+	bitset_free(bs);
+#endif
+	PG_RETURN_VOID();
+}
+#ifdef USE_ASSERT_CHECKING
+static void
+compare_bms_bs(Bitmapset **bms, Bitset *bs, int member, int op)
+{
+	char	   *str1,
+			   *str2,
+			   *str3,
+			   *str4;
+	Bitmapset  *bms3;
+	Bitset	   *bs4;
+
+	if (op == BIT_ADD)
+	{
+		*bms = bms_add_member(*bms, member);
+		bitset_add_member(bs, member);
+		Assert(bms_is_member(member, *bms));
+		Assert(bitset_is_member(member, bs));
+	}
+	else if (op == BIT_DEL)
+	{
+		*bms = bms_del_member(*bms, member);
+		bitset_del_member(bs, member);
+		Assert(!bms_is_member(member, *bms));
+		Assert(!bitset_is_member(member, bs));
+	}
+	else
+		Assert(false);
+	/* compare the rest existing bit */
+	str1 = bmsToString(*bms);
+	str2 = bitsetToString(bs, true);
+	Assert(strcmp(str1, str2) == 0);
+	/* test bitset_to_bitmap */
+	bms3 = bitset_to_bitmap(bs);
+	str3 = bmsToString(bms3);
+	Assert(strcmp(str3, str2) == 0);
+	/* test bitset_copy */
+	bs4 = bitset_copy(bs);
+	str4 = bitsetToString(bs4, true);
+	Assert(strcmp(str3, str4) == 0);
+	pfree(str1);
+	pfree(str2);
+	pfree(str3);
+	pfree(str4);
+}
+#endif
diff --git a/src/test/modules/test_misc/test_misc.control b/src/test/modules/test_misc/test_misc.control
new file mode 100644
index 0000000000..48fd08758f
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc.control
@@ -0,0 +1,4 @@
+comment = 'Test misc'
+default_version = '1.0'
+module_pathname = '$libdir/test_misc'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d808aad8b0..dcba329ee4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -268,6 +268,7 @@ BitmapOr
 BitmapOrPath
 BitmapOrState
 Bitmapset
+Bitset
 Block
 BlockId
 BlockIdData
-- 
2.34.1

v7-0002-Shared-detoast-feature.patchtext/x-diffDownload
From d37b53cdaeb41989ff53e81da33cf15f5f08a450 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 20 Feb 2024 14:16:10 +0800
Subject: [PATCH v7 2/2] Shared detoast feature.

details at https://postgr.es/m/87il4jrk1l.fsf%40163.com
---
 src/backend/executor/execExpr.c         |  35 +-
 src/backend/executor/execExprInterp.c   | 180 ++++++++
 src/backend/executor/execTuples.c       | 134 ++++++
 src/backend/executor/execUtils.c        |   2 +
 src/backend/executor/nodeHashjoin.c     |   2 +
 src/backend/executor/nodeMergejoin.c    |   2 +
 src/backend/executor/nodeNestloop.c     |   1 +
 src/backend/jit/llvm/llvmjit_expr.c     |  26 +-
 src/backend/jit/llvm/llvmjit_types.c    |   1 +
 src/backend/optimizer/plan/createplan.c | 107 ++++-
 src/backend/optimizer/plan/setrefs.c    | 551 +++++++++++++++++++-----
 src/include/executor/execExpr.h         |  12 +
 src/include/executor/tuptable.h         |  67 +++
 src/include/nodes/execnodes.h           |  14 +
 src/include/nodes/plannodes.h           |  53 +++
 src/tools/pgindent/typedefs.list        |   2 +
 16 files changed, 1083 insertions(+), 106 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 3181b1136a..779fcfaab1 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -932,22 +932,51 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				}
 				else
 				{
+					int			attnum;
+					Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 					/* regular user column */
 					scratch.d.var.attnum = variable->varattno - 1;
 					scratch.d.var.vartype = variable->vartype;
+					attnum = scratch.d.var.attnum;
+
 					switch (variable->varno)
 					{
 						case INNER_VAR:
-							scratch.opcode = EEOP_INNER_VAR;
+
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_INNER_VAR_TOAST;
+							}
+							else
+							{
+								scratch.opcode = EEOP_INNER_VAR;
+							}
 							break;
 						case OUTER_VAR:
-							scratch.opcode = EEOP_OUTER_VAR;
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_OUTER_VAR;
 							break;
 
 							/* INDEX_VAR is handled by default case */
 
 						default:
-							scratch.opcode = EEOP_SCAN_VAR;
+							if (is_scan_plan(plan) && bms_is_member(
+																	attnum,
+																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_SCAN_VAR;
 							break;
 					}
 				}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3f20f1dd31..878235ae76 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -158,6 +159,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -166,6 +170,9 @@ static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull
 static Datum ExecJustInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -181,6 +188,43 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  AggStatePerGroup pergroup,
 															  ExprContext *aggcontext,
 															  int setno);
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+	if (!slot->tts_isnull[attnum] &&
+		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+	{
+		Datum		oldDatum;
+		MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
+
+		oldDatum = slot->tts_values[attnum];
+		slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
+																(struct varlena *) oldDatum));
+		Assert(slot->tts_nvalid > attnum);
+		Assert(oldDatum != slot->tts_values[attnum]);
+		bitset_add_member(slot->pre_detoasted_attrs, attnum);
+		MemoryContextSwitchTo(old);
+	}
+}
+
+/* JIT requires a non-static (and external?) function */
+void
+ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum)
+{
+	return ExecSlotDetoastDatum(slot, attnum);
+}
+
+
+static inline void
+ExecEvalToastVar(TupleTableSlot *slot,
+				 ExprEvalStep *op,
+				 int attnum)
+{
+	ExecSlotDetoastDatum(slot, attnum);
+
+	*op->resvalue = slot->tts_values[attnum];
+	*op->resnull = slot->tts_isnull[attnum];
+}
 
 /*
  * ScalarArrayOpExprHashEntry
@@ -296,6 +340,24 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVar;
 			return;
 		}
+		if (step0 == EEOP_INNER_FETCHSOME &&
+			step1 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_FETCHSOME &&
+				 step1 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_FETCHSOME &&
+				 step1 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarToast;
+			return;
+		}
 		else if (step0 == EEOP_INNER_FETCHSOME &&
 				 step1 == EEOP_ASSIGN_INNER_VAR)
 		{
@@ -346,6 +408,21 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVarVirt;
 			return;
 		}
+		else if (step0 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarVirtToast;
+			return;
+		}
 		else if (step0 == EEOP_ASSIGN_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustAssignInnerVarVirt;
@@ -413,6 +490,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
+		&&CASE_EEOP_INNER_VAR_TOAST,
+		&&CASE_EEOP_OUTER_VAR_TOAST,
+		&&CASE_EEOP_SCAN_VAR_TOAST,
 		&&CASE_EEOP_INNER_SYSVAR,
 		&&CASE_EEOP_OUTER_SYSVAR,
 		&&CASE_EEOP_SCAN_SYSVAR,
@@ -597,6 +677,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
 			*op->resvalue = scanslot->tts_values[attnum];
 			*op->resnull = scanslot->tts_isnull[attnum];
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_INNER_VAR_TOAST)
+		{
+			ExecEvalToastVar(innerslot, op, op->d.var.attnum);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_VAR_TOAST)
+		{
+			ExecEvalToastVar(outerslot, op, op->d.var.attnum);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_VAR_TOAST)
+		{
+			ExecEvalToastVar(scanslot, op, op->d.var.attnum);
 
 			EEO_NEXT();
 		}
@@ -2137,6 +2236,42 @@ ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+static pg_attribute_always_inline Datum
+ExecJustVarImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[1];
+	int			attnum = op->d.var.attnum;
+
+	CheckOpSlotCompatibility(&state->steps[0], slot);
+
+	slot_getattr(slot, attnum + 1, isnull);
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Simple reference to inner Var */
+static Datum
+ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Simple reference to outer Var */
+static Datum
+ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Simple reference to scan Var */
+static Datum
+ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)Var */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
@@ -2275,6 +2410,51 @@ ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarVirtImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+/* implementation of ExecJust(Inner|Outer|Scan)VarVirt */
+static pg_attribute_always_inline Datum
+ExecJustVarVirtImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[0];
+	int			attnum = op->d.var.attnum;
+
+	/*
+	 * As it is guaranteed that a virtual slot is used, there never is a need
+	 * to perform tuple deforming (nor would it be possible). Therefore
+	 * execExpr.c has not emitted an EEOP_*_FETCHSOME step. Verify, as much as
+	 * possible, that that determination was accurate.
+	 */
+	Assert(TTS_IS_VIRTUAL(slot));
+	Assert(TTS_FIXED(slot));
+	Assert(attnum >= 0 && attnum < slot->tts_nvalid);
+
+	*isnull = slot->tts_isnull[attnum];
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Like ExecJustInnerVar, optimized for virtual slots */
+static Datum
+ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Like ExecJustOuterVar, optimized for virtual slots */
+static Datum
+ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Like ExecJustScanVar, optimized for virtual slots */
+static Datum
+ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)VarVirt */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarVirtImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a7aa2ee02b..830e1d4ea6 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -79,6 +79,9 @@ static inline void tts_buffer_heap_store_tuple(TupleTableSlot *slot,
 											   bool transfer_pin);
 static void tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree);
 
+static Bitmapset *cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+											  TupleDesc tupleDesc,
+											  List *forbid_pre_detoast_vars);
 
 const TupleTableSlotOps TTSOpsVirtual;
 const TupleTableSlotOps TTSOpsHeapTuple;
@@ -176,6 +179,10 @@ tts_virtual_materialize(TupleTableSlot *slot)
 		if (att->attbyval || slot->tts_isnull[natt])
 			continue;
 
+		if (bitset_is_member(natt, slot->pre_detoasted_attrs))
+			/* it has been in slot->tts_mcxt already. */
+			continue;
+
 		val = slot->tts_values[natt];
 
 		if (att->attlen == -1 &&
@@ -392,6 +399,13 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated invalidated since tts_nvalid is set to 0, so
+	 * let's free the pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
+
 }
 
 static void
@@ -457,6 +471,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -567,6 +584,9 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -637,6 +657,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -771,6 +794,9 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -904,6 +930,9 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 		 */
 		ReleaseBuffer(buffer);
 	}
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 /*
@@ -1150,7 +1179,10 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 			 + MAXALIGN(tupleDesc->natts * sizeof(Datum)));
 
 		PinTupleDesc(tupleDesc);
+		slot->pre_detoasted_attrs = bitset_init(tupleDesc->natts);
 	}
+	else
+		slot->pre_detoasted_attrs = NULL;
 
 	/*
 	 * And allow slot type specific initialization.
@@ -1288,6 +1320,8 @@ void
 ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 					  TupleDesc tupdesc)	/* new tuple descriptor */
 {
+	MemoryContext old;
+
 	Assert(!TTS_FIXED(slot));
 
 	/* For safety, make sure slot is empty before changing it */
@@ -1304,6 +1338,8 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 		pfree(slot->tts_values);
 	if (slot->tts_isnull)
 		pfree(slot->tts_isnull);
+	if (slot->pre_detoasted_attrs)
+		bitset_free(slot->pre_detoasted_attrs);
 
 	/*
 	 * Install the new descriptor; if it's refcounted, bump its refcount.
@@ -1319,6 +1355,10 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 		MemoryContextAlloc(slot->tts_mcxt, tupdesc->natts * sizeof(Datum));
 	slot->tts_isnull = (bool *)
 		MemoryContextAlloc(slot->tts_mcxt, tupdesc->natts * sizeof(bool));
+
+	old = MemoryContextSwitchTo(slot->tts_mcxt);
+	slot->pre_detoasted_attrs = bitset_init(tupdesc->natts);
+	MemoryContextSwitchTo(old);
 }
 
 /* --------------------------------
@@ -1810,12 +1850,26 @@ void
 ExecInitScanTupleSlot(EState *estate, ScanState *scanstate,
 					  TupleDesc tupledesc, const TupleTableSlotOps *tts_ops)
 {
+	Scan	   *splan = (Scan *) scanstate->ps.plan;
+
 	scanstate->ss_ScanTupleSlot = ExecAllocTableSlot(&estate->es_tupleTable,
 													 tupledesc, tts_ops);
 	scanstate->ps.scandesc = tupledesc;
 	scanstate->ps.scanopsfixed = tupledesc != NULL;
 	scanstate->ps.scanops = tts_ops;
 	scanstate->ps.scanopsset = true;
+
+	if (is_scan_plan((Plan *) splan))
+	{
+		/*
+		 * We may run detoast in Qual or Projection, but all of them happen at
+		 * the ss_ScanTupleSlot rather than ps_ResultTupleSlot. So we can only
+		 * take care of the ss_ScanTupleSlot.
+		 */
+		scanstate->scan_pre_detoast_attrs = cal_final_pre_detoast_attrs(splan->reference_attrs,
+																		tupledesc,
+																		splan->plan.forbid_pre_detoast_vars);
+	}
 }
 
 /* ----------------
@@ -2336,3 +2390,83 @@ end_tup_output(TupOutputState *tstate)
 	ExecDropSingleTupleTableSlot(tstate->slot);
 	pfree(tstate);
 }
+
+/*
+ * cal_final_pre_detoast_attrs
+ *		Calculate the final attributes which pre-detoast be helpful.
+ *
+ * reference_attrs: the attributes which will be detoast at this plan level.
+ * due to the implementation issue, some non-toast attribute may be included
+ * which should be filtered out with tupleDesc.
+ *
+ * forbid_pre_detoast_vars: the vars which should not be pre-detoast as the
+ * small_tlist reason.
+ */
+static Bitmapset *
+cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+							TupleDesc tupleDesc,
+							List *forbid_pre_detoast_vars)
+{
+	Bitmapset  *final = NULL,
+			   *toast_attrs = NULL,
+			   *forbid_pre_detoast_attrs = NULL;
+
+	int			i;
+	ListCell   *lc;
+
+	if (bms_is_empty(reference_attrs))
+		return NULL;
+
+	/*
+	 * there is no exact data type in create_plan or set_plan_refs stage, so
+	 * reference_attrs may have some attribute which is not toast attrs at
+	 * all, which should be removed.
+	 */
+	for (i = 0; i < tupleDesc->natts; i++)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupleDesc, i);
+
+		if (attr->attlen == -1 && attr->attstorage != TYPSTORAGE_PLAIN)
+			toast_attrs = bms_add_member(toast_attrs, attr->attnum - 1);
+	}
+
+	/* Filter out the non-toastable attributes. */
+	final = bms_intersect(reference_attrs, toast_attrs);
+
+	/*
+	 * Due to the fact of detoast-datum will make the tuple bigger which is
+	 * bad for some nodes like Sort/Hash, to avoid performance regression,
+	 * such attribute should be removed as well.
+	 */
+	foreach(lc, forbid_pre_detoast_vars)
+	{
+		Var		   *var = lfirst_node(Var, lc);
+
+		forbid_pre_detoast_attrs = bms_add_member(forbid_pre_detoast_attrs, var->varattno - 1);
+	}
+
+	final = bms_del_members(final, forbid_pre_detoast_attrs);
+
+	bms_free(toast_attrs);
+	bms_free(forbid_pre_detoast_attrs);
+
+	return final;
+}
+
+
+void
+SetPredetoastAttrsForJoin(JoinState *j)
+{
+	PlanState  *outerstate = outerPlanState(j);
+	PlanState  *innerstate = innerPlanState(j);
+
+	j->outer_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->outer_reference_attrs,
+															 outerstate->ps_ResultTupleDesc,
+															 outerstate->plan->forbid_pre_detoast_vars);
+
+	j->inner_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->inner_reference_attrs,
+															 innerstate->ps_ResultTupleDesc,
+															 innerstate->plan->forbid_pre_detoast_vars);
+}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index cff5dc723e..a8646ded02 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -572,6 +572,8 @@ ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
 		planstate->resultopsset = planstate->scanopsset;
 		planstate->resultopsfixed = planstate->scanopsfixed;
 		planstate->resultops = planstate->scanops;
+
+		Assert(planstate->ps_ResultTupleDesc != NULL);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1cbec4647c..19a05ed624 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -756,6 +756,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
 	innerDesc = ExecGetResultType(innerPlanState(hjstate));
 
+	SetPredetoastAttrsForJoin((JoinState *) hjstate);
+
 	/*
 	 * Initialize result slot, type and projection.
 	 */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index c1a8ca2464..be7cbd7f30 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1497,6 +1497,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 											  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	SetPredetoastAttrsForJoin((JoinState *) mergestate);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 06fa0a9b31..2d40d19192 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -306,6 +306,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(&nlstate->js.ps, &TTSOpsVirtual);
 	ExecAssignProjectionInfo(&nlstate->js.ps, NULL);
+	SetPredetoastAttrsForJoin((JoinState *) nlstate);
 
 	/*
 	 * initialize child expressions
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 0c448422e2..74563c3454 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -396,30 +396,52 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
+			case EEOP_INNER_VAR_TOAST:
+			case EEOP_OUTER_VAR_TOAST:
+			case EEOP_SCAN_VAR_TOAST:
 				{
 					LLVMValueRef value,
 								isnull;
 					LLVMValueRef v_attnum;
 					LLVMValueRef v_values;
 					LLVMValueRef v_nulls;
+					LLVMValueRef v_slot;
 
-					if (opcode == EEOP_INNER_VAR)
+					if (opcode == EEOP_INNER_VAR || opcode == EEOP_INNER_VAR_TOAST)
 					{
+						v_slot = v_innerslot;
 						v_values = v_innervalues;
 						v_nulls = v_innernulls;
 					}
-					else if (opcode == EEOP_OUTER_VAR)
+					else if (opcode == EEOP_OUTER_VAR || opcode == EEOP_OUTER_VAR_TOAST)
 					{
+						v_slot = v_outerslot;
 						v_values = v_outervalues;
 						v_nulls = v_outernulls;
 					}
 					else
 					{
+						v_slot = v_scanslot;
 						v_values = v_scanvalues;
 						v_nulls = v_scannulls;
 					}
 
 					v_attnum = l_int32_const(lc, op->d.var.attnum);
+
+					if (opcode == EEOP_INNER_VAR_TOAST ||
+						opcode == EEOP_OUTER_VAR_TOAST ||
+						opcode == EEOP_SCAN_VAR_TOAST)
+					{
+						LLVMValueRef params[2];
+
+						params[0] = v_slot;
+						params[1] = l_int32_const(lc, op->d.var.attnum);
+						l_call(b,
+							   llvm_pg_var_func_type("ExecSlotDetoastDatumExternal"),
+							   llvm_pg_func(mod, "ExecSlotDetoastDatumExternal"),
+							   params, lengthof(params), "");
+					}
+
 					value = l_load_gep1(b, TypeSizeT, v_values, v_attnum, "");
 					isnull = l_load_gep1(b, TypeStorageBool, v_nulls, v_attnum, "");
 					LLVMBuildStore(b, value, v_resvaluep);
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 47c9daf402..1dcf0c2fd8 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -178,4 +178,5 @@ void	   *referenced_functions[] =
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecSlotDetoastDatumExternal,
 };
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 610f4a56d6..8acb48240e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -314,7 +314,9 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 List *mergeActionLists, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
-
+static void set_plan_forbid_pre_detoast_vars_recurse(Plan *plan,
+													 List *small_tlist);
+static void set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist);
 
 /*
  * create_plan
@@ -346,6 +348,12 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree, demanding the correct tlist result */
 	plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);
 
+	/*
+	 * After the plan tree is built completed, we start to walk for which
+	 * expressions should not used the shared-detoast feature.
+	 */
+	set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);
+
 	/*
 	 * Make sure the topmost plan node's targetlist exposes the original
 	 * column names and other decorative info.  Targetlists generated within
@@ -378,6 +386,101 @@ create_plan(PlannerInfo *root, Path *best_path)
 	return plan;
 }
 
+/*
+ * set_plan_forbid_pre_detoast_vars_recurse
+ *	 Walking the Plan tree in the top-down manner to gather the vars which
+ * should be as small as possible and record them in Plan.forbid_pre_detoast_vars
+ *
+ * plan: the plan node to walk right now.
+ * small_tlist: a list of nodes which its subplan should provide them as
+ * small as possible.
+ */
+static void
+set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)
+{
+	if (plan == NULL)
+		return;
+
+	set_plan_not_pre_detoast_vars(plan, small_tlist);
+
+	/* Recurse to its subplan.. */
+	if (IsA(plan, Sort) || IsA(plan, Memoize) || IsA(plan, WindowAgg) ||
+		IsA(plan, Hash) || IsA(plan, Material) || IsA(plan, IncrementalSort))
+	{
+		List	   *small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * For the sort-like nodes, we want the output of its subplan as small
+		 * as possible, but the subplan's other expressions like Qual doesn't
+		 * have this restriction since they are not output to the upper nodes.
+		 * so we set the small_tlist to the subplan->targetlist.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, small_tlist);
+	}
+	else if (IsA(plan, HashJoin) && castNode(HashJoin, plan)->left_small_tlist)
+	{
+		List	   *small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * If the left_small_tlist wants a as small as possible tlist, set it
+		 * in a way like sort for the left node.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, small_tlist);
+
+		/*
+		 * The righttree is a Hash node, it can be set with its own rule, so
+		 * the small_tlist provided is not important, we just need to recuse
+		 * to its subplan.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+	else
+	{
+		/*
+		 * Recurse to its children, just push down the forbid_pre_detoast_vars
+		 * to its children.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, plan->forbid_pre_detoast_vars);
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+}
+
+/*
+ * set_plan_not_pre_detoast_vars
+ *
+ *	Set the Plan.forbid_pre_detoast_vars according the small_tlist information.
+ *
+ * small_tlist = NIL means nothing is forbidden, or else if a Var belongs to the
+ * small_tlist, then it must not be pre-detoasted.
+ */
+static void
+set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist)
+{
+	ListCell   *lc;
+	Var		   *var;
+
+	/*
+	 * fast path, if we don't have a small_tlist, the var in targetlist is
+	 * impossible member of it. and this case might be a pretty common case.
+	 */
+	if (small_tlist == NIL)
+		return;
+
+	foreach(lc, plan->targetlist)
+	{
+		TargetEntry *te = lfirst_node(TargetEntry, lc);
+
+		if (!IsA(te->expr, Var))
+			continue;
+		var = castNode(Var, te->expr);
+		if (var->varattno <= 0)
+			continue;
+		if (list_member(small_tlist, var))
+			/* pass the recheck */
+			plan->forbid_pre_detoast_vars = lappend(plan->forbid_pre_detoast_vars, var);
+	}
+}
+
 /*
  * create_plan_recurse
  *	  Recursive guts of create_plan().
@@ -4893,6 +4996,8 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	join_plan->left_small_tlist = (best_path->num_batches > 1);
+
 	return join_plan;
 }
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 22a1fa29f3..f9b30903c2 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -27,6 +27,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_relation.h"
 #include "tcop/utility.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
@@ -55,11 +56,48 @@ typedef struct
 	tlist_vinfo vars[FLEXIBLE_ARRAY_MEMBER];	/* has num_vars entries */
 } indexed_tlist;
 
+/*
+ * Decide which attrs are detoasted in a expressions level, this is judged
+ * at the fix_scan/join_expr stage. The recursed level is tracked when we
+ * walk to a Var, if the level is greater than 1, then it means the
+ * var needs an detoast in this expression list, there are some exceptions
+ * here, see increase_level_for_pre_detoast for details.
+ */
+typedef struct
+{
+	/* if the level is added during a certain walk. */
+	bool		level_added;
+	/* the current level during the walk. */
+	int			level;
+} intermediate_level_context;
+
+/*
+ * Context to hold the detoast attribute within a expression.
+ *
+ * XXX: this design was intent to avoid the pre-detoast-logic if the var
+ * only need to be detoasted *once*, but for now, this context is only
+ * maintained at the expression level rather than plan tree level, so it
+ * can't detect if a Var will be detoasted 2+ time at the plan level.
+ * Recording the times of a Var is detoasted in the plan tree level is
+ * complex, so before we decide it is a must, I am not willing to do too
+ * many changes here.
+ */
+typedef struct
+{
+	/* var is accessed for the first time. */
+	Bitmapset  *existing_attrs;
+	/* var is accessed for the 2+ times. */
+	Bitmapset **final_ref_attrs;
+} intermediate_var_ref_context;
+
+
 typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context scan_reference_attrs;
 } fix_scan_expr_context;
 
 typedef struct
@@ -71,6 +109,9 @@ typedef struct
 	int			rtoffset;
 	NullingRelsMatch nrm_match;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context outer_reference_attrs;
+	intermediate_var_ref_context inner_reference_attrs;
 } fix_join_expr_context;
 
 typedef struct
@@ -127,8 +168,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, lst, rtoffset, num_exec, pre_detoast_attrs) \
+	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec, pre_detoast_attrs))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -158,7 +199,8 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
 static Node *fix_scan_expr(PlannerInfo *root, Node *node,
-						   int rtoffset, double num_exec);
+						   int rtoffset, double num_exec,
+						   Bitmapset **scan_reference_attrs);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
@@ -190,7 +232,10 @@ static List *fix_join_expr(PlannerInfo *root,
 						   Index acceptable_rel,
 						   int rtoffset,
 						   NullingRelsMatch nrm_match,
-						   double num_exec);
+						   double num_exec,
+						   Bitmapset **outer_reference_attrs,
+						   Bitmapset **inner_reference_attrs);
+
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
 static Node *fix_upper_expr(PlannerInfo *root,
@@ -628,10 +673,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_SampleScan:
@@ -641,13 +692,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs
+					);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tablesample = (TableSampleClause *)
 					fix_scan_expr(root, (Node *) splan->tablesample,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexScan:
@@ -657,28 +715,40 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->indexqual =
 					fix_scan_list(root, splan->indexqual,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->indexorderby =
 					fix_scan_list(root, splan->indexorderby,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexorderbyorig =
 					fix_scan_list(root, splan->indexorderbyorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan), &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexOnlyScan:
 			{
 				IndexOnlyScan *splan = (IndexOnlyScan *) plan;
 
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
+
 				return set_indexonlyscan_references(root, splan, rtoffset);
 			}
 			break;
@@ -691,10 +761,15 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, splan->indexqual, rtoffset, 1,
+								  &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_BitmapHeapScan:
@@ -704,13 +779,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->bitmapqualorig =
 					fix_scan_list(root, splan->bitmapqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidScan:
@@ -720,13 +802,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidquals =
 					fix_scan_list(root, splan->tidquals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidRangeScan:
@@ -736,13 +825,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidrangequals =
 					fix_scan_list(root, splan->tidrangequals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_SubqueryScan:
@@ -757,12 +853,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, splan->functions, rtoffset, 1,
+								  &splan->scan.reference_attrs);
+
 			}
 			break;
 		case T_TableFuncScan:
@@ -772,13 +872,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->tablefunc = (TableFunc *)
 					fix_scan_expr(root, (Node *) splan->tablefunc,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ValuesScan:
@@ -788,13 +892,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->values_lists =
 					fix_scan_list(root, splan->values_lists,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_CteScan:
@@ -804,10 +911,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_NamedTuplestoreScan:
@@ -817,10 +930,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_WorkTableScan:
@@ -830,10 +945,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ForeignScan:
@@ -873,7 +990,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
 												   rtoffset,
-												   NUM_EXEC_TLIST(plan));
+												   NUM_EXEC_TLIST(plan),
+												   NULL);
 				break;
 			}
 
@@ -933,9 +1051,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, splan->limitOffset, rtoffset, 1, NULL);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, splan->limitCount, rtoffset, 1, NULL);
 			}
 			break;
 		case T_Agg:
@@ -988,17 +1106,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->startOffset, rtoffset, 1, NULL);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->endOffset, rtoffset, 1, NULL);
 				wplan->runCondition = fix_scan_list(root,
 													wplan->runCondition,
 													rtoffset,
-													NUM_EXEC_TLIST(plan));
+													NUM_EXEC_TLIST(plan), NULL);
 				wplan->runConditionOrig = fix_scan_list(root,
 														wplan->runConditionOrig,
 														rtoffset,
-														NUM_EXEC_TLIST(plan));
+														NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_Result:
@@ -1038,14 +1156,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 					splan->plan.targetlist =
 						fix_scan_list(root, splan->plan.targetlist,
-									  rtoffset, NUM_EXEC_TLIST(plan));
+									  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 					splan->plan.qual =
 						fix_scan_list(root, splan->plan.qual,
-									  rtoffset, NUM_EXEC_QUAL(plan));
+									  rtoffset, NUM_EXEC_QUAL(plan), NULL);
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1, NULL);
 			}
 			break;
 		case T_ProjectSet:
@@ -1061,7 +1179,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->withCheckOptionLists =
 					fix_scan_list(root, splan->withCheckOptionLists,
-								  rtoffset, 1);
+								  rtoffset, 1, NULL);
 
 				if (splan->returningLists)
 				{
@@ -1118,18 +1236,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						fix_join_expr(root, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					splan->onConflictWhere = (Node *)
 						fix_join_expr(root, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1, NULL);
 				}
 
 				/*
@@ -1182,7 +1302,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   resultrel,
 															   rtoffset,
 															   NRM_EQUAL,
-															   NUM_EXEC_TLIST(plan));
+															   NUM_EXEC_TLIST(plan),
+															   NULL, NULL);
 
 							/* Fix quals too. */
 							action->qual = (Node *) fix_join_expr(root,
@@ -1191,7 +1312,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 																  resultrel,
 																  rtoffset,
 																  NRM_EQUAL,
-																  NUM_EXEC_QUAL(plan));
+																  NUM_EXEC_QUAL(plan),
+																  NULL, NULL);
 						}
 					}
 				}
@@ -1356,13 +1478,16 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
 	plan->indexqual = fix_scan_list(root, plan->indexqual,
-									rtoffset, 1);
+									rtoffset, 1,
+									&plan->scan.reference_attrs);
 	/* indexorderby is already transformed to reference index columns */
 	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
-									   rtoffset, 1);
+									   rtoffset, 1,
+									   &plan->scan.reference_attrs);
 	/* indextlist must NOT be transformed to reference index columns */
 	plan->indextlist = fix_scan_list(root, plan->indextlist,
-									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+									 rtoffset, NUM_EXEC_TLIST((Plan *) plan),
+									 &plan->scan.reference_attrs);
 
 	pfree(index_itlist);
 
@@ -1409,10 +1534,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
 			fix_scan_list(root, plan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) plan), NULL);
 		plan->scan.plan.qual =
 			fix_scan_list(root, plan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) plan), NULL);
 
 		result = (Plan *) plan;
 	}
@@ -1612,7 +1737,7 @@ set_foreignscan_references(PlannerInfo *root,
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
 			fix_scan_list(root, fscan->fdw_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 	}
 	else
 	{
@@ -1622,16 +1747,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_recheck_quals =
 			fix_scan_list(root, fscan->fdw_recheck_quals,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 	}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
@@ -1690,20 +1815,20 @@ set_customscan_references(PlannerInfo *root,
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
 			fix_scan_list(root, cscan->custom_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
 			fix_scan_list(root, cscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 		cscan->scan.plan.qual =
 			fix_scan_list(root, cscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 		cscan->custom_exprs =
 			fix_scan_list(root, cscan->custom_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 	}
 
 	/* Adjust child plan-nodes recursively, if needed */
@@ -2111,6 +2236,95 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
 	return (Node *) bestplan;
 }
 
+
+static inline void
+setup_intermediate_level_ctx(intermediate_level_context *ctx)
+{
+	ctx->level = 0;
+	ctx->level_added = false;
+}
+
+static inline void
+setup_intermediate_var_ref_ctx(intermediate_var_ref_context *ctx, Bitmapset **final_ref_attrs)
+{
+	ctx->existing_attrs = NULL;
+	ctx->final_ref_attrs = final_ref_attrs;
+}
+
+/*
+ * increase_level_for_pre_detoast
+ *	Check if the given Expr could detoast a Var directly, if yes,
+ * increase the level and return true. otherwise return false;
+ */
+static inline void
+increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
+{
+	/* The following nodes is impossible to detoast a Var directly. */
+	if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
+	{
+		ctx->level_added = false;
+	}
+	else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
+	{
+		/* let's not detoast first so that pg_column_compression works. */
+		ctx->level_added = false;
+	}
+	else
+	{
+		ctx->level_added = true;
+		ctx->level += 1;
+	}
+}
+
+static inline void
+decreased_level_for_pre_detoast(intermediate_level_context *ctx)
+{
+	if (ctx->level_added)
+		ctx->level -= 1;
+
+	ctx->level_added = false;
+}
+
+/*
+ * add_pre_detoast_vars
+ *		add the var's information into pre_detoast_attrs when the check is pass.
+ */
+static inline void
+add_pre_detoast_vars(intermediate_level_context *level_ctx,
+					 intermediate_var_ref_context *ctx,
+					 Var *var)
+{
+	int			attno;
+
+	if (level_ctx->level <= 1 || ctx->final_ref_attrs == NULL || var->varattno <= 0)
+		return;
+
+	attno = var->varattno - 1;
+	if (bms_is_member(attno, ctx->existing_attrs))
+	{
+		/* not the first time to access it, add it to final result. */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+	else
+	{
+		/* first time. */
+		ctx->existing_attrs = bms_add_member(ctx->existing_attrs, attno);
+
+		/*
+		 * XXX:
+		 *
+		 * The above strategy doesn't help to detect if a Var is detoast
+		 * twice. Reasons are: 1. the context is not maintain in Plan node
+		 * level. so if it is detoast at targetlist and qual, we can't detect
+		 * it. 2. even we can make it at plan node, it still doesn't help for
+		 * the among-nodes case.
+		 *
+		 * So for now, I just disable it.
+		 */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+}
+
 /*
  * fix_scan_expr
  *		Do set_plan_references processing on a scan-level expression
@@ -2125,18 +2339,23 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * 'node': the expression to be modified
  * 'rtoffset': how much to increment varnos by
  * 'num_exec': estimated number of executions of expression
+ * 'scan_reference_attrs': gather which vars are potential to run the detoast
+ * 	on this expr, NULL means the caller doesn't have interests on this.
  *
  * The expression tree is either copied-and-modified, or modified in-place
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset,
+			  double num_exec, Bitmapset **scan_reference_attrs)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.scan_reference_attrs, scan_reference_attrs);
 
 	if (rtoffset != 0 ||
 		root->multiexpr_params != NIL ||
@@ -2167,8 +2386,13 @@ fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
 static Node *
 fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 {
+	Node	   *n;
+
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = copyVar((Var *) node);
@@ -2186,10 +2410,16 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 			var->varno += context->rtoffset;
 		if (var->varnosyn > 0)
 			var->varnosyn += context->rtoffset;
+
+		add_pre_detoast_vars(&context->level_ctx, &context->scan_reference_attrs, var);
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) var;
 	}
 	if (IsA(node, Param))
+	{
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return fix_param_node(context->root, (Param *) node);
+	}
 	if (IsA(node, Aggref))
 	{
 		Aggref	   *aggref = (Aggref *) node;
@@ -2199,8 +2429,10 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		aggparam = find_minmax_agg_replacement_param(context->root, aggref);
 		if (aggparam != NULL)
 		{
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			/* Make a copy of the Param for paranoia's sake */
 			return (Node *) copyObject(aggparam);
+
 		}
 		/* If no match, just fall through to process it normally */
 	}
@@ -2210,6 +2442,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 
 		Assert(!IS_SPECIAL_VARNO(cexpr->cvarno));
 		cexpr->cvarno += context->rtoffset;
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) cexpr;
 	}
 	if (IsA(node, PlaceHolderVar))
@@ -2218,29 +2451,52 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
 		/* XXX can we assert something about phnullingrels? */
-		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
+		Node	   *n2 = fix_scan_expr_mutator((Node *) phv->phexpr, context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
 	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_scan_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		Node	   *n2 = fix_scan_expr_mutator(fix_alternative_subplan(context->root,
+																	   (AlternativeSubPlan *) node,
+																	   context->num_exec),
+											   context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator,
-								   (void *) context);
+	n = expression_tree_mutator(node, fix_scan_expr_mutator, (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return n;
 }
 
 static bool
 fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 {
+	bool		ret;
+
 	if (node == NULL)
 		return false;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
+	if (IsA(node, Var))
+	{
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 castNode(Var, node));
+	}
 	Assert(!(IsA(node, Var) && ((Var *) node)->varno == ROWID_VAR));
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
-	return expression_tree_walker(node, fix_scan_expr_walker,
-								  (void *) context);
+	ret = expression_tree_walker(node, fix_scan_expr_walker,
+								 (void *) context);
+
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret;
 }
 
 /*
@@ -2276,7 +2532,10 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset,
 								   NRM_EQUAL,
-								   NUM_EXEC_QUAL((Plan *) join));
+								   NUM_EXEC_QUAL((Plan *) join),
+								   &join->outer_reference_attrs,
+								   &join->inner_reference_attrs
+		);
 
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
@@ -2323,7 +2582,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										 (Index) 0,
 										 rtoffset,
 										 NRM_EQUAL,
-										 NUM_EXEC_QUAL((Plan *) join));
+										 NUM_EXEC_QUAL((Plan *) join),
+										 &join->outer_reference_attrs,
+										 &join->inner_reference_attrs);
 	}
 	else if (IsA(join, HashJoin))
 	{
@@ -2336,7 +2597,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										(Index) 0,
 										rtoffset,
 										NRM_EQUAL,
-										NUM_EXEC_QUAL((Plan *) join));
+										NUM_EXEC_QUAL((Plan *) join),
+										&join->outer_reference_attrs,
+										&join->inner_reference_attrs);
 
 		/*
 		 * HashJoin's hashkeys are used to look for matching tuples from its
@@ -2368,7 +2631,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  (Index) 0,
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-										  NUM_EXEC_TLIST((Plan *) join));
+										  NUM_EXEC_TLIST((Plan *) join),
+										  &join->outer_reference_attrs,
+										  &join->inner_reference_attrs);
 	join->plan.qual = fix_join_expr(root,
 									join->plan.qual,
 									outer_itlist,
@@ -2376,8 +2641,20 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 									(Index) 0,
 									rtoffset,
 									(join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-									NUM_EXEC_QUAL((Plan *) join));
-
+									NUM_EXEC_QUAL((Plan *) join),
+									&join->outer_reference_attrs,
+									&join->inner_reference_attrs);
+
+	join->plan.forbid_pre_detoast_vars = fix_join_expr(root,
+													   join->plan.forbid_pre_detoast_vars,
+													   outer_itlist,
+													   inner_itlist,
+													   (Index) 0,
+													   rtoffset,
+													   (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
+													   NUM_EXEC_TLIST((Plan *) join),
+													   NULL,
+													   NULL);
 	pfree(outer_itlist);
 	pfree(inner_itlist);
 }
@@ -3010,9 +3287,12 @@ fix_join_expr(PlannerInfo *root,
 			  Index acceptable_rel,
 			  int rtoffset,
 			  NullingRelsMatch nrm_match,
-			  double num_exec)
+			  double num_exec,
+			  Bitmapset **outer_reference_attrs,
+			  Bitmapset **inner_reference_attrs)
 {
 	fix_join_expr_context context;
+	List	   *ret;
 
 	context.root = root;
 	context.outer_itlist = outer_itlist;
@@ -3021,16 +3301,30 @@ fix_join_expr(PlannerInfo *root,
 	context.rtoffset = rtoffset;
 	context.nrm_match = nrm_match;
 	context.num_exec = num_exec;
-	return (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.outer_reference_attrs, outer_reference_attrs);
+	setup_intermediate_var_ref_ctx(&context.inner_reference_attrs, inner_reference_attrs);
+
+	ret = (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	bms_free(context.outer_reference_attrs.existing_attrs);
+	bms_free(context.inner_reference_attrs.existing_attrs);
+
+	return ret;
 }
 
 static Node *
 fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 {
 	Var		   *newvar;
+	Node	   *ret_node;
 
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = (Var *) node;
@@ -3044,7 +3338,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* then in the inner. */
@@ -3056,7 +3356,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If it's for acceptable_rel, adjust and return it */
@@ -3066,6 +3372,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 			var->varno += context->rtoffset;
 			if (var->varnosyn > 0)
 				var->varnosyn += context->rtoffset;
+			/* XXX acceptable_rel?  we can ignore it for safety. */
+			decreased_level_for_pre_detoast(&context->level_ctx);
+
 			return (Node *) var;
 		}
 
@@ -3084,22 +3393,38 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  OUTER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 		if (context->inner_itlist && context->inner_itlist->has_ph_vars)
 		{
+
 			newvar = search_indexed_tlist_for_phv(phv,
 												  context->inner_itlist,
 												  INNER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If not supplied by input plans, evaluate the contained expr */
 		/* XXX can we assert something about phnullingrels? */
-		return fix_join_expr_mutator((Node *) phv->phexpr, context);
+		ret_node = fix_join_expr_mutator((Node *) phv->phexpr, context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
 	}
+
 	/* Try matching more complex expressions too, if tlists have any */
 	if (context->outer_itlist && context->outer_itlist->has_non_vars)
 	{
@@ -3107,7 +3432,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->outer_itlist,
 												  OUTER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->outer_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	if (context->inner_itlist && context->inner_itlist->has_non_vars)
 	{
@@ -3115,20 +3446,36 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->inner_itlist,
 												  INNER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->inner_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	/* Special cases (apply only AFTER failing to match to lower tlist) */
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		ret_node = fix_param_node(context->root, (Param *) node);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_join_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		ret_node = fix_join_expr_mutator(fix_alternative_subplan(context->root,
+																 (AlternativeSubPlan *) node,
+																 context->num_exec),
+										 context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node,
-								   fix_join_expr_mutator,
-								   (void *) context);
+	ret_node = expression_tree_mutator(node,
+									   fix_join_expr_mutator,
+									   (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret_node;
 }
 
 /*
@@ -3163,7 +3510,8 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * varno = newvarno, varattno = resno of corresponding targetlist element.
  * The original tree is not modified.
  */
-static Node *
+static Node *					/* XXX: shall I care about this for shared
+								 * detoast optimization? */
 fix_upper_expr(PlannerInfo *root,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
@@ -3318,7 +3666,10 @@ set_returning_clause_references(PlannerInfo *root,
 						  resultRelation,
 						  rtoffset,
 						  NRM_EQUAL,
-						  NUM_EXEC_TLIST(topplan));
+						  NUM_EXEC_TLIST(topplan),
+						  NULL,
+						  NULL
+		);
 
 	pfree(itlist);
 
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index a28ddcdd77..9304786bb2 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,17 @@ typedef enum ExprEvalOp
 	EEOP_OUTER_VAR,
 	EEOP_SCAN_VAR,
 
+	/*
+	 * compute non-system Var value with shared-detoast-datum logic, use some
+	 * dedicated steps rather than add extra logic to existing steps is for
+	 * performance aspect, within this way, we just decide if the extra logic
+	 * is needed at ExecInitExpr stage once rather than every time of
+	 * ExecInterpExpr.
+	 */
+	EEOP_INNER_VAR_TOAST,
+	EEOP_OUTER_VAR_TOAST,
+	EEOP_SCAN_VAR_TOAST,
+
 	/* compute system Var value */
 	EEOP_INNER_SYSVAR,
 	EEOP_OUTER_SYSVAR,
@@ -830,5 +841,6 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
+extern void ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum);
 
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 6133dbcd0a..d87751e460 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -18,6 +18,7 @@
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/tupdesc.h"
+#include "nodes/bitmapset.h"
 #include "storage/buf.h"
 
 /*----------
@@ -128,6 +129,25 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+
+	/*
+	 * The attributes whose values are the detoasted version in tts_values[*],
+	 * if so these memory needs some extra clean-up. These memory can't be put
+	 * into ecxt_per_tuple_memory since many of them needs a longer life span,
+	 * for example the Datum in outer join. These memory is put into
+	 * TupleTableSlot.tts_mcxt and be clear whenever the tts_values[*] is
+	 * invalidated.
+	 *
+	 * Bitset rather than Bitmapset is chosen here because when all the
+	 * members of Bitmapset are deleted, the allocated memory will be
+	 * deallocated automatically, which is too expensive in this case since we
+	 * need to deleted all the members in each ExecClearTuple and repopulate
+	 * it again when fill the detoast datum to tts_values[*]. This situation
+	 * will be run again and again in an execution cycle.
+	 *
+	 * These values are populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST steps.
+	 */
+	Bitset	   *pre_detoasted_attrs;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -426,12 +446,36 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+/*
+ * ExecFreePreDetoastDatum - free the memory which is allocated in pre-detoast-datum.
+ */
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+	int			attnum;
+
+	attnum = -1;
+	while ((attnum = bitset_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
+	{
+		pfree((void *) slot->tts_values[attnum]);
+	}
+
+	/*
+	 * unset the bits but keep the memory for later use, this is importance
+	 * for at the performance aspect.
+	 */
+	bitset_clear(slot->pre_detoasted_attrs);
+}
+
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
 static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_ops->clear(slot);
 
 	return slot;
@@ -450,6 +494,10 @@ ExecClearTuple(TupleTableSlot *slot)
 static inline void
 ExecMaterializeSlot(TupleTableSlot *slot)
 {
+	/*
+	 * XXX: pre_detoasted_attrs doesn't dependent on any external storage, so
+	 * nothing should be done here.
+	 */
 	slot->tts_ops->materialize(slot);
 }
 
@@ -494,6 +542,25 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 
 	dstslot->tts_ops->copyslot(dstslot, srcslot);
 
+	if (dstslot->tts_nvalid > 0 && srcslot->tts_nvalid > 0)
+	{
+		int			attnum = -1;
+		MemoryContext old = MemoryContextSwitchTo(dstslot->tts_mcxt);
+
+		dstslot->pre_detoasted_attrs = bitset_copy(srcslot->pre_detoasted_attrs);
+
+		while ((attnum = bitset_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
+		{
+			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
+			Size		len;
+
+			Assert(!VARATT_IS_EXTENDED(datum));
+			len = VARSIZE(datum);
+			dstslot->tts_values[attnum] = (Datum) palloc(len);
+			memcpy((void *) dstslot->tts_values[attnum], datum, len);
+		}
+		MemoryContextSwitchTo(old);
+	}
 	return dstslot;
 }
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd5..30fdb37d1c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1481,6 +1481,12 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the Scan nodes.
+	 */
+	Bitmapset  *scan_pre_detoast_attrs;
 } ScanState;
 
 /* ----------------
@@ -2010,6 +2016,13 @@ typedef struct JoinState
 	bool		single_match;	/* True if we should skip to next outer tuple
 								 * after finding one inner match */
 	ExprState  *joinqual;		/* JOIN quals (in addition to ps.qual) */
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the join nodes.
+	 */
+	Bitmapset  *outer_pre_detoast_attrs;
+	Bitmapset  *inner_pre_detoast_attrs;
 } JoinState;
 
 /* ----------------
@@ -2771,4 +2784,5 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+extern void SetPredetoastAttrsForJoin(JoinState *joinstate);
 #endif							/* EXECNODES_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c..ea5033aaa0 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,13 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+
+	/*
+	 * A list of Vars which should not apply the shared-detoast-datum logic
+	 * since the upper nodes like Sort/Hash wants them as small as possible.
+	 * It's a subset of targetlist in each Plan node.
+	 */
+	List	   *forbid_pre_detoast_vars;
 } Plan;
 
 /* ----------------
@@ -385,6 +392,16 @@ typedef struct Scan
 
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *reference_attrs;
 } Scan;
 
 /* ----------------
@@ -789,6 +806,17 @@ typedef struct Join
 	JoinType	jointype;
 	bool		inner_unique;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *outer_reference_attrs;
+	Bitmapset  *inner_reference_attrs;
 } Join;
 
 /* ----------------
@@ -869,6 +897,11 @@ typedef struct HashJoin
 	 * perform lookups in the hashtable over the inner plan.
 	 */
 	List	   *hashkeys;
+
+	/*
+	 * Whether the left plan tree should use a SMALL_TLIST.
+	 */
+	bool		left_small_tlist;
 } HashJoin;
 
 /* ----------------
@@ -1588,4 +1621,24 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+static inline bool
+is_join_plan(Plan *plan)
+{
+	return (plan != NULL) && (IsA(plan, NestLoop) || IsA(plan, HashJoin) || IsA(plan, MergeJoin));
+}
+
+static inline bool
+is_scan_plan(Plan *plan)
+{
+	return (plan != NULL) &&
+		(IsA(plan, SeqScan) ||
+		 IsA(plan, SampleScan) ||
+		 IsA(plan, IndexScan) ||
+		 IsA(plan, IndexOnlyScan) ||
+		 IsA(plan, BitmapIndexScan) ||
+		 IsA(plan, BitmapHeapScan) ||
+		 IsA(plan, TidScan) ||
+		 IsA(plan, SubqueryScan));
+}
+
 #endif							/* PLANNODES_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dcba329ee4..308f19d61a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4047,6 +4047,8 @@ cb_cleanup_dir
 cb_options
 cb_tablespace
 cb_tablespace_mapping
+intermediate_var_ref_context
+intermediate_level_context
 manifest_data
 manifest_writer
 rfile
-- 
2.34.1

#13Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andy Fan (#12)
Re: Shared detoast Datum proposal

On 2/20/24 19:38, Andy Fan wrote:

...

I think it should be done the other
way around, i.e. the patch should introduce the main feature first
(using the traditional Bitmapset), and then add Bitset on top of that.
That way we could easily measure the impact and see if it's useful.

Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
indicate it is too expensive. and after talk with David at [2], I
introduced bitset and use it here. the test case I used comes from [1].
IRCC, there were 5% performance difference because of this.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;
select b from w where b > 0;

To reproduce the difference, we can replace the bitset_clear() with

bitset_free(slot->pre_detoasted_attrs);
slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);

in ExecFreePreDetoastDatum. then it works same as Bitmapset.

I understand the bitset was not introduced until v5, after noticing the
bitmapset is not quite efficient. But I still think the patches should
be the other way around, i.e. the main feature first, then the bitset as
an optimization.

That allows everyone to observe the improvement on their own (without
having to tweak the code), and it also doesn't require commit of the
bitset part before it gets actually used by anything.

On the whole, my biggest concern is memory usage & leaks. It's not
difficult to already have problems with large detoasted values, and if
we start keeping more of them, that may get worse. Or at least that's my
intuition - it can't really get better by keeping the values longer, right?

The other thing is the risk of leaks (in the sense of keeping detoasted
values longer than expected). I see the values are allocated in
tts_mcxt, and maybe that's the right solution - not sure.

about the memory usage, first it is kept as the same lifesplan as the
tts_values[*] which can be released pretty quickly, only if the certain
values of the tuples is not needed. it is true that we keep the detoast
version longer than before, but that's something we have to pay I
think.

Leaks may happen since tts_mcxt is reset at the end of *executor*. So if
we forget to release the memory when the tts_values[*] is invalidated
somehow, the memory will be leaked until the end of executor. I think
that will be enough to cause an issue. Currently besides I release such
memory at the ExecClearTuple, I also relase such memory whenever we set
tts_nvalid to 0, the theory used here is:

/*
* tts_values is treated invalidated since tts_nvalid is set to 0, so
* let's free the pre-detoast datum.
*/
ExecFreePreDetoastDatum(slot);

I will do more test on the memory leak stuff, since there are so many
operation aginst slot like ExecCopySlot etc, I don't know how to test it
fully. the method in my mind now is use TPCH with 10GB data size, and
monitor the query runtime memory usage.

I think this is exactly the "high level design" description that should
be in a comment, somewhere.

FWIW while looking at the patch, I couldn't help but to think about
expanded datums. There's similarity in what these two features do - keep
detoasted values for a while, so that we don't need to do the expensive
processing if we access them repeatedly.

Could you provide some keyword or function names for the expanded datum
here, I probably miss this.

see src/include/utils/expandeddatum.h

Of course, expanded datums are
not meant to be long-lived, while "shared detoasted values" are meant to
exist (potentially) for the query duration.

hmm, acutally the "shared detoast value" just live in the
TupleTableSlot->tts_values[*], rather than the whole query duration. The
simple case is:

SELECT * FROM t WHERE a_text LIKE 'abc%';

when we scan to the next tuple, the detoast value for the previous tuple
will be relased.

But if the (detoasted) value is passed to the next executor node, it'll
be kept, right?

But maybe there's something
we could learn from expanded datums? For example how the varlena pointer
is leveraged to point to the expanded object.

maybe. currently I just use detoast_attr to get the desired version. I'm
pleasure if we have more effective way.

if (!slot->tts_isnull[attnum] &&
VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
Datum oldDatum;
MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

oldDatum = slot->tts_values[attnum];
slot->tts_values[attnum] = PointerGetDatum(detoast_attr( (struct varlena *) oldDatum));
Assert(slot->tts_nvalid > attnum);
Assert(oldDatum != slot->tts_values[attnum]);
bitset_add_member(slot->pre_detoasted_attrs, attnum);
MemoryContextSwitchTo(old);
}

Right. FWIW I'm not sure if expanded objects are similar enough to be a
useful inspiration.

Unrelated question - could this actually end up being too aggressive?
That is, could it detoast attributes that we end up not needing? For
example, what if the tuple gets eliminated for some reason (e.g. because
of a restriction on the table, or not having a match in a join)? Won't
we detoast the tuple only to throw it away?

For example, what if we add a "TOAST cache" as a query-level hash table,
and modify the detoasting to first check the hash table (with the TOAST
pointer as a key)? It'd be fairly trivial to enforce a memory limit on
the hash table, evict values from it, etc. And it wouldn't require any
of the createplan/setrefs changes, I think ...

Hmm, I am not sure I understand you correctly at this part. In the
current patch, to avoid the run-time (ExecExprInterp) check if we
should detoast and save the datum, I defined 3 extra steps so that
the *extra check itself* is not needed for unnecessary attributes.
for example an datum for int or a detoast datum should not be saved back
to tts_values[*] due to the small_tlist reason. However these steps can
be generated is based on the output of createplan/setrefs changes. take
the INNER_VAR for example:

In ExecInitExprRec:

switch (variable->varno)
{
case INNER_VAR:
if (is_join_plan(plan) &&
bms_is_member(attnum,
((JoinState *) state->parent)->inner_pre_detoast_attrs))
{
scratch.opcode = EEOP_INNER_VAR_TOAST;
}
else
{
scratch.opcode = EEOP_INNER_VAR;
}
}

The data workflow is:

1. set_plan_forbid_pre_detoast_vars_recurse (in the createplan.c)
decides which Vars should *not* be pre_detoasted because of small_tlist
reason and record it in Plan.forbid_pre_detoast_vars.

2. fix_scan_expr (in the setrefs.c) tracks which Vars should be
detoasted for the specific plan node and record them in it. Currently
only Scan and Join nodes support this feature.

typedef struct Scan
{
...
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *reference_attrs;
} Scan;

typedef struct Join
{
..
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *outer_reference_attrs;
Bitmapset *inner_reference_attrs;
} Join;

Is it actually necessary to add new fields to these nodes? Also, the
names are not particularly descriptive of the purpose - it'd be better
to have "detoast" in the name, instead of generic "reference".

3. during the InitPlan stage, we maintain the
PlanState.xxx_pre_detoast_attrs and generated different StepOp for them.

4. At the ExecExprInterp stage, only the new StepOp do the extra check
to see if the detoast should happen. Other steps doesn't need this
check at all.

If we avoid the createplan/setref.c changes, probabaly some unrelated
StepOp needs the extra check as well?

When I worked with the UniqueKey feature, I maintained a
UniqueKey.README to summaried all the dicussed topics in threads, the
README is designed to save the effort for more reviewer, I think I
should apply the same logic for this feature.

Good idea. Either that (a separate README), or a comment in a header of
some suitable .c/.h file (I prefer that, because that's kinda obvious
when reading the code, I often not notice a README exists next to it).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Andy Fan
zhihuifan1213@163.com
In reply to: Tomas Vondra (#13)
Re: Shared detoast Datum proposal

I see your reply when I started to write a more high level
document. Thanks for the step by step help!

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 2/20/24 19:38, Andy Fan wrote:

...

I think it should be done the other
way around, i.e. the patch should introduce the main feature first
(using the traditional Bitmapset), and then add Bitset on top of that.
That way we could easily measure the impact and see if it's useful.

Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
indicate it is too expensive. and after talk with David at [2], I
introduced bitset and use it here. the test case I used comes from [1].
IRCC, there were 5% performance difference because of this.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;
select b from w where b > 0;

To reproduce the difference, we can replace the bitset_clear() with

bitset_free(slot->pre_detoasted_attrs);
slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);

in ExecFreePreDetoastDatum. then it works same as Bitmapset.

I understand the bitset was not introduced until v5, after noticing the
bitmapset is not quite efficient. But I still think the patches should
be the other way around, i.e. the main feature first, then the bitset as
an optimization.

That allows everyone to observe the improvement on their own (without
having to tweak the code), and it also doesn't require commit of the
bitset part before it gets actually used by anything.

I start to think this is a better way rather than the opposite. The next
version will be:

0001: shared detoast datum feature with high level introduction.
0002: introduce bitset and use it shared-detoast-datum feature, with the
test case to show the improvement.

I will do more test on the memory leak stuff, since there are so many
operation aginst slot like ExecCopySlot etc, I don't know how to test it
fully. the method in my mind now is use TPCH with 10GB data size, and
monitor the query runtime memory usage.

I think this is exactly the "high level design" description that should
be in a comment, somewhere.

Got it.

Of course, expanded datums are
not meant to be long-lived, while "shared detoasted values" are meant to
exist (potentially) for the query duration.

hmm, acutally the "shared detoast value" just live in the
TupleTableSlot->tts_values[*], rather than the whole query duration. The
simple case is:

SELECT * FROM t WHERE a_text LIKE 'abc%';

when we scan to the next tuple, the detoast value for the previous tuple
will be relased.

But if the (detoasted) value is passed to the next executor node, it'll
be kept, right?

Yes and only one copy for all the executor nodes.

Unrelated question - could this actually end up being too aggressive?
That is, could it detoast attributes that we end up not needing? For
example, what if the tuple gets eliminated for some reason (e.g. because
of a restriction on the table, or not having a match in a join)? Won't
we detoast the tuple only to throw it away?

The detoast datum will have the exactly same lifespan with other
tts_values[*]. If the tuple get eliminated for any reason, those detoast
datum still exist until the slot is cleared for storing the next tuple.

typedef struct Join
{
..
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *outer_reference_attrs;
Bitmapset *inner_reference_attrs;
} Join;

Is it actually necessary to add new fields to these nodes? Also, the
names are not particularly descriptive of the purpose - it'd be better
to have "detoast" in the name, instead of generic "reference".

Because of the way of the data transformation, I think we must add the
fields to keep such inforamtion. Then these information will be used
initilize the necessary information in PlanState. maybe I am having a
fixed mindset, I can't think out a way to avoid that right now.

I used 'reference' rather than detoast is because some implementaion
issues. In the createplan.c and setref.c, I can't check the atttyplen
effectively, so even a Var with int type is still hold here which may
have nothing with detoast.

When I worked with the UniqueKey feature, I maintained a
UniqueKey.README to summaried all the dicussed topics in threads, the
README is designed to save the effort for more reviewer, I think I
should apply the same logic for this feature.

Good idea. Either that (a separate README), or a comment in a header of
some suitable .c/.h file (I prefer that, because that's kinda obvious
when reading the code, I often not notice a README exists next to it).

Great, I'd try this from tomorrow.

--
Best Regards
Andy Fan

#15Andy Fan
zhihuifan1213@163.com
In reply to: Andy Fan (#14)
5 attachment(s)
Re: Shared detoast Datum proposal

Hi,

Good idea. Either that (a separate README), or a comment in a header of
some suitable .c/.h file (I prefer that, because that's kinda obvious
when reading the code, I often not notice a README exists next to it).

Great, I'd try this from tomorrow.

I have made it. Currently I choose README because this feature changed
createplan.c, setrefs.c, execExpr.c and execExprInterp.c, so putting the
high level design to any of them looks inappropriate. So the high level
design is here and detailed design for each steps is in the comments
around the code. Hope this is helpful!

The problem:
-------------

In the current expression engine, a toasted datum is detoasted when
required, but the result is discarded immediately, either by pfree it
immediately or leave it for ResetExprContext. Arguments for which one to
use exists sometimes. More serious problem is detoasting is expensive,
especially for the data types like jsonb or array, which the value might
be very huge. In the blow example, the detoasting happens twice.

SELECT jb_col->'a', jb_col->'b' FROM t;

Within the shared-detoast-datum, we just need to detoast once for each
tuple, and discard it immediately when the tuple is not needed any
more. FWIW this issue may existing for small numeric, text as well
because of SHORT_TOAST feature where the toast's len using 1 byte rather
than 4 bytes.

Current Design
--------------

The high level design is let createplan.c and setref.c decide which
Vars can use this feature, and then the executor save the detoast
datum back slot->to tts_values[*] during the ExprEvalStep of
EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:

- The existing expression engine read datum from tts_values[*], no any
extra work need to be done.
- Reuse the lifespan of TupleTableSlot system to manage memory. It
is natural to think the detoast datum is a tts_value just that it is
in a detoast format. Since we have a clear lifespan for TupleTableSlot
already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
them for managing the datoast datum's memory.
- The existing projection method can copy the datoasted datum (int64)
automatically to the next node's slot, but keeping the ownership
unchanged, so only the slot where the detoast really happen take the
charge of it's lifespan.

So the final data change is adding the below field into TubleTableSlot.

typedef struct TupleTableSlot
{
..

/*
* The attributes whose values are the detoasted version in tts_values[*],
* if so these memory needs some extra clean-up. These memory can't be put
* into ecxt_per_tuple_memory since many of them needs a longer life
* span.
*
* These memory is put into TupleTableSlot.tts_mcxt and be clear
* whenever the tts_values[*] is invalidated.
*/
Bitmapset *pre_detoast_attrs;
};

Assuming which Var should use this feature has been decided in
createplan.c and setref.c already. The 3 new ExprEvalSteps
EEOP_{INNER,OUTER,SCAN}_VAR_TOAST as used. During the evaluating these
steps, the below code is used.

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
if (!slot->tts_isnull[attnum] &&
VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
Datum oldDatum;
MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

oldDatum = slot->tts_values[attnum];
slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
(struct varlena *) oldDatum));
Assert(slot->tts_nvalid > attnum);
Assert(oldDatum != slot->tts_values[attnum]);
slot->pre_detoasted_attrs= bms_add_member(slot->pre_detoasted_attrs, attnum);
MemoryContextSwitchTo(old);
}
}

Since I don't want to the run-time extra check to see if is a detoast
should happen, so introducing 3 new steps.

When to free the detoast datum? It depends on when the slot's
tts_values[*] is invalidated, ExecClearTuple is the clear one, but any
TupleTableSlotOps which set the tts_nvalid = 0 tells us no one will use
the datum in tts_values[*] so it is time to release them based on
slot.pre_detoast_attrs as well.

Now comes to the createplan.c/setref.c part, which decides which Vars
should use the shared detoast feature. The guideline of this is:

1. It needs a detoast for a given expression.
2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
detoast datum back to tts_values[*], which make tuple bigger. if we
do this blindly, it would be harmful to the ORDER / HASH style nodes.

A high level data flow is:

1. at the createplan.c, we walk the plan tree go gather the
CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
it to Plan.forbid_pre_detoast_vars via the function
set_plan_forbid_pre_detoast_vars_recurse.

2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
expression, so it is a good time to track the attribute number and
see if the Var is directly or indirectly accessed. Usually the
indirectly access a Var means a detoast would happens, for
example an expression like a > 3. However some known expressions like
VAR is NULL; is ignored. The output is {Scan|Join}.xxx_reference_attrs;

As a result, the final result is added into the plan node of Scan and
Join.

typedef struct Scan
{
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *reference_attrs;
} Scan;

typedef struct Join
{
..
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *outer_reference_attrs;
Bitmapset *inner_reference_attrs;
} Join;

Note that here I used '_reference_' rather than '_detoast_' is because
at this part, I still don't know if it is a toastable attrbiute, which
is known at the MakeTupleTableSlot stage.

3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
in ScanState & JoinState, which will be passed into expression
engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
_VAR_TOAST steps are generated finally then everything is connected
with ExecSlotDetoastDatum!

Testing
-------

Case 1:

create table t (a numeric);
insert into t select i from generate_series(1, 100000)i;

cat 1.sql

select * from t where a > 0;

In this test, the current master run detoast twice for each datum. one
in numeric_gt, one in numeric_out. this feature makes the detoast once.

pgbench -f 1.sql -n postgres -T 10 -M prepared

master: 30.218 ms
patched(Bitmapset): 30.881ms

Then we can see the perf report as below:

- 7.34% 0.00% postgres postgres [.] ExecSlotDetoastDatum (inlined)
- ExecSlotDetoastDatum (inlined)
- 3.47% bms_add_member
- 3.06% bms_make_singleton (inlined)
- palloc0
1.30% AllocSetAlloc

- 5.99% 0.00% postgres postgres [.] ExecFreePreDetoastDatum (inlined)
- ExecFreePreDetoastDatum (inlined)
2.64% bms_next_member
1.17% bms_del_members
0.94% AllocSetFree

One of the reasons is because Bitmapset will deallocate its memory when
all the bits are deleted due to commit 00b41463c, then we have to
allocate memory at the next time when adding a member to it. This kind
of action is executed 100000 times in the above workload.

Then I introduce bitset data struct (commit 0002) which is pretty like
the Bitmapset, but it would not deallocate the memory when all the bits
is unset. and use it in this feature (commit 0003). Then the result
became to: 28.715ms

- 5.22% 0.00% postgres postgres [.] ExecFreePreDetoastDatum (inlined)
- ExecFreePreDetoastDatum (inlined)
- 2.82% bitset_next_member
1.69% bms_next_member_internal (inlined)
0.95% bitset_next_member
0.66% AllocSetFree

Here we can see the expensive calls are bitset_next_member on
slot->pre_detoast_attrs and pfree. if we put the detoast datum into
a dedicated memory context, then we can save the cost of
bitset_next_member since can discard all the memory in once and use
MemoryContextReset instead of AllocSetFree (commit 0004). then the
result became to 27.840ms.

So the final result for case 1:

master: 30.218 ms
patched(Bitmapset): 30.881ms
patched(bitset): 28.715ms
latency average(bitset + tts_value_mctx) = 27.840 ms

Big jsonbs test:

create table b(big jsonb);

insert into b select
jsonb_object_agg(x::text,
random()::text || random()::text || random()::text)
from generate_series(1,600) f(x);

insert into b select (select big from b) from generate_series(1, 1000)i;

explain analyze
select big->'1', big->'2', big->'3', big->'5', big->'10' from b;

master: 702.224 ms
patched: 133.306 ms

Memory usage test:

I run the workload of tpch scale 10 on against both master and patched
versions, the memory usage looks stable.

In progress work:

I'm still running tpc-h scale 100 to see if anything interesting
finding, that is in progress. As for the scale 10:

master: 55s
patched: 56s

The reason is q9 plan changed a bit, the real reason needs some more
time. Since this patch doesn't impact on the best path generation, so it
should not reasonble for me.

--
Best Regards
Andy Fan

Attachments:

v8-0001-Shared-detoast-feature.patchtext/x-diffDownload
From cce81190cca7209cbb8aa314a27c52832fbe0a2d Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 20 Feb 2024 14:16:10 +0800
Subject: [PATCH v8 1/5] Shared detoast feature.

details at https://postgr.es/m/87il4jrk1l.fsf%40163.com
---
 src/backend/executor/execExpr.c         |  35 +-
 src/backend/executor/execExprInterp.c   | 180 ++++++++
 src/backend/executor/execTuples.c       | 134 ++++++
 src/backend/executor/execUtils.c        |   2 +
 src/backend/executor/nodeHashjoin.c     |   2 +
 src/backend/executor/nodeMergejoin.c    |   2 +
 src/backend/executor/nodeNestloop.c     |   1 +
 src/backend/jit/llvm/llvmjit_expr.c     |  26 +-
 src/backend/jit/llvm/llvmjit_types.c    |   1 +
 src/backend/optimizer/plan/createplan.c | 107 ++++-
 src/backend/optimizer/plan/setrefs.c    | 551 +++++++++++++++++++-----
 src/include/executor/execExpr.h         |  12 +
 src/include/executor/tuptable.h         |  56 +++
 src/include/nodes/execnodes.h           |  14 +
 src/include/nodes/plannodes.h           |  53 +++
 src/tools/pgindent/typedefs.list        |   2 +
 16 files changed, 1072 insertions(+), 106 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 3181b1136a..779fcfaab1 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -932,22 +932,51 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				}
 				else
 				{
+					int			attnum;
+					Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 					/* regular user column */
 					scratch.d.var.attnum = variable->varattno - 1;
 					scratch.d.var.vartype = variable->vartype;
+					attnum = scratch.d.var.attnum;
+
 					switch (variable->varno)
 					{
 						case INNER_VAR:
-							scratch.opcode = EEOP_INNER_VAR;
+
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_INNER_VAR_TOAST;
+							}
+							else
+							{
+								scratch.opcode = EEOP_INNER_VAR;
+							}
 							break;
 						case OUTER_VAR:
-							scratch.opcode = EEOP_OUTER_VAR;
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_OUTER_VAR;
 							break;
 
 							/* INDEX_VAR is handled by default case */
 
 						default:
-							scratch.opcode = EEOP_SCAN_VAR;
+							if (is_scan_plan(plan) && bms_is_member(
+																	attnum,
+																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+							}
+							else
+								scratch.opcode = EEOP_SCAN_VAR;
 							break;
 					}
 				}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3f20f1dd31..dc0db12ff4 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -158,6 +159,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -166,6 +170,9 @@ static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull
 static Datum ExecJustInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -181,6 +188,43 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  AggStatePerGroup pergroup,
 															  ExprContext *aggcontext,
 															  int setno);
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+	if (!slot->tts_isnull[attnum] &&
+		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+	{
+		Datum		oldDatum;
+		MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
+
+		oldDatum = slot->tts_values[attnum];
+		slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
+																(struct varlena *) oldDatum));
+		Assert(slot->tts_nvalid > attnum);
+		Assert(oldDatum != slot->tts_values[attnum]);
+		slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
+		MemoryContextSwitchTo(old);
+	}
+}
+
+/* JIT requires a non-static (and external?) function */
+void
+ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum)
+{
+	return ExecSlotDetoastDatum(slot, attnum);
+}
+
+
+static inline void
+ExecEvalToastVar(TupleTableSlot *slot,
+				 ExprEvalStep *op,
+				 int attnum)
+{
+	ExecSlotDetoastDatum(slot, attnum);
+
+	*op->resvalue = slot->tts_values[attnum];
+	*op->resnull = slot->tts_isnull[attnum];
+}
 
 /*
  * ScalarArrayOpExprHashEntry
@@ -296,6 +340,24 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVar;
 			return;
 		}
+		if (step0 == EEOP_INNER_FETCHSOME &&
+			step1 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_FETCHSOME &&
+				 step1 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_FETCHSOME &&
+				 step1 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarToast;
+			return;
+		}
 		else if (step0 == EEOP_INNER_FETCHSOME &&
 				 step1 == EEOP_ASSIGN_INNER_VAR)
 		{
@@ -346,6 +408,21 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVarVirt;
 			return;
 		}
+		else if (step0 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarVirtToast;
+			return;
+		}
 		else if (step0 == EEOP_ASSIGN_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustAssignInnerVarVirt;
@@ -413,6 +490,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
+		&&CASE_EEOP_INNER_VAR_TOAST,
+		&&CASE_EEOP_OUTER_VAR_TOAST,
+		&&CASE_EEOP_SCAN_VAR_TOAST,
 		&&CASE_EEOP_INNER_SYSVAR,
 		&&CASE_EEOP_OUTER_SYSVAR,
 		&&CASE_EEOP_SCAN_SYSVAR,
@@ -597,6 +677,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
 			*op->resvalue = scanslot->tts_values[attnum];
 			*op->resnull = scanslot->tts_isnull[attnum];
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_INNER_VAR_TOAST)
+		{
+			ExecEvalToastVar(innerslot, op, op->d.var.attnum);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_VAR_TOAST)
+		{
+			ExecEvalToastVar(outerslot, op, op->d.var.attnum);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_VAR_TOAST)
+		{
+			ExecEvalToastVar(scanslot, op, op->d.var.attnum);
 
 			EEO_NEXT();
 		}
@@ -2137,6 +2236,42 @@ ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+static pg_attribute_always_inline Datum
+ExecJustVarImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[1];
+	int			attnum = op->d.var.attnum;
+
+	CheckOpSlotCompatibility(&state->steps[0], slot);
+
+	slot_getattr(slot, attnum + 1, isnull);
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Simple reference to inner Var */
+static Datum
+ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Simple reference to outer Var */
+static Datum
+ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Simple reference to scan Var */
+static Datum
+ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)Var */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
@@ -2275,6 +2410,51 @@ ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarVirtImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+/* implementation of ExecJust(Inner|Outer|Scan)VarVirt */
+static pg_attribute_always_inline Datum
+ExecJustVarVirtImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[0];
+	int			attnum = op->d.var.attnum;
+
+	/*
+	 * As it is guaranteed that a virtual slot is used, there never is a need
+	 * to perform tuple deforming (nor would it be possible). Therefore
+	 * execExpr.c has not emitted an EEOP_*_FETCHSOME step. Verify, as much as
+	 * possible, that that determination was accurate.
+	 */
+	Assert(TTS_IS_VIRTUAL(slot));
+	Assert(TTS_FIXED(slot));
+	Assert(attnum >= 0 && attnum < slot->tts_nvalid);
+
+	*isnull = slot->tts_isnull[attnum];
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Like ExecJustInnerVar, optimized for virtual slots */
+static Datum
+ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Like ExecJustOuterVar, optimized for virtual slots */
+static Datum
+ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Like ExecJustScanVar, optimized for virtual slots */
+static Datum
+ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)VarVirt */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarVirtImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a7aa2ee02b..2d11e5e8b3 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -79,6 +79,9 @@ static inline void tts_buffer_heap_store_tuple(TupleTableSlot *slot,
 											   bool transfer_pin);
 static void tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree);
 
+static Bitmapset *cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+											  TupleDesc tupleDesc,
+											  List *forbid_pre_detoast_vars);
 
 const TupleTableSlotOps TTSOpsVirtual;
 const TupleTableSlotOps TTSOpsHeapTuple;
@@ -176,6 +179,10 @@ tts_virtual_materialize(TupleTableSlot *slot)
 		if (att->attbyval || slot->tts_isnull[natt])
 			continue;
 
+		if (bms_is_member(natt, slot->pre_detoasted_attrs))
+			/* it has been in slot->tts_mcxt already. */
+			continue;
+
 		val = slot->tts_values[natt];
 
 		if (att->attlen == -1 &&
@@ -392,6 +399,13 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated invalidated since tts_nvalid is set to 0, so
+	 * let's free the pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
+
 }
 
 static void
@@ -457,6 +471,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -567,6 +584,9 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -637,6 +657,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -771,6 +794,9 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -904,6 +930,9 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 		 */
 		ReleaseBuffer(buffer);
 	}
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 /*
@@ -1150,7 +1179,10 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 			 + MAXALIGN(tupleDesc->natts * sizeof(Datum)));
 
 		PinTupleDesc(tupleDesc);
+		slot->pre_detoasted_attrs = NULL;
 	}
+	else
+		slot->pre_detoasted_attrs = NULL;
 
 	/*
 	 * And allow slot type specific initialization.
@@ -1288,6 +1320,8 @@ void
 ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 					  TupleDesc tupdesc)	/* new tuple descriptor */
 {
+	MemoryContext old;
+
 	Assert(!TTS_FIXED(slot));
 
 	/* For safety, make sure slot is empty before changing it */
@@ -1304,6 +1338,8 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 		pfree(slot->tts_values);
 	if (slot->tts_isnull)
 		pfree(slot->tts_isnull);
+	if (slot->pre_detoasted_attrs)
+		bms_free(slot->pre_detoasted_attrs);
 
 	/*
 	 * Install the new descriptor; if it's refcounted, bump its refcount.
@@ -1319,6 +1355,10 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 		MemoryContextAlloc(slot->tts_mcxt, tupdesc->natts * sizeof(Datum));
 	slot->tts_isnull = (bool *)
 		MemoryContextAlloc(slot->tts_mcxt, tupdesc->natts * sizeof(bool));
+
+	old = MemoryContextSwitchTo(slot->tts_mcxt);
+	slot->pre_detoasted_attrs = NULL;
+	MemoryContextSwitchTo(old);
 }
 
 /* --------------------------------
@@ -1810,12 +1850,26 @@ void
 ExecInitScanTupleSlot(EState *estate, ScanState *scanstate,
 					  TupleDesc tupledesc, const TupleTableSlotOps *tts_ops)
 {
+	Scan	   *splan = (Scan *) scanstate->ps.plan;
+
 	scanstate->ss_ScanTupleSlot = ExecAllocTableSlot(&estate->es_tupleTable,
 													 tupledesc, tts_ops);
 	scanstate->ps.scandesc = tupledesc;
 	scanstate->ps.scanopsfixed = tupledesc != NULL;
 	scanstate->ps.scanops = tts_ops;
 	scanstate->ps.scanopsset = true;
+
+	if (is_scan_plan((Plan *) splan))
+	{
+		/*
+		 * We may run detoast in Qual or Projection, but all of them happen at
+		 * the ss_ScanTupleSlot rather than ps_ResultTupleSlot. So we can only
+		 * take care of the ss_ScanTupleSlot.
+		 */
+		scanstate->scan_pre_detoast_attrs = cal_final_pre_detoast_attrs(splan->reference_attrs,
+																		tupledesc,
+																		splan->plan.forbid_pre_detoast_vars);
+	}
 }
 
 /* ----------------
@@ -2336,3 +2390,83 @@ end_tup_output(TupOutputState *tstate)
 	ExecDropSingleTupleTableSlot(tstate->slot);
 	pfree(tstate);
 }
+
+/*
+ * cal_final_pre_detoast_attrs
+ *		Calculate the final attributes which pre-detoast be helpful.
+ *
+ * reference_attrs: the attributes which will be detoast at this plan level.
+ * due to the implementation issue, some non-toast attribute may be included
+ * which should be filtered out with tupleDesc.
+ *
+ * forbid_pre_detoast_vars: the vars which should not be pre-detoast as the
+ * small_tlist reason.
+ */
+static Bitmapset *
+cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+							TupleDesc tupleDesc,
+							List *forbid_pre_detoast_vars)
+{
+	Bitmapset  *final = NULL,
+			   *toast_attrs = NULL,
+			   *forbid_pre_detoast_attrs = NULL;
+
+	int			i;
+	ListCell   *lc;
+
+	if (bms_is_empty(reference_attrs))
+		return NULL;
+
+	/*
+	 * there is no exact data type in create_plan or set_plan_refs stage, so
+	 * reference_attrs may have some attribute which is not toast attrs at
+	 * all, which should be removed.
+	 */
+	for (i = 0; i < tupleDesc->natts; i++)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupleDesc, i);
+
+		if (attr->attlen == -1 && attr->attstorage != TYPSTORAGE_PLAIN)
+			toast_attrs = bms_add_member(toast_attrs, attr->attnum - 1);
+	}
+
+	/* Filter out the non-toastable attributes. */
+	final = bms_intersect(reference_attrs, toast_attrs);
+
+	/*
+	 * Due to the fact of detoast-datum will make the tuple bigger which is
+	 * bad for some nodes like Sort/Hash, to avoid performance regression,
+	 * such attribute should be removed as well.
+	 */
+	foreach(lc, forbid_pre_detoast_vars)
+	{
+		Var		   *var = lfirst_node(Var, lc);
+
+		forbid_pre_detoast_attrs = bms_add_member(forbid_pre_detoast_attrs, var->varattno - 1);
+	}
+
+	final = bms_del_members(final, forbid_pre_detoast_attrs);
+
+	bms_free(toast_attrs);
+	bms_free(forbid_pre_detoast_attrs);
+
+	return final;
+}
+
+
+void
+SetPredetoastAttrsForJoin(JoinState *j)
+{
+	PlanState  *outerstate = outerPlanState(j);
+	PlanState  *innerstate = innerPlanState(j);
+
+	j->outer_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->outer_reference_attrs,
+															 outerstate->ps_ResultTupleDesc,
+															 outerstate->plan->forbid_pre_detoast_vars);
+
+	j->inner_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->inner_reference_attrs,
+															 innerstate->ps_ResultTupleDesc,
+															 innerstate->plan->forbid_pre_detoast_vars);
+}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index cff5dc723e..a8646ded02 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -572,6 +572,8 @@ ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
 		planstate->resultopsset = planstate->scanopsset;
 		planstate->resultopsfixed = planstate->scanopsfixed;
 		planstate->resultops = planstate->scanops;
+
+		Assert(planstate->ps_ResultTupleDesc != NULL);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1cbec4647c..19a05ed624 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -756,6 +756,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
 	innerDesc = ExecGetResultType(innerPlanState(hjstate));
 
+	SetPredetoastAttrsForJoin((JoinState *) hjstate);
+
 	/*
 	 * Initialize result slot, type and projection.
 	 */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index c1a8ca2464..be7cbd7f30 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1497,6 +1497,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 											  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	SetPredetoastAttrsForJoin((JoinState *) mergestate);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 06fa0a9b31..2d40d19192 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -306,6 +306,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(&nlstate->js.ps, &TTSOpsVirtual);
 	ExecAssignProjectionInfo(&nlstate->js.ps, NULL);
+	SetPredetoastAttrsForJoin((JoinState *) nlstate);
 
 	/*
 	 * initialize child expressions
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 0c448422e2..74563c3454 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -396,30 +396,52 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
+			case EEOP_INNER_VAR_TOAST:
+			case EEOP_OUTER_VAR_TOAST:
+			case EEOP_SCAN_VAR_TOAST:
 				{
 					LLVMValueRef value,
 								isnull;
 					LLVMValueRef v_attnum;
 					LLVMValueRef v_values;
 					LLVMValueRef v_nulls;
+					LLVMValueRef v_slot;
 
-					if (opcode == EEOP_INNER_VAR)
+					if (opcode == EEOP_INNER_VAR || opcode == EEOP_INNER_VAR_TOAST)
 					{
+						v_slot = v_innerslot;
 						v_values = v_innervalues;
 						v_nulls = v_innernulls;
 					}
-					else if (opcode == EEOP_OUTER_VAR)
+					else if (opcode == EEOP_OUTER_VAR || opcode == EEOP_OUTER_VAR_TOAST)
 					{
+						v_slot = v_outerslot;
 						v_values = v_outervalues;
 						v_nulls = v_outernulls;
 					}
 					else
 					{
+						v_slot = v_scanslot;
 						v_values = v_scanvalues;
 						v_nulls = v_scannulls;
 					}
 
 					v_attnum = l_int32_const(lc, op->d.var.attnum);
+
+					if (opcode == EEOP_INNER_VAR_TOAST ||
+						opcode == EEOP_OUTER_VAR_TOAST ||
+						opcode == EEOP_SCAN_VAR_TOAST)
+					{
+						LLVMValueRef params[2];
+
+						params[0] = v_slot;
+						params[1] = l_int32_const(lc, op->d.var.attnum);
+						l_call(b,
+							   llvm_pg_var_func_type("ExecSlotDetoastDatumExternal"),
+							   llvm_pg_func(mod, "ExecSlotDetoastDatumExternal"),
+							   params, lengthof(params), "");
+					}
+
 					value = l_load_gep1(b, TypeSizeT, v_values, v_attnum, "");
 					isnull = l_load_gep1(b, TypeStorageBool, v_nulls, v_attnum, "");
 					LLVMBuildStore(b, value, v_resvaluep);
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 47c9daf402..1dcf0c2fd8 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -178,4 +178,5 @@ void	   *referenced_functions[] =
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecSlotDetoastDatumExternal,
 };
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 610f4a56d6..8acb48240e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -314,7 +314,9 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 List *mergeActionLists, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
-
+static void set_plan_forbid_pre_detoast_vars_recurse(Plan *plan,
+													 List *small_tlist);
+static void set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist);
 
 /*
  * create_plan
@@ -346,6 +348,12 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree, demanding the correct tlist result */
 	plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);
 
+	/*
+	 * After the plan tree is built completed, we start to walk for which
+	 * expressions should not used the shared-detoast feature.
+	 */
+	set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);
+
 	/*
 	 * Make sure the topmost plan node's targetlist exposes the original
 	 * column names and other decorative info.  Targetlists generated within
@@ -378,6 +386,101 @@ create_plan(PlannerInfo *root, Path *best_path)
 	return plan;
 }
 
+/*
+ * set_plan_forbid_pre_detoast_vars_recurse
+ *	 Walking the Plan tree in the top-down manner to gather the vars which
+ * should be as small as possible and record them in Plan.forbid_pre_detoast_vars
+ *
+ * plan: the plan node to walk right now.
+ * small_tlist: a list of nodes which its subplan should provide them as
+ * small as possible.
+ */
+static void
+set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)
+{
+	if (plan == NULL)
+		return;
+
+	set_plan_not_pre_detoast_vars(plan, small_tlist);
+
+	/* Recurse to its subplan.. */
+	if (IsA(plan, Sort) || IsA(plan, Memoize) || IsA(plan, WindowAgg) ||
+		IsA(plan, Hash) || IsA(plan, Material) || IsA(plan, IncrementalSort))
+	{
+		List	   *small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * For the sort-like nodes, we want the output of its subplan as small
+		 * as possible, but the subplan's other expressions like Qual doesn't
+		 * have this restriction since they are not output to the upper nodes.
+		 * so we set the small_tlist to the subplan->targetlist.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, small_tlist);
+	}
+	else if (IsA(plan, HashJoin) && castNode(HashJoin, plan)->left_small_tlist)
+	{
+		List	   *small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * If the left_small_tlist wants a as small as possible tlist, set it
+		 * in a way like sort for the left node.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, small_tlist);
+
+		/*
+		 * The righttree is a Hash node, it can be set with its own rule, so
+		 * the small_tlist provided is not important, we just need to recuse
+		 * to its subplan.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+	else
+	{
+		/*
+		 * Recurse to its children, just push down the forbid_pre_detoast_vars
+		 * to its children.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, plan->forbid_pre_detoast_vars);
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+}
+
+/*
+ * set_plan_not_pre_detoast_vars
+ *
+ *	Set the Plan.forbid_pre_detoast_vars according the small_tlist information.
+ *
+ * small_tlist = NIL means nothing is forbidden, or else if a Var belongs to the
+ * small_tlist, then it must not be pre-detoasted.
+ */
+static void
+set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist)
+{
+	ListCell   *lc;
+	Var		   *var;
+
+	/*
+	 * fast path, if we don't have a small_tlist, the var in targetlist is
+	 * impossible member of it. and this case might be a pretty common case.
+	 */
+	if (small_tlist == NIL)
+		return;
+
+	foreach(lc, plan->targetlist)
+	{
+		TargetEntry *te = lfirst_node(TargetEntry, lc);
+
+		if (!IsA(te->expr, Var))
+			continue;
+		var = castNode(Var, te->expr);
+		if (var->varattno <= 0)
+			continue;
+		if (list_member(small_tlist, var))
+			/* pass the recheck */
+			plan->forbid_pre_detoast_vars = lappend(plan->forbid_pre_detoast_vars, var);
+	}
+}
+
 /*
  * create_plan_recurse
  *	  Recursive guts of create_plan().
@@ -4893,6 +4996,8 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	join_plan->left_small_tlist = (best_path->num_batches > 1);
+
 	return join_plan;
 }
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 22a1fa29f3..f9b30903c2 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -27,6 +27,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_relation.h"
 #include "tcop/utility.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
@@ -55,11 +56,48 @@ typedef struct
 	tlist_vinfo vars[FLEXIBLE_ARRAY_MEMBER];	/* has num_vars entries */
 } indexed_tlist;
 
+/*
+ * Decide which attrs are detoasted in a expressions level, this is judged
+ * at the fix_scan/join_expr stage. The recursed level is tracked when we
+ * walk to a Var, if the level is greater than 1, then it means the
+ * var needs an detoast in this expression list, there are some exceptions
+ * here, see increase_level_for_pre_detoast for details.
+ */
+typedef struct
+{
+	/* if the level is added during a certain walk. */
+	bool		level_added;
+	/* the current level during the walk. */
+	int			level;
+} intermediate_level_context;
+
+/*
+ * Context to hold the detoast attribute within a expression.
+ *
+ * XXX: this design was intent to avoid the pre-detoast-logic if the var
+ * only need to be detoasted *once*, but for now, this context is only
+ * maintained at the expression level rather than plan tree level, so it
+ * can't detect if a Var will be detoasted 2+ time at the plan level.
+ * Recording the times of a Var is detoasted in the plan tree level is
+ * complex, so before we decide it is a must, I am not willing to do too
+ * many changes here.
+ */
+typedef struct
+{
+	/* var is accessed for the first time. */
+	Bitmapset  *existing_attrs;
+	/* var is accessed for the 2+ times. */
+	Bitmapset **final_ref_attrs;
+} intermediate_var_ref_context;
+
+
 typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context scan_reference_attrs;
 } fix_scan_expr_context;
 
 typedef struct
@@ -71,6 +109,9 @@ typedef struct
 	int			rtoffset;
 	NullingRelsMatch nrm_match;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context outer_reference_attrs;
+	intermediate_var_ref_context inner_reference_attrs;
 } fix_join_expr_context;
 
 typedef struct
@@ -127,8 +168,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, lst, rtoffset, num_exec, pre_detoast_attrs) \
+	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec, pre_detoast_attrs))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -158,7 +199,8 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
 static Node *fix_scan_expr(PlannerInfo *root, Node *node,
-						   int rtoffset, double num_exec);
+						   int rtoffset, double num_exec,
+						   Bitmapset **scan_reference_attrs);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
@@ -190,7 +232,10 @@ static List *fix_join_expr(PlannerInfo *root,
 						   Index acceptable_rel,
 						   int rtoffset,
 						   NullingRelsMatch nrm_match,
-						   double num_exec);
+						   double num_exec,
+						   Bitmapset **outer_reference_attrs,
+						   Bitmapset **inner_reference_attrs);
+
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
 static Node *fix_upper_expr(PlannerInfo *root,
@@ -628,10 +673,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_SampleScan:
@@ -641,13 +692,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs
+					);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tablesample = (TableSampleClause *)
 					fix_scan_expr(root, (Node *) splan->tablesample,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexScan:
@@ -657,28 +715,40 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->indexqual =
 					fix_scan_list(root, splan->indexqual,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->indexorderby =
 					fix_scan_list(root, splan->indexorderby,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexorderbyorig =
 					fix_scan_list(root, splan->indexorderbyorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan), &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexOnlyScan:
 			{
 				IndexOnlyScan *splan = (IndexOnlyScan *) plan;
 
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
+
 				return set_indexonlyscan_references(root, splan, rtoffset);
 			}
 			break;
@@ -691,10 +761,15 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, splan->indexqual, rtoffset, 1,
+								  &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_BitmapHeapScan:
@@ -704,13 +779,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->bitmapqualorig =
 					fix_scan_list(root, splan->bitmapqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidScan:
@@ -720,13 +802,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidquals =
 					fix_scan_list(root, splan->tidquals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidRangeScan:
@@ -736,13 +825,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidrangequals =
 					fix_scan_list(root, splan->tidrangequals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_SubqueryScan:
@@ -757,12 +853,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, splan->functions, rtoffset, 1,
+								  &splan->scan.reference_attrs);
+
 			}
 			break;
 		case T_TableFuncScan:
@@ -772,13 +872,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->tablefunc = (TableFunc *)
 					fix_scan_expr(root, (Node *) splan->tablefunc,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ValuesScan:
@@ -788,13 +892,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->values_lists =
 					fix_scan_list(root, splan->values_lists,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_CteScan:
@@ -804,10 +911,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_NamedTuplestoreScan:
@@ -817,10 +930,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_WorkTableScan:
@@ -830,10 +945,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ForeignScan:
@@ -873,7 +990,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
 												   rtoffset,
-												   NUM_EXEC_TLIST(plan));
+												   NUM_EXEC_TLIST(plan),
+												   NULL);
 				break;
 			}
 
@@ -933,9 +1051,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, splan->limitOffset, rtoffset, 1, NULL);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, splan->limitCount, rtoffset, 1, NULL);
 			}
 			break;
 		case T_Agg:
@@ -988,17 +1106,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->startOffset, rtoffset, 1, NULL);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->endOffset, rtoffset, 1, NULL);
 				wplan->runCondition = fix_scan_list(root,
 													wplan->runCondition,
 													rtoffset,
-													NUM_EXEC_TLIST(plan));
+													NUM_EXEC_TLIST(plan), NULL);
 				wplan->runConditionOrig = fix_scan_list(root,
 														wplan->runConditionOrig,
 														rtoffset,
-														NUM_EXEC_TLIST(plan));
+														NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_Result:
@@ -1038,14 +1156,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 					splan->plan.targetlist =
 						fix_scan_list(root, splan->plan.targetlist,
-									  rtoffset, NUM_EXEC_TLIST(plan));
+									  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 					splan->plan.qual =
 						fix_scan_list(root, splan->plan.qual,
-									  rtoffset, NUM_EXEC_QUAL(plan));
+									  rtoffset, NUM_EXEC_QUAL(plan), NULL);
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1, NULL);
 			}
 			break;
 		case T_ProjectSet:
@@ -1061,7 +1179,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->withCheckOptionLists =
 					fix_scan_list(root, splan->withCheckOptionLists,
-								  rtoffset, 1);
+								  rtoffset, 1, NULL);
 
 				if (splan->returningLists)
 				{
@@ -1118,18 +1236,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						fix_join_expr(root, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					splan->onConflictWhere = (Node *)
 						fix_join_expr(root, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1, NULL);
 				}
 
 				/*
@@ -1182,7 +1302,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   resultrel,
 															   rtoffset,
 															   NRM_EQUAL,
-															   NUM_EXEC_TLIST(plan));
+															   NUM_EXEC_TLIST(plan),
+															   NULL, NULL);
 
 							/* Fix quals too. */
 							action->qual = (Node *) fix_join_expr(root,
@@ -1191,7 +1312,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 																  resultrel,
 																  rtoffset,
 																  NRM_EQUAL,
-																  NUM_EXEC_QUAL(plan));
+																  NUM_EXEC_QUAL(plan),
+																  NULL, NULL);
 						}
 					}
 				}
@@ -1356,13 +1478,16 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
 	plan->indexqual = fix_scan_list(root, plan->indexqual,
-									rtoffset, 1);
+									rtoffset, 1,
+									&plan->scan.reference_attrs);
 	/* indexorderby is already transformed to reference index columns */
 	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
-									   rtoffset, 1);
+									   rtoffset, 1,
+									   &plan->scan.reference_attrs);
 	/* indextlist must NOT be transformed to reference index columns */
 	plan->indextlist = fix_scan_list(root, plan->indextlist,
-									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+									 rtoffset, NUM_EXEC_TLIST((Plan *) plan),
+									 &plan->scan.reference_attrs);
 
 	pfree(index_itlist);
 
@@ -1409,10 +1534,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
 			fix_scan_list(root, plan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) plan), NULL);
 		plan->scan.plan.qual =
 			fix_scan_list(root, plan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) plan), NULL);
 
 		result = (Plan *) plan;
 	}
@@ -1612,7 +1737,7 @@ set_foreignscan_references(PlannerInfo *root,
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
 			fix_scan_list(root, fscan->fdw_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 	}
 	else
 	{
@@ -1622,16 +1747,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_recheck_quals =
 			fix_scan_list(root, fscan->fdw_recheck_quals,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 	}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
@@ -1690,20 +1815,20 @@ set_customscan_references(PlannerInfo *root,
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
 			fix_scan_list(root, cscan->custom_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
 			fix_scan_list(root, cscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 		cscan->scan.plan.qual =
 			fix_scan_list(root, cscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 		cscan->custom_exprs =
 			fix_scan_list(root, cscan->custom_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 	}
 
 	/* Adjust child plan-nodes recursively, if needed */
@@ -2111,6 +2236,95 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
 	return (Node *) bestplan;
 }
 
+
+static inline void
+setup_intermediate_level_ctx(intermediate_level_context *ctx)
+{
+	ctx->level = 0;
+	ctx->level_added = false;
+}
+
+static inline void
+setup_intermediate_var_ref_ctx(intermediate_var_ref_context *ctx, Bitmapset **final_ref_attrs)
+{
+	ctx->existing_attrs = NULL;
+	ctx->final_ref_attrs = final_ref_attrs;
+}
+
+/*
+ * increase_level_for_pre_detoast
+ *	Check if the given Expr could detoast a Var directly, if yes,
+ * increase the level and return true. otherwise return false;
+ */
+static inline void
+increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
+{
+	/* The following nodes is impossible to detoast a Var directly. */
+	if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
+	{
+		ctx->level_added = false;
+	}
+	else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
+	{
+		/* let's not detoast first so that pg_column_compression works. */
+		ctx->level_added = false;
+	}
+	else
+	{
+		ctx->level_added = true;
+		ctx->level += 1;
+	}
+}
+
+static inline void
+decreased_level_for_pre_detoast(intermediate_level_context *ctx)
+{
+	if (ctx->level_added)
+		ctx->level -= 1;
+
+	ctx->level_added = false;
+}
+
+/*
+ * add_pre_detoast_vars
+ *		add the var's information into pre_detoast_attrs when the check is pass.
+ */
+static inline void
+add_pre_detoast_vars(intermediate_level_context *level_ctx,
+					 intermediate_var_ref_context *ctx,
+					 Var *var)
+{
+	int			attno;
+
+	if (level_ctx->level <= 1 || ctx->final_ref_attrs == NULL || var->varattno <= 0)
+		return;
+
+	attno = var->varattno - 1;
+	if (bms_is_member(attno, ctx->existing_attrs))
+	{
+		/* not the first time to access it, add it to final result. */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+	else
+	{
+		/* first time. */
+		ctx->existing_attrs = bms_add_member(ctx->existing_attrs, attno);
+
+		/*
+		 * XXX:
+		 *
+		 * The above strategy doesn't help to detect if a Var is detoast
+		 * twice. Reasons are: 1. the context is not maintain in Plan node
+		 * level. so if it is detoast at targetlist and qual, we can't detect
+		 * it. 2. even we can make it at plan node, it still doesn't help for
+		 * the among-nodes case.
+		 *
+		 * So for now, I just disable it.
+		 */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+}
+
 /*
  * fix_scan_expr
  *		Do set_plan_references processing on a scan-level expression
@@ -2125,18 +2339,23 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * 'node': the expression to be modified
  * 'rtoffset': how much to increment varnos by
  * 'num_exec': estimated number of executions of expression
+ * 'scan_reference_attrs': gather which vars are potential to run the detoast
+ * 	on this expr, NULL means the caller doesn't have interests on this.
  *
  * The expression tree is either copied-and-modified, or modified in-place
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset,
+			  double num_exec, Bitmapset **scan_reference_attrs)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.scan_reference_attrs, scan_reference_attrs);
 
 	if (rtoffset != 0 ||
 		root->multiexpr_params != NIL ||
@@ -2167,8 +2386,13 @@ fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
 static Node *
 fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 {
+	Node	   *n;
+
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = copyVar((Var *) node);
@@ -2186,10 +2410,16 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 			var->varno += context->rtoffset;
 		if (var->varnosyn > 0)
 			var->varnosyn += context->rtoffset;
+
+		add_pre_detoast_vars(&context->level_ctx, &context->scan_reference_attrs, var);
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) var;
 	}
 	if (IsA(node, Param))
+	{
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return fix_param_node(context->root, (Param *) node);
+	}
 	if (IsA(node, Aggref))
 	{
 		Aggref	   *aggref = (Aggref *) node;
@@ -2199,8 +2429,10 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		aggparam = find_minmax_agg_replacement_param(context->root, aggref);
 		if (aggparam != NULL)
 		{
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			/* Make a copy of the Param for paranoia's sake */
 			return (Node *) copyObject(aggparam);
+
 		}
 		/* If no match, just fall through to process it normally */
 	}
@@ -2210,6 +2442,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 
 		Assert(!IS_SPECIAL_VARNO(cexpr->cvarno));
 		cexpr->cvarno += context->rtoffset;
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) cexpr;
 	}
 	if (IsA(node, PlaceHolderVar))
@@ -2218,29 +2451,52 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
 		/* XXX can we assert something about phnullingrels? */
-		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
+		Node	   *n2 = fix_scan_expr_mutator((Node *) phv->phexpr, context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
 	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_scan_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		Node	   *n2 = fix_scan_expr_mutator(fix_alternative_subplan(context->root,
+																	   (AlternativeSubPlan *) node,
+																	   context->num_exec),
+											   context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator,
-								   (void *) context);
+	n = expression_tree_mutator(node, fix_scan_expr_mutator, (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return n;
 }
 
 static bool
 fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 {
+	bool		ret;
+
 	if (node == NULL)
 		return false;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
+	if (IsA(node, Var))
+	{
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 castNode(Var, node));
+	}
 	Assert(!(IsA(node, Var) && ((Var *) node)->varno == ROWID_VAR));
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
-	return expression_tree_walker(node, fix_scan_expr_walker,
-								  (void *) context);
+	ret = expression_tree_walker(node, fix_scan_expr_walker,
+								 (void *) context);
+
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret;
 }
 
 /*
@@ -2276,7 +2532,10 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset,
 								   NRM_EQUAL,
-								   NUM_EXEC_QUAL((Plan *) join));
+								   NUM_EXEC_QUAL((Plan *) join),
+								   &join->outer_reference_attrs,
+								   &join->inner_reference_attrs
+		);
 
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
@@ -2323,7 +2582,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										 (Index) 0,
 										 rtoffset,
 										 NRM_EQUAL,
-										 NUM_EXEC_QUAL((Plan *) join));
+										 NUM_EXEC_QUAL((Plan *) join),
+										 &join->outer_reference_attrs,
+										 &join->inner_reference_attrs);
 	}
 	else if (IsA(join, HashJoin))
 	{
@@ -2336,7 +2597,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										(Index) 0,
 										rtoffset,
 										NRM_EQUAL,
-										NUM_EXEC_QUAL((Plan *) join));
+										NUM_EXEC_QUAL((Plan *) join),
+										&join->outer_reference_attrs,
+										&join->inner_reference_attrs);
 
 		/*
 		 * HashJoin's hashkeys are used to look for matching tuples from its
@@ -2368,7 +2631,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  (Index) 0,
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-										  NUM_EXEC_TLIST((Plan *) join));
+										  NUM_EXEC_TLIST((Plan *) join),
+										  &join->outer_reference_attrs,
+										  &join->inner_reference_attrs);
 	join->plan.qual = fix_join_expr(root,
 									join->plan.qual,
 									outer_itlist,
@@ -2376,8 +2641,20 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 									(Index) 0,
 									rtoffset,
 									(join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-									NUM_EXEC_QUAL((Plan *) join));
-
+									NUM_EXEC_QUAL((Plan *) join),
+									&join->outer_reference_attrs,
+									&join->inner_reference_attrs);
+
+	join->plan.forbid_pre_detoast_vars = fix_join_expr(root,
+													   join->plan.forbid_pre_detoast_vars,
+													   outer_itlist,
+													   inner_itlist,
+													   (Index) 0,
+													   rtoffset,
+													   (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
+													   NUM_EXEC_TLIST((Plan *) join),
+													   NULL,
+													   NULL);
 	pfree(outer_itlist);
 	pfree(inner_itlist);
 }
@@ -3010,9 +3287,12 @@ fix_join_expr(PlannerInfo *root,
 			  Index acceptable_rel,
 			  int rtoffset,
 			  NullingRelsMatch nrm_match,
-			  double num_exec)
+			  double num_exec,
+			  Bitmapset **outer_reference_attrs,
+			  Bitmapset **inner_reference_attrs)
 {
 	fix_join_expr_context context;
+	List	   *ret;
 
 	context.root = root;
 	context.outer_itlist = outer_itlist;
@@ -3021,16 +3301,30 @@ fix_join_expr(PlannerInfo *root,
 	context.rtoffset = rtoffset;
 	context.nrm_match = nrm_match;
 	context.num_exec = num_exec;
-	return (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.outer_reference_attrs, outer_reference_attrs);
+	setup_intermediate_var_ref_ctx(&context.inner_reference_attrs, inner_reference_attrs);
+
+	ret = (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	bms_free(context.outer_reference_attrs.existing_attrs);
+	bms_free(context.inner_reference_attrs.existing_attrs);
+
+	return ret;
 }
 
 static Node *
 fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 {
 	Var		   *newvar;
+	Node	   *ret_node;
 
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = (Var *) node;
@@ -3044,7 +3338,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* then in the inner. */
@@ -3056,7 +3356,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If it's for acceptable_rel, adjust and return it */
@@ -3066,6 +3372,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 			var->varno += context->rtoffset;
 			if (var->varnosyn > 0)
 				var->varnosyn += context->rtoffset;
+			/* XXX acceptable_rel?  we can ignore it for safety. */
+			decreased_level_for_pre_detoast(&context->level_ctx);
+
 			return (Node *) var;
 		}
 
@@ -3084,22 +3393,38 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  OUTER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 		if (context->inner_itlist && context->inner_itlist->has_ph_vars)
 		{
+
 			newvar = search_indexed_tlist_for_phv(phv,
 												  context->inner_itlist,
 												  INNER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If not supplied by input plans, evaluate the contained expr */
 		/* XXX can we assert something about phnullingrels? */
-		return fix_join_expr_mutator((Node *) phv->phexpr, context);
+		ret_node = fix_join_expr_mutator((Node *) phv->phexpr, context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
 	}
+
 	/* Try matching more complex expressions too, if tlists have any */
 	if (context->outer_itlist && context->outer_itlist->has_non_vars)
 	{
@@ -3107,7 +3432,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->outer_itlist,
 												  OUTER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->outer_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	if (context->inner_itlist && context->inner_itlist->has_non_vars)
 	{
@@ -3115,20 +3446,36 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->inner_itlist,
 												  INNER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->inner_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	/* Special cases (apply only AFTER failing to match to lower tlist) */
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		ret_node = fix_param_node(context->root, (Param *) node);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_join_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		ret_node = fix_join_expr_mutator(fix_alternative_subplan(context->root,
+																 (AlternativeSubPlan *) node,
+																 context->num_exec),
+										 context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node,
-								   fix_join_expr_mutator,
-								   (void *) context);
+	ret_node = expression_tree_mutator(node,
+									   fix_join_expr_mutator,
+									   (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret_node;
 }
 
 /*
@@ -3163,7 +3510,8 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * varno = newvarno, varattno = resno of corresponding targetlist element.
  * The original tree is not modified.
  */
-static Node *
+static Node *					/* XXX: shall I care about this for shared
+								 * detoast optimization? */
 fix_upper_expr(PlannerInfo *root,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
@@ -3318,7 +3666,10 @@ set_returning_clause_references(PlannerInfo *root,
 						  resultRelation,
 						  rtoffset,
 						  NRM_EQUAL,
-						  NUM_EXEC_TLIST(topplan));
+						  NUM_EXEC_TLIST(topplan),
+						  NULL,
+						  NULL
+		);
 
 	pfree(itlist);
 
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index a28ddcdd77..9304786bb2 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,17 @@ typedef enum ExprEvalOp
 	EEOP_OUTER_VAR,
 	EEOP_SCAN_VAR,
 
+	/*
+	 * compute non-system Var value with shared-detoast-datum logic, use some
+	 * dedicated steps rather than add extra logic to existing steps is for
+	 * performance aspect, within this way, we just decide if the extra logic
+	 * is needed at ExecInitExpr stage once rather than every time of
+	 * ExecInterpExpr.
+	 */
+	EEOP_INNER_VAR_TOAST,
+	EEOP_OUTER_VAR_TOAST,
+	EEOP_SCAN_VAR_TOAST,
+
 	/* compute system Var value */
 	EEOP_INNER_SYSVAR,
 	EEOP_OUTER_SYSVAR,
@@ -830,5 +841,6 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
+extern void ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum);
 
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 6133dbcd0a..3951ffc495 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -18,6 +18,7 @@
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/tupdesc.h"
+#include "nodes/bitmapset.h"
 #include "storage/buf.h"
 
 /*----------
@@ -128,6 +129,18 @@ typedef struct TupleTableSlot
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
+
+	/*
+	 * The attributes whose values are the detoasted version in tts_values[*],
+	 * if so these memory needs some extra clean-up. These memory can't be put
+	 * into ecxt_per_tuple_memory since many of them needs a longer life span,
+	 * for example the Datum in outer join. These memory is put into
+	 * TupleTableSlot.tts_mcxt and be clear whenever the tts_values[*] is
+	 * invalidated.
+	 *
+	 * These values are populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST steps.
+	 */
+	Bitmapset	   *pre_detoasted_attrs;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -426,12 +439,32 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+/*
+ * ExecFreePreDetoastDatum - free the memory which is allocated in pre-detoast-datum.
+ */
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+	int			attnum;
+
+	attnum = -1;
+	while ((attnum = bms_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
+	{
+		pfree((void *) slot->tts_values[attnum]);
+	}
+
+	slot->pre_detoasted_attrs = bms_del_members(slot->pre_detoasted_attrs, slot->pre_detoasted_attrs);
+}
+
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
 static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_ops->clear(slot);
 
 	return slot;
@@ -450,6 +483,10 @@ ExecClearTuple(TupleTableSlot *slot)
 static inline void
 ExecMaterializeSlot(TupleTableSlot *slot)
 {
+	/*
+	 * XXX: pre_detoasted_attrs doesn't dependent on any external storage, so
+	 * nothing should be done here.
+	 */
 	slot->tts_ops->materialize(slot);
 }
 
@@ -494,6 +531,25 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 
 	dstslot->tts_ops->copyslot(dstslot, srcslot);
 
+	if (dstslot->tts_nvalid > 0 && srcslot->tts_nvalid > 0)
+	{
+		int			attnum = -1;
+		MemoryContext old = MemoryContextSwitchTo(dstslot->tts_mcxt);
+
+		dstslot->pre_detoasted_attrs = bms_copy(srcslot->pre_detoasted_attrs);
+
+		while ((attnum = bms_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
+		{
+			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
+			Size		len;
+
+			Assert(!VARATT_IS_EXTENDED(datum));
+			len = VARSIZE(datum);
+			dstslot->tts_values[attnum] = (Datum) palloc(len);
+			memcpy((void *) dstslot->tts_values[attnum], datum, len);
+		}
+		MemoryContextSwitchTo(old);
+	}
 	return dstslot;
 }
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd5..30fdb37d1c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1481,6 +1481,12 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the Scan nodes.
+	 */
+	Bitmapset  *scan_pre_detoast_attrs;
 } ScanState;
 
 /* ----------------
@@ -2010,6 +2016,13 @@ typedef struct JoinState
 	bool		single_match;	/* True if we should skip to next outer tuple
 								 * after finding one inner match */
 	ExprState  *joinqual;		/* JOIN quals (in addition to ps.qual) */
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the join nodes.
+	 */
+	Bitmapset  *outer_pre_detoast_attrs;
+	Bitmapset  *inner_pre_detoast_attrs;
 } JoinState;
 
 /* ----------------
@@ -2771,4 +2784,5 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+extern void SetPredetoastAttrsForJoin(JoinState *joinstate);
 #endif							/* EXECNODES_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c..ea5033aaa0 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,13 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+
+	/*
+	 * A list of Vars which should not apply the shared-detoast-datum logic
+	 * since the upper nodes like Sort/Hash wants them as small as possible.
+	 * It's a subset of targetlist in each Plan node.
+	 */
+	List	   *forbid_pre_detoast_vars;
 } Plan;
 
 /* ----------------
@@ -385,6 +392,16 @@ typedef struct Scan
 
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *reference_attrs;
 } Scan;
 
 /* ----------------
@@ -789,6 +806,17 @@ typedef struct Join
 	JoinType	jointype;
 	bool		inner_unique;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *outer_reference_attrs;
+	Bitmapset  *inner_reference_attrs;
 } Join;
 
 /* ----------------
@@ -869,6 +897,11 @@ typedef struct HashJoin
 	 * perform lookups in the hashtable over the inner plan.
 	 */
 	List	   *hashkeys;
+
+	/*
+	 * Whether the left plan tree should use a SMALL_TLIST.
+	 */
+	bool		left_small_tlist;
 } HashJoin;
 
 /* ----------------
@@ -1588,4 +1621,24 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+static inline bool
+is_join_plan(Plan *plan)
+{
+	return (plan != NULL) && (IsA(plan, NestLoop) || IsA(plan, HashJoin) || IsA(plan, MergeJoin));
+}
+
+static inline bool
+is_scan_plan(Plan *plan)
+{
+	return (plan != NULL) &&
+		(IsA(plan, SeqScan) ||
+		 IsA(plan, SampleScan) ||
+		 IsA(plan, IndexScan) ||
+		 IsA(plan, IndexOnlyScan) ||
+		 IsA(plan, BitmapIndexScan) ||
+		 IsA(plan, BitmapHeapScan) ||
+		 IsA(plan, TidScan) ||
+		 IsA(plan, SubqueryScan));
+}
+
 #endif							/* PLANNODES_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fc8b15d0cf..ac55727e05 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4049,6 +4049,8 @@ cb_cleanup_dir
 cb_options
 cb_tablespace
 cb_tablespace_mapping
+intermediate_var_ref_context
+intermediate_level_context
 manifest_data
 manifest_writer
 rfile
-- 
2.34.1

v8-0002-Introduce-a-Bitset-data-struct.patchtext/x-diffDownload
From bd3dc494097811542fd0e9bd73d876f534543e54 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Tue, 20 Feb 2024 11:11:53 +0800
Subject: [PATCH v8 2/5] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

[1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com
---
 src/backend/nodes/bitmapset.c                 | 200 +++++++++++++++++-
 src/backend/nodes/outfuncs.c                  |  51 +++++
 src/include/nodes/bitmapset.h                 |  28 +++
 src/include/nodes/nodes.h                     |   4 +
 src/test/modules/test_misc/Makefile           |  11 +
 src/test/modules/test_misc/README             |   4 +-
 .../test_misc/expected/test_bitset.out        |   7 +
 src/test/modules/test_misc/meson.build        |  17 ++
 .../modules/test_misc/sql/test_bitset.sql     |   3 +
 src/test/modules/test_misc/test_misc--1.0.sql |   5 +
 src/test/modules/test_misc/test_misc.c        | 118 +++++++++++
 src/test/modules/test_misc/test_misc.control  |   4 +
 src/tools/pgindent/typedefs.list              |   1 +
 13 files changed, 441 insertions(+), 12 deletions(-)
 create mode 100644 src/test/modules/test_misc/expected/test_bitset.out
 create mode 100644 src/test/modules/test_misc/sql/test_bitset.sql
 create mode 100644 src/test/modules/test_misc/test_misc--1.0.sql
 create mode 100644 src/test/modules/test_misc/test_misc.c
 create mode 100644 src/test/modules/test_misc/test_misc.control

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 65805d4527..40cfea2308 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1315,23 +1315,18 @@ bms_join(Bitmapset *a, Bitmapset *b)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-int
-bms_next_member(const Bitmapset *a, int prevbit)
+
+static int
+bms_next_member_internal(int nwords, const bitmapword *words, int prevbit)
 {
-	int			nwords;
 	int			wordnum;
 	bitmapword	mask;
 
-	Assert(bms_is_valid_set(a));
-
-	if (a == NULL)
-		return -2;
-	nwords = a->nwords;
 	prevbit++;
 	mask = (~(bitmapword) 0) << BITNUM(prevbit);
 	for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
 	{
-		bitmapword	w = a->words[wordnum];
+		bitmapword	w = words[wordnum];
 
 		/* ignore bits before prevbit */
 		w &= mask;
@@ -1351,6 +1346,19 @@ bms_next_member(const Bitmapset *a, int prevbit)
 	return -2;
 }
 
+int
+bms_next_member(const Bitmapset *a, int prevbit)
+{
+	Assert(a == NULL || IsA(a, Bitmapset));
+
+	Assert(bms_is_valid_set(a));
+
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
 /*
  * bms_prev_member - find prev member of a set
  *
@@ -1458,3 +1466,177 @@ bitmap_match(const void *key1, const void *key2, Size keysize)
 	return !bms_equal(*((const Bitmapset *const *) key1),
 					  *((const Bitmapset *const *) key2));
 }
+
+/*
+ * bitset_init - create a Bitset. the set will be round up to nwords;
+ */
+Bitset *
+bitset_init(size_t size)
+{
+	int			nword = (size + BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+	Bitset	   *result;
+
+	if (size == 0)
+		return NULL;
+
+	result = (Bitset *) palloc0(sizeof(Bitset) + nword * sizeof(bitmapword));
+	result->nwords = nword;
+
+	return result;
+}
+
+/*
+ * bitset_clear - clear the bits only, but the memory is still there.
+ */
+void
+bitset_clear(Bitset *a)
+{
+	if (a != NULL)
+		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
+}
+
+void
+bitset_free(Bitset *a)
+{
+	if (a != NULL)
+		pfree(a);
+}
+
+bool
+bitset_is_empty(Bitset *a)
+{
+	int			i;
+
+	if (a == NULL)
+		return true;
+
+	for (i = 0; i < a->nwords; i++)
+	{
+		bitmapword	w = a->words[i];
+
+		if (w != 0)
+			return false;
+	}
+
+	return true;
+}
+
+Bitset *
+bitset_copy(Bitset *a)
+{
+	Bitset	   *result;
+
+	if (a == NULL)
+		return NULL;
+
+	result = bitset_init(a->nwords * BITS_PER_BITMAPWORD);
+
+	memcpy(result->words, a->words, sizeof(bitmapword) * a->nwords);
+	return result;
+}
+
+void
+bitset_add_member(Bitset *a, int x)
+{
+	int			wordnum,
+				bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] |= ((bitmapword) 1 << bitnum);
+}
+
+void
+bitset_del_member(Bitset *a, int x)
+{
+	int			wordnum,
+				bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] &= ~((bitmapword) 1 << bitnum);
+}
+
+int
+bitset_is_member(int x, Bitset *a)
+{
+	int			wordnum,
+				bitnum;
+
+	/* used in expression engine */
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	if (a == NULL)
+		return false;
+
+	if (wordnum >= a->nwords)
+		return false;
+
+	return (a->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+int
+bitset_next_member(const Bitset *a, int prevbit)
+{
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
+
+/*
+ * bitset_to_bitmap - build a legal bitmapset from bitset.
+ */
+Bitmapset *
+bitset_to_bitmap(Bitset *a)
+{
+	int			n;
+
+	bool		found = false;	/* any non-empty bits */
+	Bitmapset  *result;
+	int			i;
+
+	if (a == NULL)
+		return NULL;
+
+	n = a->nwords - 1;
+	do
+	{
+		if (a->words[n] > 0)
+		{
+			found = true;
+			break;
+		}
+	} while (--n >= 0);
+
+	if (!found)
+		return NULL;
+
+	result = (Bitmapset *) palloc0(BITMAPSET_SIZE(n + 1));
+	result->type = T_Bitmapset;
+	result->nwords = n + 1;
+
+	Assert(result->nwords <= a->nwords);
+
+	i = 0;
+	do
+	{
+		result->words[i] = a->words[i];
+	} while (++i < result->nwords);
+
+	return result;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 2c30bba212..f2ee806ef1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -331,6 +331,43 @@ outBitmapset(StringInfo str, const Bitmapset *bms)
 	appendStringInfoChar(str, ')');
 }
 
+
+
+/*
+ * outBitset -
+ *	similar to outBitmapset, but for Bitset.
+ */
+static void
+outBitsetInternal(struct StringInfoData *str,
+				  const struct Bitset *bs,
+				  bool asBitmap)
+{
+	int			x;
+
+	appendStringInfoChar(str, '(');
+	if (asBitmap)
+		appendStringInfoChar(str, 'b');
+	else
+		appendStringInfo(str, "bs");
+	x = -1;
+	while ((x = bitset_next_member(bs, x)) >= 0)
+		appendStringInfo(str, " %d", x);
+	appendStringInfoChar(str, ')');
+}
+
+
+/*
+ * outBitset -
+ *	similar to outBitmapset, but for Bitset.
+ */
+void
+outBitset(struct StringInfoData *str,
+		  const struct Bitset *bs)
+{
+	outBitsetInternal(str, bs, false);
+}
+
+
 /*
  * Print the value of a Datum given its type.
  */
@@ -783,3 +820,17 @@ bmsToString(const Bitmapset *bms)
 	outBitmapset(&str, bms);
 	return str.data;
 }
+
+/*
+ * bitsetToString -
+ *	   similar to bmsToString, but for Bitset
+ */
+char *
+bitsetToString(const struct Bitset *bs, bool asBitmap)
+{
+	StringInfoData str;
+
+	initStringInfo(&str);
+	outBitsetInternal(&str, bs, asBitmap);
+	return str.data;
+}
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 906e8dcc15..95ff37c6e9 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -55,6 +55,24 @@ typedef struct Bitmapset
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];	/* really [nwords] */
 } Bitmapset;
 
+/*
+ * While Bitmapset is designed for variable-length of bits, Bitset is
+ * designed for fixed-length of bits, the fixed length must be specified at
+ * the bitset_init stage and keep unchanged at the whole lifespan. Because
+ * of this, some operations on Bitset is simpler than Bitmapset.
+ *
+ * The bitset_clear unsets all the bits but kept the allocated memory, this
+ * capacity is impossible for bit Bitmapset for some solid reasons.
+ *
+ * Also for performance aspect, the functions for Bitset removed some
+ * unlikely checks, instead with some Asserts.
+ */
+
+typedef struct Bitset
+{
+	int			nwords;			/* number of words in array */
+	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];	/* really [nwords] */
+} Bitset;
 
 /* result of bms_subset_compare */
 typedef enum
@@ -124,4 +142,14 @@ extern uint32 bms_hash_value(const Bitmapset *a);
 extern uint32 bitmap_hash(const void *key, Size keysize);
 extern int	bitmap_match(const void *key1, const void *key2, Size keysize);
 
+extern Bitset *bitset_init(size_t size);
+extern void bitset_clear(Bitset *a);
+extern void bitset_free(Bitset *a);
+extern bool bitset_is_empty(Bitset *a);
+extern Bitset *bitset_copy(Bitset *a);
+extern void bitset_add_member(Bitset *a, int x);
+extern void bitset_del_member(Bitset *a, int x);
+extern int	bitset_is_member(int bit, Bitset *a);
+extern int	bitset_next_member(const Bitset *a, int prevbit);
+extern Bitmapset *bitset_to_bitmap(Bitset *a);
 #endif							/* BITMAPSET_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 2969dd831b..4d13107990 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -186,16 +186,20 @@ castNodeImpl(NodeTag type, void *ptr)
  * nodes/{outfuncs.c,print.c}
  */
 struct Bitmapset;				/* not to include bitmapset.h here */
+struct Bitset;					/* not to include bitmapset.h here */
 struct StringInfoData;			/* not to include stringinfo.h here */
 
 extern void outNode(struct StringInfoData *str, const void *obj);
 extern void outToken(struct StringInfoData *str, const char *s);
 extern void outBitmapset(struct StringInfoData *str,
 						 const struct Bitmapset *bms);
+extern void outBitset(struct StringInfoData *str, const struct Bitset *bs);
+
 extern void outDatum(struct StringInfoData *str, uintptr_t value,
 					 int typlen, bool typbyval);
 extern char *nodeToString(const void *obj);
 extern char *bmsToString(const struct Bitmapset *bms);
+extern char *bitsetToString(const struct Bitset *bs, bool asBitmap);
 
 /*
  * nodes/{readfuncs.c,read.c}
diff --git a/src/test/modules/test_misc/Makefile b/src/test/modules/test_misc/Makefile
index 39c6c2014a..af96604096 100644
--- a/src/test/modules/test_misc/Makefile
+++ b/src/test/modules/test_misc/Makefile
@@ -2,6 +2,17 @@
 
 TAP_TESTS = 1
 
+MODULE_big = test_misc
+OBJS = \
+	$(WIN32RES) \
+	test_misc.o
+PGFILEDESC = "test_misc"
+
+EXTENSION = test_misc
+DATA = test_misc--1.0.sql
+
+REGRESS = test_bitset
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/src/test/modules/test_misc/README b/src/test/modules/test_misc/README
index 4876733fa2..ec426c4ad5 100644
--- a/src/test/modules/test_misc/README
+++ b/src/test/modules/test_misc/README
@@ -1,4 +1,2 @@
-This directory doesn't actually contain any extension module.
-
-What it is is a home for otherwise-unclassified TAP tests that exercise core
+What it is is a home for otherwise-unclassified tests that exercise core
 server features.  We might equally well have called it, say, src/test/misc.
diff --git a/src/test/modules/test_misc/expected/test_bitset.out b/src/test/modules/test_misc/expected/test_bitset.out
new file mode 100644
index 0000000000..3d0302d30d
--- /dev/null
+++ b/src/test/modules/test_misc/expected/test_bitset.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_misc;
+SELECT test_bitset();
+ test_bitset 
+-------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 964d95db26..a23f3e3f47 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -1,5 +1,22 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+test_misc_sources = files(
+    'test_misc.c',
+)
+
+if host_system == 'windows'
+  test_misc_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_misc',
+    '--FILEDESC', 'test_misc - ',])
+endif
+
+test_misc = shared_module('test_misc',
+  test_misc_sources,
+  kwargs: pg_test_mod_args,
+)
+
+test_install_libs += test_misc
+
 tests += {
   'name': 'test_misc',
   'sd': meson.current_source_dir(),
diff --git a/src/test/modules/test_misc/sql/test_bitset.sql b/src/test/modules/test_misc/sql/test_bitset.sql
new file mode 100644
index 0000000000..0f73bbf532
--- /dev/null
+++ b/src/test/modules/test_misc/sql/test_bitset.sql
@@ -0,0 +1,3 @@
+CREATE EXTENSION test_misc;
+
+SELECT test_bitset();
diff --git a/src/test/modules/test_misc/test_misc--1.0.sql b/src/test/modules/test_misc/test_misc--1.0.sql
new file mode 100644
index 0000000000..79afaa6263
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc--1.0.sql
@@ -0,0 +1,5 @@
+\echo Use "CREATE EXTENSION test_misc" to load this file. \quit
+
+CREATE FUNCTION test_bitset()
+       RETURNS pg_catalog.void
+       AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_misc/test_misc.c b/src/test/modules/test_misc/test_misc.c
new file mode 100644
index 0000000000..70d0255ada
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc.c
@@ -0,0 +1,118 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_misc.c
+ *
+ * Copyright (c) 2022-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_dsa/test_misc.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+#include "fmgr.h"
+#include "nodes/bitmapset.h"
+#include "nodes/nodes.h"
+#define BIT_ADD 0
+#define BIT_DEL 1
+#ifdef USE_ASSERT_CHECKING
+static void compare_bms_bs(Bitmapset **bms, Bitset *bs, int member, int op);
+#endif
+PG_MODULE_MAGIC;
+/* Test basic DSA functionality */
+PG_FUNCTION_INFO_V1(test_bitset);
+Datum
+test_bitset(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ASSERT_CHECKING
+	Bitset	   *bs;
+	Bitset	   *bs2;
+	char	   *str1,
+			   *str2,
+			   *empty_str;
+	Bitmapset  *bms = NULL;
+	int			i;
+
+	empty_str = bmsToString(NULL);
+	/* size = 0 */
+	bs = bitset_init(0);
+	Assert(bs == NULL);
+	bitset_clear(bs);
+	Assert(bitset_is_empty(bs));
+	/* bitset_add_member(bs, 0); // crash. */
+	/* bitset_del_member(bs, 0); // crash. */
+	Assert(!bitset_is_member(0, bs));
+	Assert(bitset_next_member(bs, -1) == -2);
+	bs2 = bitset_copy(bs);
+	Assert(bs2 == NULL);
+	bitset_free(bs);
+	bitset_free(bs2);
+	/* size == 68, nword == 2 */
+	bs = bitset_init(68);
+	for (i = 0; i < 68; i = i + 3)
+	{
+		compare_bms_bs(&bms, bs, i, BIT_ADD);
+	}
+	Assert(!bitset_is_empty(bs));
+	for (i = 0; i < 68; i = i + 3)
+	{
+		compare_bms_bs(&bms, bs, i, BIT_DEL);
+	}
+	Assert(bitset_is_empty(bs));
+	bitset_clear(bs);
+	str1 = bitsetToString(bs, true);
+	Assert(strcmp(str1, empty_str) == 0);
+	bms = bitset_to_bitmap(bs);
+	str2 = bmsToString(bms);
+	Assert(strcmp(str1, str2) == 0);
+	bms = bitset_to_bitmap(NULL);
+	Assert(strcmp(bmsToString(bms), empty_str) == 0);
+	bitset_free(bs);
+#endif
+	PG_RETURN_VOID();
+}
+#ifdef USE_ASSERT_CHECKING
+static void
+compare_bms_bs(Bitmapset **bms, Bitset *bs, int member, int op)
+{
+	char	   *str1,
+			   *str2,
+			   *str3,
+			   *str4;
+	Bitmapset  *bms3;
+	Bitset	   *bs4;
+
+	if (op == BIT_ADD)
+	{
+		*bms = bms_add_member(*bms, member);
+		bitset_add_member(bs, member);
+		Assert(bms_is_member(member, *bms));
+		Assert(bitset_is_member(member, bs));
+	}
+	else if (op == BIT_DEL)
+	{
+		*bms = bms_del_member(*bms, member);
+		bitset_del_member(bs, member);
+		Assert(!bms_is_member(member, *bms));
+		Assert(!bitset_is_member(member, bs));
+	}
+	else
+		Assert(false);
+	/* compare the rest existing bit */
+	str1 = bmsToString(*bms);
+	str2 = bitsetToString(bs, true);
+	Assert(strcmp(str1, str2) == 0);
+	/* test bitset_to_bitmap */
+	bms3 = bitset_to_bitmap(bs);
+	str3 = bmsToString(bms3);
+	Assert(strcmp(str3, str2) == 0);
+	/* test bitset_copy */
+	bs4 = bitset_copy(bs);
+	str4 = bitsetToString(bs4, true);
+	Assert(strcmp(str3, str4) == 0);
+	pfree(str1);
+	pfree(str2);
+	pfree(str3);
+	pfree(str4);
+}
+#endif
diff --git a/src/test/modules/test_misc/test_misc.control b/src/test/modules/test_misc/test_misc.control
new file mode 100644
index 0000000000..48fd08758f
--- /dev/null
+++ b/src/test/modules/test_misc/test_misc.control
@@ -0,0 +1,4 @@
+comment = 'Test misc'
+default_version = '1.0'
+module_pathname = '$libdir/test_misc'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ac55727e05..152e586252 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -268,6 +268,7 @@ BitmapOr
 BitmapOrPath
 BitmapOrState
 Bitmapset
+Bitset
 Block
 BlockId
 BlockIdData
-- 
2.34.1

v8-0003-Use-bitset-instead-of-Bitmapset-for-pre_detoast_a.patchtext/x-diffDownload
From 99fe617f2e09955f1ef638b360bc7b819779b18f Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Sat, 24 Feb 2024 21:26:19 +0800
Subject: [PATCH v8 3/5] Use bitset instead of Bitmapset for pre_detoast_attrs

due to commit 00b41463c, when all the bits in a bitmapset is unset, the
memory under it will be released automatically, this is cause extra
overhead for shared-detoast-datum feature, so using bitset now.
---
 src/backend/executor/execExprInterp.c |  2 +-
 src/backend/executor/execTuples.c     |  8 ++++----
 src/include/executor/tuptable.h       | 17 ++++++++++++-----
 3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index dc0db12ff4..878235ae76 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -202,7 +202,7 @@ ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
 																(struct varlena *) oldDatum));
 		Assert(slot->tts_nvalid > attnum);
 		Assert(oldDatum != slot->tts_values[attnum]);
-		slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
+		bitset_add_member(slot->pre_detoasted_attrs, attnum);
 		MemoryContextSwitchTo(old);
 	}
 }
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 2d11e5e8b3..830e1d4ea6 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -179,7 +179,7 @@ tts_virtual_materialize(TupleTableSlot *slot)
 		if (att->attbyval || slot->tts_isnull[natt])
 			continue;
 
-		if (bms_is_member(natt, slot->pre_detoasted_attrs))
+		if (bitset_is_member(natt, slot->pre_detoasted_attrs))
 			/* it has been in slot->tts_mcxt already. */
 			continue;
 
@@ -1179,7 +1179,7 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 			 + MAXALIGN(tupleDesc->natts * sizeof(Datum)));
 
 		PinTupleDesc(tupleDesc);
-		slot->pre_detoasted_attrs = NULL;
+		slot->pre_detoasted_attrs = bitset_init(tupleDesc->natts);
 	}
 	else
 		slot->pre_detoasted_attrs = NULL;
@@ -1339,7 +1339,7 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 	if (slot->tts_isnull)
 		pfree(slot->tts_isnull);
 	if (slot->pre_detoasted_attrs)
-		bms_free(slot->pre_detoasted_attrs);
+		bitset_free(slot->pre_detoasted_attrs);
 
 	/*
 	 * Install the new descriptor; if it's refcounted, bump its refcount.
@@ -1357,7 +1357,7 @@ ExecSetSlotDescriptor(TupleTableSlot *slot, /* slot to change */
 		MemoryContextAlloc(slot->tts_mcxt, tupdesc->natts * sizeof(bool));
 
 	old = MemoryContextSwitchTo(slot->tts_mcxt);
-	slot->pre_detoasted_attrs = NULL;
+	slot->pre_detoasted_attrs = bitset_init(tupdesc->natts);
 	MemoryContextSwitchTo(old);
 }
 
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 3951ffc495..8f3eba7fbb 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -138,9 +138,16 @@ typedef struct TupleTableSlot
 	 * TupleTableSlot.tts_mcxt and be clear whenever the tts_values[*] is
 	 * invalidated.
 	 *
+	 * Bitset rather than Bitmapset is chosen here because when all the
+	 * members of Bitmapset are deleted, the allocated memory will be
+	 * deallocated automatically, which is too expensive in this case since we
+	 * need to deleted all the members in each ExecClearTuple and repopulate
+	 * it again when fill the detoast datum to tts_values[*]. This situation
+	 * will be run again and again in an execution cycle.
+	 *
 	 * These values are populated by EEOP_{INNER/OUTER/SCAN}_VAR_TOAST steps.
 	 */
-	Bitmapset	   *pre_detoasted_attrs;
+	Bitset	   *pre_detoasted_attrs;
 } TupleTableSlot;
 
 /* routines for a TupleTableSlot implementation */
@@ -448,12 +455,12 @@ ExecFreePreDetoastDatum(TupleTableSlot *slot)
 	int			attnum;
 
 	attnum = -1;
-	while ((attnum = bms_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
+	while ((attnum = bitset_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
 	{
 		pfree((void *) slot->tts_values[attnum]);
 	}
 
-	slot->pre_detoasted_attrs = bms_del_members(slot->pre_detoasted_attrs, slot->pre_detoasted_attrs);
+	bitset_clear(slot->pre_detoasted_attrs);
 }
 
 
@@ -536,9 +543,9 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 		int			attnum = -1;
 		MemoryContext old = MemoryContextSwitchTo(dstslot->tts_mcxt);
 
-		dstslot->pre_detoasted_attrs = bms_copy(srcslot->pre_detoasted_attrs);
+		dstslot->pre_detoasted_attrs = bitset_copy(srcslot->pre_detoasted_attrs);
 
-		while ((attnum = bms_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
+		while ((attnum = bitset_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
 		{
 			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
 			Size		len;
-- 
2.34.1

v8-0004-Adding-tts_value_mctx-to-TupleTableSlot.patchtext/x-diffDownload
From 83ea86a095bdf6f5c829299906e9f5cfdee9e2b4 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Sun, 25 Feb 2024 17:25:36 +0800
Subject: [PATCH v8 4/5] Adding tts_value_mctx to TupleTableSlot

This context is used to hold the detoast datum for now.  With this
change, we don't need to iterate pre_detoast_attrs and pfree each of
them any more during ExecClearTuple, instead we just MemoryContextReset,
which will be more effective. However the cost is we adds 1KB
MemoryContext to each TupleTableSlot unconditionally.

As of now, slot->pre_detoast_attrs is still needed in the ExecCopySlot
case.
---
 src/backend/executor/execExprInterp.c |  2 +-
 src/backend/executor/execTuples.c     | 50 ++++++++++++++++-----------
 src/backend/nodes/bitmapset.c         |  9 -----
 src/include/executor/tuptable.h       | 17 +++++----
 src/include/nodes/bitmapset.h         | 13 ++++++-
 5 files changed, 53 insertions(+), 38 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 878235ae76..13abe54c0b 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -195,7 +195,7 @@ ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
 		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
 	{
 		Datum		oldDatum;
-		MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
+		MemoryContext old = MemoryContextSwitchTo(slot->tts_values_mctx);
 
 		oldDatum = slot->tts_values[attnum];
 		slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 830e1d4ea6..3a56d1c55d 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -375,6 +375,15 @@ tts_heap_materialize(TupleTableSlot *slot)
 
 	oldContext = MemoryContextSwitchTo(slot->tts_mcxt);
 
+	/*
+	 * tts_values is treated invalidated since tts_nvalid will is set to 0,
+	 * so let's free the pre-detoast datum.
+	 *
+	 * call ExecFreePreDetoastDatum before tts_nvalid is set to 0 is for the
+	 * fast path of ExecFreePreDetoastDatum.
+	 */
+	ExecFreePreDetoastDatum(slot);
+
 	/*
 	 * Have to deform from scratch, otherwise tts_values[] entries could point
 	 * into the non-materialized tuple (which might be gone when accessed).
@@ -399,13 +408,6 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
-
-	/*
-	 * tts_values is treated invalidated since tts_nvalid is set to 0, so
-	 * let's free the pre-detoast datum.
-	 */
-	ExecFreePreDetoastDatum(slot);
-
 }
 
 static void
@@ -463,6 +465,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	tts_heap_clear(slot);
 
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 	hslot->tuple = tuple;
 	hslot->off = 0;
@@ -472,8 +477,6 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
-	/* slot_nvalid = 0 */
-	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -552,6 +555,10 @@ tts_minimal_materialize(TupleTableSlot *slot)
 
 	oldContext = MemoryContextSwitchTo(slot->tts_mcxt);
 
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	/*
 	 * Have to deform from scratch, otherwise tts_values[] entries could point
 	 * into the non-materialized tuple (which might be gone when accessed).
@@ -584,9 +591,6 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
-
-	/* slot_nvalid = 0 */
-	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -646,6 +650,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 	Assert(TTS_EMPTY(slot));
 
 	slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 	slot->tts_nvalid = 0;
 	mslot->off = 0;
 
@@ -658,8 +665,6 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 	if (shouldFree)
 		slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
-	/* tts_nvalid = 0 */
-	ExecFreePreDetoastDatum(slot);
 }
 
 
@@ -756,6 +761,10 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	 * into the non-materialized tuple (which might be gone when accessed).
 	 */
 	bslot->base.off = 0;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 
 	if (!bslot->base.tuple)
@@ -794,9 +803,6 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
-
-	/* slot_nvalid = 0 */
-	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -896,6 +902,10 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 	}
 
 	slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 	bslot->base.tuple = tuple;
 	bslot->base.off = 0;
@@ -930,9 +940,6 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 		 */
 		ReleaseBuffer(buffer);
 	}
-
-	/* tts_nvalid = 0 */
-	ExecFreePreDetoastDatum(slot);
 }
 
 /*
@@ -1166,6 +1173,9 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 		slot->tts_flags |= TTS_FLAG_FIXED;
 	slot->tts_tupleDescriptor = tupleDesc;
 	slot->tts_mcxt = CurrentMemoryContext;
+	slot->tts_values_mctx = GenerationContextCreate(slot->tts_mcxt,
+													"tts_value_ctx",
+													ALLOCSET_START_SMALL_SIZES);
 	slot->tts_nvalid = 0;
 
 	if (tupleDesc != NULL)
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 40cfea2308..3e1ccfc8dd 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1485,15 +1485,6 @@ bitset_init(size_t size)
 	return result;
 }
 
-/*
- * bitset_clear - clear the bits only, but the memory is still there.
- */
-void
-bitset_clear(Bitset *a)
-{
-	if (a != NULL)
-		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
-}
 
 void
 bitset_free(Bitset *a)
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 8f3eba7fbb..53782a993f 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -20,6 +20,7 @@
 #include "access/tupdesc.h"
 #include "nodes/bitmapset.h"
 #include "storage/buf.h"
+#include "utils/memutils.h"
 
 /*----------
  * The executor stores tuples in a "tuple table" which is a List of
@@ -127,6 +128,7 @@ typedef struct TupleTableSlot
 #define FIELDNO_TUPLETABLESLOT_ISNULL 6
 	bool	   *tts_isnull;		/* current per-attribute isnull flags */
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
+	MemoryContext tts_values_mctx; /* reset when tts_values[*] is invalidated. */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
 
@@ -452,14 +454,11 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 static inline void
 ExecFreePreDetoastDatum(TupleTableSlot *slot)
 {
-	int			attnum;
-
-	attnum = -1;
-	while ((attnum = bitset_next_member(slot->pre_detoasted_attrs, attnum)) >= 0)
-	{
-		pfree((void *) slot->tts_values[attnum]);
-	}
+	/* We have called this when tts_nvalid is set to 0. */
+	if (slot->tts_nvalid == 0)
+		return;
 
+	MemoryContextResetOnly(slot->tts_values_mctx);
 	bitset_clear(slot->pre_detoasted_attrs);
 }
 
@@ -545,6 +544,10 @@ ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
 
 		dstslot->pre_detoasted_attrs = bitset_copy(srcslot->pre_detoasted_attrs);
 
+		MemoryContextSwitchTo(old);
+
+		old = MemoryContextSwitchTo(dstslot->tts_values_mctx);
+
 		while ((attnum = bitset_next_member(dstslot->pre_detoasted_attrs, attnum)) >= 0)
 		{
 			struct varlena *datum = (struct varlena *) srcslot->tts_values[attnum];
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 95ff37c6e9..e433842217 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -143,7 +143,6 @@ extern uint32 bitmap_hash(const void *key, Size keysize);
 extern int	bitmap_match(const void *key1, const void *key2, Size keysize);
 
 extern Bitset *bitset_init(size_t size);
-extern void bitset_clear(Bitset *a);
 extern void bitset_free(Bitset *a);
 extern bool bitset_is_empty(Bitset *a);
 extern Bitset *bitset_copy(Bitset *a);
@@ -152,4 +151,16 @@ extern void bitset_del_member(Bitset *a, int x);
 extern int	bitset_is_member(int bit, Bitset *a);
 extern int	bitset_next_member(const Bitset *a, int prevbit);
 extern Bitmapset *bitset_to_bitmap(Bitset *a);
+
+/*
+ * bitset_clear - clear the bits only, but the memory is still there.
+ * this is used in performance critical path, so inline this one specially.
+ */
+inline static void
+bitset_clear(Bitset *a)
+{
+	if (a != NULL)
+		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
+}
+
 #endif							/* BITMAPSET_H */
-- 
2.34.1

v8-0005-Provide-a-building-option-for-review-purpose.patchtext/x-diffDownload
From 75e03c2042d96a8df0586f27fe141523059d5c85 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Mon, 26 Feb 2024 09:58:37 +0800
Subject: [PATCH v8 5/5] Provide a building option for review purpose.

when DEBUG_PRE_DETOAST_DATUM is defined, which attribute is prepared
detoast in which plan node is logged via INFO.
---
 src/backend/executor/execExpr.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 779fcfaab1..45c2c625b2 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -949,6 +949,14 @@ ExecInitExprRec(Expr *node, ExprState *state,
 											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
 							{
 								scratch.opcode = EEOP_INNER_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+								elog(INFO,
+									 "EEOP_INNER_VAR_TOAST: flags = %d costs=%.2f..%.2f, attnum: %d",
+									 state->flags,
+									 plan->startup_cost,
+									 plan->total_cost,
+									 attnum);
+#endif
 							}
 							else
 							{
@@ -961,6 +969,14 @@ ExecInitExprRec(Expr *node, ExprState *state,
 											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
 							{
 								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+									elog(INFO,
+										 "EEOP_OUTER_VAR_TOAST: flags = %u costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+#endif
 							}
 							else
 								scratch.opcode = EEOP_OUTER_VAR;
@@ -974,6 +990,15 @@ ExecInitExprRec(Expr *node, ExprState *state,
 																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
 							{
 								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+									elog(INFO,
+										 "EEOP_SCAN_VAR_TOAST: flags = %u costs=%.2f..%.2f, scanId: %d, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 ((Scan *) plan)->scanrelid,
+										 attnum);
+#endif
 							}
 							else
 								scratch.opcode = EEOP_SCAN_VAR;
-- 
2.34.1

#16Nikita Malakhov
hukutoc@gmail.com
In reply to: Andy Fan (#15)
Re: Shared detoast Datum proposal

Hi!

I see this to be a very promising initiative, but some issues come into my
mind.
When we store and detoast large values, say, 1Gb - that's a very likely
scenario,
we have such cases from prod systems - we would end up in using a lot of
shared
memory to keep these values alive, only to discard them later. Also,
toasted values
are not always being used immediately and as a whole, i.e. jsonb values are
fully
detoasted (we're working on this right now) to extract the smallest value
from
big json, and these values are not worth keeping in memory. For text values
too,
we often do not need the whole value to be detoasted and kept in memory.

What do you think?

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#17Andy Fan
zhihuifan1213@163.com
In reply to: Nikita Malakhov (#16)
Re: Shared detoast Datum proposal

Hi,

When we store and detoast large values, say, 1Gb - that's a very likely scenario,
we have such cases from prod systems - we would end up in using a lot of shared
memory to keep these values alive, only to discard them later.

First the feature can make sure if the original expression will not
detoast it, this feature will not detoast it as well. Based on this, if
there is a 1Gb datum, in the past, it took 1GB memory as well, but
it may be discarded quicker than the patched version. but patched
version will *not* keep it for too long time since once the
slot->tts_values[*] is invalidated, the memory will be released
immedately.

For example:

create table t(a text, b int);
insert into t select '1-gb-length-text' from generate_series(1, 100);

select * from t where a like '%gb%';

The patched version will take 1gb extra memory only.

Are you worried about this case or some other cases?

Also, toasted values
are not always being used immediately and as a whole, i.e. jsonb values are fully
detoasted (we're working on this right now) to extract the smallest value from
big json, and these values are not worth keeping in memory. For text values too,
we often do not need the whole value to be detoasted.

I'm not sure how often this is true, espeically you metioned text data
type. I can't image why people acquire a piece of text and how can we
design a toasting system to fulfill such case without detoast the whole
as the first step. But for the jsonb or array data type, I think it is
more often. However if we design toasting like that, I'm not sure if it
would slow down other user case. for example detoasting the whole piece
use case. I am justing feeling both way has its user case, kind of heap
store and columna store.

FWIW, as I shown in the test case, this feature is not only helpful for
big datum, it is also helpful for small toast datum.

--
Best Regards
Andy Fan

#18Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andy Fan (#17)
Re: Shared detoast Datum proposal

On 2/26/24 14:22, Andy Fan wrote:

...

Also, toasted values
are not always being used immediately and as a whole, i.e. jsonb values are fully
detoasted (we're working on this right now) to extract the smallest value from
big json, and these values are not worth keeping in memory. For text values too,
we often do not need the whole value to be detoasted.

I'm not sure how often this is true, espeically you metioned text data
type. I can't image why people acquire a piece of text and how can we
design a toasting system to fulfill such case without detoast the whole
as the first step. But for the jsonb or array data type, I think it is
more often. However if we design toasting like that, I'm not sure if it
would slow down other user case. for example detoasting the whole piece
use case. I am justing feeling both way has its user case, kind of heap
store and columna store.

Any substr/starts_with call benefits from this optimization, and such
calls don't seem that uncommon. I certainly can imagine people doing
this fairly often. In any case, it's a long-standing behavior /
optimization, and I don't think we can just dismiss it this quickly.

Is there a way to identify cases that are likely to benefit from this
slicing, and disable the detoasting for them? We already disable it for
other cases, so maybe we can do this here too?

Or maybe there's a way to do the detoasting lazily, and only keep the
detoasted value when it's requesting the whole value. Or perhaps even
better, remember what part we detoasted, and maybe it'll be enough for
future requests?

I'm not sure how difficult would this be with the proposed approach,
which eagerly detoasts the value into tts_values. I think it'd be easier
to implement with the TOAST cache I briefly described ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Nikita Malakhov
hukutoc@gmail.com
In reply to: Tomas Vondra (#18)
Re: Shared detoast Datum proposal

Hi,

Tomas, we already have a working jsonb partial detoast prototype,
and currently I'm porting it to the recent master. Due to the size
of the changes and very invasive nature it takes a lot of effort,
but it is already done. I'm also trying to make the core patch
less invasive. Actually, it is a subject for a separate topic, as soon
as I make it work on the current master we'll propose it
to the community.

Andy, thank you! I'll check the last patch set out and reply in a day or
two.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#20Andy Fan
zhihuifan1213@163.com
In reply to: Tomas Vondra (#18)
Re: Shared detoast Datum proposal

On 2/26/24 14:22, Andy Fan wrote:

...

Also, toasted values
are not always being used immediately and as a whole, i.e. jsonb values are fully
detoasted (we're working on this right now) to extract the smallest value from
big json, and these values are not worth keeping in memory. For text values too,
we often do not need the whole value to be detoasted.

I'm not sure how often this is true, espeically you metioned text data
type. I can't image why people acquire a piece of text and how can we
design a toasting system to fulfill such case without detoast the whole
as the first step. But for the jsonb or array data type, I think it is
more often. However if we design toasting like that, I'm not sure if it
would slow down other user case. for example detoasting the whole piece
use case. I am justing feeling both way has its user case, kind of heap
store and columna store.

Any substr/starts_with call benefits from this optimization, and such
calls don't seem that uncommon. I certainly can imagine people doing
this fairly often.

This leads me to pay more attention to pg_detoast_datum_slice user case
which I did overlook and caused some regression in this case. Thanks!

As you said later:

Is there a way to identify cases that are likely to benefit from this
slicing, and disable the detoasting for them? We already disable it for
other cases, so maybe we can do this here too?

I think the answer is yes, if our goal is to disable the whole toast for
some speical calls like substr/starts_with. In the current patch, we have:

/*
* increase_level_for_pre_detoast
* Check if the given Expr could detoast a Var directly, if yes,
* increase the level and return true. otherwise return false;
*/
static inline void
increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
{
/* The following nodes is impossible to detoast a Var directly. */
if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
{
ctx->level_added = false;
}
else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == F_PG_COLUMN_COMPRESSION)
{
/* let's not detoast first so that pg_column_compression works. */
ctx->level_added = false;
}

while it works, but the blacklist looks not awesome.

In any case, it's a long-standing behavior /
optimization, and I don't think we can just dismiss it this quickly.

I agree. So I said both have their user case. and when I say the *both*, I
mean the other method is "partial detoast prototype", which Nikita has
told me before. while I am sure it has many user case, I'm also feeling
it is complex and have many questions in my mind, but I'd like to see
the design before asking them.

Or maybe there's a way to do the detoasting lazily, and only keep the
detoasted value when it's requesting the whole value. Or perhaps even
better, remember what part we detoasted, and maybe it'll be enough for
future requests?

I'm not sure how difficult would this be with the proposed approach,
which eagerly detoasts the value into tts_values. I think it'd be
easier to implement with the TOAST cache I briefly described ...

I can understand the benefits of the TOAST cache, but the following
issues looks not good to me IIUC:

1). we can't put the result to tts_values[*] since without the planner
decision, we don't know if this will break small_tlist logic. But if we
put it into the cache only, which means a cache-lookup as a overhead is
unavoidable.

2). It is hard to decide which entry should be evicted without attaching
it to the TupleTableSlot's life-cycle. then we can't grantee the entry
we keep is the entry we will reuse soon?

--
Best Regards
Andy Fan

#21Andy Fan
zhihuifan1213@163.com
In reply to: Nikita Malakhov (#19)
Re: Shared detoast Datum proposal

Nikita Malakhov <hukutoc@gmail.com> writes:

Hi,

Tomas, we already have a working jsonb partial detoast prototype,
and currently I'm porting it to the recent master.

This is really awesome! Acutally when I talked to MySQL guys, they said
MySQL already did this and I admit it can resolve some different issues
than the current patch. Just I have many implemetion questions in my
mind.

Andy, thank you! I'll check the last patch set out and reply in a day
or two.

Thank you for your attention!

--
Best Regards
Andy Fan

#22Nikita Malakhov
hukutoc@gmail.com
In reply to: Andy Fan (#21)
Re: Shared detoast Datum proposal

Hi, Andy!

Sorry for the delay, I have had long flights this week.
I've reviewed the patch set, thank you for your efforts.
I have several notes about patch set code, but first of
I'm not sure the overall approach is the best for the task.

As Tomas wrote above, the approach is very invasive
and spreads code related to detoasting among many
parts of code.

Have you considered another one - to alter pg_detoast_datum
(actually, it would be detoast_attr function) and save
detoasted datums in the detoast context derived
from the query context?

We have just enough information at this step to identify
the datum - toast relation id and value id, and could
keep links to these detoasted values in a, say, linked list
or hash table. Thus we would avoid altering the executor
code and all detoast-related code would reside within
the detoast source files?

I'd check this approach in several days and would
report on the result here.

There are also comments on the code itself, I'd write them
a bit later.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#23Andy Fan
zhihuifan1213@163.com
In reply to: Nikita Malakhov (#22)
Re: Shared detoast Datum proposal

Hi Nikita,

Have you considered another one - to alter pg_detoast_datum
(actually, it would be detoast_attr function) and save
detoasted datums in the detoast context derived
from the query context?

We have just enough information at this step to identify
the datum - toast relation id and value id, and could
keep links to these detoasted values in a, say, linked list
or hash table. Thus we would avoid altering the executor
code and all detoast-related code would reside within
the detoast source files?

I think you are talking about the way Tomas provided. I am really
afraid that I was thought of too self-opinionated, but I do have some
concerns about this approch as I stated here [1]/messages/by-id/875xyb1a6q.fsf@163.com, looks my concerns is
still not addressed, or the concerns itself are too absurd which is
really possible I think?

[1]: /messages/by-id/875xyb1a6q.fsf@163.com

--
Best Regards
Andy Fan

#24Andy Fan
zhihuifan1213@163.com
In reply to: Andy Fan (#23)
1 attachment(s)
Re: Shared detoast Datum proposal

Hi,

Here is a updated version, the main changes are:

1. an shared_detoast_datum.org file which shows the latest desgin and
pending items during discussion.
2. I removed the slot->pre_detoast_attrs totally.
3. handle some pg_detoast_datum_slice use case.
4. Some implementation improvement.

commit 66c64c197a5dab97a563be5a291127e4c5d6841d (HEAD -> shared_detoast_value)
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date: Sun Mar 3 13:48:25 2024 +0800

shared detoast datum

See the overall design & alternative design & testing in
shared_detoast_datum.org

In the shared_detoast_datum.org, I added the alternative design part for
the idea of TOAST cache.

--
Best Regards
Andy Fan

Attachments:

v9-0001-shared-detoast-datum.patchtext/x-diffDownload
From 66c64c197a5dab97a563be5a291127e4c5d6841d Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 Mar 2024 13:48:25 +0800
Subject: [PATCH v9 1/1] shared detoast datum

See the overall design & alternative design & testing in
shared_detoast_datum.org
---
 src/backend/access/common/detoast.c           |  68 +-
 src/backend/access/common/toast_compression.c |  10 +-
 src/backend/executor/execExpr.c               |  60 +-
 src/backend/executor/execExprInterp.c         | 179 ++++++
 src/backend/executor/execTuples.c             | 127 ++++
 src/backend/executor/execUtils.c              |   2 +
 src/backend/executor/nodeHashjoin.c           |   2 +
 src/backend/executor/nodeMergejoin.c          |   2 +
 src/backend/executor/nodeNestloop.c           |   1 +
 src/backend/executor/shared_detoast_datum.org | 203 ++++++
 src/backend/jit/llvm/llvmjit_expr.c           |  26 +-
 src/backend/jit/llvm/llvmjit_types.c          |   1 +
 src/backend/optimizer/plan/createplan.c       | 107 +++-
 src/backend/optimizer/plan/setrefs.c          | 590 +++++++++++++++---
 src/include/access/detoast.h                  |   3 +
 src/include/access/toast_compression.h        |   4 +-
 src/include/executor/execExpr.h               |  12 +
 src/include/executor/tuptable.h               |  14 +
 src/include/nodes/execnodes.h                 |  14 +
 src/include/nodes/plannodes.h                 |  53 ++
 src/tools/pgindent/typedefs.list              |   2 +
 21 files changed, 1342 insertions(+), 138 deletions(-)
 create mode 100644 src/backend/executor/shared_detoast_datum.org

diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c
index 3547cdba56..acc9644689 100644
--- a/src/backend/access/common/detoast.c
+++ b/src/backend/access/common/detoast.c
@@ -22,11 +22,11 @@
 #include "utils/expandeddatum.h"
 #include "utils/rel.h"
 
-static struct varlena *toast_fetch_datum(struct varlena *attr);
+static struct varlena *toast_fetch_datum(struct varlena *attr, MemoryContext ctx);
 static struct varlena *toast_fetch_datum_slice(struct varlena *attr,
 											   int32 sliceoffset,
 											   int32 slicelength);
-static struct varlena *toast_decompress_datum(struct varlena *attr);
+static struct varlena *toast_decompress_datum(struct varlena *attr, MemoryContext ctx);
 static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32 slicelength);
 
 /* ----------
@@ -42,7 +42,7 @@ static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32
  * ----------
  */
 struct varlena *
-detoast_external_attr(struct varlena *attr)
+detoast_external_attr_ext(struct varlena *attr, MemoryContext ctx)
 {
 	struct varlena *result;
 
@@ -51,7 +51,7 @@ detoast_external_attr(struct varlena *attr)
 		/*
 		 * This is an external stored plain value
 		 */
-		result = toast_fetch_datum(attr);
+		result = toast_fetch_datum(attr, ctx);
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -68,13 +68,13 @@ detoast_external_attr(struct varlena *attr)
 
 		/* recurse if value is still external in some other way */
 		if (VARATT_IS_EXTERNAL(attr))
-			return detoast_external_attr(attr);
+			return detoast_external_attr_ext(attr, ctx);
 
 		/*
 		 * Copy into the caller's memory context, in case caller tries to
 		 * pfree the result.
 		 */
-		result = (struct varlena *) palloc(VARSIZE_ANY(attr));
+		result = (struct varlena *) MemoryContextAlloc(ctx, VARSIZE_ANY(attr));
 		memcpy(result, attr, VARSIZE_ANY(attr));
 	}
 	else if (VARATT_IS_EXTERNAL_EXPANDED(attr))
@@ -87,7 +87,7 @@ detoast_external_attr(struct varlena *attr)
 
 		eoh = DatumGetEOHP(PointerGetDatum(attr));
 		resultsize = EOH_get_flat_size(eoh);
-		result = (struct varlena *) palloc(resultsize);
+		result = (struct varlena *) MemoryContextAlloc(ctx, resultsize);
 		EOH_flatten_into(eoh, (void *) result, resultsize);
 	}
 	else
@@ -101,32 +101,45 @@ detoast_external_attr(struct varlena *attr)
 	return result;
 }
 
+struct varlena *
+detoast_external_attr(struct varlena *attr)
+{
+	return detoast_external_attr_ext(attr, CurrentMemoryContext);
+}
+
 
 /* ----------
- * detoast_attr -
+ * detoast_attr_ext -
  *
  *	Public entry point to get back a toasted value from compression
  *	or external storage.  The result is always non-extended varlena form.
  *
+ * ctx: The memory context which the final value belongs to.
+ *
  * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
  * datum, the result will be a pfree'able chunk.
  * ----------
  */
-struct varlena *
-detoast_attr(struct varlena *attr)
+
+extern struct varlena *
+detoast_attr_ext(struct varlena *attr, MemoryContext ctx)
 {
 	if (VARATT_IS_EXTERNAL_ONDISK(attr))
 	{
 		/*
 		 * This is an externally stored datum --- fetch it back from there
 		 */
-		attr = toast_fetch_datum(attr);
+		attr = toast_fetch_datum(attr, ctx);
 		/* If it's compressed, decompress it */
 		if (VARATT_IS_COMPRESSED(attr))
 		{
 			struct varlena *tmp = attr;
 
-			attr = toast_decompress_datum(tmp);
+			attr = toast_decompress_datum(tmp, ctx);
+			/*
+			 * XXX: this pfree block us from using BumpContext directly
+			 * we need some extra effort to make it happen at least.
+			 */
 			pfree(tmp);
 		}
 	}
@@ -144,14 +157,14 @@ detoast_attr(struct varlena *attr)
 		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
 
 		/* recurse in case value is still extended in some other way */
-		attr = detoast_attr(attr);
+		attr = detoast_attr_ext(attr, ctx);
 
 		/* if it isn't, we'd better copy it */
 		if (attr == (struct varlena *) redirect.pointer)
 		{
 			struct varlena *result;
 
-			result = (struct varlena *) palloc(VARSIZE_ANY(attr));
+			result = (struct varlena *) MemoryContextAlloc(ctx, VARSIZE_ANY(attr));
 			memcpy(result, attr, VARSIZE_ANY(attr));
 			attr = result;
 		}
@@ -161,7 +174,7 @@ detoast_attr(struct varlena *attr)
 		/*
 		 * This is an expanded-object pointer --- get flat format
 		 */
-		attr = detoast_external_attr(attr);
+		attr = detoast_external_attr_ext(attr, ctx);
 		/* flatteners are not allowed to produce compressed/short output */
 		Assert(!VARATT_IS_EXTENDED(attr));
 	}
@@ -170,7 +183,7 @@ detoast_attr(struct varlena *attr)
 		/*
 		 * This is a compressed value inside of the main tuple
 		 */
-		attr = toast_decompress_datum(attr);
+		attr = toast_decompress_datum(attr, ctx);
 	}
 	else if (VARATT_IS_SHORT(attr))
 	{
@@ -181,7 +194,7 @@ detoast_attr(struct varlena *attr)
 		Size		new_size = data_size + VARHDRSZ;
 		struct varlena *new_attr;
 
-		new_attr = (struct varlena *) palloc(new_size);
+		new_attr = (struct varlena *) MemoryContextAlloc(ctx, new_size);
 		SET_VARSIZE(new_attr, new_size);
 		memcpy(VARDATA(new_attr), VARDATA_SHORT(attr), data_size);
 		attr = new_attr;
@@ -190,6 +203,11 @@ detoast_attr(struct varlena *attr)
 	return attr;
 }
 
+struct varlena *
+detoast_attr(struct varlena *attr)
+{
+	return detoast_attr_ext(attr, CurrentMemoryContext);
+}
 
 /* ----------
  * detoast_attr_slice -
@@ -262,7 +280,7 @@ detoast_attr_slice(struct varlena *attr,
 			preslice = toast_fetch_datum_slice(attr, 0, max_size);
 		}
 		else
-			preslice = toast_fetch_datum(attr);
+			preslice = toast_fetch_datum(attr, CurrentMemoryContext);
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -294,7 +312,7 @@ detoast_attr_slice(struct varlena *attr,
 		if (slicelimit >= 0)
 			preslice = toast_decompress_datum_slice(tmp, slicelimit);
 		else
-			preslice = toast_decompress_datum(tmp);
+			preslice = toast_decompress_datum(tmp, CurrentMemoryContext);
 
 		if (tmp != attr)
 			pfree(tmp);
@@ -340,7 +358,7 @@ detoast_attr_slice(struct varlena *attr,
  * ----------
  */
 static struct varlena *
-toast_fetch_datum(struct varlena *attr)
+toast_fetch_datum(struct varlena *attr, MemoryContext ctx)
 {
 	Relation	toastrel;
 	struct varlena *result;
@@ -355,7 +373,7 @@ toast_fetch_datum(struct varlena *attr)
 
 	attrsize = VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer);
 
-	result = (struct varlena *) palloc(attrsize + VARHDRSZ);
+	result = (struct varlena *) MemoryContextAlloc(ctx, attrsize + VARHDRSZ);
 
 	if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
 		SET_VARSIZE_COMPRESSED(result, attrsize + VARHDRSZ);
@@ -468,7 +486,7 @@ toast_fetch_datum_slice(struct varlena *attr, int32 sliceoffset,
  * Decompress a compressed version of a varlena datum
  */
 static struct varlena *
-toast_decompress_datum(struct varlena *attr)
+toast_decompress_datum(struct varlena *attr, MemoryContext ctx)
 {
 	ToastCompressionId cmid;
 
@@ -482,9 +500,9 @@ toast_decompress_datum(struct varlena *attr)
 	switch (cmid)
 	{
 		case TOAST_PGLZ_COMPRESSION_ID:
-			return pglz_decompress_datum(attr);
+			return pglz_decompress_datum(attr, ctx);
 		case TOAST_LZ4_COMPRESSION_ID:
-			return lz4_decompress_datum(attr);
+			return lz4_decompress_datum(attr, ctx);
 		default:
 			elog(ERROR, "invalid compression method id %d", cmid);
 			return NULL;		/* keep compiler quiet */
@@ -515,7 +533,7 @@ toast_decompress_datum_slice(struct varlena *attr, int32 slicelength)
 	 * more than the data's true decompressed size.
 	 */
 	if ((uint32) slicelength >= TOAST_COMPRESS_EXTSIZE(attr))
-		return toast_decompress_datum(attr);
+		return toast_decompress_datum(attr, CurrentMemoryContext);
 
 	/*
 	 * Fetch the compression method id stored in the compression header and
diff --git a/src/backend/access/common/toast_compression.c b/src/backend/access/common/toast_compression.c
index 09d05d97c5..323cf013da 100644
--- a/src/backend/access/common/toast_compression.c
+++ b/src/backend/access/common/toast_compression.c
@@ -81,13 +81,13 @@ pglz_compress_datum(const struct varlena *value)
  * Decompress a varlena that was compressed using PGLZ.
  */
 struct varlena *
-pglz_decompress_datum(const struct varlena *value)
+pglz_decompress_datum(const struct varlena *value, MemoryContext ctx)
 {
 	struct varlena *result;
 	int32		rawsize;
 
 	/* allocate memory for the uncompressed data */
-	result = (struct varlena *) palloc(VARDATA_COMPRESSED_GET_EXTSIZE(value) + VARHDRSZ);
+	result = (struct varlena *) MemoryContextAlloc(ctx, VARDATA_COMPRESSED_GET_EXTSIZE(value) + VARHDRSZ);
 
 	/* decompress the data */
 	rawsize = pglz_decompress((char *) value + VARHDRSZ_COMPRESSED,
@@ -181,7 +181,7 @@ lz4_compress_datum(const struct varlena *value)
  * Decompress a varlena that was compressed using LZ4.
  */
 struct varlena *
-lz4_decompress_datum(const struct varlena *value)
+lz4_decompress_datum(const struct varlena *value, MemoryContext ctx)
 {
 #ifndef USE_LZ4
 	NO_LZ4_SUPPORT();
@@ -191,7 +191,7 @@ lz4_decompress_datum(const struct varlena *value)
 	struct varlena *result;
 
 	/* allocate memory for the uncompressed data */
-	result = (struct varlena *) palloc(VARDATA_COMPRESSED_GET_EXTSIZE(value) + VARHDRSZ);
+	result = (struct varlena *) MemoryContextAlloc(ctx, VARDATA_COMPRESSED_GET_EXTSIZE(value) + VARHDRSZ);
 
 	/* decompress the data */
 	rawsize = LZ4_decompress_safe((char *) value + VARHDRSZ_COMPRESSED,
@@ -225,7 +225,7 @@ lz4_decompress_datum_slice(const struct varlena *value, int32 slicelength)
 
 	/* slice decompression not supported prior to 1.8.3 */
 	if (LZ4_versionNumber() < 10803)
-		return lz4_decompress_datum(value);
+		return lz4_decompress_datum(value, CurrentMemoryContext);
 
 	/* allocate memory for the uncompressed data */
 	result = (struct varlena *) palloc(slicelength + VARHDRSZ);
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 3181b1136a..45c2c625b2 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -932,22 +932,76 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				}
 				else
 				{
+					int			attnum;
+					Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 					/* regular user column */
 					scratch.d.var.attnum = variable->varattno - 1;
 					scratch.d.var.vartype = variable->vartype;
+					attnum = scratch.d.var.attnum;
+
 					switch (variable->varno)
 					{
 						case INNER_VAR:
-							scratch.opcode = EEOP_INNER_VAR;
+
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_INNER_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+								elog(INFO,
+									 "EEOP_INNER_VAR_TOAST: flags = %d costs=%.2f..%.2f, attnum: %d",
+									 state->flags,
+									 plan->startup_cost,
+									 plan->total_cost,
+									 attnum);
+#endif
+							}
+							else
+							{
+								scratch.opcode = EEOP_INNER_VAR;
+							}
 							break;
 						case OUTER_VAR:
-							scratch.opcode = EEOP_OUTER_VAR;
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+									elog(INFO,
+										 "EEOP_OUTER_VAR_TOAST: flags = %u costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+#endif
+							}
+							else
+								scratch.opcode = EEOP_OUTER_VAR;
 							break;
 
 							/* INDEX_VAR is handled by default case */
 
 						default:
-							scratch.opcode = EEOP_SCAN_VAR;
+							if (is_scan_plan(plan) && bms_is_member(
+																	attnum,
+																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+									elog(INFO,
+										 "EEOP_SCAN_VAR_TOAST: flags = %u costs=%.2f..%.2f, scanId: %d, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 ((Scan *) plan)->scanrelid,
+										 attnum);
+#endif
+							}
+							else
+								scratch.opcode = EEOP_SCAN_VAR;
 							break;
 					}
 				}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3f20f1dd31..7ebaca36bb 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -158,6 +159,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -166,6 +170,9 @@ static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull
 static Datum ExecJustInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -181,6 +188,42 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  AggStatePerGroup pergroup,
 															  ExprContext *aggcontext,
 															  int setno);
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+	if (!slot->tts_isnull[attnum] &&
+		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+	{
+		if (unlikely(slot->tts_data_mctx == NULL))
+			slot->tts_data_mctx = GenerationContextCreate(slot->tts_mcxt,
+														  "tts_value_ctx",
+														  ALLOCSET_START_SMALL_SIZES);
+		slot->tts_values[attnum] = PointerGetDatum(detoast_attr_ext(
+													   (struct varlena *) slot->tts_values[attnum],
+													   /* save the detoast value to the given MemoryContext. */
+													   slot->tts_data_mctx));
+		Assert(slot->tts_nvalid > attnum);
+	}
+}
+
+/* JIT requires a non-static (and external?) function */
+void
+ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum)
+{
+	return ExecSlotDetoastDatum(slot, attnum);
+}
+
+
+static inline void
+ExecEvalToastVar(TupleTableSlot *slot,
+				 ExprEvalStep *op,
+				 int attnum)
+{
+	ExecSlotDetoastDatum(slot, attnum);
+
+	*op->resvalue = slot->tts_values[attnum];
+	*op->resnull = slot->tts_isnull[attnum];
+}
 
 /*
  * ScalarArrayOpExprHashEntry
@@ -296,6 +339,24 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVar;
 			return;
 		}
+		if (step0 == EEOP_INNER_FETCHSOME &&
+			step1 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_FETCHSOME &&
+				 step1 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_FETCHSOME &&
+				 step1 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarToast;
+			return;
+		}
 		else if (step0 == EEOP_INNER_FETCHSOME &&
 				 step1 == EEOP_ASSIGN_INNER_VAR)
 		{
@@ -346,6 +407,21 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVarVirt;
 			return;
 		}
+		else if (step0 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarVirtToast;
+			return;
+		}
 		else if (step0 == EEOP_ASSIGN_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustAssignInnerVarVirt;
@@ -413,6 +489,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
+		&&CASE_EEOP_INNER_VAR_TOAST,
+		&&CASE_EEOP_OUTER_VAR_TOAST,
+		&&CASE_EEOP_SCAN_VAR_TOAST,
 		&&CASE_EEOP_INNER_SYSVAR,
 		&&CASE_EEOP_OUTER_SYSVAR,
 		&&CASE_EEOP_SCAN_SYSVAR,
@@ -597,6 +676,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
 			*op->resvalue = scanslot->tts_values[attnum];
 			*op->resnull = scanslot->tts_isnull[attnum];
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_INNER_VAR_TOAST)
+		{
+			ExecEvalToastVar(innerslot, op, op->d.var.attnum);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_VAR_TOAST)
+		{
+			ExecEvalToastVar(outerslot, op, op->d.var.attnum);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_VAR_TOAST)
+		{
+			ExecEvalToastVar(scanslot, op, op->d.var.attnum);
 
 			EEO_NEXT();
 		}
@@ -2137,6 +2235,42 @@ ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+static pg_attribute_always_inline Datum
+ExecJustVarImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[1];
+	int			attnum = op->d.var.attnum;
+
+	CheckOpSlotCompatibility(&state->steps[0], slot);
+
+	slot_getattr(slot, attnum + 1, isnull);
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Simple reference to inner Var */
+static Datum
+ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Simple reference to outer Var */
+static Datum
+ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Simple reference to scan Var */
+static Datum
+ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)Var */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
@@ -2275,6 +2409,51 @@ ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarVirtImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+/* implementation of ExecJust(Inner|Outer|Scan)VarVirt */
+static pg_attribute_always_inline Datum
+ExecJustVarVirtImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[0];
+	int			attnum = op->d.var.attnum;
+
+	/*
+	 * As it is guaranteed that a virtual slot is used, there never is a need
+	 * to perform tuple deforming (nor would it be possible). Therefore
+	 * execExpr.c has not emitted an EEOP_*_FETCHSOME step. Verify, as much as
+	 * possible, that that determination was accurate.
+	 */
+	Assert(TTS_IS_VIRTUAL(slot));
+	Assert(TTS_FIXED(slot));
+	Assert(attnum >= 0 && attnum < slot->tts_nvalid);
+
+	*isnull = slot->tts_isnull[attnum];
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Like ExecJustInnerVar, optimized for virtual slots */
+static Datum
+ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Like ExecJustOuterVar, optimized for virtual slots */
+static Datum
+ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Like ExecJustScanVar, optimized for virtual slots */
+static Datum
+ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)VarVirt */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarVirtImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a7aa2ee02b..584f9f1fd1 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -79,6 +79,9 @@ static inline void tts_buffer_heap_store_tuple(TupleTableSlot *slot,
 											   bool transfer_pin);
 static void tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree);
 
+static Bitmapset *cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+											  TupleDesc tupleDesc,
+											  List *forbid_pre_detoast_vars);
 
 const TupleTableSlotOps TTSOpsVirtual;
 const TupleTableSlotOps TTSOpsHeapTuple;
@@ -392,6 +395,12 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated invalidated since tts_nvalid will is set to 0,
+	 * so let's free the pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -449,6 +458,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	tts_heap_clear(slot);
 
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 	hslot->tuple = tuple;
 	hslot->off = 0;
@@ -535,6 +547,7 @@ tts_minimal_materialize(TupleTableSlot *slot)
 
 	oldContext = MemoryContextSwitchTo(slot->tts_mcxt);
 
+
 	/*
 	 * Have to deform from scratch, otherwise tts_values[] entries could point
 	 * into the non-materialized tuple (which might be gone when accessed).
@@ -567,6 +580,9 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -626,6 +642,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 	Assert(TTS_EMPTY(slot));
 
 	slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 	slot->tts_nvalid = 0;
 	mslot->off = 0;
 
@@ -733,6 +752,10 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	 * into the non-materialized tuple (which might be gone when accessed).
 	 */
 	bslot->base.off = 0;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 
 	if (!bslot->base.tuple)
@@ -870,6 +893,10 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 	}
 
 	slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 	bslot->base.tuple = tuple;
 	bslot->base.off = 0;
@@ -1137,6 +1164,7 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 		slot->tts_flags |= TTS_FLAG_FIXED;
 	slot->tts_tupleDescriptor = tupleDesc;
 	slot->tts_mcxt = CurrentMemoryContext;
+	slot->tts_data_mctx = NULL;
 	slot->tts_nvalid = 0;
 
 	if (tupleDesc != NULL)
@@ -1215,6 +1243,8 @@ ExecResetTupleTable(List *tupleTable,	/* tuple table */
 				if (slot->tts_isnull)
 					pfree(slot->tts_isnull);
 			}
+			if (slot->tts_data_mctx != NULL)
+				MemoryContextDelete(slot->tts_data_mctx);
 			pfree(slot);
 		}
 	}
@@ -1265,6 +1295,9 @@ ExecDropSingleTupleTableSlot(TupleTableSlot *slot)
 		if (slot->tts_isnull)
 			pfree(slot->tts_isnull);
 	}
+	if (slot->tts_data_mctx != NULL)
+		MemoryContextDelete(slot->tts_data_mctx);
+
 	pfree(slot);
 }
 
@@ -1810,12 +1843,26 @@ void
 ExecInitScanTupleSlot(EState *estate, ScanState *scanstate,
 					  TupleDesc tupledesc, const TupleTableSlotOps *tts_ops)
 {
+	Scan	   *splan = (Scan *) scanstate->ps.plan;
+
 	scanstate->ss_ScanTupleSlot = ExecAllocTableSlot(&estate->es_tupleTable,
 													 tupledesc, tts_ops);
 	scanstate->ps.scandesc = tupledesc;
 	scanstate->ps.scanopsfixed = tupledesc != NULL;
 	scanstate->ps.scanops = tts_ops;
 	scanstate->ps.scanopsset = true;
+
+	if (is_scan_plan((Plan *) splan))
+	{
+		/*
+		 * We may run detoast in Qual or Projection, but all of them happen at
+		 * the ss_ScanTupleSlot rather than ps_ResultTupleSlot. So we can only
+		 * take care of the ss_ScanTupleSlot.
+		 */
+		scanstate->scan_pre_detoast_attrs = cal_final_pre_detoast_attrs(splan->reference_attrs,
+																		tupledesc,
+																		splan->plan.forbid_pre_detoast_vars);
+	}
 }
 
 /* ----------------
@@ -2336,3 +2383,83 @@ end_tup_output(TupOutputState *tstate)
 	ExecDropSingleTupleTableSlot(tstate->slot);
 	pfree(tstate);
 }
+
+/*
+ * cal_final_pre_detoast_attrs
+ *		Calculate the final attributes which pre-detoast be helpful.
+ *
+ * reference_attrs: the attributes which will be detoast at this plan level.
+ * due to the implementation issue, some non-toast attribute may be included
+ * which should be filtered out with tupleDesc.
+ *
+ * forbid_pre_detoast_vars: the vars which should not be pre-detoast as the
+ * small_tlist reason.
+ */
+static Bitmapset *
+cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+							TupleDesc tupleDesc,
+							List *forbid_pre_detoast_vars)
+{
+	Bitmapset  *final = NULL,
+			   *toast_attrs = NULL,
+			   *forbid_pre_detoast_attrs = NULL;
+
+	int			i;
+	ListCell   *lc;
+
+	if (bms_is_empty(reference_attrs))
+		return NULL;
+
+	/*
+	 * there is no exact data type in create_plan or set_plan_refs stage, so
+	 * reference_attrs may have some attribute which is not toast attrs at
+	 * all, which should be removed.
+	 */
+	for (i = 0; i < tupleDesc->natts; i++)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupleDesc, i);
+
+		if (attr->attlen == -1 && attr->attstorage != TYPSTORAGE_PLAIN)
+			toast_attrs = bms_add_member(toast_attrs, attr->attnum - 1);
+	}
+
+	/* Filter out the non-toastable attributes. */
+	final = bms_intersect(reference_attrs, toast_attrs);
+
+	/*
+	 * Due to the fact of detoast-datum will make the tuple bigger which is
+	 * bad for some nodes like Sort/Hash, to avoid performance regression,
+	 * such attribute should be removed as well.
+	 */
+	foreach(lc, forbid_pre_detoast_vars)
+	{
+		Var		   *var = lfirst_node(Var, lc);
+
+		forbid_pre_detoast_attrs = bms_add_member(forbid_pre_detoast_attrs, var->varattno - 1);
+	}
+
+	final = bms_del_members(final, forbid_pre_detoast_attrs);
+
+	bms_free(toast_attrs);
+	bms_free(forbid_pre_detoast_attrs);
+
+	return final;
+}
+
+
+void
+SetPredetoastAttrsForJoin(JoinState *j)
+{
+	PlanState  *outerstate = outerPlanState(j);
+	PlanState  *innerstate = innerPlanState(j);
+
+	j->outer_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->outer_reference_attrs,
+															 outerstate->ps_ResultTupleDesc,
+															 outerstate->plan->forbid_pre_detoast_vars);
+
+	j->inner_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->inner_reference_attrs,
+															 innerstate->ps_ResultTupleDesc,
+															 innerstate->plan->forbid_pre_detoast_vars);
+}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index cff5dc723e..a8646ded02 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -572,6 +572,8 @@ ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
 		planstate->resultopsset = planstate->scanopsset;
 		planstate->resultopsfixed = planstate->scanopsfixed;
 		planstate->resultops = planstate->scanops;
+
+		Assert(planstate->ps_ResultTupleDesc != NULL);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 1cbec4647c..19a05ed624 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -756,6 +756,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
 	innerDesc = ExecGetResultType(innerPlanState(hjstate));
 
+	SetPredetoastAttrsForJoin((JoinState *) hjstate);
+
 	/*
 	 * Initialize result slot, type and projection.
 	 */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index c1a8ca2464..be7cbd7f30 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1497,6 +1497,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 											  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	SetPredetoastAttrsForJoin((JoinState *) mergestate);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 06fa0a9b31..2d40d19192 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -306,6 +306,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(&nlstate->js.ps, &TTSOpsVirtual);
 	ExecAssignProjectionInfo(&nlstate->js.ps, NULL);
+	SetPredetoastAttrsForJoin((JoinState *) nlstate);
 
 	/*
 	 * initialize child expressions
diff --git a/src/backend/executor/shared_detoast_datum.org b/src/backend/executor/shared_detoast_datum.org
new file mode 100644
index 0000000000..f63392ef4f
--- /dev/null
+++ b/src/backend/executor/shared_detoast_datum.org
@@ -0,0 +1,203 @@
+The problem:
+-------------
+
+In the current expression engine, a toasted datum is detoasted when
+required, but the result is discarded immediately, either by pfree it
+immediately or leave it for ResetExprContext. Arguments for which one to
+use exists sometimes. More serious problem is detoasting is expensive,
+especially for the data types like jsonb or array, which the value might
+be very huge. In the blow example, the detoasting happens twice.
+
+SELECT jb_col->'a', jb_col->'b' FROM t;
+
+Within the shared-detoast-datum, we just need to detoast once for each
+tuple, and discard it immediately when the tuple is not needed any
+more. FWIW this issue may existing for small numeric, text as well
+because of SHORT_TOAST feature where the toast's len using 1 byte rather
+than 4 bytes.
+
+Current Design
+--------------
+
+The high level design is let createplan.c and setref.c decide which
+Vars can use this feature, and then the executor save the detoast
+datum back slot->to tts_values[*] during the ExprEvalStep of
+EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:
+
+- The existing expression engine read datum from tts_values[*], no any
+  extra work need to be done.
+- Reuse the lifespan of TupleTableSlot system to manage memory. It
+  is natural to think the detoast datum is a tts_value just that it is
+  in a detoast format. Since we have a clear lifespan for TupleTableSlot
+  already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
+  them for managing the datoast datum's memory.
+- The existing projection method can copy the datoasted datum (int64)
+  automatically to the next node's slot, but keeping the ownership
+  unchanged, so only the slot where the detoast really happen take the
+  charge of it's lifespan.
+
+Assuming which Var should use this feature has been decided in
+createplan.c and setref.c already. The 3 new ExprEvalSteps
+EEOP_{INNER,OUTER,SCAN}_VAR_TOAST as used. During the evaluating these
+steps, the below code is used.
+
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+    if (!slot->tts_isnull[attnum] &&
+        VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+    {
+        if (unlikely(slot->tts_data_mctx == NULL))
+            slot->tts_data_mctx = GenerationContextCreate(slot->tts_mcxt,
+                                     "tts_value_ctx",
+                                    ALLOCSET_START_SMALL_SIZES);
+        slot->tts_values[attnum] = PointerGetDatum(detoast_attr_ext((struct varlena *) slot->tts_values[attnum],
+                                                                 /* save the detoast value to the given MemoryContext. */
+                                        slot->tts_data_mctx));
+    }
+}
+
+Since I don't want to the run-time extra check to see if is a detoast
+should happen, so introducing 3 new steps.
+
+When to free the detoast datum? It depends on when the slot's
+tts_values[*] is invalidated, ExecClearTuple is the clear one, but any
+TupleTableSlotOps which set the tts_nvalid = 0 tells us no one will use
+the datum in tts_values[*] so it is time to release them, this is an
+important part for memory usage consideration.  since we used
+dedicated MemoryContext for it, so what we just need to do it:
+
+/*
+ * ExecFreePreDetoastDatum - free the memory which is allocated in tts_data_mcxt.
+ */
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+    if (slot->tts_data_mctx)
+        MemoryContextResetOnly(slot->tts_data_mctx);
+}
+
+
+Now comes to the createplan.c/setref.c part, which decides which Vars
+should use the shared detoast feature. The guideline of this is:
+
+1. It needs a detoast for a given expression in the previous logic.
+2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
+   detoast datum back to tts_values[*], which make tuple bigger. if we
+   do this blindly, it would be harmful to the ORDER / HASH style nodes.
+
+A high level data flow is:
+
+1. at the createplan.c, we walk the plan tree go gather the
+   CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
+   it to Plan.forbid_pre_detoast_vars via the function
+   set_plan_forbid_pre_detoast_vars_recurse.
+
+2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
+   expression, so it is a good time to track the attribute number and
+   see if the Var is directly or indirectly accessed. Usually the
+   indirectly access a Var means a detoast would happens, for
+   example an expression like a > 3. However some known expressions is
+   ignored. for example: NullTest, pg_column_compression which needs the
+   raw datum, start_with/sub_string which needs a slice
+   detoasting. Currently there is some hard code here, we may needs a
+   pg_proc.detoasting_requirement flags to make this generic.  The
+   output is {Scan|Join}.xxx_reference_attrs;
+
+Note that here I used '_reference_' rather than '_detoast_' is because
+at this part, I still don't know if it is a toastable attrbiute, which
+is known at the MakeTupleTableSlot stage.
+
+3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
+   in ScanState & JoinState, which will be passed into expression
+   engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
+   _VAR_TOAST steps are generated finally then everything is connected
+   with ExecSlotDetoastDatum!
+
+
+Testing
+-------
+
+Case 1: small numeric testing.
+===============================
+
+create table t (a numeric);
+insert into t select i from generate_series(1, 100000)i;
+
+cat 1.sql
+
+select * from t where a > 0;
+
+In this test, the current master run detoast twice for each datum. one
+in numeric_gt,  one in numeric_out.  this feature makes the detoast once.
+
+pgbench -f 1.sql -n postgres -T 10 -M prepared
+
+master: 30.218 ms
+patched: 26.957 ms
+
+
+Case 2: Big jsonbs test:
+=============================
+
+
+create table b(blog jsonb);
+
+INSERT INTO b
+SELECT jsonb_build_object(
+        'title', 'title ' || s.i::text,
+        'content', substring(repeat(md5(random()::text), 100), 1, 3000),
+        'subscriber', (random() * 100)::int,
+        'reader', (random() * 100)::int
+    )
+FROM generate_series(1, 10000) s(i);
+
+
+explain analyze
+select blog from b
+where cast(blog->'reader' as numeric) > 10 and
+cast(blog->'subscriber' as numeric) > 0;
+
+Dump and restore the above data into the current master:
+
+master: 24.588 ms
+patched: 17.664 ms
+
+Memory usage test:
+
+I run the workload of tpch scale 10 on against both master and patched
+versions, the memory usage looks stable and the performance doesn't have
+noticeable improvement and regression as well.
+
+A alternative design: toast cache
+---------------------------------
+
+This method is provided by Tomas during the review process. IIUC, this
+method would maintain a local HTAB which map a toast datum to a detoast
+datum and the entry is maintained / used in detoast_attr
+function. Within this method, the overall design is pretty clear and the
+code modification can be controlled in toasting system only.
+
+I assumed that releasing all of the memory at the end of executor once
+is not an option since it may consumed too many memory. Then, when and
+which entry to release becomes a trouble for me. For example:
+
+          QUERY PLAN
+------------------------------
+ Nested Loop
+   Join Filter: (t1.a = t2.a)
+   ->  Seq Scan on t1
+   ->  Seq Scan on t2
+(4 rows)
+
+In this case t1.a needs a longer lifespan than t2.a since it is
+in outer relation. Without the help from slot's life-cycle system, I
+can't think out a answer for the above question.
+
+Another difference between the 2 methods is my method have many
+modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
+it can save some run-time effort like hash_search find / enter run-time
+in method 2 since I put them directly into tts_values[*].
+
+I'm not sure the factor 2 makes some real measurable difference in real
+case, so my current concern mainly comes from factor 1.
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 0c448422e2..74563c3454 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -396,30 +396,52 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
+			case EEOP_INNER_VAR_TOAST:
+			case EEOP_OUTER_VAR_TOAST:
+			case EEOP_SCAN_VAR_TOAST:
 				{
 					LLVMValueRef value,
 								isnull;
 					LLVMValueRef v_attnum;
 					LLVMValueRef v_values;
 					LLVMValueRef v_nulls;
+					LLVMValueRef v_slot;
 
-					if (opcode == EEOP_INNER_VAR)
+					if (opcode == EEOP_INNER_VAR || opcode == EEOP_INNER_VAR_TOAST)
 					{
+						v_slot = v_innerslot;
 						v_values = v_innervalues;
 						v_nulls = v_innernulls;
 					}
-					else if (opcode == EEOP_OUTER_VAR)
+					else if (opcode == EEOP_OUTER_VAR || opcode == EEOP_OUTER_VAR_TOAST)
 					{
+						v_slot = v_outerslot;
 						v_values = v_outervalues;
 						v_nulls = v_outernulls;
 					}
 					else
 					{
+						v_slot = v_scanslot;
 						v_values = v_scanvalues;
 						v_nulls = v_scannulls;
 					}
 
 					v_attnum = l_int32_const(lc, op->d.var.attnum);
+
+					if (opcode == EEOP_INNER_VAR_TOAST ||
+						opcode == EEOP_OUTER_VAR_TOAST ||
+						opcode == EEOP_SCAN_VAR_TOAST)
+					{
+						LLVMValueRef params[2];
+
+						params[0] = v_slot;
+						params[1] = l_int32_const(lc, op->d.var.attnum);
+						l_call(b,
+							   llvm_pg_var_func_type("ExecSlotDetoastDatumExternal"),
+							   llvm_pg_func(mod, "ExecSlotDetoastDatumExternal"),
+							   params, lengthof(params), "");
+					}
+
 					value = l_load_gep1(b, TypeSizeT, v_values, v_attnum, "");
 					isnull = l_load_gep1(b, TypeStorageBool, v_nulls, v_attnum, "");
 					LLVMBuildStore(b, value, v_resvaluep);
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 47c9daf402..1dcf0c2fd8 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -178,4 +178,5 @@ void	   *referenced_functions[] =
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecSlotDetoastDatumExternal,
 };
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 610f4a56d6..8acb48240e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -314,7 +314,9 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 List *mergeActionLists, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
-
+static void set_plan_forbid_pre_detoast_vars_recurse(Plan *plan,
+													 List *small_tlist);
+static void set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist);
 
 /*
  * create_plan
@@ -346,6 +348,12 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree, demanding the correct tlist result */
 	plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);
 
+	/*
+	 * After the plan tree is built completed, we start to walk for which
+	 * expressions should not used the shared-detoast feature.
+	 */
+	set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);
+
 	/*
 	 * Make sure the topmost plan node's targetlist exposes the original
 	 * column names and other decorative info.  Targetlists generated within
@@ -378,6 +386,101 @@ create_plan(PlannerInfo *root, Path *best_path)
 	return plan;
 }
 
+/*
+ * set_plan_forbid_pre_detoast_vars_recurse
+ *	 Walking the Plan tree in the top-down manner to gather the vars which
+ * should be as small as possible and record them in Plan.forbid_pre_detoast_vars
+ *
+ * plan: the plan node to walk right now.
+ * small_tlist: a list of nodes which its subplan should provide them as
+ * small as possible.
+ */
+static void
+set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)
+{
+	if (plan == NULL)
+		return;
+
+	set_plan_not_pre_detoast_vars(plan, small_tlist);
+
+	/* Recurse to its subplan.. */
+	if (IsA(plan, Sort) || IsA(plan, Memoize) || IsA(plan, WindowAgg) ||
+		IsA(plan, Hash) || IsA(plan, Material) || IsA(plan, IncrementalSort))
+	{
+		List	   *small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * For the sort-like nodes, we want the output of its subplan as small
+		 * as possible, but the subplan's other expressions like Qual doesn't
+		 * have this restriction since they are not output to the upper nodes.
+		 * so we set the small_tlist to the subplan->targetlist.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, small_tlist);
+	}
+	else if (IsA(plan, HashJoin) && castNode(HashJoin, plan)->left_small_tlist)
+	{
+		List	   *small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * If the left_small_tlist wants a as small as possible tlist, set it
+		 * in a way like sort for the left node.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, small_tlist);
+
+		/*
+		 * The righttree is a Hash node, it can be set with its own rule, so
+		 * the small_tlist provided is not important, we just need to recuse
+		 * to its subplan.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+	else
+	{
+		/*
+		 * Recurse to its children, just push down the forbid_pre_detoast_vars
+		 * to its children.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, plan->forbid_pre_detoast_vars);
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+}
+
+/*
+ * set_plan_not_pre_detoast_vars
+ *
+ *	Set the Plan.forbid_pre_detoast_vars according the small_tlist information.
+ *
+ * small_tlist = NIL means nothing is forbidden, or else if a Var belongs to the
+ * small_tlist, then it must not be pre-detoasted.
+ */
+static void
+set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist)
+{
+	ListCell   *lc;
+	Var		   *var;
+
+	/*
+	 * fast path, if we don't have a small_tlist, the var in targetlist is
+	 * impossible member of it. and this case might be a pretty common case.
+	 */
+	if (small_tlist == NIL)
+		return;
+
+	foreach(lc, plan->targetlist)
+	{
+		TargetEntry *te = lfirst_node(TargetEntry, lc);
+
+		if (!IsA(te->expr, Var))
+			continue;
+		var = castNode(Var, te->expr);
+		if (var->varattno <= 0)
+			continue;
+		if (list_member(small_tlist, var))
+			/* pass the recheck */
+			plan->forbid_pre_detoast_vars = lappend(plan->forbid_pre_detoast_vars, var);
+	}
+}
+
 /*
  * create_plan_recurse
  *	  Recursive guts of create_plan().
@@ -4893,6 +4996,8 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	join_plan->left_small_tlist = (best_path->num_batches > 1);
+
 	return join_plan;
 }
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 22a1fa29f3..9b9e2b4345 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -27,6 +27,7 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_relation.h"
 #include "tcop/utility.h"
+#include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
@@ -55,11 +56,48 @@ typedef struct
 	tlist_vinfo vars[FLEXIBLE_ARRAY_MEMBER];	/* has num_vars entries */
 } indexed_tlist;
 
+/*
+ * Decide which attrs are detoasted in a expressions level, this is judged
+ * at the fix_scan/join_expr stage. The recursed level is tracked when we
+ * walk to a Var, if the level is greater than 1, then it means the
+ * var needs an detoast in this expression list, there are some exceptions
+ * here, see increase_level_for_pre_detoast for details.
+ */
+typedef struct
+{
+	/* if the level is added during a certain walk. */
+	bool		level_added;
+	/* the current level during the walk. */
+	int			level;
+} intermediate_level_context;
+
+/*
+ * Context to hold the detoast attribute within a expression.
+ *
+ * XXX: this design was intent to avoid the pre-detoast-logic if the var
+ * only need to be detoasted *once*, but for now, this context is only
+ * maintained at the expression level rather than plan tree level, so it
+ * can't detect if a Var will be detoasted 2+ time at the plan level.
+ * Recording the times of a Var is detoasted in the plan tree level is
+ * complex, so before we decide it is a must, I am not willing to do too
+ * many changes here.
+ */
+typedef struct
+{
+	/* var is accessed for the first time. */
+	Bitmapset  *existing_attrs;
+	/* var is accessed for the 2+ times. */
+	Bitmapset **final_ref_attrs;
+} intermediate_var_ref_context;
+
+
 typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context scan_reference_attrs;
 } fix_scan_expr_context;
 
 typedef struct
@@ -71,6 +109,9 @@ typedef struct
 	int			rtoffset;
 	NullingRelsMatch nrm_match;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context outer_reference_attrs;
+	intermediate_var_ref_context inner_reference_attrs;
 } fix_join_expr_context;
 
 typedef struct
@@ -127,8 +168,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, lst, rtoffset, num_exec, pre_detoast_attrs) \
+	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec, pre_detoast_attrs))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -158,7 +199,8 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
 static Node *fix_scan_expr(PlannerInfo *root, Node *node,
-						   int rtoffset, double num_exec);
+						   int rtoffset, double num_exec,
+						   Bitmapset **scan_reference_attrs);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
@@ -190,7 +232,10 @@ static List *fix_join_expr(PlannerInfo *root,
 						   Index acceptable_rel,
 						   int rtoffset,
 						   NullingRelsMatch nrm_match,
-						   double num_exec);
+						   double num_exec,
+						   Bitmapset **outer_reference_attrs,
+						   Bitmapset **inner_reference_attrs);
+
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
 static Node *fix_upper_expr(PlannerInfo *root,
@@ -211,6 +256,38 @@ static List *set_windowagg_runcondition_references(PlannerInfo *root,
 												   List *runcondition,
 												   Plan *plan);
 
+/*
+ * func_use_slice_detoast
+ *   Check if a func just needs a pg_detoast_datum_slice, if so we should
+ * not pre detoast it. For now, the known function ID is hard-coded. but
+ * it'd be good that pg_proc can have a attribute like 'detoast_requirement'
+ * the value can be either of:
+ * - full
+ * - first_n (0 ~ N, the current slice method).
+ * - any. (for the incoming of partial detoast feature.
+ *
+ * I think adding this attribute to pg_proc has a stronger reason if partial
+ * detoast patch is accepted.
+ */
+static inline bool
+func_use_slice_detoast(Oid funcOid)
+{
+	/* hard code for now, and it is not used in a hot path yet. */
+	const Oid oids[] = {F_STARTS_WITH,
+
+						F_OVERLAY_BYTEA_BYTEA_INT4, F_OVERLAY_BYTEA_BYTEA_INT4_INT4,
+						F_OVERLAY_TEXT_TEXT_INT4, F_OVERLAY_TEXT_TEXT_INT4_INT4,
+
+						F_SUBSTRING_BYTEA_INT4, F_SUBSTRING_BYTEA_INT4_INT4,
+						F_SUBSTRING_TEXT_INT4, F_SUBSTRING_TEXT_INT4_INT4};
+
+	for (int i = 0; i < sizeof(oids) /sizeof(Oid); i++)
+	{
+		if (funcOid == oids[i])
+			return true;
+	}
+	return false;
+}
 
 /*****************************************************************************
  *
@@ -628,10 +705,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_SampleScan:
@@ -641,13 +724,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs
+					);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tablesample = (TableSampleClause *)
 					fix_scan_expr(root, (Node *) splan->tablesample,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexScan:
@@ -657,28 +747,40 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->indexqual =
 					fix_scan_list(root, splan->indexqual,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->indexorderby =
 					fix_scan_list(root, splan->indexorderby,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexorderbyorig =
 					fix_scan_list(root, splan->indexorderbyorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan), &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexOnlyScan:
 			{
 				IndexOnlyScan *splan = (IndexOnlyScan *) plan;
 
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
+
 				return set_indexonlyscan_references(root, splan, rtoffset);
 			}
 			break;
@@ -691,10 +793,15 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, splan->indexqual, rtoffset, 1,
+								  &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_BitmapHeapScan:
@@ -704,13 +811,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->bitmapqualorig =
 					fix_scan_list(root, splan->bitmapqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidScan:
@@ -720,13 +834,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidquals =
 					fix_scan_list(root, splan->tidquals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidRangeScan:
@@ -736,13 +857,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidrangequals =
 					fix_scan_list(root, splan->tidrangequals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_SubqueryScan:
@@ -757,12 +885,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, splan->functions, rtoffset, 1,
+								  &splan->scan.reference_attrs);
+
 			}
 			break;
 		case T_TableFuncScan:
@@ -772,13 +904,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->tablefunc = (TableFunc *)
 					fix_scan_expr(root, (Node *) splan->tablefunc,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ValuesScan:
@@ -788,13 +924,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->values_lists =
 					fix_scan_list(root, splan->values_lists,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_CteScan:
@@ -804,10 +943,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_NamedTuplestoreScan:
@@ -817,10 +962,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_WorkTableScan:
@@ -830,10 +977,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ForeignScan:
@@ -873,7 +1022,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
 												   rtoffset,
-												   NUM_EXEC_TLIST(plan));
+												   NUM_EXEC_TLIST(plan),
+												   NULL);
 				break;
 			}
 
@@ -933,9 +1083,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, splan->limitOffset, rtoffset, 1, NULL);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, splan->limitCount, rtoffset, 1, NULL);
 			}
 			break;
 		case T_Agg:
@@ -988,17 +1138,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->startOffset, rtoffset, 1, NULL);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->endOffset, rtoffset, 1, NULL);
 				wplan->runCondition = fix_scan_list(root,
 													wplan->runCondition,
 													rtoffset,
-													NUM_EXEC_TLIST(plan));
+													NUM_EXEC_TLIST(plan), NULL);
 				wplan->runConditionOrig = fix_scan_list(root,
 														wplan->runConditionOrig,
 														rtoffset,
-														NUM_EXEC_TLIST(plan));
+														NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_Result:
@@ -1038,14 +1188,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 					splan->plan.targetlist =
 						fix_scan_list(root, splan->plan.targetlist,
-									  rtoffset, NUM_EXEC_TLIST(plan));
+									  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 					splan->plan.qual =
 						fix_scan_list(root, splan->plan.qual,
-									  rtoffset, NUM_EXEC_QUAL(plan));
+									  rtoffset, NUM_EXEC_QUAL(plan), NULL);
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1, NULL);
 			}
 			break;
 		case T_ProjectSet:
@@ -1061,7 +1211,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->withCheckOptionLists =
 					fix_scan_list(root, splan->withCheckOptionLists,
-								  rtoffset, 1);
+								  rtoffset, 1, NULL);
 
 				if (splan->returningLists)
 				{
@@ -1118,18 +1268,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						fix_join_expr(root, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					splan->onConflictWhere = (Node *)
 						fix_join_expr(root, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1, NULL);
 				}
 
 				/*
@@ -1182,7 +1334,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   resultrel,
 															   rtoffset,
 															   NRM_EQUAL,
-															   NUM_EXEC_TLIST(plan));
+															   NUM_EXEC_TLIST(plan),
+															   NULL, NULL);
 
 							/* Fix quals too. */
 							action->qual = (Node *) fix_join_expr(root,
@@ -1191,7 +1344,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 																  resultrel,
 																  rtoffset,
 																  NRM_EQUAL,
-																  NUM_EXEC_QUAL(plan));
+																  NUM_EXEC_QUAL(plan),
+																  NULL, NULL);
 						}
 					}
 				}
@@ -1356,13 +1510,16 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
 	plan->indexqual = fix_scan_list(root, plan->indexqual,
-									rtoffset, 1);
+									rtoffset, 1,
+									&plan->scan.reference_attrs);
 	/* indexorderby is already transformed to reference index columns */
 	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
-									   rtoffset, 1);
+									   rtoffset, 1,
+									   &plan->scan.reference_attrs);
 	/* indextlist must NOT be transformed to reference index columns */
 	plan->indextlist = fix_scan_list(root, plan->indextlist,
-									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+									 rtoffset, NUM_EXEC_TLIST((Plan *) plan),
+									 &plan->scan.reference_attrs);
 
 	pfree(index_itlist);
 
@@ -1409,10 +1566,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
 			fix_scan_list(root, plan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) plan), NULL);
 		plan->scan.plan.qual =
 			fix_scan_list(root, plan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) plan), NULL);
 
 		result = (Plan *) plan;
 	}
@@ -1612,7 +1769,7 @@ set_foreignscan_references(PlannerInfo *root,
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
 			fix_scan_list(root, fscan->fdw_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 	}
 	else
 	{
@@ -1622,16 +1779,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_recheck_quals =
 			fix_scan_list(root, fscan->fdw_recheck_quals,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 	}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
@@ -1690,20 +1847,20 @@ set_customscan_references(PlannerInfo *root,
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
 			fix_scan_list(root, cscan->custom_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
 			fix_scan_list(root, cscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 		cscan->scan.plan.qual =
 			fix_scan_list(root, cscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 		cscan->custom_exprs =
 			fix_scan_list(root, cscan->custom_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 	}
 
 	/* Adjust child plan-nodes recursively, if needed */
@@ -2111,6 +2268,102 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
 	return (Node *) bestplan;
 }
 
+
+static inline void
+setup_intermediate_level_ctx(intermediate_level_context *ctx)
+{
+	ctx->level = 0;
+	ctx->level_added = false;
+}
+
+static inline void
+setup_intermediate_var_ref_ctx(intermediate_var_ref_context *ctx, Bitmapset **final_ref_attrs)
+{
+	ctx->existing_attrs = NULL;
+	ctx->final_ref_attrs = final_ref_attrs;
+}
+
+/*
+ * increase_level_for_pre_detoast
+ *	Check if the given Expr could detoast a Var directly, if yes,
+ * increase the level and return true. otherwise return false;
+ */
+static inline void
+increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
+{
+	/* The following nodes is impossible to detoast a Var directly. */
+	if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
+	{
+		ctx->level_added = false;
+	}
+	else if (IsA(node, FuncExpr))
+	{
+		Oid funcOid = castNode(FuncExpr, node)->funcid;
+
+		if (funcOid == F_PG_COLUMN_COMPRESSION || func_use_slice_detoast(funcOid))
+			ctx->level_added =  false;
+		else
+		{
+			ctx->level_added = true;
+			ctx->level += 1;
+		}
+	}
+	else
+	{
+		ctx->level_added = true;
+		ctx->level += 1;
+	}
+}
+
+static inline void
+decreased_level_for_pre_detoast(intermediate_level_context *ctx)
+{
+	if (ctx->level_added)
+		ctx->level -= 1;
+
+	ctx->level_added = false;
+}
+
+/*
+ * add_pre_detoast_vars
+ *		add the var's information into pre_detoast_attrs when the check is pass.
+ */
+static inline void
+add_pre_detoast_vars(intermediate_level_context *level_ctx,
+					 intermediate_var_ref_context *ctx,
+					 Var *var)
+{
+	int			attno;
+
+	if (level_ctx->level <= 1 || ctx->final_ref_attrs == NULL || var->varattno <= 0)
+		return;
+
+	attno = var->varattno - 1;
+	if (bms_is_member(attno, ctx->existing_attrs))
+	{
+		/* not the first time to access it, add it to final result. */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+	else
+	{
+		/* first time. */
+		ctx->existing_attrs = bms_add_member(ctx->existing_attrs, attno);
+
+		/*
+		 * XXX:
+		 *
+		 * The above strategy doesn't help to detect if a Var is detoast
+		 * twice. Reasons are: 1. the context is not maintain in Plan node
+		 * level. so if it is detoast at targetlist and qual, we can't detect
+		 * it. 2. even we can make it at plan node, it still doesn't help for
+		 * the among-nodes case.
+		 *
+		 * So for now, I just disable it.
+		 */
+		*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+	}
+}
+
 /*
  * fix_scan_expr
  *		Do set_plan_references processing on a scan-level expression
@@ -2125,18 +2378,23 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * 'node': the expression to be modified
  * 'rtoffset': how much to increment varnos by
  * 'num_exec': estimated number of executions of expression
+ * 'scan_reference_attrs': gather which vars are potential to run the detoast
+ * 	on this expr, NULL means the caller doesn't have interests on this.
  *
  * The expression tree is either copied-and-modified, or modified in-place
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset,
+			  double num_exec, Bitmapset **scan_reference_attrs)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.scan_reference_attrs, scan_reference_attrs);
 
 	if (rtoffset != 0 ||
 		root->multiexpr_params != NIL ||
@@ -2167,8 +2425,13 @@ fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
 static Node *
 fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 {
+	Node	   *n;
+
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = copyVar((Var *) node);
@@ -2186,10 +2449,16 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 			var->varno += context->rtoffset;
 		if (var->varnosyn > 0)
 			var->varnosyn += context->rtoffset;
+
+		add_pre_detoast_vars(&context->level_ctx, &context->scan_reference_attrs, var);
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) var;
 	}
 	if (IsA(node, Param))
+	{
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return fix_param_node(context->root, (Param *) node);
+	}
 	if (IsA(node, Aggref))
 	{
 		Aggref	   *aggref = (Aggref *) node;
@@ -2199,8 +2468,10 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		aggparam = find_minmax_agg_replacement_param(context->root, aggref);
 		if (aggparam != NULL)
 		{
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			/* Make a copy of the Param for paranoia's sake */
 			return (Node *) copyObject(aggparam);
+
 		}
 		/* If no match, just fall through to process it normally */
 	}
@@ -2210,6 +2481,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 
 		Assert(!IS_SPECIAL_VARNO(cexpr->cvarno));
 		cexpr->cvarno += context->rtoffset;
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) cexpr;
 	}
 	if (IsA(node, PlaceHolderVar))
@@ -2218,29 +2490,52 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
 		/* XXX can we assert something about phnullingrels? */
-		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
+		Node	   *n2 = fix_scan_expr_mutator((Node *) phv->phexpr, context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
 	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_scan_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		Node	   *n2 = fix_scan_expr_mutator(fix_alternative_subplan(context->root,
+																	   (AlternativeSubPlan *) node,
+																	   context->num_exec),
+											   context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator,
-								   (void *) context);
+	n = expression_tree_mutator(node, fix_scan_expr_mutator, (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return n;
 }
 
 static bool
 fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 {
+	bool		ret;
+
 	if (node == NULL)
 		return false;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
+	if (IsA(node, Var))
+	{
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 castNode(Var, node));
+	}
 	Assert(!(IsA(node, Var) && ((Var *) node)->varno == ROWID_VAR));
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
-	return expression_tree_walker(node, fix_scan_expr_walker,
-								  (void *) context);
+	ret = expression_tree_walker(node, fix_scan_expr_walker,
+								 (void *) context);
+
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret;
 }
 
 /*
@@ -2276,7 +2571,10 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset,
 								   NRM_EQUAL,
-								   NUM_EXEC_QUAL((Plan *) join));
+								   NUM_EXEC_QUAL((Plan *) join),
+								   &join->outer_reference_attrs,
+								   &join->inner_reference_attrs
+		);
 
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
@@ -2323,7 +2621,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										 (Index) 0,
 										 rtoffset,
 										 NRM_EQUAL,
-										 NUM_EXEC_QUAL((Plan *) join));
+										 NUM_EXEC_QUAL((Plan *) join),
+										 &join->outer_reference_attrs,
+										 &join->inner_reference_attrs);
 	}
 	else if (IsA(join, HashJoin))
 	{
@@ -2336,7 +2636,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										(Index) 0,
 										rtoffset,
 										NRM_EQUAL,
-										NUM_EXEC_QUAL((Plan *) join));
+										NUM_EXEC_QUAL((Plan *) join),
+										&join->outer_reference_attrs,
+										&join->inner_reference_attrs);
 
 		/*
 		 * HashJoin's hashkeys are used to look for matching tuples from its
@@ -2368,7 +2670,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  (Index) 0,
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-										  NUM_EXEC_TLIST((Plan *) join));
+										  NUM_EXEC_TLIST((Plan *) join),
+										  &join->outer_reference_attrs,
+										  &join->inner_reference_attrs);
 	join->plan.qual = fix_join_expr(root,
 									join->plan.qual,
 									outer_itlist,
@@ -2376,8 +2680,20 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 									(Index) 0,
 									rtoffset,
 									(join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-									NUM_EXEC_QUAL((Plan *) join));
-
+									NUM_EXEC_QUAL((Plan *) join),
+									&join->outer_reference_attrs,
+									&join->inner_reference_attrs);
+
+	join->plan.forbid_pre_detoast_vars = fix_join_expr(root,
+													   join->plan.forbid_pre_detoast_vars,
+													   outer_itlist,
+													   inner_itlist,
+													   (Index) 0,
+													   rtoffset,
+													   (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
+													   NUM_EXEC_TLIST((Plan *) join),
+													   NULL,
+													   NULL);
 	pfree(outer_itlist);
 	pfree(inner_itlist);
 }
@@ -3010,9 +3326,12 @@ fix_join_expr(PlannerInfo *root,
 			  Index acceptable_rel,
 			  int rtoffset,
 			  NullingRelsMatch nrm_match,
-			  double num_exec)
+			  double num_exec,
+			  Bitmapset **outer_reference_attrs,
+			  Bitmapset **inner_reference_attrs)
 {
 	fix_join_expr_context context;
+	List	   *ret;
 
 	context.root = root;
 	context.outer_itlist = outer_itlist;
@@ -3021,16 +3340,30 @@ fix_join_expr(PlannerInfo *root,
 	context.rtoffset = rtoffset;
 	context.nrm_match = nrm_match;
 	context.num_exec = num_exec;
-	return (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.outer_reference_attrs, outer_reference_attrs);
+	setup_intermediate_var_ref_ctx(&context.inner_reference_attrs, inner_reference_attrs);
+
+	ret = (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	bms_free(context.outer_reference_attrs.existing_attrs);
+	bms_free(context.inner_reference_attrs.existing_attrs);
+
+	return ret;
 }
 
 static Node *
 fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 {
 	Var		   *newvar;
+	Node	   *ret_node;
 
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = (Var *) node;
@@ -3044,7 +3377,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* then in the inner. */
@@ -3056,7 +3395,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If it's for acceptable_rel, adjust and return it */
@@ -3066,6 +3411,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 			var->varno += context->rtoffset;
 			if (var->varnosyn > 0)
 				var->varnosyn += context->rtoffset;
+			/* XXX acceptable_rel?  we can ignore it for safety. */
+			decreased_level_for_pre_detoast(&context->level_ctx);
+
 			return (Node *) var;
 		}
 
@@ -3084,22 +3432,38 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  OUTER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 		if (context->inner_itlist && context->inner_itlist->has_ph_vars)
 		{
+
 			newvar = search_indexed_tlist_for_phv(phv,
 												  context->inner_itlist,
 												  INNER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If not supplied by input plans, evaluate the contained expr */
 		/* XXX can we assert something about phnullingrels? */
-		return fix_join_expr_mutator((Node *) phv->phexpr, context);
+		ret_node = fix_join_expr_mutator((Node *) phv->phexpr, context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
 	}
+
 	/* Try matching more complex expressions too, if tlists have any */
 	if (context->outer_itlist && context->outer_itlist->has_non_vars)
 	{
@@ -3107,7 +3471,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->outer_itlist,
 												  OUTER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->outer_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	if (context->inner_itlist && context->inner_itlist->has_non_vars)
 	{
@@ -3115,20 +3485,36 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->inner_itlist,
 												  INNER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->inner_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	/* Special cases (apply only AFTER failing to match to lower tlist) */
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		ret_node = fix_param_node(context->root, (Param *) node);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_join_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		ret_node = fix_join_expr_mutator(fix_alternative_subplan(context->root,
+																 (AlternativeSubPlan *) node,
+																 context->num_exec),
+										 context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node,
-								   fix_join_expr_mutator,
-								   (void *) context);
+	ret_node = expression_tree_mutator(node,
+									   fix_join_expr_mutator,
+									   (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret_node;
 }
 
 /*
@@ -3163,7 +3549,8 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * varno = newvarno, varattno = resno of corresponding targetlist element.
  * The original tree is not modified.
  */
-static Node *
+static Node *					/* XXX: shall I care about this for shared
+								 * detoast optimization? */
 fix_upper_expr(PlannerInfo *root,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
@@ -3318,7 +3705,10 @@ set_returning_clause_references(PlannerInfo *root,
 						  resultRelation,
 						  rtoffset,
 						  NRM_EQUAL,
-						  NUM_EXEC_TLIST(topplan));
+						  NUM_EXEC_TLIST(topplan),
+						  NULL,
+						  NULL
+		);
 
 	pfree(itlist);
 
diff --git a/src/include/access/detoast.h b/src/include/access/detoast.h
index 12d8cdb356..9ddc05604e 100644
--- a/src/include/access/detoast.h
+++ b/src/include/access/detoast.h
@@ -42,6 +42,7 @@ do { \
  * ----------
  */
 extern struct varlena *detoast_external_attr(struct varlena *attr);
+extern struct varlena *detoast_external_attr_ext(struct varlena *attr, MemoryContext ctx);
 
 /* ----------
  * detoast_attr() -
@@ -51,6 +52,8 @@ extern struct varlena *detoast_external_attr(struct varlena *attr);
  * ----------
  */
 extern struct varlena *detoast_attr(struct varlena *attr);
+extern struct varlena *detoast_attr_ext(struct varlena *attr, MemoryContext ctx);
+
 
 /* ----------
  * detoast_attr_slice() -
diff --git a/src/include/access/toast_compression.h b/src/include/access/toast_compression.h
index 64d5e079fa..00ea153e4e 100644
--- a/src/include/access/toast_compression.h
+++ b/src/include/access/toast_compression.h
@@ -55,13 +55,13 @@ typedef enum ToastCompressionId
 
 /* pglz compression/decompression routines */
 extern struct varlena *pglz_compress_datum(const struct varlena *value);
-extern struct varlena *pglz_decompress_datum(const struct varlena *value);
+extern struct varlena *pglz_decompress_datum(const struct varlena *value, MemoryContext ctx);
 extern struct varlena *pglz_decompress_datum_slice(const struct varlena *value,
 												   int32 slicelength);
 
 /* lz4 compression/decompression routines */
 extern struct varlena *lz4_compress_datum(const struct varlena *value);
-extern struct varlena *lz4_decompress_datum(const struct varlena *value);
+extern struct varlena *lz4_decompress_datum(const struct varlena *value, MemoryContext ctx);
 extern struct varlena *lz4_decompress_datum_slice(const struct varlena *value,
 												  int32 slicelength);
 
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index a28ddcdd77..9304786bb2 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,17 @@ typedef enum ExprEvalOp
 	EEOP_OUTER_VAR,
 	EEOP_SCAN_VAR,
 
+	/*
+	 * compute non-system Var value with shared-detoast-datum logic, use some
+	 * dedicated steps rather than add extra logic to existing steps is for
+	 * performance aspect, within this way, we just decide if the extra logic
+	 * is needed at ExecInitExpr stage once rather than every time of
+	 * ExecInterpExpr.
+	 */
+	EEOP_INNER_VAR_TOAST,
+	EEOP_OUTER_VAR_TOAST,
+	EEOP_SCAN_VAR_TOAST,
+
 	/* compute system Var value */
 	EEOP_INNER_SYSVAR,
 	EEOP_OUTER_SYSVAR,
@@ -830,5 +841,6 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
+extern void ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum);
 
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 6133dbcd0a..9edb843b15 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -19,6 +19,7 @@
 #include "access/sysattr.h"
 #include "access/tupdesc.h"
 #include "storage/buf.h"
+#include "utils/memutils.h"
 
 /*----------
  * The executor stores tuples in a "tuple table" which is a List of
@@ -126,6 +127,7 @@ typedef struct TupleTableSlot
 #define FIELDNO_TUPLETABLESLOT_ISNULL 6
 	bool	   *tts_isnull;		/* current per-attribute isnull flags */
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
+	MemoryContext tts_data_mctx; /* The external content of tts_values[*] */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
 } TupleTableSlot;
@@ -426,12 +428,24 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+/*
+ * ExecFreePreDetoastDatum - free the memory which is allocated in tts_data_mcxt.
+ */
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+	if (slot->tts_data_mctx)
+		MemoryContextResetOnly(slot->tts_data_mctx);
+}
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
 static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_ops->clear(slot);
 
 	return slot;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd5..30fdb37d1c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1481,6 +1481,12 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the Scan nodes.
+	 */
+	Bitmapset  *scan_pre_detoast_attrs;
 } ScanState;
 
 /* ----------------
@@ -2010,6 +2016,13 @@ typedef struct JoinState
 	bool		single_match;	/* True if we should skip to next outer tuple
 								 * after finding one inner match */
 	ExprState  *joinqual;		/* JOIN quals (in addition to ps.qual) */
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the join nodes.
+	 */
+	Bitmapset  *outer_pre_detoast_attrs;
+	Bitmapset  *inner_pre_detoast_attrs;
 } JoinState;
 
 /* ----------------
@@ -2771,4 +2784,5 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+extern void SetPredetoastAttrsForJoin(JoinState *joinstate);
 #endif							/* EXECNODES_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c..ea5033aaa0 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,13 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+
+	/*
+	 * A list of Vars which should not apply the shared-detoast-datum logic
+	 * since the upper nodes like Sort/Hash wants them as small as possible.
+	 * It's a subset of targetlist in each Plan node.
+	 */
+	List	   *forbid_pre_detoast_vars;
 } Plan;
 
 /* ----------------
@@ -385,6 +392,16 @@ typedef struct Scan
 
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *reference_attrs;
 } Scan;
 
 /* ----------------
@@ -789,6 +806,17 @@ typedef struct Join
 	JoinType	jointype;
 	bool		inner_unique;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *outer_reference_attrs;
+	Bitmapset  *inner_reference_attrs;
 } Join;
 
 /* ----------------
@@ -869,6 +897,11 @@ typedef struct HashJoin
 	 * perform lookups in the hashtable over the inner plan.
 	 */
 	List	   *hashkeys;
+
+	/*
+	 * Whether the left plan tree should use a SMALL_TLIST.
+	 */
+	bool		left_small_tlist;
 } HashJoin;
 
 /* ----------------
@@ -1588,4 +1621,24 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+static inline bool
+is_join_plan(Plan *plan)
+{
+	return (plan != NULL) && (IsA(plan, NestLoop) || IsA(plan, HashJoin) || IsA(plan, MergeJoin));
+}
+
+static inline bool
+is_scan_plan(Plan *plan)
+{
+	return (plan != NULL) &&
+		(IsA(plan, SeqScan) ||
+		 IsA(plan, SampleScan) ||
+		 IsA(plan, IndexScan) ||
+		 IsA(plan, IndexOnlyScan) ||
+		 IsA(plan, BitmapIndexScan) ||
+		 IsA(plan, BitmapHeapScan) ||
+		 IsA(plan, TidScan) ||
+		 IsA(plan, SubqueryScan));
+}
+
 #endif							/* PLANNODES_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ee40a341d3..2335000e18 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4048,6 +4048,8 @@ cb_cleanup_dir
 cb_options
 cb_tablespace
 cb_tablespace_mapping
+intermediate_var_ref_context
+intermediate_level_context
 manifest_data
 manifest_writer
 rfile
-- 
2.34.1

#25Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andy Fan (#23)
Re: Shared detoast Datum proposal

On 3/3/24 02:52, Andy Fan wrote:

Hi Nikita,

Have you considered another one - to alter pg_detoast_datum
(actually, it would be detoast_attr function) and save
detoasted datums in the detoast context derived
from the query context?

We have just enough information at this step to identify
the datum - toast relation id and value id, and could
keep links to these detoasted values in a, say, linked list
or hash table. Thus we would avoid altering the executor
code and all detoast-related code would reside within
the detoast source files?

I think you are talking about the way Tomas provided. I am really
afraid that I was thought of too self-opinionated, but I do have some
concerns about this approch as I stated here [1], looks my concerns is
still not addressed, or the concerns itself are too absurd which is
really possible I think?

I'm not sure I understand your concerns. I can't speak for others, but I
did not consider you and your proposals too self-opinionated. You did
propose a solution that you consider the right one. That's fine. People
will question that and suggest possible alternatives. That's fine too,
it's why we have this list in the first place.

FWIW I'm not somehow sure the approach I suggested is guaranteed to be
better than "your" approach. Maybe there's some issue that I missed, for
example.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andy Fan (#20)
Re: Shared detoast Datum proposal

On 2/26/24 16:29, Andy Fan wrote:

...>
I can understand the benefits of the TOAST cache, but the following
issues looks not good to me IIUC:

1). we can't put the result to tts_values[*] since without the planner
decision, we don't know if this will break small_tlist logic. But if we
put it into the cache only, which means a cache-lookup as a overhead is
unavoidable.

True - if you're comparing having the detoasted value in tts_values[*]
directly with having to do a lookup in a cache, then yes, there's a bit
of an overhead.

But I think from the discussion it's clear having to detoast the value
into tts_values[*] has some weaknesses too, in particular:

- It requires decisions which attributes to detoast eagerly, which is
quite invasive (having to walk the plan, ...).

- I'm sure there will be cases where we choose to not detoast, but it
would be beneficial to detoast.

- Detoasting just the initial slices does not seem compatible with this.

IMHO the overhead of the cache lookup would be negligible compared to
the repeated detoasting of the value (which is the current baseline). I
somewhat doubt the difference compared to tts_values[*] will be even
measurable.

2). It is hard to decide which entry should be evicted without attaching
it to the TupleTableSlot's life-cycle. then we can't grantee the entry
we keep is the entry we will reuse soon?

True. But is that really a problem? I imagined we'd set some sort of
memory limit on the cache (work_mem?), and evict oldest entries. So the
entries would eventually get evicted, and the memory limit would ensure
we don't consume arbitrary amounts of memory.

We could also add some "slot callback" but that seems unnecessary.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andy Fan (#24)
Re: Shared detoast Datum proposal

On 3/3/24 07:10, Andy Fan wrote:

Hi,

Here is a updated version, the main changes are:

1. an shared_detoast_datum.org file which shows the latest desgin and
pending items during discussion.
2. I removed the slot->pre_detoast_attrs totally.
3. handle some pg_detoast_datum_slice use case.
4. Some implementation improvement.

I only very briefly skimmed the patch, and I guess most of my earlier
comments still apply. But I'm a bit surprised the patch needs to pass a
MemoryContext to so many places as a function argument - that seems to
go against how we work with memory contexts. Doubly so because it seems
to only ever pass CurrentMemoryContext anyway. So what's that about?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28Andy Fan
zhihuifan1213@163.com
In reply to: Tomas Vondra (#25)
Re: Shared detoast Datum proposal

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 3/3/24 02:52, Andy Fan wrote:

Hi Nikita,

Have you considered another one - to alter pg_detoast_datum
(actually, it would be detoast_attr function) and save
detoasted datums in the detoast context derived
from the query context?

We have just enough information at this step to identify
the datum - toast relation id and value id, and could
keep links to these detoasted values in a, say, linked list
or hash table. Thus we would avoid altering the executor
code and all detoast-related code would reside within
the detoast source files?

I think you are talking about the way Tomas provided. I am really
afraid that I was thought of too self-opinionated, but I do have some
concerns about this approch as I stated here [1], looks my concerns is
still not addressed, or the concerns itself are too absurd which is
really possible I think?

I can't speak for others, but I did not consider you and your
proposals too self-opinionated.

This would be really great to know:)

You did
propose a solution that you consider the right one. That's fine. People
will question that and suggest possible alternatives. That's fine too,
it's why we have this list in the first place.

I agree.

--
Best Regards
Andy Fan

#29Andy Fan
zhihuifan1213@163.com
In reply to: Tomas Vondra (#26)
Re: Shared detoast Datum proposal

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 2/26/24 16:29, Andy Fan wrote:

...>
I can understand the benefits of the TOAST cache, but the following
issues looks not good to me IIUC:

1). we can't put the result to tts_values[*] since without the planner
decision, we don't know if this will break small_tlist logic. But if we
put it into the cache only, which means a cache-lookup as a overhead is
unavoidable.

True - if you're comparing having the detoasted value in tts_values[*]
directly with having to do a lookup in a cache, then yes, there's a bit
of an overhead.

But I think from the discussion it's clear having to detoast the value
into tts_values[*] has some weaknesses too, in particular:

- It requires decisions which attributes to detoast eagerly, which is
quite invasive (having to walk the plan, ...).

- I'm sure there will be cases where we choose to not detoast, but it
would be beneficial to detoast.

- Detoasting just the initial slices does not seem compatible with this.

IMHO the overhead of the cache lookup would be negligible compared to
the repeated detoasting of the value (which is the current baseline). I
somewhat doubt the difference compared to tts_values[*] will be even
measurable.

2). It is hard to decide which entry should be evicted without attaching
it to the TupleTableSlot's life-cycle. then we can't grantee the entry
we keep is the entry we will reuse soon?

True. But is that really a problem? I imagined we'd set some sort of
memory limit on the cache (work_mem?), and evict oldest entries. So the
entries would eventually get evicted, and the memory limit would ensure
we don't consume arbitrary amounts of memory.

Here is a copy from the shared_detoast_datum.org in the patch. I am
feeling about when / which entry to free is a key problem and run-time
(detoast_attr) overhead vs createplan.c overhead is a small difference
as well. the overhead I paid for createplan.c/setref.c looks not huge as
well.

"""
A alternative design: toast cache
---------------------------------

This method is provided by Tomas during the review process. IIUC, this
method would maintain a local HTAB which map a toast datum to a detoast
datum and the entry is maintained / used in detoast_attr
function. Within this method, the overall design is pretty clear and the
code modification can be controlled in toasting system only.

I assumed that releasing all of the memory at the end of executor once
is not an option since it may consumed too many memory. Then, when and
which entry to release becomes a trouble for me. For example:

QUERY PLAN
------------------------------
Nested Loop
Join Filter: (t1.a = t2.a)
-> Seq Scan on t1
-> Seq Scan on t2
(4 rows)

In this case t1.a needs a longer lifespan than t2.a since it is
in outer relation. Without the help from slot's life-cycle system, I
can't think out a answer for the above question.

Another difference between the 2 methods is my method have many
modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
it can save some run-time effort like hash_search find / enter run-time
in method 2 since I put them directly into tts_values[*].

I'm not sure the factor 2 makes some real measurable difference in real
case, so my current concern mainly comes from factor 1.
"""

--
Best Regards
Andy Fan

#30Andy Fan
zhihuifan1213@163.com
In reply to: Tomas Vondra (#27)
Re: Shared detoast Datum proposal

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 3/3/24 07:10, Andy Fan wrote:

Hi,

Here is a updated version, the main changes are:

1. an shared_detoast_datum.org file which shows the latest desgin and
pending items during discussion.
2. I removed the slot->pre_detoast_attrs totally.
3. handle some pg_detoast_datum_slice use case.
4. Some implementation improvement.

I only very briefly skimmed the patch, and I guess most of my earlier
comments still apply.

Yes, the overall design is not changed.

But I'm a bit surprised the patch needs to pass a
MemoryContext to so many places as a function argument - that seems to
go against how we work with memory contexts. Doubly so because it seems
to only ever pass CurrentMemoryContext anyway. So what's that about?

I think you are talking about the argument like this:

 /* ----------
- * detoast_attr -
+ * detoast_attr_ext -
  *
  *	Public entry point to get back a toasted value from compression
  *	or external storage.  The result is always non-extended varlena form.
  *
+ * ctx: The memory context which the final value belongs to.
+ *
  * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
  * datum, the result will be a pfree'able chunk.
  * ----------
  */
+extern struct varlena *
+detoast_attr_ext(struct varlena *attr, MemoryContext ctx)

This is mainly because 'detoast_attr' will apply more memory than the
"final detoast datum" , for example the code to scan toast relation
needs some memory as well, and what I want is just keeping the memory
for the final detoast datum so that other memory can be released sooner,
so I added the function argument for that.

--
Best Regards
Andy Fan

#31Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andy Fan (#30)
Re: Shared detoast Datum proposal

On 3/4/24 02:29, Andy Fan wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 3/3/24 07:10, Andy Fan wrote:

Hi,

Here is a updated version, the main changes are:

1. an shared_detoast_datum.org file which shows the latest desgin and
pending items during discussion.
2. I removed the slot->pre_detoast_attrs totally.
3. handle some pg_detoast_datum_slice use case.
4. Some implementation improvement.

I only very briefly skimmed the patch, and I guess most of my earlier
comments still apply.

Yes, the overall design is not changed.

But I'm a bit surprised the patch needs to pass a
MemoryContext to so many places as a function argument - that seems to
go against how we work with memory contexts. Doubly so because it seems
to only ever pass CurrentMemoryContext anyway. So what's that about?

I think you are talking about the argument like this:

/* ----------
- * detoast_attr -
+ * detoast_attr_ext -
*
*	Public entry point to get back a toasted value from compression
*	or external storage.  The result is always non-extended varlena form.
*
+ * ctx: The memory context which the final value belongs to.
+ *
* Note some callers assume that if the input is an EXTERNAL or COMPRESSED
* datum, the result will be a pfree'able chunk.
* ----------
*/
+extern struct varlena *
+detoast_attr_ext(struct varlena *attr, MemoryContext ctx)

This is mainly because 'detoast_attr' will apply more memory than the
"final detoast datum" , for example the code to scan toast relation
needs some memory as well, and what I want is just keeping the memory
for the final detoast datum so that other memory can be released sooner,
so I added the function argument for that.

What exactly does detoast_attr allocate in order to scan toast relation?
Does that happen in master, or just with the patch? If with master, I
suggest to ignore that / treat that as a separate issue and leave it for
a different patch.

In any case, the custom is to allocate results in the context that is
set in CurrentMemoryContext at the moment of the call, and if there's
substantial amount of allocations that we want to free soon, we either
do that by explicit pfree() calls, or create a small temporary context
in the function (but that's more expensive).

I don't think we should invent a new convention where the context is
passed as an argument, unless absolutely necessary.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32Andy Fan
zhihuifan1213@163.com
In reply to: Tomas Vondra (#31)
Re: Shared detoast Datum proposal

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

But I'm a bit surprised the patch needs to pass a
MemoryContext to so many places as a function argument - that seems to
go against how we work with memory contexts. Doubly so because it seems
to only ever pass CurrentMemoryContext anyway. So what's that about?

I think you are talking about the argument like this:

/* ----------
- * detoast_attr -
+ * detoast_attr_ext -
*
*	Public entry point to get back a toasted value from compression
*	or external storage.  The result is always non-extended varlena form.
*
+ * ctx: The memory context which the final value belongs to.
+ *
* Note some callers assume that if the input is an EXTERNAL or COMPRESSED
* datum, the result will be a pfree'able chunk.
* ----------
*/
+extern struct varlena *
+detoast_attr_ext(struct varlena *attr, MemoryContext ctx)

This is mainly because 'detoast_attr' will apply more memory than the
"final detoast datum" , for example the code to scan toast relation
needs some memory as well, and what I want is just keeping the memory
for the final detoast datum so that other memory can be released sooner,
so I added the function argument for that.

What exactly does detoast_attr allocate in order to scan toast relation?
Does that happen in master, or just with the patch?

It is in the current master, for example the TupleTableSlot creation
needed by scanning the toast relation needs a memory allocating.

If with master, I
suggest to ignore that / treat that as a separate issue and leave it for
a different patch.

OK, I can make it as a seperate commit in the next version.

In any case, the custom is to allocate results in the context that is
set in CurrentMemoryContext at the moment of the call, and if there's
substantial amount of allocations that we want to free soon, we either
do that by explicit pfree() calls, or create a small temporary context
in the function (but that's more expensive).

I don't think we should invent a new convention where the context is
passed as an argument, unless absolutely necessary.

Hmm, in this case, if we don't add the new argument, we have to get the
detoast datum in Context-1 and copy it to the desired memory context,
which is the thing I want to avoid. I think we have a same decision to
make on the TOAST cache method as well.

--
Best Regards
Andy Fan

#33Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andy Fan (#29)
2 attachment(s)
Re: Shared detoast Datum proposal

On 3/4/24 02:23, Andy Fan wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 2/26/24 16:29, Andy Fan wrote:

...>
I can understand the benefits of the TOAST cache, but the following
issues looks not good to me IIUC:

1). we can't put the result to tts_values[*] since without the planner
decision, we don't know if this will break small_tlist logic. But if we
put it into the cache only, which means a cache-lookup as a overhead is
unavoidable.

True - if you're comparing having the detoasted value in tts_values[*]
directly with having to do a lookup in a cache, then yes, there's a bit
of an overhead.

But I think from the discussion it's clear having to detoast the value
into tts_values[*] has some weaknesses too, in particular:

- It requires decisions which attributes to detoast eagerly, which is
quite invasive (having to walk the plan, ...).

- I'm sure there will be cases where we choose to not detoast, but it
would be beneficial to detoast.

- Detoasting just the initial slices does not seem compatible with this.

IMHO the overhead of the cache lookup would be negligible compared to
the repeated detoasting of the value (which is the current baseline). I
somewhat doubt the difference compared to tts_values[*] will be even
measurable.

2). It is hard to decide which entry should be evicted without attaching
it to the TupleTableSlot's life-cycle. then we can't grantee the entry
we keep is the entry we will reuse soon?

True. But is that really a problem? I imagined we'd set some sort of
memory limit on the cache (work_mem?), and evict oldest entries. So the
entries would eventually get evicted, and the memory limit would ensure
we don't consume arbitrary amounts of memory.

Here is a copy from the shared_detoast_datum.org in the patch. I am
feeling about when / which entry to free is a key problem and run-time
(detoast_attr) overhead vs createplan.c overhead is a small difference
as well. the overhead I paid for createplan.c/setref.c looks not huge as
well.

I decided to whip up a PoC patch implementing this approach, to get a
better idea of how it compares to the original proposal, both in terms
of performance and code complexity.

Attached are two patches that both add a simple cache in detoast.c, but
differ in when exactly the caching happens.

toast-cache-v1 - caching happens in toast_fetch_datum, which means this
happens before decompression

toast-cache-v2 - caching happens in detoast_attr, after decompression

I started with v1, but that did not help with the example workloads
(from the original post) at all. Only after I changed this to cache
decompressed data (in v2) it became competitive.

There's GUC to enable the cache (enable_toast_cache, off by default), to
make experimentation easier.

The cache is invalidated at the end of a transaction - I think this
should be OK, because the rows may be deleted but can't be cleaned up
(or replaced with a new TOAST value). This means the cache could cover
multiple queries - not sure if that's a good thing or bad thing.

I haven't implemented any sort of cache eviction. I agree that's a hard
problem in general, but I doubt we can do better than LRU. I also didn't
implement any memory limit.

FWIW I think it's a fairly serious issue Andy's approach does not have
any way to limit the memory. It will detoasts the values eagerly, but
what if the rows have multiple large values? What if we end up keeping
multiple such rows (from different parts of the plan) at once?

I also haven't implemented caching for sliced data. I think modifying
the code to support this should not be difficult - it'd require tracking
how much data we have in the cache, and considering that during lookup
and when adding entries to cache.

I've implemented the cache as "global". Maybe it should be tied to query
or executor state, but then it'd be harder to access it from detoast.c
(we'd have to pass the cache pointer in some way, and it wouldn't work
for use cases that don't go through executor).

I think implementation-wise this is pretty non-invasive.

Performance-wise, these patches are slower than with Andy's patch. For
example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and
v2 at ~150ms. I'm sure there's a number of optimization opportunities
and v2 could get v2 closer to the 100ms.

v1 is not competitive at all in this jsonb/Q1 test - it's just as slow
as master, because the data set is small enough to be fully cached, so
there's no I/O and it's the decompression is the actual bottleneck.

That being said, I'm not convinced v1 would be a bad choice for some
cases. If there's memory pressure enough to evict TOAST, it's quite
possible the I/O would become the bottleneck. At which point it might be
a good trade off to cache compressed data, because then we'd cache more
detoasted values.

OTOH it's likely the access to TOAST values is localized (in temporal
sense), i.e. we read it from disk, detoast it a couple times in a short
time interval, and then move to the next row. That'd mean v2 is probably
the correct trade off.

Not sure.

"""
A alternative design: toast cache
---------------------------------

This method is provided by Tomas during the review process. IIUC, this
method would maintain a local HTAB which map a toast datum to a detoast
datum and the entry is maintained / used in detoast_attr
function. Within this method, the overall design is pretty clear and the
code modification can be controlled in toasting system only.

Right.

I assumed that releasing all of the memory at the end of executor once
is not an option since it may consumed too many memory. Then, when and
which entry to release becomes a trouble for me. For example:

QUERY PLAN
------------------------------
Nested Loop
Join Filter: (t1.a = t2.a)
-> Seq Scan on t1
-> Seq Scan on t2
(4 rows)

In this case t1.a needs a longer lifespan than t2.a since it is
in outer relation. Without the help from slot's life-cycle system, I
can't think out a answer for the above question.

This is true, but how likely such plans are? I mean, surely no one would
do nested loop with sequential scans on reasonably large tables, so how
representative this example is?

Also, this leads me to the need of having some sort of memory limit. If
we may be keeping entries for extended periods of time, and we don't
have any way to limit the amount of memory, that does not seem great.

AFAIK if we detoast everything into tts_values[] there's no way to
implement and enforce such limit. What happens if there's a row with
multiple large-ish TOAST values? What happens if those rows are in
different (and distant) parts of the plan?

It seems far easier to limit the memory with the toast cache.

Another difference between the 2 methods is my method have many
modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
it can save some run-time effort like hash_search find / enter run-time
in method 2 since I put them directly into tts_values[*].

I'm not sure the factor 2 makes some real measurable difference in real
case, so my current concern mainly comes from factor 1.
"""

This seems a bit dismissive of the (possible) issue. It'd be good to do
some measurements, especially on simple queries that can't benefit from
the detoasting (e.g. because there's no varlena attribute).

In any case, my concern is more about having to do this when creating
the plan at all, the code complexity etc. Not just because it might have
performance impact.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

toast-cache-v1.patchtext/x-patch; charset=UTF-8; name=toast-cache-v1.patchDownload
diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c
index 3547cdba56e..39c801a0a52 100644
--- a/src/backend/access/common/detoast.c
+++ b/src/backend/access/common/detoast.c
@@ -20,6 +20,8 @@
 #include "common/int.h"
 #include "common/pg_lzcompress.h"
 #include "utils/expandeddatum.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
 #include "utils/rel.h"
 
 static struct varlena *toast_fetch_datum(struct varlena *attr);
@@ -29,6 +31,25 @@ static struct varlena *toast_fetch_datum_slice(struct varlena *attr,
 static struct varlena *toast_decompress_datum(struct varlena *attr);
 static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32 slicelength);
 
+bool enable_toast_cache = false;
+
+HTAB *toastCache = NULL;
+MemoryContext ToastCacheContext = NULL;
+
+typedef struct toast_cache_key {
+	Oid toastrelid;
+	Oid valueid;
+} toast_cache_key;
+
+typedef struct toast_cache_entry {
+	toast_cache_key key;
+	int32	size;
+	void   *data;
+} toast_cache_entry;
+
+static toast_cache_entry *toast_cache_lookup(Oid toastrelid, Oid valueid);
+static void toast_cache_add_entry(Oid toastrelid, Oid valueid, struct varlena *attr);
+
 /* ----------
  * detoast_external_attr -
  *
@@ -346,6 +367,7 @@ toast_fetch_datum(struct varlena *attr)
 	struct varlena *result;
 	struct varatt_external toast_pointer;
 	int32		attrsize;
+	toast_cache_entry *entry;
 
 	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
 		elog(ERROR, "toast_fetch_datum shouldn't be called for non-ondisk datums");
@@ -366,6 +388,16 @@ toast_fetch_datum(struct varlena *attr)
 		return result;			/* Probably shouldn't happen, but just in
 								 * case. */
 
+	/* try lookup in the toast cache */
+	entry = toast_cache_lookup(toast_pointer.va_toastrelid,
+							   toast_pointer.va_valueid);
+
+	if (entry)
+	{
+		memcpy(VARDATA(result), entry->data, attrsize);
+		return result;
+	}
+
 	/*
 	 * Open the toast relation and its indexes
 	 */
@@ -378,6 +410,10 @@ toast_fetch_datum(struct varlena *attr)
 	/* Close toast table */
 	table_close(toastrel, AccessShareLock);
 
+	toast_cache_add_entry(toast_pointer.va_toastrelid,
+						  toast_pointer.va_valueid,
+						  result);
+
 	return result;
 }
 
@@ -644,3 +680,93 @@ toast_datum_size(Datum value)
 	}
 	return result;
 }
+
+static bool
+toast_cache_init(void)
+{
+	HASHCTL		ctl;
+
+	if (!enable_toast_cache)
+		return false;
+
+	/* already initialized */
+	if (toastCache)
+		return true;
+
+	/* FIXME should really be per query */
+	ToastCacheContext = AllocSetContextCreate(TopMemoryContext,
+											  "TOAST cache context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/* Make a new hash table for the cache */
+	ctl.keysize = sizeof(toast_cache_key);
+	ctl.entrysize = sizeof(toast_cache_entry);
+	ctl.hcxt = ToastCacheContext;
+
+	toastCache = hash_create("TOAST cache",
+							 128, &ctl,
+							 HASH_ELEM | HASH_CONTEXT | HASH_BLOBS);
+
+	return true;
+}
+
+static toast_cache_entry *
+toast_cache_lookup(Oid toastrelid, Oid valueid)
+{
+	toast_cache_key		key;
+	toast_cache_entry  *entry;
+
+	/* make sure cache initialized */
+	if (!toast_cache_init())
+		return NULL;
+
+	key.toastrelid = toastrelid;
+	key.valueid = valueid;
+
+	entry = (toast_cache_entry *) hash_search(toastCache, &key,
+											  HASH_FIND, NULL);
+
+	return entry;
+}
+
+static void
+toast_cache_add_entry(Oid toastrelid, Oid valueid, struct varlena *attr)
+{
+	bool				found = false;
+	toast_cache_key		key;
+	toast_cache_entry  *entry;
+
+	/* make sure cache initialized */
+	if (!toast_cache_init())
+		return;
+
+	key.toastrelid = toastrelid;
+	key.valueid = valueid;
+
+	entry = (toast_cache_entry *) hash_search(toastCache, &key,
+											  HASH_ENTER, &found);
+
+	Assert(!found);
+
+	entry->size = VARSIZE_ANY_EXHDR(attr);
+	entry->data = MemoryContextAlloc(ToastCacheContext, entry->size);
+	memcpy(entry->data, VARDATA_ANY(attr), entry->size);
+}
+
+static void
+toast_cache_destroy(void)
+{
+	if (!toastCache)
+		return;
+
+	MemoryContextDelete(ToastCacheContext);
+
+	toastCache = NULL;
+	ToastCacheContext = NULL;
+}
+
+void
+AtEOXact_ToastCache(void)
+{
+	toast_cache_destroy();
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 70ab6e27a13..2cae3582026 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/detoast.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -2381,6 +2382,7 @@ CommitTransaction(void)
 	AtEOXact_Snapshot(true, false);
 	AtEOXact_ApplyLauncher(true);
 	AtEOXact_LogicalRepWorkers(true);
+	AtEOXact_ToastCache();
 	pgstat_report_xact_timestamp(0);
 
 	ResourceOwnerDelete(TopTransactionResourceOwner);
@@ -2671,6 +2673,7 @@ PrepareTransaction(void)
 	/* we treat PREPARE as ROLLBACK so far as waking workers goes */
 	AtEOXact_ApplyLauncher(false);
 	AtEOXact_LogicalRepWorkers(false);
+	AtEOXact_ToastCache();
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2888,6 +2891,7 @@ AbortTransaction(void)
 		AtEOXact_PgStat(false, is_parallel_worker);
 		AtEOXact_ApplyLauncher(false);
 		AtEOXact_LogicalRepWorkers(false);
+		AtEOXact_ToastCache();
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 93ded31ed92..ceb3d3b953c 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/detoast.h"
 #include "access/gin.h"
 #include "access/slru.h"
 #include "access/toast_compression.h"
@@ -1015,6 +1016,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_toast_cache", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables caching of detoasted values."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_toast_cache,
+		false,
+		NULL, NULL, NULL
+	},
 	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
diff --git a/src/include/access/detoast.h b/src/include/access/detoast.h
index 12d8cdb356a..1a662c81388 100644
--- a/src/include/access/detoast.h
+++ b/src/include/access/detoast.h
@@ -79,4 +79,7 @@ extern Size toast_raw_datum_size(Datum value);
  */
 extern Size toast_datum_size(Datum value);
 
+extern PGDLLIMPORT bool enable_toast_cache;
+extern void AtEOXact_ToastCache(void);
+
 #endif							/* DETOAST_H */
toast-cache-v2.patchtext/x-patch; charset=UTF-8; name=toast-cache-v2.patchDownload
diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c
index 3547cdba56e..5049410e81f 100644
--- a/src/backend/access/common/detoast.c
+++ b/src/backend/access/common/detoast.c
@@ -20,6 +20,8 @@
 #include "common/int.h"
 #include "common/pg_lzcompress.h"
 #include "utils/expandeddatum.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
 #include "utils/rel.h"
 
 static struct varlena *toast_fetch_datum(struct varlena *attr);
@@ -29,6 +31,30 @@ static struct varlena *toast_fetch_datum_slice(struct varlena *attr,
 static struct varlena *toast_decompress_datum(struct varlena *attr);
 static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32 slicelength);
 
+bool enable_toast_cache = false;
+
+HTAB *toastCache = NULL;
+MemoryContext ToastCacheContext = NULL;
+
+static uint64	toast_cache_hits = 0;
+static uint64	toast_cache_misses = 0;
+
+typedef struct toast_cache_key {
+	Oid toastrelid;
+	Oid valueid;
+} toast_cache_key;
+
+typedef struct toast_cache_entry {
+	toast_cache_key key;
+	int32	size;
+	void   *data;
+} toast_cache_entry;
+
+static toast_cache_entry *toast_cache_lookup(Oid toastrelid, Oid valueid);
+static void toast_cache_add_entry(Oid toastrelid, Oid valueid, struct varlena *attr);
+static struct varlena *toast_cache_lookup_datum(toast_cache_key key);
+static void toast_cache_store_datum(toast_cache_key key, struct varlena *attr);
+
 /* ----------
  * detoast_external_attr -
  *
@@ -117,6 +143,28 @@ detoast_attr(struct varlena *attr)
 {
 	if (VARATT_IS_EXTERNAL_ONDISK(attr))
 	{
+		struct varlena *result;
+		toast_cache_key	key;
+
+		/* calculate the cache key */
+		{
+			struct varatt_external toast_pointer;
+
+			/* Must copy to access aligned fields */
+			VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
+			/* try lookup in the toast cache */
+			key.toastrelid = toast_pointer.va_toastrelid;
+			key.valueid = toast_pointer.va_valueid;
+		}
+
+		/* first try to lookup in the toast cache */
+		result = toast_cache_lookup_datum(key);
+
+		/* cache hit, we're done */
+		if (result)
+			return result;
+
 		/*
 		 * This is an externally stored datum --- fetch it back from there
 		 */
@@ -129,6 +177,8 @@ detoast_attr(struct varlena *attr)
 			attr = toast_decompress_datum(tmp);
 			pfree(tmp);
 		}
+
+		toast_cache_store_datum(key, attr);
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -644,3 +694,130 @@ toast_datum_size(Datum value)
 	}
 	return result;
 }
+
+static bool
+toast_cache_init(void)
+{
+	HASHCTL		ctl;
+
+	if (!enable_toast_cache)
+		return false;
+
+	/* already initialized */
+	if (toastCache)
+		return true;
+
+	toast_cache_hits = 0;
+	toast_cache_misses = 0;
+
+	/* FIXME should really be per query */
+	ToastCacheContext = AllocSetContextCreate(TopMemoryContext,
+											  "TOAST cache context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/* Make a new hash table for the cache */
+	ctl.keysize = sizeof(toast_cache_key);
+	ctl.entrysize = sizeof(toast_cache_entry);
+	ctl.hcxt = ToastCacheContext;
+
+	toastCache = hash_create("TOAST cache",
+							 128, &ctl,
+							 HASH_ELEM | HASH_CONTEXT | HASH_BLOBS);
+
+	return true;
+}
+
+static toast_cache_entry *
+toast_cache_lookup(Oid toastrelid, Oid valueid)
+{
+	toast_cache_key		key;
+	toast_cache_entry  *entry;
+
+	/* make sure cache initialized */
+	if (!toast_cache_init())
+		return NULL;
+
+	key.toastrelid = toastrelid;
+	key.valueid = valueid;
+
+	entry = (toast_cache_entry *) hash_search(toastCache, &key,
+											  HASH_FIND, NULL);
+
+	if (entry)
+		toast_cache_hits++;
+	else
+		toast_cache_misses++;
+
+	return entry;
+}
+
+static void
+toast_cache_add_entry(Oid toastrelid, Oid valueid, struct varlena *attr)
+{
+	bool				found = false;
+	toast_cache_key		key;
+	toast_cache_entry  *entry;
+
+	/* make sure cache initialized */
+	if (!toast_cache_init())
+		return;
+
+	key.toastrelid = toastrelid;
+	key.valueid = valueid;
+
+	entry = (toast_cache_entry *) hash_search(toastCache, &key,
+											  HASH_ENTER, &found);
+
+	Assert(!found);
+
+	entry->size = VARSIZE_ANY_EXHDR(attr);
+	entry->data = MemoryContextAlloc(ToastCacheContext, entry->size);
+	memcpy(entry->data, VARDATA_ANY(attr), entry->size);
+}
+
+static void
+toast_cache_destroy(void)
+{
+	if (!toastCache)
+		return;
+
+	elog(LOG, "AtEOXact_ToastCache hits %ld misses %ld ratio %.2f%%",
+		 toast_cache_hits, toast_cache_misses,
+		 (toast_cache_hits * 100.0) / Max(1, toast_cache_hits + toast_cache_misses));
+
+	MemoryContextDelete(ToastCacheContext);
+
+	toastCache = NULL;
+	ToastCacheContext = NULL;
+}
+
+void
+AtEOXact_ToastCache(void)
+{
+	toast_cache_destroy();
+}
+
+static struct varlena *
+toast_cache_lookup_datum(toast_cache_key key)
+{
+	toast_cache_entry *entry;
+	struct varlena *result = NULL;
+
+	entry = toast_cache_lookup(key.toastrelid, key.valueid);
+
+	if (!entry)
+		return NULL;
+
+	result = (struct varlena *) palloc(entry->size + VARHDRSZ);
+	SET_VARSIZE(result, entry->size + VARHDRSZ);
+
+	memcpy(VARDATA(result), entry->data, entry->size);
+
+	return result;
+}
+
+static void
+toast_cache_store_datum(toast_cache_key key, struct varlena *attr)
+{
+	toast_cache_add_entry(key.toastrelid, key.valueid, attr);
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 70ab6e27a13..2cae3582026 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/detoast.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -2381,6 +2382,7 @@ CommitTransaction(void)
 	AtEOXact_Snapshot(true, false);
 	AtEOXact_ApplyLauncher(true);
 	AtEOXact_LogicalRepWorkers(true);
+	AtEOXact_ToastCache();
 	pgstat_report_xact_timestamp(0);
 
 	ResourceOwnerDelete(TopTransactionResourceOwner);
@@ -2671,6 +2673,7 @@ PrepareTransaction(void)
 	/* we treat PREPARE as ROLLBACK so far as waking workers goes */
 	AtEOXact_ApplyLauncher(false);
 	AtEOXact_LogicalRepWorkers(false);
+	AtEOXact_ToastCache();
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2888,6 +2891,7 @@ AbortTransaction(void)
 		AtEOXact_PgStat(false, is_parallel_worker);
 		AtEOXact_ApplyLauncher(false);
 		AtEOXact_LogicalRepWorkers(false);
+		AtEOXact_ToastCache();
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 93ded31ed92..ceb3d3b953c 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/detoast.h"
 #include "access/gin.h"
 #include "access/slru.h"
 #include "access/toast_compression.h"
@@ -1015,6 +1016,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_toast_cache", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables caching of detoasted values."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_toast_cache,
+		false,
+		NULL, NULL, NULL
+	},
 	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
diff --git a/src/include/access/detoast.h b/src/include/access/detoast.h
index 12d8cdb356a..1a662c81388 100644
--- a/src/include/access/detoast.h
+++ b/src/include/access/detoast.h
@@ -79,4 +79,7 @@ extern Size toast_raw_datum_size(Datum value);
  */
 extern Size toast_datum_size(Datum value);
 
+extern PGDLLIMPORT bool enable_toast_cache;
+extern void AtEOXact_ToastCache(void);
+
 #endif							/* DETOAST_H */
#34Andy Fan
zhihuifan1213@163.com
In reply to: Tomas Vondra (#33)
Re: Shared detoast Datum proposal

I decided to whip up a PoC patch implementing this approach, to get a
better idea of how it compares to the original proposal, both in terms
of performance and code complexity.

Attached are two patches that both add a simple cache in detoast.c, but
differ in when exactly the caching happens.

toast-cache-v1 - caching happens in toast_fetch_datum, which means this
happens before decompression

toast-cache-v2 - caching happens in detoast_attr, after decompression

...

I think implementation-wise this is pretty non-invasive.

..

Performance-wise, these patches are slower than with Andy's patch. For
example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and
v2 at ~150ms. I'm sure there's a number of optimization opportunities
and v2 could get v2 closer to the 100ms.

v1 is not competitive at all in this jsonb/Q1 test - it's just as slow
as master, because the data set is small enough to be fully cached, so
there's no I/O and it's the decompression is the actual bottleneck.

That being said, I'm not convinced v1 would be a bad choice for some
cases. If there's memory pressure enough to evict TOAST, it's quite
possible the I/O would become the bottleneck. At which point it might be
a good trade off to cache compressed data, because then we'd cache more
detoasted values.

OTOH it's likely the access to TOAST values is localized (in temporal
sense), i.e. we read it from disk, detoast it a couple times in a short
time interval, and then move to the next row. That'd mean v2 is probably
the correct trade off.

I don't have different thought about the above statement and Thanks for
the PoC code which would make later testing much easier.

"""
A alternative design: toast cache
---------------------------------

This method is provided by Tomas during the review process. IIUC, this
method would maintain a local HTAB which map a toast datum to a detoast
datum and the entry is maintained / used in detoast_attr
function. Within this method, the overall design is pretty clear and the
code modification can be controlled in toasting system only.

Right.

I assumed that releasing all of the memory at the end of executor once
is not an option since it may consumed too many memory. Then, when and
which entry to release becomes a trouble for me. For example:

QUERY PLAN
------------------------------
Nested Loop
Join Filter: (t1.a = t2.a)
-> Seq Scan on t1
-> Seq Scan on t2
(4 rows)

In this case t1.a needs a longer lifespan than t2.a since it is
in outer relation. Without the help from slot's life-cycle system, I
can't think out a answer for the above question.

This is true, but how likely such plans are? I mean, surely no one would
do nested loop with sequential scans on reasonably large tables, so how
representative this example is?

Acutally this is a simplest Join case, we still have same problem like
Nested Loop + Index Scan which will be pretty common.

Also, this leads me to the need of having some sort of memory limit. If
we may be keeping entries for extended periods of time, and we don't
have any way to limit the amount of memory, that does not seem great.

AFAIK if we detoast everything into tts_values[] there's no way to
implement and enforce such limit. What happens if there's a row with
multiple large-ish TOAST values? What happens if those rows are in
different (and distant) parts of the plan?

I think this can be done by tracking the memory usage on EState level
or global variable level and disable it if it exceeds the limits and
resume it when we free the detoast datum when we don't need it. I think
no other changes need to be done.

It seems far easier to limit the memory with the toast cache.

I think the memory limit and entry eviction is the key issue now. IMO,
there are still some difference when both methods can support memory
limit. The reason is my patch can grantee the cached memory will be
reused, so if we set the limit to 10MB, we know all the 10MB is
useful, but the TOAST cache method, probably can't grantee that, then
when we want to make it effecitvely, we have to set a higher limit for
this.

Another difference between the 2 methods is my method have many
modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
it can save some run-time effort like hash_search find / enter run-time
in method 2 since I put them directly into tts_values[*].

I'm not sure the factor 2 makes some real measurable difference in real
case, so my current concern mainly comes from factor 1.
"""

This seems a bit dismissive of the (possible) issue.

Hmm, sorry about this, that is not my intention:(

It'd be good to do
some measurements, especially on simple queries that can't benefit from
the detoasting (e.g. because there's no varlena attribute).

This testing for the queries which have no varlena attribute was done at
the very begining of this thread, and for now, the code is much more
nicer for this sistuation. all the changes in execExpr.c
execExprInterp.c has no impact on this. Only the plan walker codes
matter. Here is a test based on tenk1.

q1: explain select count(*) from tenk1 where odd > 10 and even > 30;
q2: select count(*) from tenk1 where odd > 10 and even > 30;

pgbench -n -f qx.sql regression -T10

| Query | master (ms) | patched (ms) |
|-------+-------------+--------------|
| q1 | 0.129 | 0.129 |
| q2 | 1.876 | 1.870 |

there are some error for patched-q2 combination, but at least it can
show it doesn't cause noticable regression.

In any case, my concern is more about having to do this when creating
the plan at all, the code complexity etc. Not just because it might have
performance impact.

I think the main trade-off is TOAST cache method is pretty non-invasive
but can't control the eviction well, the impacts includes:
1. may evicting the datum we want and kept the datum we don't need.
2. more likely to use up all the memory which is allowed. for example:
if we set the limit to 1MB, then we kept more data which will be not
used and then consuming all of the 1MB.

My method is resolving this with some helps from other modules (kind of
invasive) but can control the eviction well and use the memory as less
as possible.

--
Best Regards
Andy Fan

#35Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Andy Fan (#34)
Re: Shared detoast Datum proposal

On 3/4/24 18:08, Andy Fan wrote:

...

I assumed that releasing all of the memory at the end of executor once
is not an option since it may consumed too many memory. Then, when and
which entry to release becomes a trouble for me. For example:

QUERY PLAN
------------------------------
Nested Loop
Join Filter: (t1.a = t2.a)
-> Seq Scan on t1
-> Seq Scan on t2
(4 rows)

In this case t1.a needs a longer lifespan than t2.a since it is
in outer relation. Without the help from slot's life-cycle system, I
can't think out a answer for the above question.

This is true, but how likely such plans are? I mean, surely no one would
do nested loop with sequential scans on reasonably large tables, so how
representative this example is?

Acutally this is a simplest Join case, we still have same problem like
Nested Loop + Index Scan which will be pretty common.

Yes, I understand there are cases where LRU eviction may not be the best
choice - like here, where the "t1" should stay in the case. But there
are also cases where this is the wrong choice, and LRU would be better.

For example a couple paragraphs down you suggest to enforce the memory
limit by disabling detoasting if the memory limit is reached. That means
the detoasting can get disabled because there's a single access to the
attribute somewhere "up the plan tree". But what if the other attributes
(which now won't be detoasted) are accessed many times until then?

I think LRU is a pretty good "default" algorithm if we don't have a very
good idea of the exact life span of the values, etc. Which is why other
nodes (e.g. Memoize) use LRU too.

But I wonder if there's a way to count how many times an attribute is
accessed (or is likely to be). That might be used to inform a better
eviction strategy.

Also, we don't need to evict the whole entry - we could evict just the
data part (guaranteed to be fairly large), but keep the header, and keep
the counts, expected number of hits, and other info. And use this to
e.g. release entries that reached the expected number of hits. But I'd
probably start with LRU and only do this as an improvement later.

Also, this leads me to the need of having some sort of memory limit. If
we may be keeping entries for extended periods of time, and we don't
have any way to limit the amount of memory, that does not seem great.

AFAIK if we detoast everything into tts_values[] there's no way to
implement and enforce such limit. What happens if there's a row with
multiple large-ish TOAST values? What happens if those rows are in
different (and distant) parts of the plan?

I think this can be done by tracking the memory usage on EState level
or global variable level and disable it if it exceeds the limits and
resume it when we free the detoast datum when we don't need it. I think
no other changes need to be done.

That seems like a fair amount of additional complexity. And what if the
toasted values are accessed in context without EState (I haven't checked
how common / important that is)?

And I'm not sure just disabling the detoast as a way to enforce a memory
limit, as explained earlier.

It seems far easier to limit the memory with the toast cache.

I think the memory limit and entry eviction is the key issue now. IMO,
there are still some difference when both methods can support memory
limit. The reason is my patch can grantee the cached memory will be
reused, so if we set the limit to 10MB, we know all the 10MB is
useful, but the TOAST cache method, probably can't grantee that, then
when we want to make it effecitvely, we have to set a higher limit for
this.

Can it actually guarantee that? It can guarantee the slot may be used,
but I don't see how could it guarantee the detoasted value will be used.
We may be keeping the slot for other reasons. And even if it could
guarantee the detoasted value will be used, does that actually prove
it's better to keep that value? What if it's only used once, but it's
blocking detoasting of values that will be used 10x that?

If we detoast a 10MB value in the outer side of the Nest Loop, what if
the inner path has multiple accesses to another 10MB value that now
can't be detoasted (as a shared value)?

Another difference between the 2 methods is my method have many
modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
it can save some run-time effort like hash_search find / enter run-time
in method 2 since I put them directly into tts_values[*].

I'm not sure the factor 2 makes some real measurable difference in real
case, so my current concern mainly comes from factor 1.
"""

This seems a bit dismissive of the (possible) issue.

Hmm, sorry about this, that is not my intention:(

No need to apologize.

It'd be good to do
some measurements, especially on simple queries that can't benefit from
the detoasting (e.g. because there's no varlena attribute).

This testing for the queries which have no varlena attribute was done at
the very begining of this thread, and for now, the code is much more
nicer for this sistuation. all the changes in execExpr.c
execExprInterp.c has no impact on this. Only the plan walker codes
matter. Here is a test based on tenk1.

q1: explain select count(*) from tenk1 where odd > 10 and even > 30;
q2: select count(*) from tenk1 where odd > 10 and even > 30;

pgbench -n -f qx.sql regression -T10

| Query | master (ms) | patched (ms) |
|-------+-------------+--------------|
| q1 | 0.129 | 0.129 |
| q2 | 1.876 | 1.870 |

there are some error for patched-q2 combination, but at least it can
show it doesn't cause noticable regression.

OK, I'll try to do some experiments with these cases too.

In any case, my concern is more about having to do this when creating
the plan at all, the code complexity etc. Not just because it might have
performance impact.

I think the main trade-off is TOAST cache method is pretty non-invasive
but can't control the eviction well, the impacts includes:
1. may evicting the datum we want and kept the datum we don't need.

This applies to any eviction algorithm, not just LRU. Ultimately what
matters is whether we have in the cache the most often used values, i.e.
the hit ratio (perhaps in combination with how expensive detoasting that
particular entry was).

2. more likely to use up all the memory which is allowed. for example:
if we set the limit to 1MB, then we kept more data which will be not
used and then consuming all of the 1MB.

My method is resolving this with some helps from other modules (kind of
invasive) but can control the eviction well and use the memory as less
as possible.

Is the memory usage really an issue? Sure, it'd be nice to evict entries
as soon as we can prove they are not needed anymore, but if the cache
limit is set to 1MB it's not really a problem to use 1MB.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#36Andy Fan
zhihuifan1213@163.com
In reply to: Tomas Vondra (#35)
Re: Shared detoast Datum proposal

2. more likely to use up all the memory which is allowed. for example:
if we set the limit to 1MB, then we kept more data which will be not
used and then consuming all of the 1MB.

My method is resolving this with some helps from other modules (kind of
invasive) but can control the eviction well and use the memory as less
as possible.

Is the memory usage really an issue? Sure, it'd be nice to evict entries
as soon as we can prove they are not needed anymore, but if the cache
limit is set to 1MB it's not really a problem to use 1MB.

This might be a key point which leads us to some different directions, so
I want to explain more about this, to see if we can get some agreement
here.

It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB,
100MB. In my current case it is 40MB at least. Less memory limit
causes cache ineffecitve, high memory limit cause potential memory
use-up issue in the TOAST cache design. But in my method, even we set a
higher value, it just limits the user case it really (nearly) needed,
and it would not cache more values util the limit is hit. This would
make a noticable difference when we want to set a high limit and we have
some high active sessions, like 100 * 40MB = 4GB.

On 3/4/24 18:08, Andy Fan wrote:

...

I assumed that releasing all of the memory at the end of executor once
is not an option since it may consumed too many memory. Then, when and
which entry to release becomes a trouble for me. For example:

QUERY PLAN
------------------------------
Nested Loop
Join Filter: (t1.a = t2.a)
-> Seq Scan on t1
-> Seq Scan on t2
(4 rows)

In this case t1.a needs a longer lifespan than t2.a since it is
in outer relation. Without the help from slot's life-cycle system, I
can't think out a answer for the above question.

This is true, but how likely such plans are? I mean, surely no one would
do nested loop with sequential scans on reasonably large tables, so how
representative this example is?

Acutally this is a simplest Join case, we still have same problem like
Nested Loop + Index Scan which will be pretty common.

Yes, I understand there are cases where LRU eviction may not be the best
choice - like here, where the "t1" should stay in the case. But there
are also cases where this is the wrong choice, and LRU would be better.

For example a couple paragraphs down you suggest to enforce the memory
limit by disabling detoasting if the memory limit is reached. That means
the detoasting can get disabled because there's a single access to the
attribute somewhere "up the plan tree". But what if the other attributes
(which now won't be detoasted) are accessed many times until then?

I am not sure I can't follow up here, but I want to explain more about
the disable-detoast-sharing logic when the memory limit is hit. When
this happen, the detoast sharing is disabled, but since the detoast
datum will be released every soon when the slot->tts_values[*] is
discard, then the 'disable' turn to 'enable' quickly. So It is not
true that once it is get disabled, it can't be enabled any more for the
given query.

I think LRU is a pretty good "default" algorithm if we don't have a very
good idea of the exact life span of the values, etc. Which is why other
nodes (e.g. Memoize) use LRU too.

But I wonder if there's a way to count how many times an attribute is
accessed (or is likely to be). That might be used to inform a better
eviction strategy.

Yes, but in current issue we can get a better esitimation with the help
of plan shape and Memoize depends on some planner information as well.
If we bypass the planner information and try to resolve it at the
cache level, the code may become to complex as well and all the cost is
run time overhead while the other way is a planning timie overhead.

Also, we don't need to evict the whole entry - we could evict just the
data part (guaranteed to be fairly large), but keep the header, and keep
the counts, expected number of hits, and other info. And use this to
e.g. release entries that reached the expected number of hits. But I'd
probably start with LRU and only do this as an improvement later.

A great lession learnt here, thanks for sharing this!

As for the current user case what I want to highlight is in the current
user case, we are "caching" "user data" "locally".

USER DATA indicates it might be very huge, it is not common to have a
1M tables, but it is much common we have 1M Tuples in one scan, so
keeping the header might extra memory usage as well, like 10M * 24 bytes
= 240MB. LOCALLY means it is not friend to multi active sessions. CACHE
indicates it is hard to evict correctly. My method also have the USER
DATA, LOCALLY attributes, but it would be better at eviction. eviction
then have lead to memory usage issue which is discribed at the beginning
of this writing.

Also, this leads me to the need of having some sort of memory limit. If
we may be keeping entries for extended periods of time, and we don't
have any way to limit the amount of memory, that does not seem great.

AFAIK if we detoast everything into tts_values[] there's no way to
implement and enforce such limit. What happens if there's a row with
multiple large-ish TOAST values? What happens if those rows are in
different (and distant) parts of the plan?

I think this can be done by tracking the memory usage on EState level
or global variable level and disable it if it exceeds the limits and
resume it when we free the detoast datum when we don't need it. I think
no other changes need to be done.

That seems like a fair amount of additional complexity. And what if the
toasted values are accessed in context without EState (I haven't checked
how common / important that is)?

And I'm not sure just disabling the detoast as a way to enforce a memory
limit, as explained earlier.

It seems far easier to limit the memory with the toast cache.

I think the memory limit and entry eviction is the key issue now. IMO,
there are still some difference when both methods can support memory
limit. The reason is my patch can grantee the cached memory will be
reused, so if we set the limit to 10MB, we know all the 10MB is
useful, but the TOAST cache method, probably can't grantee that, then
when we want to make it effecitvely, we have to set a higher limit for
this.

Can it actually guarantee that? It can guarantee the slot may be used,
but I don't see how could it guarantee the detoasted value will be used.
We may be keeping the slot for other reasons. And even if it could
guarantee the detoasted value will be used, does that actually prove
it's better to keep that value? What if it's only used once, but it's
blocking detoasting of values that will be used 10x that?

If we detoast a 10MB value in the outer side of the Nest Loop, what if
the inner path has multiple accesses to another 10MB value that now
can't be detoasted (as a shared value)?

Grarantee may be wrong word. The difference in my mind are:
1. plan shape have better potential to know the user case of datum,
since we know the plan tree and knows the rows pass to a given node.
2. Planning time effort is cheaper than run-time effort.
3. eviction in my method is not as important as it is in TOAST cache
method since it is reset per slot, so usually it doesn't hit limit in
fact. But as a cache, it does.
4. use up to memory limit we set in TOAST cache case.

In any case, my concern is more about having to do this when creating
the plan at all, the code complexity etc. Not just because it might have
performance impact.

I think the main trade-off is TOAST cache method is pretty non-invasive
but can't control the eviction well, the impacts includes:
1. may evicting the datum we want and kept the datum we don't need.

This applies to any eviction algorithm, not just LRU. Ultimately what
matters is whether we have in the cache the most often used values, i.e.
the hit ratio (perhaps in combination with how expensive detoasting that
particular entry was).

Correct, just that I am doubtful about design a LOCAL CACHE for USER
DATA with the reason I described above.

At last, thanks for your attention, really appreciated about it!

--
Best Regards
Andy Fan

#37Nikita Malakhov
hukutoc@gmail.com
In reply to: Andy Fan (#36)
Re: Shared detoast Datum proposal

Hi,

Tomas, sorry for confusion, in my previous message I meant exactly
the same approach you've posted above, and came up with almost
the same implementation.

Thank you very much for your attention to this thread!

I've asked Andy about this approach because of the same reasons you
mentioned - it keeps cache code small, localized and easy to maintain.

The question that worries me is the memory limit, and I think that cache
lookup and entry invalidation should be done in toast_tuple_externalize
code, for the case the value has been detoasted previously is updated
in the same query, like
UPDATE test SET value = value || '...';

I've added cache entry invalidation and data removal on delete and update
of the toasted values and am currently experimenting with large values.

On Tue, Mar 5, 2024 at 5:59 AM Andy Fan <zhihuifan1213@163.com> wrote:

2. more likely to use up all the memory which is allowed. for example:
if we set the limit to 1MB, then we kept more data which will be not
used and then consuming all of the 1MB.

My method is resolving this with some helps from other modules (kind of
invasive) but can control the eviction well and use the memory as less
as possible.

Is the memory usage really an issue? Sure, it'd be nice to evict entries
as soon as we can prove they are not needed anymore, but if the cache
limit is set to 1MB it's not really a problem to use 1MB.

This might be a key point which leads us to some different directions, so
I want to explain more about this, to see if we can get some agreement
here.

It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB,
100MB. In my current case it is 40MB at least. Less memory limit
causes cache ineffecitve, high memory limit cause potential memory
use-up issue in the TOAST cache design. But in my method, even we set a
higher value, it just limits the user case it really (nearly) needed,
and it would not cache more values util the limit is hit. This would
make a noticable difference when we want to set a high limit and we have
some high active sessions, like 100 * 40MB = 4GB.

On 3/4/24 18:08, Andy Fan wrote:

...

I assumed that releasing all of the memory at the end of executor once
is not an option since it may consumed too many memory. Then, when and
which entry to release becomes a trouble for me. For example:

QUERY PLAN
------------------------------
Nested Loop
Join Filter: (t1.a = t2.a)
-> Seq Scan on t1
-> Seq Scan on t2
(4 rows)

In this case t1.a needs a longer lifespan than t2.a since it is
in outer relation. Without the help from slot's life-cycle system, I
can't think out a answer for the above question.

This is true, but how likely such plans are? I mean, surely no one

would

do nested loop with sequential scans on reasonably large tables, so how
representative this example is?

Acutally this is a simplest Join case, we still have same problem like
Nested Loop + Index Scan which will be pretty common.

Yes, I understand there are cases where LRU eviction may not be the best
choice - like here, where the "t1" should stay in the case. But there
are also cases where this is the wrong choice, and LRU would be better.

For example a couple paragraphs down you suggest to enforce the memory
limit by disabling detoasting if the memory limit is reached. That means
the detoasting can get disabled because there's a single access to the
attribute somewhere "up the plan tree". But what if the other attributes
(which now won't be detoasted) are accessed many times until then?

I am not sure I can't follow up here, but I want to explain more about
the disable-detoast-sharing logic when the memory limit is hit. When
this happen, the detoast sharing is disabled, but since the detoast
datum will be released every soon when the slot->tts_values[*] is
discard, then the 'disable' turn to 'enable' quickly. So It is not
true that once it is get disabled, it can't be enabled any more for the
given query.

I think LRU is a pretty good "default" algorithm if we don't have a very
good idea of the exact life span of the values, etc. Which is why other
nodes (e.g. Memoize) use LRU too.

But I wonder if there's a way to count how many times an attribute is
accessed (or is likely to be). That might be used to inform a better
eviction strategy.

Yes, but in current issue we can get a better esitimation with the help
of plan shape and Memoize depends on some planner information as well.
If we bypass the planner information and try to resolve it at the
cache level, the code may become to complex as well and all the cost is
run time overhead while the other way is a planning timie overhead.

Also, we don't need to evict the whole entry - we could evict just the
data part (guaranteed to be fairly large), but keep the header, and keep
the counts, expected number of hits, and other info. And use this to
e.g. release entries that reached the expected number of hits. But I'd
probably start with LRU and only do this as an improvement later.

A great lession learnt here, thanks for sharing this!

As for the current user case what I want to highlight is in the current
user case, we are "caching" "user data" "locally".

USER DATA indicates it might be very huge, it is not common to have a
1M tables, but it is much common we have 1M Tuples in one scan, so
keeping the header might extra memory usage as well, like 10M * 24 bytes
= 240MB. LOCALLY means it is not friend to multi active sessions. CACHE
indicates it is hard to evict correctly. My method also have the USER
DATA, LOCALLY attributes, but it would be better at eviction. eviction
then have lead to memory usage issue which is discribed at the beginning
of this writing.

Also, this leads me to the need of having some sort of memory limit. If
we may be keeping entries for extended periods of time, and we don't
have any way to limit the amount of memory, that does not seem great.

AFAIK if we detoast everything into tts_values[] there's no way to
implement and enforce such limit. What happens if there's a row with
multiple large-ish TOAST values? What happens if those rows are in
different (and distant) parts of the plan?

I think this can be done by tracking the memory usage on EState level
or global variable level and disable it if it exceeds the limits and
resume it when we free the detoast datum when we don't need it. I think
no other changes need to be done.

That seems like a fair amount of additional complexity. And what if the
toasted values are accessed in context without EState (I haven't checked
how common / important that is)?

And I'm not sure just disabling the detoast as a way to enforce a memory
limit, as explained earlier.

It seems far easier to limit the memory with the toast cache.

I think the memory limit and entry eviction is the key issue now. IMO,
there are still some difference when both methods can support memory
limit. The reason is my patch can grantee the cached memory will be
reused, so if we set the limit to 10MB, we know all the 10MB is
useful, but the TOAST cache method, probably can't grantee that, then
when we want to make it effecitvely, we have to set a higher limit for
this.

Can it actually guarantee that? It can guarantee the slot may be used,
but I don't see how could it guarantee the detoasted value will be used.
We may be keeping the slot for other reasons. And even if it could
guarantee the detoasted value will be used, does that actually prove
it's better to keep that value? What if it's only used once, but it's
blocking detoasting of values that will be used 10x that?

If we detoast a 10MB value in the outer side of the Nest Loop, what if
the inner path has multiple accesses to another 10MB value that now
can't be detoasted (as a shared value)?

Grarantee may be wrong word. The difference in my mind are:
1. plan shape have better potential to know the user case of datum,
since we know the plan tree and knows the rows pass to a given node.
2. Planning time effort is cheaper than run-time effort.
3. eviction in my method is not as important as it is in TOAST cache
method since it is reset per slot, so usually it doesn't hit limit in
fact. But as a cache, it does.
4. use up to memory limit we set in TOAST cache case.

In any case, my concern is more about having to do this when creating
the plan at all, the code complexity etc. Not just because it might

have

performance impact.

I think the main trade-off is TOAST cache method is pretty non-invasive
but can't control the eviction well, the impacts includes:
1. may evicting the datum we want and kept the datum we don't need.

This applies to any eviction algorithm, not just LRU. Ultimately what
matters is whether we have in the cache the most often used values, i.e.
the hit ratio (perhaps in combination with how expensive detoasting that
particular entry was).

Correct, just that I am doubtful about design a LOCAL CACHE for USER
DATA with the reason I described above.

At last, thanks for your attention, really appreciated about it!

--
Best Regards
Andy Fan

--
Regards,

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#38Nikita Malakhov
hukutoc@gmail.com
In reply to: Nikita Malakhov (#37)
Re: Shared detoast Datum proposal

Hi,

In addition to the previous message - for the toast_fetch_datum_slice
the first (seems obvious) way is to detoast the whole value, save it
to cache and get slices from it on demand. I have another one on my
mind, but have to play with it first.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#39Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Nikita Malakhov (#38)
Re: Shared detoast Datum proposal

On 3/6/24 07:09, Nikita Malakhov wrote:

Hi,

In addition to the previous message - for the toast_fetch_datum_slice
the first (seems obvious) way is to detoast the whole value, save it
to cache and get slices from it on demand. I have another one on my
mind, but have to play with it first.

What if you only need the first slice? In that case decoding everything
will be a clear regression.

IMHO this needs to work like this:

1) check if the cache already has sufficiently large slice detoasted
(and use the cached value if yes)

2) if there's no slice detoasted, or if it's too small, detoast a
sufficiently large slice and store it in the cache (possibly evicting
the already detoasted slice)

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#40Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Nikita Malakhov (#37)
Re: Shared detoast Datum proposal

On 3/5/24 10:03, Nikita Malakhov wrote:

Hi,

Tomas, sorry for confusion, in my previous message I meant exactly
the same approach you've posted above, and came up with almost
the same implementation.

Thank you very much for your attention to this thread!

I've asked Andy about this approach because of the same reasons you
mentioned - it keeps cache code small, localized and easy to maintain.

The question that worries me is the memory limit, and I think that cache
lookup and entry invalidation should be done in toast_tuple_externalize
code, for the case the value has been detoasted previously is updated
in the same query, like

I haven't thought very much about when we need to invalidate entries and
where exactly should that happen. My patch is a very rough PoC that I
wrote in ~1h, and it's quite possible it will have to move or should be
refactored in some way.

But I don't quite understand the problem with this query:

UPDATE test SET value = value || '...';

Surely the update creates a new TOAST value, with a completely new TOAST
pointer, new rows in the TOAST table etc. It's not updated in-place. So
it'd end up with two independent entries in the TOAST cache.

Or are you interested just in evicting the old value simply to free the
memory, because we're not going to need that (now invisible) value? That
seems interesting, but if it's too invasive or complex I'd just leave it
up to the LRU eviction (at least for v1).

I'm not sure why should anything happen in toast_tuple_externalize, that
seems like a very low-level piece of code. But I realized there's also
toast_delete_external, and maybe that's a good place to invalidate the
deleted TOAST values (which would also cover the UPDATE).

I've added cache entry invalidation and data removal on delete and update
of the toasted values and am currently experimenting with large values.

I haven't done anything about large values in the PoC patch, but I think
it might be a good idea to either ignore values that are too large (so
that it does not push everything else from the cache) or store them in
compressed form (assuming it's small enough, to at least save the I/O).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#41Nikita Malakhov
hukutoc@gmail.com
In reply to: Tomas Vondra (#40)
Re: Shared detoast Datum proposal

Hi!

Tomas, I thought about this issue -

What if you only need the first slice? In that case decoding everything
will be a clear regression.

And completely agree with you, I'm currently working on it and will post
a patch a bit later. Another issue we have here - if we have multiple
slices requested - then we have to update cached slice from both sides,
the beginning and the end.

On update, yep, you're right

Surely the update creates a new TOAST value, with a completely new TOAST
pointer, new rows in the TOAST table etc. It's not updated in-place. So
it'd end up with two independent entries in the TOAST cache.

Or are you interested just in evicting the old value simply to free the
memory, because we're not going to need that (now invisible) value? That
seems interesting, but if it's too invasive or complex I'd just leave it
up to the LRU eviction (at least for v1).

Again, yes, we do not need the old value after it was updated and
it is better to take care of it explicitly. It's a simple and not invasive
addition to your patch.

Could not agree with you about large values - it makes sense
to cache large values because the larger it is the more benefit
we will have from not detoasting it again and again. One way
is to keep them compressed, but what if we have data with very
low compression rate?

I also changed HASHACTION field value from HASH_ENTER to
HASH_ENTER_NULL to softly deal with case when we do not
have enough memory and added checks for if the value was stored
or not.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#42Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Nikita Malakhov (#41)
Re: Shared detoast Datum proposal

On 3/7/24 08:33, Nikita Malakhov wrote:

Hi!

Tomas, I thought about this issue -

What if you only need the first slice? In that case decoding everything
will be a clear regression.

And completely agree with you, I'm currently working on it and will post
a patch a bit later.

Cool. I don't plan to work on improving my patch - it was merely a PoC,
so if you're interested in working on that, that's good.

Another issue we have here - if we have multiple slices requested -
then we have to update cached slice from both sides, the beginning
and the end.

No opinion. I haven't thought about how to handle slices too much.

On update, yep, you're right

Surely the update creates a new TOAST value, with a completely new TOAST
pointer, new rows in the TOAST table etc. It's not updated in-place. So
it'd end up with two independent entries in the TOAST cache.

Or are you interested just in evicting the old value simply to free the
memory, because we're not going to need that (now invisible) value? That
seems interesting, but if it's too invasive or complex I'd just leave it
up to the LRU eviction (at least for v1).

Again, yes, we do not need the old value after it was updated and
it is better to take care of it explicitly. It's a simple and not invasive
addition to your patch.

OK

Could not agree with you about large values - it makes sense
to cache large values because the larger it is the more benefit
we will have from not detoasting it again and again. One way
is to keep them compressed, but what if we have data with very
low compression rate?

I'm not against caching large values, but I also think it's not
difficult to construct cases where it's a losing strategy.

I also changed HASHACTION field value from HASH_ENTER to
HASH_ENTER_NULL to softly deal with case when we do not
have enough memory and added checks for if the value was stored
or not.

I'm not sure about this. HASH_ENTER_NULL is only being used in three
very specific places (two are lock management, one is shmeminitstruct).
This hash table is not unique in any way, so why not to use HASH_ENTER
like 99% other hash tables?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#43Nikita Malakhov
hukutoc@gmail.com
In reply to: Tomas Vondra (#42)
1 attachment(s)
Re: Shared detoast Datum proposal

Hi!

Here's a slightly improved version of patch Tomas provided above (v2),
with cache invalidations and slices caching added, still as PoC.

The main issue I've encountered during tests is that when the same query
retrieves both slices and full value - slices, like substring(t,...) the
order of
the values is not predictable, with text fields substrings are retrieved
before the full value and we cannot benefit from cache full value first
and use slices from cached value.

Yet the cache code is still very compact and affects only sources related
to detoasting.

Tomas, about HASH_ENTER - according to comments it could throw
an OOM error, so I've changed it to HASH_ENTER_NULL to avoid
new errors. In this case we would just have the value not cached
without an error.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Attachments:

0001-toast-cache-v3.patchapplication/octet-stream; name=0001-toast-cache-v3.patchDownload
From c8557c827481a0ec6f6ff5cecdf2299a5743d052 Mon Sep 17 00:00:00 2001
From: Nikita Malakhov <n.malakhov@postgrespro.ru>
Date: Thu, 14 Mar 2024 09:12:32 +0300
Subject: [PATCH] [WIP] Shared detoast datum patch v2 with toast slices caching
 added


diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c
index 3547cdba56..8beb5f8e68 100644
--- a/src/backend/access/common/detoast.c
+++ b/src/backend/access/common/detoast.c
@@ -20,6 +20,8 @@
 #include "common/int.h"
 #include "common/pg_lzcompress.h"
 #include "utils/expandeddatum.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
 #include "utils/rel.h"
 
 static struct varlena *toast_fetch_datum(struct varlena *attr);
@@ -29,6 +31,43 @@ static struct varlena *toast_fetch_datum_slice(struct varlena *attr,
 static struct varlena *toast_decompress_datum(struct varlena *attr);
 static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32 slicelength);
 
+bool enable_toast_cache = false;
+
+HTAB *toastCache = NULL;
+MemoryContext ToastCacheContext = NULL;
+MemoryContext ToastSliceContext = NULL;
+
+static uint64	toast_cache_hits = 0;
+static uint64	toast_cache_misses = 0;
+
+typedef struct toast_cache_key {
+	Oid toastrelid;
+	Oid valueid;
+} toast_cache_key;
+
+typedef struct toast_cache_entry {
+	toast_cache_key key;
+	int32	size;
+	uintptr_t slices;
+	void   *data;
+} toast_cache_entry;
+
+typedef struct toast_slice_entry {
+	int32	offset;
+	int32	length;
+	void   *data;
+} toast_slice_entry;
+
+static toast_cache_entry *toast_cache_lookup(Oid toastrelid, Oid valueid);
+static void toast_cache_add_entry(Oid toastrelid, Oid valueid, struct varlena *attr);
+static void
+toast_cache_add_slice_entry(Oid toastrelid, Oid valueid, struct varlena *attr,
+						 int32 sliceoffset, int32 slicelength);
+void toast_cache_add_slice_datum(Oid toastrelid, Oid valueid, struct varlena *attr, int32 offset, int32 length);
+static void toast_cache_remove_entry(Oid toastrelid, Oid valueid);
+static struct varlena *
+toast_cache_lookup_slice(Oid toastrelid, Oid valueid, int32 offset, int32 length);
+
 /* ----------
  * detoast_external_attr -
  *
@@ -51,7 +90,24 @@ detoast_external_attr(struct varlena *attr)
 		/*
 		 * This is an external stored plain value
 		 */
+		struct varatt_external toast_pointer;
+
+		/* Must copy to access aligned fields */
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
+		/* first try to lookup in the toast cache */
+		result = toast_cache_lookup_datum(toast_pointer.va_toastrelid, toast_pointer.va_valueid);
+		if (result)
+			return result;
+
+		/*
+		 * This is an externally stored datum --- fetch it back from there
+		 */
 		result = toast_fetch_datum(attr);
+
+		if (!VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
+			toast_cache_add_datum(toast_pointer.va_toastrelid, toast_pointer.va_valueid, result);
+
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -117,6 +173,21 @@ detoast_attr(struct varlena *attr)
 {
 	if (VARATT_IS_EXTERNAL_ONDISK(attr))
 	{
+		struct varlena *result;
+		struct varatt_external toast_pointer;
+
+		/* Get toast pointer */
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
+		/* try lookup in the toast cache */
+
+		/* first try to lookup in the toast cache */
+		result = toast_cache_lookup_datum(toast_pointer.va_toastrelid, toast_pointer.va_valueid);
+
+		/* cache hit, we're done */
+		if (result)
+			return result;
+
 		/*
 		 * This is an externally stored datum --- fetch it back from there
 		 */
@@ -129,6 +200,8 @@ detoast_attr(struct varlena *attr)
 			attr = toast_decompress_datum(tmp);
 			pfree(tmp);
 		}
+
+		toast_cache_add_datum(toast_pointer.va_toastrelid, toast_pointer.va_valueid, attr);
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -206,10 +279,12 @@ detoast_attr_slice(struct varlena *attr,
 				   int32 sliceoffset, int32 slicelength)
 {
 	struct varlena *preslice;
-	struct varlena *result;
+	struct varlena *result = NULL;
 	char	   *attrdata;
 	int32		slicelimit;
 	int32		attrsize;
+	Oid toastrelid = InvalidOid;
+	Oid valueid = InvalidOid;
 
 	if (sliceoffset < 0)
 		elog(ERROR, "invalid sliceoffset: %d", sliceoffset);
@@ -229,9 +304,26 @@ detoast_attr_slice(struct varlena *attr,
 
 		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
 
+
+		/* set the cache key */
+		/* try lookup in the toast cache */
+		toastrelid = toast_pointer.va_toastrelid;
+		valueid = toast_pointer.va_valueid;
+
+		/* first try to lookup in the toast cache */
+		result = toast_cache_lookup_slice(toastrelid, valueid, sliceoffset, slicelength);
+
+		/* cache hit, we're done */
+		if (result)
+			return result;
+
 		/* fast path for non-compressed external datums */
 		if (!VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
-			return toast_fetch_datum_slice(attr, sliceoffset, slicelength);
+		{
+			result = toast_fetch_datum_slice(attr, sliceoffset, slicelength);
+			toast_cache_add_slice_datum(toast_pointer.va_toastrelid, toast_pointer.va_valueid, result, sliceoffset, slicelength);
+			return result;
+		}
 
 		/*
 		 * For compressed values, we need to fetch enough slices to decompress
@@ -259,10 +351,13 @@ detoast_attr_slice(struct varlena *attr,
 			 * Fetch enough compressed slices (compressed marker will get set
 			 * automatically).
 			 */
+
 			preslice = toast_fetch_datum_slice(attr, 0, max_size);
 		}
 		else
+		{
 			preslice = toast_fetch_datum(attr);
+		}
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -296,6 +391,14 @@ detoast_attr_slice(struct varlena *attr,
 		else
 			preslice = toast_decompress_datum(tmp);
 
+		if(OidIsValid(toastrelid))
+		{
+			if(slicelimit >= 0)
+				toast_cache_add_slice_datum(toastrelid, valueid, preslice, 0, slicelimit);
+			else
+				toast_cache_add_datum(toastrelid, valueid, preslice);
+		}
+
 		if (tmp != attr)
 			pfree(tmp);
 	}
@@ -644,3 +747,344 @@ toast_datum_size(Datum value)
 	}
 	return result;
 }
+
+static bool
+toast_cache_init(void)
+{
+	HASHCTL		ctl;
+
+	if (!enable_toast_cache)
+		return false;
+
+	/* already initialized */
+	if (toastCache)
+		return true;
+
+	toast_cache_hits = 0;
+	toast_cache_misses = 0;
+
+	/* FIXME should really be per transaction */
+	ToastCacheContext = AllocSetContextCreate(TopTransactionContext,
+											  "TOAST cache context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	ToastSliceContext = AllocSetContextCreate(TopTransactionContext,
+											  "TOAST cached slices context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/* Make a new hash table for the cache */
+	ctl.keysize = sizeof(toast_cache_key);
+	ctl.entrysize = sizeof(toast_cache_entry);
+	ctl.hcxt = ToastCacheContext;
+
+	toastCache = hash_create("TOAST cache",
+							 128, &ctl,
+							 HASH_ELEM | HASH_CONTEXT | HASH_BLOBS);
+
+	return true;
+}
+
+static toast_cache_entry *
+toast_cache_lookup(Oid toastrelid, Oid valueid)
+{
+	toast_cache_key		key;
+	toast_cache_entry  *entry;
+	bool found = false;
+
+	/* make sure cache initialized */
+	if (!toast_cache_init())
+		return NULL;
+
+	key.toastrelid = toastrelid;
+	key.valueid = valueid;
+
+	entry = (toast_cache_entry *) hash_search(toastCache, &key,
+											  HASH_FIND, &found);
+
+	if (found)
+		toast_cache_hits++;
+	else
+	{
+		toast_cache_misses++;
+		return NULL;
+	}
+
+	return entry;
+}
+
+static void
+toast_cache_add_entry(Oid toastrelid, Oid valueid, struct varlena *attr)
+{
+	bool				found = false;
+	ListCell		   *lc = NULL;
+	toast_cache_key		key;
+	toast_cache_entry  *entry;
+	MemoryContext ctx;
+
+	/* make sure cache initialized */
+	if (!toast_cache_init())
+		return;
+
+	key.toastrelid = toastrelid;
+	key.valueid = valueid;
+
+	entry = (toast_cache_entry *) hash_search(toastCache, &key,
+											  HASH_FIND, &found);
+
+	/* replace-only flag means that if value was not found - do not*/
+	if(found)
+	{
+		/* if found while adding entry - we have to free old data and replace it with new one */
+		/* if already have some slices - clean up, full value is stored instead of slices */
+		if(entry->slices != 0)
+		{
+			ctx = MemoryContextSwitchTo(ToastSliceContext);
+
+			foreach(lc, (List *) entry->slices)
+			{
+				toast_slice_entry *slice = (toast_slice_entry *) lfirst(lc);
+				pfree(slice->data);
+			}
+			MemoryContextSwitchTo(ctx);
+		}
+	}
+	else
+	{
+		entry = (toast_cache_entry *) hash_search(toastCache, &key,
+											  HASH_ENTER_NULL, &found);
+		if(!entry)
+			return;
+	}
+
+	entry->slices = (uintptr_t) NIL;
+
+	entry->size = VARSIZE_ANY_EXHDR(attr);
+	entry->data = MemoryContextAlloc(ToastCacheContext, entry->size);
+	memcpy(entry->data, VARDATA_ANY(attr), entry->size);
+}
+
+static void
+toast_cache_add_slice_entry(Oid toastrelid, Oid valueid, struct varlena *attr,
+						 int32 sliceoffset, int32 slicelength)
+{
+	ListCell		   *lc = NULL;
+	bool				found = false;
+	toast_cache_key		key;
+	toast_cache_entry  *entry;
+	toast_slice_entry   *slice;
+
+	/* make sure cache initialized */
+	if (!toast_cache_init())
+		return;
+
+	key.toastrelid = toastrelid;
+	key.valueid = valueid;
+
+	entry = (toast_cache_entry *) hash_search(toastCache, &key,
+											  HASH_FIND, &found);
+
+	if(found)
+	{
+		toast_cache_hits++;
+	}
+	else
+		toast_cache_misses++;
+
+	if(!found)
+	{
+		List *slices = NIL;
+		slice = (toast_slice_entry *) MemoryContextAlloc(ToastSliceContext, sizeof(toast_slice_entry));
+		slice->offset = sliceoffset;
+		slice->length = slicelength;
+		slice->data = MemoryContextAlloc(ToastSliceContext, slicelength);
+		memcpy(slice->data, VARDATA_ANY(attr), slicelength);
+
+		entry = (toast_cache_entry *) hash_search(toastCache, &key,
+											  HASH_ENTER_NULL, &found);
+
+		slices = lappend(slices, slice);
+		entry->slices = (uintptr_t) slices;
+		entry->size = slicelength;
+		entry->data = NULL;
+		return;
+	}
+
+	found = false;
+
+	if((List *) entry->slices != NIL)
+	{
+		MemoryContext ctx = MemoryContextSwitchTo(ToastSliceContext);
+
+		foreach(lc, (List *) entry->slices)
+		{
+			toast_slice_entry *lookup_slice = (toast_slice_entry *) lfirst(lc);
+
+			if(sliceoffset >= lookup_slice->offset
+				&& sliceoffset <= (lookup_slice->offset + lookup_slice->length)
+				&& (sliceoffset + slicelength) <= (lookup_slice->offset + lookup_slice->length))
+			{
+				found = true;
+				break;
+			}
+		}
+		MemoryContextSwitchTo(ctx);
+	}
+
+	if(!found)
+	{
+		slice = (toast_slice_entry *) MemoryContextAlloc(ToastSliceContext, sizeof(toast_slice_entry));
+		slice->offset = sliceoffset;
+		slice->length = slicelength;
+		slice->data = MemoryContextAlloc(ToastSliceContext, slicelength);
+		memcpy(slice->data, VARDATA_ANY(attr), slicelength);
+
+		entry->slices = (uintptr_t) lappend((List *) entry->slices, slice);
+		entry->size += slicelength;
+	}
+}
+
+static void
+toast_cache_remove_entry(Oid toastrelid, Oid valueid)
+{
+	ListCell 		   *lc = NULL;
+	bool				found = false;
+	toast_cache_key		key;
+	toast_cache_entry  *entry = NULL;
+
+	/* make sure cache initialized */
+	if (!toast_cache_init())
+		return;
+
+	key.toastrelid = toastrelid;
+	key.valueid = valueid;
+
+	entry = (toast_cache_entry *) hash_search(toastCache, &key,
+											  HASH_REMOVE, &found);
+
+	if(entry)
+	{
+		toast_cache_hits++;
+		if(entry->slices != 0)
+		{
+		foreach(lc, (List *) entry->slices)
+		{
+			toast_slice_entry *slice = (toast_slice_entry *) lfirst(lc);
+			pfree(slice->data);
+		}
+		}
+		pfree(entry->data);
+	}
+	else
+		toast_cache_misses++;
+}
+
+static void
+toast_cache_destroy(void)
+{
+	if (!toastCache)
+		return;
+
+	elog(LOG, "AtEOXact_ToastCache hits %ld misses %ld ratio %.2f%%",
+		 toast_cache_hits, toast_cache_misses,
+		 (toast_cache_hits * 100.0) / Max(1, toast_cache_hits + toast_cache_misses));
+
+	MemoryContextDelete(ToastCacheContext);
+	MemoryContextDelete(ToastSliceContext);
+
+	toastCache = NULL;
+	ToastCacheContext = NULL;
+	ToastSliceContext = NULL;
+}
+
+void
+AtEOXact_ToastCache(void)
+{
+	toast_cache_destroy();
+}
+
+struct varlena *
+toast_cache_lookup_datum(Oid toastrelid, Oid valueid)
+{
+	toast_cache_entry *entry;
+	struct varlena *result = NULL;
+
+	entry = toast_cache_lookup(toastrelid, valueid);
+
+	if (!entry)
+		return NULL;
+
+	if(entry->data == NULL)
+		return NULL;
+
+	result = (struct varlena *) palloc(entry->size + VARHDRSZ);
+	SET_VARSIZE(result, entry->size + VARHDRSZ);
+
+	memcpy(VARDATA(result), entry->data, entry->size);
+
+	return result;
+}
+
+static struct varlena *
+toast_cache_lookup_slice(Oid toastrelid, Oid valueid, int32 offset, int32 length)
+{
+	ListCell	*lc = NULL;
+	toast_cache_entry *entry;
+	struct varlena *result = NULL;
+
+	/* result must be pre-alloc'ed by caller */
+	entry = toast_cache_lookup(toastrelid, valueid);
+
+	if (!entry)
+		return NULL;
+
+	if(entry->data != NULL)
+	{
+		result = (struct varlena *) palloc(length + VARHDRSZ);
+		SET_VARSIZE(result, length + VARHDRSZ);
+		memcpy(VARDATA(result), ((char *) entry->data + offset), length);
+		return result;
+	}
+
+	if(entry->slices == 0 || (List *) entry->slices == NIL)
+		return NULL;
+
+	if((List *) entry->slices != NIL)
+	{
+		foreach(lc, (List *) entry->slices)
+		{
+
+			toast_slice_entry *slice = (toast_slice_entry *) lfirst(lc);
+
+			if(offset >= slice->offset
+				&& (offset + length) <= (slice->offset + slice->length))
+			{
+				result = (struct varlena *) palloc(length + VARHDRSZ);
+				SET_VARSIZE(result, length + VARHDRSZ);
+				memcpy(VARDATA(result), ((char *) slice->data + offset - slice->offset), length);
+				break;
+			}
+			else
+				continue;
+		}
+	}
+
+	return result;
+}
+
+void
+toast_cache_add_slice_datum(Oid toastrelid, Oid valueid, struct varlena *attr, int32 offset, int32 length)
+{
+	toast_cache_add_slice_entry(toastrelid, valueid, attr, offset, length);
+}
+
+void
+toast_cache_remove_datum(Oid toastrelid, Oid valueid)
+{
+	toast_cache_remove_entry(toastrelid, valueid);
+}
+
+void
+toast_cache_add_datum(Oid toastrelid, Oid valueid, struct varlena *attr)
+{
+	return toast_cache_add_entry(toastrelid, valueid, attr);
+}
diff --git a/src/backend/access/table/toast_helper.c b/src/backend/access/table/toast_helper.c
index 3bcde2ca1b..c93800a61b 100644
--- a/src/backend/access/table/toast_helper.c
+++ b/src/backend/access/table/toast_helper.c
@@ -260,9 +260,33 @@ toast_tuple_externalize(ToastTupleContext *ttc, int attribute, int options)
 	Datum		old_value = *value;
 	ToastAttrInfo *attr = &ttc->ttc_attr[attribute];
 
+	/*
+	 * Should we cache any value moved out-of-line?
+	 * For the sake of using it later in transaction
+	 * it makes sense, but in case of mass insertions
+	 * we could quickly run out of cache memory
+	 * so we cache value only if the value with this
+	 * key is already there - we consider it as been
+	 * used previously
+	 */
+	if(value
+		&& attr->tai_oldexternal != NULL
+		&& VARATT_IS_EXTERNAL_ONDISK(attr->tai_oldexternal))
+	{
+		struct varlena *tmp = NULL;
+		struct varatt_external toast_pointer;
+		VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr->tai_oldexternal);
+
+		/* add value to the toast cache with replace-only flag*/
+		tmp = toast_cache_lookup_datum(toast_pointer.va_toastrelid, toast_pointer.va_valueid);
+		if(tmp)
+			toast_cache_add_datum(toast_pointer.va_toastrelid, toast_pointer.va_valueid, (struct varlena *) value);
+	}
+
 	attr->tai_colflags |= TOASTCOL_IGNORE;
 	*value = toast_save_datum(ttc->ttc_rel, old_value, attr->tai_oldexternal,
 							  options);
+
 	if ((attr->tai_colflags & TOASTCOL_NEEDS_FREE) != 0)
 		pfree(DatumGetPointer(old_value));
 	attr->tai_colflags |= TOASTCOL_NEEDS_FREE;
@@ -306,7 +330,17 @@ toast_tuple_cleanup(ToastTupleContext *ttc)
 			ToastAttrInfo *attr = &ttc->ttc_attr[i];
 
 			if ((attr->tai_colflags & TOASTCOL_NEEDS_DELETE_OLD) != 0)
+			{
+				if(VARATT_IS_EXTERNAL(ttc->ttc_oldvalues[i]))
+				{
+					struct varatt_external toast_pointer;
+					VARATT_EXTERNAL_GET_POINTER(toast_pointer, ttc->ttc_oldvalues[i]);
+
+					toast_cache_remove_datum(toast_pointer.va_toastrelid, toast_pointer.va_valueid);
+				}
+
 				toast_delete_datum(ttc->ttc_rel, ttc->ttc_oldvalues[i], false);
+			}
 		}
 	}
 }
@@ -332,7 +366,14 @@ toast_delete_external(Relation rel, const Datum *values, const bool *isnull,
 			if (isnull[i])
 				continue;
 			else if (VARATT_IS_EXTERNAL_ONDISK(value))
+			{
+				struct varatt_external toast_pointer;
+				VARATT_EXTERNAL_GET_POINTER(toast_pointer, value);
+
+				toast_cache_remove_datum(toast_pointer.va_toastrelid, toast_pointer.va_valueid);
+
 				toast_delete_datum(rel, value, is_speculative);
+			}
 		}
 	}
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 464858117e..f00e3b63a1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/detoast.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -2373,6 +2374,7 @@ CommitTransaction(void)
 	AtEOXact_Snapshot(true, false);
 	AtEOXact_ApplyLauncher(true);
 	AtEOXact_LogicalRepWorkers(true);
+	AtEOXact_ToastCache();
 	pgstat_report_xact_timestamp(0);
 
 	ResourceOwnerDelete(TopTransactionResourceOwner);
@@ -2659,6 +2661,7 @@ PrepareTransaction(void)
 	/* we treat PREPARE as ROLLBACK so far as waking workers goes */
 	AtEOXact_ApplyLauncher(false);
 	AtEOXact_LogicalRepWorkers(false);
+	AtEOXact_ToastCache();
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2872,6 +2875,7 @@ AbortTransaction(void)
 		AtEOXact_PgStat(false, is_parallel_worker);
 		AtEOXact_ApplyLauncher(false);
 		AtEOXact_LogicalRepWorkers(false);
+		AtEOXact_ToastCache();
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7fe58518d7..1bf0325b78 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/detoast.h"
 #include "access/gin.h"
 #include "access/toast_compression.h"
 #include "access/twophase.h"
@@ -1060,6 +1061,16 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_toast_cache", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables caching of detoasted values."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_toast_cache,
+		false,
+		NULL, NULL, NULL
+	},
 	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
diff --git a/src/include/access/detoast.h b/src/include/access/detoast.h
index 12d8cdb356..832f0aa1ca 100644
--- a/src/include/access/detoast.h
+++ b/src/include/access/detoast.h
@@ -12,6 +12,9 @@
 #ifndef DETOAST_H
 #define DETOAST_H
 
+#include "postgres.h"
+#include "fmgr.h"
+
 /*
  * Macro to fetch the possibly-unaligned contents of an EXTERNAL datum
  * into a local "struct varatt_external" toast pointer.  This should be
@@ -79,4 +82,12 @@ extern Size toast_raw_datum_size(Datum value);
  */
 extern Size toast_datum_size(Datum value);
 
+extern PGDLLIMPORT bool enable_toast_cache;
+extern void AtEOXact_ToastCache(void);
+
+extern struct varlena *toast_cache_lookup_datum(Oid toastrelid, Oid valueid);
+extern void toast_cache_add_datum(Oid toastrelid, Oid valueid, struct varlena *attr);
+extern void toast_cache_add_slice_datum(Oid toastrelid, Oid valueid, struct varlena *attr, int32 offset, int32 length);
+extern void toast_cache_remove_datum(Oid toastrelid, Oid valueid);
+
 #endif							/* DETOAST_H */
-- 
2.25.1

#44Robert Haas
robertmhaas@gmail.com
In reply to: Nikita Malakhov (#43)
Re: Shared detoast Datum proposal

Hi,

Andy Fan asked me off-list for some feedback about this proposal. I
have hesitated to comment on it for lack of having studied the matter
in any detail, but since I've been asked for my input, here goes:

I agree that there's a problem here, but I suspect this is not the
right way to solve it. Let's compare this to something like the
syscache. Specifically, let's think about the syscache on
pg_class.relname.

In the case of the syscache on pg_class.relname, there are two reasons
why we might repeatedly look up the same values in the cache. One is
that the code might do multiple name lookups when it really ought to
do only one. Doing multiple lookups is bad for security and bad for
performance, but we have code like that in some places and some of it
is hard to fix. The other reason is that it is likely that the user
will mention the same table names repeatedly; each time they do, they
will trigger a lookup on the same name. By caching the result of the
lookup, we can make it much faster. An important point to note here is
that the queries that the user will issue in the future are unknown to
us. In a certain sense, we cannot even know whether the same table
name will be mentioned more than once in the same query: when we reach
the first instance of the table name and look it up, we do not have
any good way of knowing whether it will be mentioned again later, say
due to a self-join. Hence, the pattern of cache lookups is
fundamentally unknowable.

But that's not the case in the motivating example that started this
thread. In that case, the target list includes three separate
references to the same column. We can therefore predict that if the
column value is toasted, we're going to de-TOAST it exactly three
times. If, on the other hand, the column value were mentioned only
once, it would be detoasted just once. In that latter case, which is
probably quite common, this whole cache idea seems like it ought to
just waste cycles, which makes me suspect that no matter what is done
to this patch, there will always be cases where it causes significant
regressions. In the former case, where the column reference is
repeated, it will win, but it will also hold onto the detoasted value
after the third instance of detoasting, even though there will never
be any more cache hits. Here, unlike the syscache case, the cache is
thrown away at the end of the query, so future queries can't benefit.
Even if we could find a way to preserve the cache in some cases, it's
not very likely to pay off. People are much more likely to hit the
same table more than once than to hit the same values in the same
table more than once in the same session.

Suppose we had a design where, when a value got detoasted, the
detoasted value went into the place where the toasted value had come
from. Then this problem would be handled pretty much perfectly: the
detoasted value would last until the scan advanced to the next row,
and then it would be thrown out. So there would be no query-lifespan
memory leak. Now, I don't really know how this would work, because the
TOAST pointer is coming out of a slot and we can't necessarily write
one attribute of any arbitrary slot type without a bunch of extra
overhead, but maybe there's some way. If it could be done cheaply
enough, it would gain all the benefits of this proposal a lot more
cheaply and with fewer downsides. Alternatively, maybe there's a way
to notice the multiple references and introduce some kind of
intermediate slot or other holding area where the detoasted value can
be stored.

In other words, instead of computing

big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c'

Maybe we ought to be trying to compute this:

big_jsonb_col=detoast(big_jsonb_col)
big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c'

Or perhaps this:

tmp=detoast(big_jsonb_col)
tmp->'a', tmp->'b', tmp->'c'

In still other words, a query-lifespan cache for this information
feels like it's tackling the problem at the wrong level. I do suspect
there may be query shapes where the same value gets detoasted multiple
times in ways that the proposals I just made wouldn't necessarily
catch, such as in the case of a self-join. But I think those cases are
rare. In most cases, repeated detoasting is probably very localized,
with the same exact TOAST pointer being de-TOASTed a couple of times
in quick succession, so a query-lifespan cache seems like the wrong
thing. Also, in most cases, repeated detoasting is probably quite
predictable: we could infer it from static analysis of the query. So
throwing ANY kind of cache at the problem seems like the wrong
approach unless the cache is super-cheap, which doesn't seem to be the
case with this proposal, because a cache caters to cases where we
can't know whether we're going to recompute the same result, and here,
that seems like something we could probably figure out.

Because I don't see any way to avoid regressions in the common case
where detoasting isn't repeated, I think this patch is likely never
going to be committed no matter how much time anyone spends on it.

But, to repeat the disclaimer from the top, all of the above is just a
relatively uneducated opinion without any detailed study of the
matter.

...Robert

#45Andy Fan
zhihuifan1213@163.com
In reply to: Robert Haas (#44)
Re: Shared detoast Datum proposal

Hi Robert,

Andy Fan asked me off-list for some feedback about this proposal. I
have hesitated to comment on it for lack of having studied the matter
in any detail, but since I've been asked for my input, here goes:

Thanks for doing this! Since we have two totally different designs
between Tomas and me and I think we both understand the two sides of
each design, just that the following things are really puzzled to me.

- which factors we should care more.
- the possiblility to improve the design-A/B for its badness.

But for the point 2, we probably needs some prefer design first.

for now let's highlight that there are 2 totally different method to
handle this problem. Let's call:

design a: detoast the datum into slot->tts_values (and manage the
memory with the lifespan of *slot*). From me.

design b: detost the datum into a CACHE (and manage the memory with
CACHE eviction and at released the end-of-query). From Tomas.

Currently I pefer design a, I'm not sure if I want desgin a because a is
really good or just because design a is my own design. So I will
summarize the the current state for the easier discussion. I summarize
them so that you don't need to go to the long discussion, and I summarize
the each point with the link to the original post since I may
misunderstand some points and the original post can be used as a double
check.

I agree that there's a problem here, but I suspect this is not the
right way to solve it. Let's compare this to something like the
syscache. Specifically, let's think about the syscache on
pg_class.relname.

OK.

In the case of the syscache on pg_class.relname, there are two reasons
why we might repeatedly look up the same values in the cache. One is
that the code might do multiple name lookups when it really ought to
do only one. Doing multiple lookups is bad for security and bad for
performance, but we have code like that in some places and some of it
is hard to fix. The other reason is that it is likely that the user
will mention the same table names repeatedly; each time they do, they
will trigger a lookup on the same name. By caching the result of the
lookup, we can make it much faster. An important point to note here is
that the queries that the user will issue in the future are unknown to
us. In a certain sense, we cannot even know whether the same table
name will be mentioned more than once in the same query: when we reach
the first instance of the table name and look it up, we do not have
any good way of knowing whether it will be mentioned again later, say
due to a self-join. Hence, the pattern of cache lookups is
fundamentally unknowable.

True.

But that's not the case in the motivating example that started this
thread. In that case, the target list includes three separate
references to the same column. We can therefore predict that if the
column value is toasted, we're going to de-TOAST it exactly three
times.

True.

If, on the other hand, the column value were mentioned only
once, it would be detoasted just once. In that latter case, which is
probably quite common, this whole cache idea seems like it ought to
just waste cycles, which makes me suspect that no matter what is done
to this patch, there will always be cases where it causes significant
regressions.

I agree.

In the former case, where the column reference is
repeated, it will win, but it will also hold onto the detoasted value
after the third instance of detoasting, even though there will never
be any more cache hits.

True.

Here, unlike the syscache case, the cache is
thrown away at the end of the query, so future queries can't benefit.
Even if we could find a way to preserve the cache in some cases, it's
not very likely to pay off. People are much more likely to hit the
same table more than once than to hit the same values in the same
table more than once in the same session.

I agree.

One more things I want to highlight it "syscache" is used for metadata
and *detoast cache* is used for user data. user data is more
likely bigger than metadata, so cache size control (hence eviction logic
run more often) is more necessary in this case. The first graph in [1]/messages/by-id/87ttllbd00.fsf@163.com
(including the quota text) highlight this idea.

Suppose we had a design where, when a value got detoasted, the
detoasted value went into the place where the toasted value had come
from. Then this problem would be handled pretty much perfectly: the
detoasted value would last until the scan advanced to the next row,
and then it would be thrown out. So there would be no query-lifespan
memory leak.

I think that is same idea as design a.

Now, I don't really know how this would work, because the
TOAST pointer is coming out of a slot and we can't necessarily write
one attribute of any arbitrary slot type without a bunch of extra
overhead, but maybe there's some way.

I think the key points includes:

1. Where to put the *detoast value* so that ExprEval system can find out
it cheaply. The soluation in A slot->tts_values[*].

2. How to manage the memory for holding the detoast value.

The soluation in A is:

- using a lazy created MemoryContext in slot.
- let the detoast system detoast the datum into the designed
MemoryContext.

3. How to let Expr evalution run the extra cycles when needed.

The soluation in A is:

EEOP_XXX_TOAST step is designed to get riding of some *run-time* *if
(condition)" . I found the existing ExprEval system have many
specialized steps.

I will post a detailed overall design later and explain the its known
badness so far.

If it could be done cheaply enough, it would gain all the benefits of
this proposal a lot more cheaply and with fewer
downsides.

I agree.

Alternatively, maybe there's a way
to notice the multiple references and introduce some kind of
intermediate slot or other holding area where the detoasted value can
be stored.

In other words, instead of computing

big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c'

Maybe we ought to be trying to compute this:

big_jsonb_col=detoast(big_jsonb_col)
big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c'

Or perhaps this:

tmp=detoast(big_jsonb_col)
tmp->'a', tmp->'b', tmp->'c'

Current the design a doesn't use 'tmp' so that the existing ExprEval
engine can find out "detoast datum" easily, so it is more like:

big_jsonb_col=detoast(big_jsonb_col)
big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c'

[1]: /messages/by-id/87ttllbd00.fsf@163.com

--
Best Regards
Andy Fan

#46Andy Fan
zhihuifan1213@163.com
In reply to: Andy Fan (#45)
Re: Shared detoast Datum proposal

Andy Fan <zhihuifan1213@163.com> writes:

Hi Robert,

Andy Fan asked me off-list for some feedback about this proposal. I
have hesitated to comment on it for lack of having studied the matter
in any detail, but since I've been asked for my input, here goes:

Thanks for doing this! Since we have two totally different designs
between Tomas and me and I think we both understand the two sides of
each design, just that the following things are really puzzled to me.

my emacs was stuck and somehow this incompleted message was sent out,
Please ignore this post for now. Sorry for this.

--
Best Regards
Andy Fan

#47Nikita Malakhov
hukutoc@gmail.com
In reply to: Andy Fan (#46)
Re: Shared detoast Datum proposal

Hi,

Robert, thank you very much for your response to this thread.
I agree with most of things you've mentioned, but this proposal
is a PoC, and anyway has a long way to go to be presented
(if it ever would) as something to be committed.

Andy, glad you've not lost interest in this work, I'm looking
forward to your improvements!

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#48Robert Haas
robertmhaas@gmail.com
In reply to: Andy Fan (#45)
Re: Shared detoast Datum proposal

On Tue, May 21, 2024 at 10:02 PM Andy Fan <zhihuifan1213@163.com> wrote:

One more things I want to highlight it "syscache" is used for metadata
and *detoast cache* is used for user data. user data is more
likely bigger than metadata, so cache size control (hence eviction logic
run more often) is more necessary in this case. The first graph in [1]
(including the quota text) highlight this idea.

Yes.

I think that is same idea as design a.

Sounds similar.

I think the key points includes:

1. Where to put the *detoast value* so that ExprEval system can find out
it cheaply. The soluation in A slot->tts_values[*].

2. How to manage the memory for holding the detoast value.

The soluation in A is:

- using a lazy created MemoryContext in slot.
- let the detoast system detoast the datum into the designed
MemoryContext.

3. How to let Expr evalution run the extra cycles when needed.

I don't understand (3) but I agree that (1) and (2) are key points. In
many - nearly all - circumstances just overwriting slot->tts_values is
unsafe and will cause problems. To make this work, we've got to find
some way of doing this that is compatible with general design of
TupleTableSlot, and I think in that API, just about the only time you
can touch slot->tts_values is when you're trying to populate a virtual
slot. But often we'll have some other kind of slot. And even if we do
have a virtual slot, I think you're only allowed to manipulate
slot->tts_values up until you call ExecStoreVirtualTuple.

That's why I mentioned using something temporary. You could legally
(a) materialize the existing slot, (b) create a new virtual slot, (c)
copy over the tts_values and tts_isnull arrays, (c) detoast any datums
you like from tts_values and store the new datums back into the array,
(d) call ExecStoreVirtualTuple, and then (e) use the new slot in place
of the original one. That would avoid repeated detoasting, but it
would also have a bunch of overhead. Having to fully materialize the
existing slot is a cost that you wouldn't always need to pay if you
didn't do this, but you have to do it to make this work. Creating the
new virtual slot costs something. Copying the arrays costs something.
Detoasting costs quite a lot potentially, if you don't end up using
the values. If you end up needing a heap tuple representation of the
slot, you're going to now have to make a new one rather than just
using the one that came straight from the buffer.

So I don't think we could do something like this unconditionally.
We'd have to do it only when there was a reasonable expectation that
it would work out, which means we'd have to be able to predict fairly
accurately whether it was going to work out. But I do agree with you
that *if* we could make it work, it would probably be better than a
longer-lived cache.

--
Robert Haas
EDB: http://www.enterprisedb.com

#49Andy Fan
zhihuifan1213@163.com
In reply to: Robert Haas (#48)
1 attachment(s)
Re: Shared detoast Datum proposal

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, May 21, 2024 at 10:02 PM Andy Fan <zhihuifan1213@163.com> wrote:

One more things I want to highlight it "syscache" is used for metadata
and *detoast cache* is used for user data. user data is more
likely bigger than metadata, so cache size control (hence eviction logic
run more often) is more necessary in this case. The first graph in [1]
(including the quota text) highlight this idea.

Yes.

I think that is same idea as design a.

Sounds similar.

This is really great to know!

I think the key points includes:

1. Where to put the *detoast value* so that ExprEval system can find out
it cheaply. The soluation in A slot->tts_values[*].

2. How to manage the memory for holding the detoast value.

The soluation in A is:

- using a lazy created MemoryContext in slot.
- let the detoast system detoast the datum into the designed
MemoryContext.

3. How to let Expr evalution run the extra cycles when needed.

I don't understand (3) but I agree that (1) and (2) are key points. In
many - nearly all - circumstances just overwriting slot->tts_values is
unsafe and will cause problems. To make this work, we've got to find
some way of doing this that is compatible with general design of
TupleTableSlot, and I think in that API, just about the only time you
can touch slot->tts_values is when you're trying to populate a virtual
slot. But often we'll have some other kind of slot. And even if we do
have a virtual slot, I think you're only allowed to manipulate
slot->tts_values up until you call ExecStoreVirtualTuple.

Please give me one more chance to explain this. What I mean is:

Take SELECT f(a) FROM t1 join t2...; for example,

When we read the Datum of a Var, we read it from tts->tts_values[*], no
matter what kind of TupleTableSlot is. So if we can put the "detoast
datum" back to the "original " tts_values, then nothing need to be
changed.

Aside from efficiency considerations, security and rationality are also
important considerations. So what I think now is:

- tts_values[*] means the Datum from the slot, and it has its operations
like slot_getsomeattrs, ExecMaterializeSlot, ExecCopySlot,
ExecClearTuple and so on. All of the characters have nothing with what
kind of slot it is.

- Saving the "detoast datum" version into tts_values[*] doesn't break
the above semantic and the ExprEval engine just get a detoast version
so it doesn't need to detoast it again.

- The keypoint is the memory management and effeiciency. for now I think
all the places where "slot->tts_nvalid" is set to 0 means the
tts_values[*] is no longer validate, then this is the place we should
release the memory for the "detoast datum". All the other places like
ExecMaterializeSlot or ExecCopySlot also need to think about the
"detoast datum" as well.

That's why I mentioned using something temporary. You could legally
(a) materialize the existing slot, (b) create a new virtual slot, (c)
copy over the tts_values and tts_isnull arrays, (c) detoast any datums
you like from tts_values and store the new datums back into the array,
(d) call ExecStoreVirtualTuple, and then (e) use the new slot in place
of the original one. That would avoid repeated detoasting, but it
would also have a bunch of overhead.

Yes, I agree with the overhead, so I perfer to know if the my current
strategy has principle issue first.

Having to fully materialize the
existing slot is a cost that you wouldn't always need to pay if you
didn't do this, but you have to do it to make this work. Creating the
new virtual slot costs something. Copying the arrays costs something.
Detoasting costs quite a lot potentially, if you don't end up using
the values. If you end up needing a heap tuple representation of the
slot, you're going to now have to make a new one rather than just
using the one that came straight from the buffer.

But I do agree with you
that *if* we could make it work, it would probably be better than a
longer-lived cache.

I noticed the "if" and great to know that:), speically for the efficiecy
& memory usage control purpose.

Another big challenge of this is how to decide if a Var need this
pre-detoasting strategy, we have to consider:

- The in-place tts_values[*] update makes the slot bigger, which is
harmful for CP_SMALL_TLIST (createplan.c) purpose.
- ExprEval runtime checking if the Var is toastable is an overhead. It
is must be decided ExecInitExpr time.

The code complexity because of the above 2 factors makes people (include
me) unhappy.

===
Yesterday I did some worst case testing for the current strategy, and
the first case show me ~1% *unexpected* performance regression and I
tried to find out it with perf record/report, and no lucky. that's the
reason why I didn't send a complete post yesterday.

As for now, since we are talking about the high level design, I think
it is OK to post the improved design document and the incompleted worst
case testing data and my planning.

Current Design
--------------

The high level design is let createplan.c and setref.c decide which
Vars can use this feature, and then the executor save the detoast
datum back slot->to tts_values[*] during the ExprEvalStep of
EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:

- The existing expression engine read datum from tts_values[*], no any
extra work need to be done.
- Reuse the lifespan of TupleTableSlot system to manage memory. It
is natural to think the detoast datum is a tts_value just that it is
in a detoast format. Since we have a clear lifespan for TupleTableSlot
already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
them for managing the datoast datum's memory.
- The existing projection method can copy the datoasted datum (int64)
automatically to the next node's slot, but keeping the ownership
unchanged, so only the slot where the detoast really happen take the
charge of it's lifespan.

Assuming which Var should use this feature has been decided in
createplan.c and setref.c already. The following strategy is used to
reduce the run time overhead.

1. Putting the detoast datum into tts->tts_values[*]. then the ExprEval
system doesn't need any extra effort to find them.

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
if (!slot->tts_isnull[attnum] &&
VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
if (unlikely(slot->tts_data_mctx == NULL))
slot->tts_data_mctx = GenerationContextCreate(slot->tts_mcxt,
"tts_value_ctx",
ALLOCSET_START_SMALL_SIZES);
slot->tts_values[attnum] = PointerGetDatum(detoast_attr_ext((struct varlena *) slot->tts_values[attnum],
/* save the detoast value to the given MemoryContext. */
slot->tts_data_mctx));
}
}

2. Using a dedicated lazy created memory context under TupleTableSlot so
that the memory can be released effectively.

static inline void
ExecFreePreDetoastDatum(TupleTableSlot *slot)
{
if (slot->tts_data_mctx)
MemoryContextResetOnly(slot->tts_data_mctx);
}

3. Control the memory usage by releasing them whenever the
slot->tts_values[*] is deemed as invalidate, so the lifespan is
likely short.

4. Reduce the run-time overhead for checking if a VAR should use
pre-detoast feature, 3 new steps are introduced.
EEOP_{INNER,OUTER,SCAN}_VAR_TOAST, so that other Var is totally non
related.

Now comes to the createplan.c/setref.c part, which decides which Vars
should use the shared detoast feature. The guideline of this is:

1. It needs a detoast for a given expression in the previous logic.
2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
detoast datum back to tts_values[*], which make tuple bigger. if we
do this blindly, it would be harmful to the ORDER / HASH style nodes.

A high level data flow is:

1. at the createplan.c, we walk the plan tree go gather the
CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
it to Plan.forbid_pre_detoast_vars via the function
set_plan_forbid_pre_detoast_vars_recurse.

2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
expression, so it is a good time to track the attribute number and
see if the Var is directly or indirectly accessed. Usually the
indirectly access a Var means a detoast would happens, for
example an expression like a > 3. However some known expressions is
ignored. for example: NullTest, pg_column_compression which needs the
raw datum, start_with/sub_string which needs a slice
detoasting. Currently there is some hard code here, we may needs a
pg_proc.detoasting_requirement flags to make this generic. The
output is {Scan|Join}.xxx_reference_attrs;

Note that here I used '_reference_' rather than '_detoast_' is because
at this part, I still don't know if it is a toastable attrbiute, which
is known at the MakeTupleTableSlot stage.

3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
in ScanState & JoinState, which will be passed into expression
engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
_VAR_TOAST steps are generated finally then everything is connected
with ExecSlotDetoastDatum!

Bad case testing of current design:
====================================

- The extra effort of createplan.c & setref.c & execExpr.c & InitPlan :

CREATE TABLE t(a int, b numeric);

q1:
explain select * from t where a > 3;

q2:
set enable_hashjoin to off;
set enable_mergejoin to off;
set enable_nestloop to on;

explain select * from t join t t2 using(a);

q3:
set enable_hashjoin to off;
set enable_mergejoin to on;
set enable_nestloop to off;
explain select * from t join t t2 using(a);

q4:
set enable_hashjoin to on;
set enable_mergejoin to off;
set enable_nestloop to off;
explain select * from t join t t2 using(a);

pgbench -f x.sql -n postgres -M simple -T10

| query ID | patched (ms) | master (ms) | regression(%) |
|----------+-------------------------------------+-------------------------------------+---------------|
| 1 | {0.088, 0.088, 0.088, 0.087, 0.089} | {0.087, 0.086, 0.086, 0.087, 0.087} | 1.14% |
| 2 | not-done-yet | | |
| 3 | not-done-yet | | |
| 4 | not-done-yet | | |

- The extra effort of ExecInterpExpr

insert into t select i, i FROM generate_series(1, 100000) i;

SELECT count(b) from t WHERE b > 0;

In this query, the extra execution code is run but it doesn't buy anything.

| master | patched | regression |
|--------+---------+------------|
| | | |

What I am planning now is:
- Gather more feedback on the overall strategy.
- figure out the 1% unexpected regression. I don't have a clear
direction for now, my current expression is analyzing the perf report,
and I can't find out the root cause. and reading the code can't find out
the root cause as well.
- Simplifying the "planner part" for deciding which Var should use this
strategy. I don't have clear direction as well.
- testing and improving more worst case.

Attached is a appliable and testingable version against current master
(e87e73245).

--
Best Regards
Andy Fan

Attachments:

v10-0001-shared-detoast-datum.patchtext/x-diffDownload
From 7d17a8c9110b7e944a6431887c9cd38a4d44d8b2 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 Mar 2024 13:48:25 +0800
Subject: [PATCH v10 1/1] shared detoast datum

See the overall design & alternative design & testing in

src/backend/executor/shared_detoast_datum.org
---
 src/backend/access/common/detoast.c           |  68 ++-
 src/backend/access/common/toast_compression.c |  10 +-
 src/backend/executor/execExpr.c               |  60 +-
 src/backend/executor/execExprInterp.c         | 179 ++++++
 src/backend/executor/execTuples.c             | 131 ++++
 src/backend/executor/execUtils.c              |   2 +
 src/backend/executor/nodeHashjoin.c           |   2 +
 src/backend/executor/nodeMergejoin.c          |   2 +
 src/backend/executor/nodeNestloop.c           |   1 +
 src/backend/executor/shared_detoast_datum.org | 264 ++++++++
 src/backend/jit/llvm/llvmjit_expr.c           |  26 +-
 src/backend/jit/llvm/llvmjit_types.c          |   1 +
 src/backend/optimizer/plan/createplan.c       | 107 +++-
 src/backend/optimizer/plan/setrefs.c          | 577 +++++++++++++++---
 src/include/access/detoast.h                  |   3 +
 src/include/access/toast_compression.h        |   4 +-
 src/include/executor/execExpr.h               |  12 +
 src/include/executor/tuptable.h               |  14 +
 src/include/nodes/execnodes.h                 |  14 +
 src/include/nodes/plannodes.h                 |  53 ++
 src/tools/pgindent/typedefs.list              |   2 +
 21 files changed, 1393 insertions(+), 139 deletions(-)
 create mode 100644 src/backend/executor/shared_detoast_datum.org

diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c
index 3547cdba56..acc9644689 100644
--- a/src/backend/access/common/detoast.c
+++ b/src/backend/access/common/detoast.c
@@ -22,11 +22,11 @@
 #include "utils/expandeddatum.h"
 #include "utils/rel.h"
 
-static struct varlena *toast_fetch_datum(struct varlena *attr);
+static struct varlena *toast_fetch_datum(struct varlena *attr, MemoryContext ctx);
 static struct varlena *toast_fetch_datum_slice(struct varlena *attr,
 											   int32 sliceoffset,
 											   int32 slicelength);
-static struct varlena *toast_decompress_datum(struct varlena *attr);
+static struct varlena *toast_decompress_datum(struct varlena *attr, MemoryContext ctx);
 static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32 slicelength);
 
 /* ----------
@@ -42,7 +42,7 @@ static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32
  * ----------
  */
 struct varlena *
-detoast_external_attr(struct varlena *attr)
+detoast_external_attr_ext(struct varlena *attr, MemoryContext ctx)
 {
 	struct varlena *result;
 
@@ -51,7 +51,7 @@ detoast_external_attr(struct varlena *attr)
 		/*
 		 * This is an external stored plain value
 		 */
-		result = toast_fetch_datum(attr);
+		result = toast_fetch_datum(attr, ctx);
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -68,13 +68,13 @@ detoast_external_attr(struct varlena *attr)
 
 		/* recurse if value is still external in some other way */
 		if (VARATT_IS_EXTERNAL(attr))
-			return detoast_external_attr(attr);
+			return detoast_external_attr_ext(attr, ctx);
 
 		/*
 		 * Copy into the caller's memory context, in case caller tries to
 		 * pfree the result.
 		 */
-		result = (struct varlena *) palloc(VARSIZE_ANY(attr));
+		result = (struct varlena *) MemoryContextAlloc(ctx, VARSIZE_ANY(attr));
 		memcpy(result, attr, VARSIZE_ANY(attr));
 	}
 	else if (VARATT_IS_EXTERNAL_EXPANDED(attr))
@@ -87,7 +87,7 @@ detoast_external_attr(struct varlena *attr)
 
 		eoh = DatumGetEOHP(PointerGetDatum(attr));
 		resultsize = EOH_get_flat_size(eoh);
-		result = (struct varlena *) palloc(resultsize);
+		result = (struct varlena *) MemoryContextAlloc(ctx, resultsize);
 		EOH_flatten_into(eoh, (void *) result, resultsize);
 	}
 	else
@@ -101,32 +101,45 @@ detoast_external_attr(struct varlena *attr)
 	return result;
 }
 
+struct varlena *
+detoast_external_attr(struct varlena *attr)
+{
+	return detoast_external_attr_ext(attr, CurrentMemoryContext);
+}
+
 
 /* ----------
- * detoast_attr -
+ * detoast_attr_ext -
  *
  *	Public entry point to get back a toasted value from compression
  *	or external storage.  The result is always non-extended varlena form.
  *
+ * ctx: The memory context which the final value belongs to.
+ *
  * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
  * datum, the result will be a pfree'able chunk.
  * ----------
  */
-struct varlena *
-detoast_attr(struct varlena *attr)
+
+extern struct varlena *
+detoast_attr_ext(struct varlena *attr, MemoryContext ctx)
 {
 	if (VARATT_IS_EXTERNAL_ONDISK(attr))
 	{
 		/*
 		 * This is an externally stored datum --- fetch it back from there
 		 */
-		attr = toast_fetch_datum(attr);
+		attr = toast_fetch_datum(attr, ctx);
 		/* If it's compressed, decompress it */
 		if (VARATT_IS_COMPRESSED(attr))
 		{
 			struct varlena *tmp = attr;
 
-			attr = toast_decompress_datum(tmp);
+			attr = toast_decompress_datum(tmp, ctx);
+			/*
+			 * XXX: this pfree block us from using BumpContext directly
+			 * we need some extra effort to make it happen at least.
+			 */
 			pfree(tmp);
 		}
 	}
@@ -144,14 +157,14 @@ detoast_attr(struct varlena *attr)
 		Assert(!VARATT_IS_EXTERNAL_INDIRECT(attr));
 
 		/* recurse in case value is still extended in some other way */
-		attr = detoast_attr(attr);
+		attr = detoast_attr_ext(attr, ctx);
 
 		/* if it isn't, we'd better copy it */
 		if (attr == (struct varlena *) redirect.pointer)
 		{
 			struct varlena *result;
 
-			result = (struct varlena *) palloc(VARSIZE_ANY(attr));
+			result = (struct varlena *) MemoryContextAlloc(ctx, VARSIZE_ANY(attr));
 			memcpy(result, attr, VARSIZE_ANY(attr));
 			attr = result;
 		}
@@ -161,7 +174,7 @@ detoast_attr(struct varlena *attr)
 		/*
 		 * This is an expanded-object pointer --- get flat format
 		 */
-		attr = detoast_external_attr(attr);
+		attr = detoast_external_attr_ext(attr, ctx);
 		/* flatteners are not allowed to produce compressed/short output */
 		Assert(!VARATT_IS_EXTENDED(attr));
 	}
@@ -170,7 +183,7 @@ detoast_attr(struct varlena *attr)
 		/*
 		 * This is a compressed value inside of the main tuple
 		 */
-		attr = toast_decompress_datum(attr);
+		attr = toast_decompress_datum(attr, ctx);
 	}
 	else if (VARATT_IS_SHORT(attr))
 	{
@@ -181,7 +194,7 @@ detoast_attr(struct varlena *attr)
 		Size		new_size = data_size + VARHDRSZ;
 		struct varlena *new_attr;
 
-		new_attr = (struct varlena *) palloc(new_size);
+		new_attr = (struct varlena *) MemoryContextAlloc(ctx, new_size);
 		SET_VARSIZE(new_attr, new_size);
 		memcpy(VARDATA(new_attr), VARDATA_SHORT(attr), data_size);
 		attr = new_attr;
@@ -190,6 +203,11 @@ detoast_attr(struct varlena *attr)
 	return attr;
 }
 
+struct varlena *
+detoast_attr(struct varlena *attr)
+{
+	return detoast_attr_ext(attr, CurrentMemoryContext);
+}
 
 /* ----------
  * detoast_attr_slice -
@@ -262,7 +280,7 @@ detoast_attr_slice(struct varlena *attr,
 			preslice = toast_fetch_datum_slice(attr, 0, max_size);
 		}
 		else
-			preslice = toast_fetch_datum(attr);
+			preslice = toast_fetch_datum(attr, CurrentMemoryContext);
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -294,7 +312,7 @@ detoast_attr_slice(struct varlena *attr,
 		if (slicelimit >= 0)
 			preslice = toast_decompress_datum_slice(tmp, slicelimit);
 		else
-			preslice = toast_decompress_datum(tmp);
+			preslice = toast_decompress_datum(tmp, CurrentMemoryContext);
 
 		if (tmp != attr)
 			pfree(tmp);
@@ -340,7 +358,7 @@ detoast_attr_slice(struct varlena *attr,
  * ----------
  */
 static struct varlena *
-toast_fetch_datum(struct varlena *attr)
+toast_fetch_datum(struct varlena *attr, MemoryContext ctx)
 {
 	Relation	toastrel;
 	struct varlena *result;
@@ -355,7 +373,7 @@ toast_fetch_datum(struct varlena *attr)
 
 	attrsize = VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer);
 
-	result = (struct varlena *) palloc(attrsize + VARHDRSZ);
+	result = (struct varlena *) MemoryContextAlloc(ctx, attrsize + VARHDRSZ);
 
 	if (VARATT_EXTERNAL_IS_COMPRESSED(toast_pointer))
 		SET_VARSIZE_COMPRESSED(result, attrsize + VARHDRSZ);
@@ -468,7 +486,7 @@ toast_fetch_datum_slice(struct varlena *attr, int32 sliceoffset,
  * Decompress a compressed version of a varlena datum
  */
 static struct varlena *
-toast_decompress_datum(struct varlena *attr)
+toast_decompress_datum(struct varlena *attr, MemoryContext ctx)
 {
 	ToastCompressionId cmid;
 
@@ -482,9 +500,9 @@ toast_decompress_datum(struct varlena *attr)
 	switch (cmid)
 	{
 		case TOAST_PGLZ_COMPRESSION_ID:
-			return pglz_decompress_datum(attr);
+			return pglz_decompress_datum(attr, ctx);
 		case TOAST_LZ4_COMPRESSION_ID:
-			return lz4_decompress_datum(attr);
+			return lz4_decompress_datum(attr, ctx);
 		default:
 			elog(ERROR, "invalid compression method id %d", cmid);
 			return NULL;		/* keep compiler quiet */
@@ -515,7 +533,7 @@ toast_decompress_datum_slice(struct varlena *attr, int32 slicelength)
 	 * more than the data's true decompressed size.
 	 */
 	if ((uint32) slicelength >= TOAST_COMPRESS_EXTSIZE(attr))
-		return toast_decompress_datum(attr);
+		return toast_decompress_datum(attr, CurrentMemoryContext);
 
 	/*
 	 * Fetch the compression method id stored in the compression header and
diff --git a/src/backend/access/common/toast_compression.c b/src/backend/access/common/toast_compression.c
index 52230f31c6..1189faf264 100644
--- a/src/backend/access/common/toast_compression.c
+++ b/src/backend/access/common/toast_compression.c
@@ -79,13 +79,13 @@ pglz_compress_datum(const struct varlena *value)
  * Decompress a varlena that was compressed using PGLZ.
  */
 struct varlena *
-pglz_decompress_datum(const struct varlena *value)
+pglz_decompress_datum(const struct varlena *value, MemoryContext ctx)
 {
 	struct varlena *result;
 	int32		rawsize;
 
 	/* allocate memory for the uncompressed data */
-	result = (struct varlena *) palloc(VARDATA_COMPRESSED_GET_EXTSIZE(value) + VARHDRSZ);
+	result = (struct varlena *) MemoryContextAlloc(ctx, VARDATA_COMPRESSED_GET_EXTSIZE(value) + VARHDRSZ);
 
 	/* decompress the data */
 	rawsize = pglz_decompress((char *) value + VARHDRSZ_COMPRESSED,
@@ -179,7 +179,7 @@ lz4_compress_datum(const struct varlena *value)
  * Decompress a varlena that was compressed using LZ4.
  */
 struct varlena *
-lz4_decompress_datum(const struct varlena *value)
+lz4_decompress_datum(const struct varlena *value, MemoryContext ctx)
 {
 #ifndef USE_LZ4
 	NO_LZ4_SUPPORT();
@@ -189,7 +189,7 @@ lz4_decompress_datum(const struct varlena *value)
 	struct varlena *result;
 
 	/* allocate memory for the uncompressed data */
-	result = (struct varlena *) palloc(VARDATA_COMPRESSED_GET_EXTSIZE(value) + VARHDRSZ);
+	result = (struct varlena *) MemoryContextAlloc(ctx, VARDATA_COMPRESSED_GET_EXTSIZE(value) + VARHDRSZ);
 
 	/* decompress the data */
 	rawsize = LZ4_decompress_safe((char *) value + VARHDRSZ_COMPRESSED,
@@ -223,7 +223,7 @@ lz4_decompress_datum_slice(const struct varlena *value, int32 slicelength)
 
 	/* slice decompression not supported prior to 1.8.3 */
 	if (LZ4_versionNumber() < 10803)
-		return lz4_decompress_datum(value);
+		return lz4_decompress_datum(value, CurrentMemoryContext);
 
 	/* allocate memory for the uncompressed data */
 	result = (struct varlena *) palloc(slicelength + VARHDRSZ);
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index b9ebc827a7..7b4742bf8e 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -938,22 +938,76 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				}
 				else
 				{
+					int			attnum;
+					Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 					/* regular user column */
 					scratch.d.var.attnum = variable->varattno - 1;
 					scratch.d.var.vartype = variable->vartype;
+					attnum = scratch.d.var.attnum;
+
 					switch (variable->varno)
 					{
 						case INNER_VAR:
-							scratch.opcode = EEOP_INNER_VAR;
+
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_INNER_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+								elog(INFO,
+									 "EEOP_INNER_VAR_TOAST: flags = %d costs=%.2f..%.2f, attnum: %d",
+									 state->flags,
+									 plan->startup_cost,
+									 plan->total_cost,
+									 attnum);
+#endif
+							}
+							else
+							{
+								scratch.opcode = EEOP_INNER_VAR;
+							}
 							break;
 						case OUTER_VAR:
-							scratch.opcode = EEOP_OUTER_VAR;
+							if (is_join_plan(plan) &&
+								bms_is_member(attnum,
+											  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_OUTER_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+									elog(INFO,
+										 "EEOP_OUTER_VAR_TOAST: flags = %u costs=%.2f..%.2f, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 attnum);
+#endif
+							}
+							else
+								scratch.opcode = EEOP_OUTER_VAR;
 							break;
 
 							/* INDEX_VAR is handled by default case */
 
 						default:
-							scratch.opcode = EEOP_SCAN_VAR;
+							if (is_scan_plan(plan) && bms_is_member(
+																	attnum,
+																	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+							{
+								scratch.opcode = EEOP_SCAN_VAR_TOAST;
+#ifdef DEBUG_PRE_DETOAST_DATUM
+									elog(INFO,
+										 "EEOP_SCAN_VAR_TOAST: flags = %u costs=%.2f..%.2f, scanId: %d, attnum: %d",
+										 state->flags,
+										 plan->startup_cost,
+										 plan->total_cost,
+										 ((Scan *) plan)->scanrelid,
+										 attnum);
+#endif
+							}
+							else
+								scratch.opcode = EEOP_SCAN_VAR;
 							break;
 					}
 				}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 852186312c..df932dc7c1 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -157,6 +158,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -165,6 +169,9 @@ static Datum ExecJustConst(ExprState *state, ExprContext *econtext, bool *isnull
 static Datum ExecJustInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignInnerVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignOuterVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustAssignScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull);
@@ -181,6 +188,42 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  ExprContext *aggcontext,
 															  int setno);
 static char *ExecGetJsonValueItemString(JsonbValue *item, bool *resnull);
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+	if (!slot->tts_isnull[attnum] &&
+		VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+	{
+		if (unlikely(slot->tts_data_mctx == NULL))
+			slot->tts_data_mctx = GenerationContextCreate(slot->tts_mcxt,
+														  "tts_value_ctx",
+														  ALLOCSET_START_SMALL_SIZES);
+		slot->tts_values[attnum] = PointerGetDatum(detoast_attr_ext(
+													   (struct varlena *) slot->tts_values[attnum],
+													   /* save the detoast value to the given MemoryContext. */
+													   slot->tts_data_mctx));
+		Assert(slot->tts_nvalid > attnum);
+	}
+}
+
+/* JIT requires a non-static (and external?) function */
+void
+ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum)
+{
+	return ExecSlotDetoastDatum(slot, attnum);
+}
+
+
+static inline void
+ExecEvalToastVar(TupleTableSlot *slot,
+				 ExprEvalStep *op,
+				 int attnum)
+{
+	ExecSlotDetoastDatum(slot, attnum);
+
+	*op->resvalue = slot->tts_values[attnum];
+	*op->resnull = slot->tts_isnull[attnum];
+}
 
 /*
  * ScalarArrayOpExprHashEntry
@@ -296,6 +339,24 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVar;
 			return;
 		}
+		if (step0 == EEOP_INNER_FETCHSOME &&
+			step1 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_FETCHSOME &&
+				 step1 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_FETCHSOME &&
+				 step1 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarToast;
+			return;
+		}
 		else if (step0 == EEOP_INNER_FETCHSOME &&
 				 step1 == EEOP_ASSIGN_INNER_VAR)
 		{
@@ -346,6 +407,21 @@ ExecReadyInterpretedExpr(ExprState *state)
 			state->evalfunc_private = (void *) ExecJustScanVarVirt;
 			return;
 		}
+		else if (step0 == EEOP_INNER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustInnerVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_OUTER_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustOuterVarVirtToast;
+			return;
+		}
+		else if (step0 == EEOP_SCAN_VAR_TOAST)
+		{
+			state->evalfunc_private = (void *) ExecJustScanVarVirtToast;
+			return;
+		}
 		else if (step0 == EEOP_ASSIGN_INNER_VAR)
 		{
 			state->evalfunc_private = (void *) ExecJustAssignInnerVarVirt;
@@ -413,6 +489,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
+		&&CASE_EEOP_INNER_VAR_TOAST,
+		&&CASE_EEOP_OUTER_VAR_TOAST,
+		&&CASE_EEOP_SCAN_VAR_TOAST,
 		&&CASE_EEOP_INNER_SYSVAR,
 		&&CASE_EEOP_OUTER_SYSVAR,
 		&&CASE_EEOP_SCAN_SYSVAR,
@@ -601,6 +680,25 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
 			*op->resvalue = scanslot->tts_values[attnum];
 			*op->resnull = scanslot->tts_isnull[attnum];
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_INNER_VAR_TOAST)
+		{
+			ExecEvalToastVar(innerslot, op, op->d.var.attnum);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_VAR_TOAST)
+		{
+			ExecEvalToastVar(outerslot, op, op->d.var.attnum);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_VAR_TOAST)
+		{
+			ExecEvalToastVar(scanslot, op, op->d.var.attnum);
 
 			EEO_NEXT();
 		}
@@ -2171,6 +2269,42 @@ ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+static pg_attribute_always_inline Datum
+ExecJustVarImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[1];
+	int			attnum = op->d.var.attnum;
+
+	CheckOpSlotCompatibility(&state->steps[0], slot);
+
+	slot_getattr(slot, attnum + 1, isnull);
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Simple reference to inner Var */
+static Datum
+ExecJustInnerVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Simple reference to outer Var */
+static Datum
+ExecJustOuterVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Simple reference to scan Var */
+static Datum
+ExecJustScanVarToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)Var */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
@@ -2309,6 +2443,51 @@ ExecJustScanVarVirt(ExprState *state, ExprContext *econtext, bool *isnull)
 	return ExecJustVarVirtImpl(state, econtext->ecxt_scantuple, isnull);
 }
 
+/* implementation of ExecJust(Inner|Outer|Scan)VarVirt */
+static pg_attribute_always_inline Datum
+ExecJustVarVirtImplToast(ExprState *state, TupleTableSlot *slot, bool *isnull)
+{
+	ExprEvalStep *op = &state->steps[0];
+	int			attnum = op->d.var.attnum;
+
+	/*
+	 * As it is guaranteed that a virtual slot is used, there never is a need
+	 * to perform tuple deforming (nor would it be possible). Therefore
+	 * execExpr.c has not emitted an EEOP_*_FETCHSOME step. Verify, as much as
+	 * possible, that that determination was accurate.
+	 */
+	Assert(TTS_IS_VIRTUAL(slot));
+	Assert(TTS_FIXED(slot));
+	Assert(attnum >= 0 && attnum < slot->tts_nvalid);
+
+	*isnull = slot->tts_isnull[attnum];
+
+	ExecSlotDetoastDatum(slot, attnum);
+
+	return slot->tts_values[attnum];
+}
+
+/* Like ExecJustInnerVar, optimized for virtual slots */
+static Datum
+ExecJustInnerVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_innertuple, isnull);
+}
+
+/* Like ExecJustOuterVar, optimized for virtual slots */
+static Datum
+ExecJustOuterVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_outertuple, isnull);
+}
+
+/* Like ExecJustScanVar, optimized for virtual slots */
+static Datum
+ExecJustScanVarVirtToast(ExprState *state, ExprContext *econtext, bool *isnull)
+{
+	return ExecJustVarVirtImplToast(state, econtext->ecxt_scantuple, isnull);
+}
+
 /* implementation of ExecJustAssign(Inner|Outer|Scan)VarVirt */
 static pg_attribute_always_inline Datum
 ExecJustAssignVarVirtImpl(ExprState *state, TupleTableSlot *inslot, bool *isnull)
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 00dc339615..4b3e799574 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -80,6 +80,9 @@ static inline void tts_buffer_heap_store_tuple(TupleTableSlot *slot,
 											   bool transfer_pin);
 static void tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree);
 
+static Bitmapset *cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+											  TupleDesc tupleDesc,
+											  List *forbid_pre_detoast_vars);
 
 const TupleTableSlotOps TTSOpsVirtual;
 const TupleTableSlotOps TTSOpsHeapTuple;
@@ -432,6 +435,12 @@ tts_heap_materialize(TupleTableSlot *slot)
 	slot->tts_flags |= TTS_FLAG_SHOULDFREE;
 
 	MemoryContextSwitchTo(oldContext);
+
+	/*
+	 * tts_values is treated invalidated since tts_nvalid will is set to 0,
+	 * so let's free the pre-detoast datum.
+	 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -489,6 +498,9 @@ tts_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple, bool shouldFree)
 
 	tts_heap_clear(slot);
 
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 	hslot->tuple = tuple;
 	hslot->off = 0;
@@ -596,6 +608,7 @@ tts_minimal_materialize(TupleTableSlot *slot)
 
 	oldContext = MemoryContextSwitchTo(slot->tts_mcxt);
 
+
 	/*
 	 * Have to deform from scratch, otherwise tts_values[] entries could point
 	 * into the non-materialized tuple (which might be gone when accessed).
@@ -628,6 +641,9 @@ tts_minimal_materialize(TupleTableSlot *slot)
 	mslot->minhdr.t_data = (HeapTupleHeader) ((char *) mslot->mintuple - MINIMAL_TUPLE_OFFSET);
 
 	MemoryContextSwitchTo(oldContext);
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 }
 
 static void
@@ -687,6 +703,9 @@ tts_minimal_store_tuple(TupleTableSlot *slot, MinimalTuple mtup, bool shouldFree
 	Assert(TTS_EMPTY(slot));
 
 	slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
 	slot->tts_nvalid = 0;
 	mslot->off = 0;
 
@@ -817,6 +836,10 @@ tts_buffer_heap_materialize(TupleTableSlot *slot)
 	 * into the non-materialized tuple (which might be gone when accessed).
 	 */
 	bslot->base.off = 0;
+
+	/* slot_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 
 	if (!bslot->base.tuple)
@@ -954,6 +977,10 @@ tts_buffer_heap_store_tuple(TupleTableSlot *slot, HeapTuple tuple,
 	}
 
 	slot->tts_flags &= ~TTS_FLAG_EMPTY;
+
+	/* tts_nvalid = 0 */
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_nvalid = 0;
 	bslot->base.tuple = tuple;
 	bslot->base.off = 0;
@@ -1225,6 +1252,7 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 		slot->tts_flags |= TTS_FLAG_FIXED;
 	slot->tts_tupleDescriptor = tupleDesc;
 	slot->tts_mcxt = CurrentMemoryContext;
+	slot->tts_data_mctx = NULL;
 	slot->tts_nvalid = 0;
 
 	if (tupleDesc != NULL)
@@ -1303,6 +1331,8 @@ ExecResetTupleTable(List *tupleTable,	/* tuple table */
 				if (slot->tts_isnull)
 					pfree(slot->tts_isnull);
 			}
+			if (slot->tts_data_mctx != NULL)
+				MemoryContextDelete(slot->tts_data_mctx);
 			pfree(slot);
 		}
 	}
@@ -1353,6 +1383,9 @@ ExecDropSingleTupleTableSlot(TupleTableSlot *slot)
 		if (slot->tts_isnull)
 			pfree(slot->tts_isnull);
 	}
+	if (slot->tts_data_mctx != NULL)
+		MemoryContextDelete(slot->tts_data_mctx);
+
 	pfree(slot);
 }
 
@@ -1898,12 +1931,26 @@ void
 ExecInitScanTupleSlot(EState *estate, ScanState *scanstate,
 					  TupleDesc tupledesc, const TupleTableSlotOps *tts_ops)
 {
+	Scan	   *splan = (Scan *) scanstate->ps.plan;
+
 	scanstate->ss_ScanTupleSlot = ExecAllocTableSlot(&estate->es_tupleTable,
 													 tupledesc, tts_ops);
 	scanstate->ps.scandesc = tupledesc;
 	scanstate->ps.scanopsfixed = tupledesc != NULL;
 	scanstate->ps.scanops = tts_ops;
 	scanstate->ps.scanopsset = true;
+
+	if (is_scan_plan((Plan *) splan))
+	{
+		/*
+		 * We may run detoast in Qual or Projection, but all of them happen at
+		 * the ss_ScanTupleSlot rather than ps_ResultTupleSlot. So we can only
+		 * take care of the ss_ScanTupleSlot.
+		 */
+		scanstate->scan_pre_detoast_attrs = cal_final_pre_detoast_attrs(splan->reference_attrs,
+																		tupledesc,
+																		splan->plan.forbid_pre_detoast_vars);
+	}
 }
 
 /* ----------------
@@ -2424,3 +2471,87 @@ end_tup_output(TupOutputState *tstate)
 	ExecDropSingleTupleTableSlot(tstate->slot);
 	pfree(tstate);
 }
+
+/*
+ * cal_final_pre_detoast_attrs
+ *		Calculate the final attributes which pre-detoast be helpful.
+ *
+ * reference_attrs: the attributes which will be detoast at this plan level.
+ * due to the implementation issue, some non-toast attribute may be included
+ * in reference_attrs, which should be filtered out with tupleDesc.
+ *
+ * forbid_pre_detoast_vars: the vars which should not be pre-detoast as the
+ * small_tlist reason.
+ */
+static Bitmapset *
+cal_final_pre_detoast_attrs(Bitmapset *reference_attrs,
+							TupleDesc tupleDesc,
+							List *forbid_pre_detoast_vars)
+{
+	Bitmapset  *final = NULL,
+			   *toast_attrs = NULL,
+			   *forbid_pre_detoast_attrs = NULL;
+
+	int			i;
+	ListCell   *lc;
+
+	if (bms_is_empty(reference_attrs))
+		return NULL;
+
+	/*
+	 * there is no exact data type in create_plan or set_plan_refs stage, so
+	 * reference_attrs may have some attribute which is not toast attrs at
+	 * all, which should be removed.
+	 */
+	i = -1;
+	while ((i = bms_next_member(reference_attrs, i)) >= 0)
+	{
+		Form_pg_attribute attr = TupleDescAttr(tupleDesc, i);
+		if (attr->attlen == -1 && attr->attstorage != TYPSTORAGE_PLAIN)
+			toast_attrs = bms_add_member(toast_attrs, attr->attnum - 1);
+	}
+
+	/* Filter out the non-toastable attributes. */
+	final = bms_intersect(reference_attrs, toast_attrs);
+
+	if (!bms_is_empty(final))
+	{
+		/*
+		 * Due to the fact of detoast-datum will make the tuple bigger which is
+		 * bad for some nodes like Sort/Hash, to avoid performance regression,
+		 * such attribute should be removed as well.
+		 */
+		foreach(lc, forbid_pre_detoast_vars)
+		{
+			Var		   *var = lfirst_node(Var, lc);
+
+			forbid_pre_detoast_attrs = bms_add_member(forbid_pre_detoast_attrs,
+													  var->varattno - 1);
+		}
+	}
+
+	final = bms_del_members(final, forbid_pre_detoast_attrs);
+
+	bms_free(toast_attrs);
+	bms_free(forbid_pre_detoast_attrs);
+
+	return final;
+}
+
+
+void
+SetPredetoastAttrsForJoin(JoinState *j)
+{
+	PlanState  *outerstate = outerPlanState(j);
+	PlanState  *innerstate = innerPlanState(j);
+
+	j->outer_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->outer_reference_attrs,
+															 outerstate->ps_ResultTupleDesc,
+															 outerstate->plan->forbid_pre_detoast_vars);
+
+	j->inner_pre_detoast_attrs = cal_final_pre_detoast_attrs(
+															 ((Join *) j->ps.plan)->inner_reference_attrs,
+															 innerstate->ps_ResultTupleDesc,
+															 innerstate->plan->forbid_pre_detoast_vars);
+}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 5737f9f4eb..14be69bfec 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -567,6 +567,8 @@ ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
 		planstate->resultopsset = planstate->scanopsset;
 		planstate->resultopsfixed = planstate->scanopsfixed;
 		planstate->resultops = planstate->scanops;
+
+		Assert(planstate->ps_ResultTupleDesc != NULL);
 	}
 	else
 	{
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index dbf114cd5e..dec99cb819 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -755,6 +755,8 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
 	innerDesc = ExecGetResultType(innerPlanState(hjstate));
 
+	SetPredetoastAttrsForJoin((JoinState *) hjstate);
+
 	/*
 	 * Initialize result slot, type and projection.
 	 */
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 4fb34e3537..a4091ffa1a 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1496,6 +1496,8 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 											  (eflags | EXEC_FLAG_MARK));
 	innerDesc = ExecGetResultType(innerPlanState(mergestate));
 
+	SetPredetoastAttrsForJoin((JoinState *) mergestate);
+
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
 	 * MARK every time we advance past an inner tuple we will never return to.
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 7f4bf6c4db..d1dfaaab6b 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -305,6 +305,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 */
 	ExecInitResultTupleSlotTL(&nlstate->js.ps, &TTSOpsVirtual);
 	ExecAssignProjectionInfo(&nlstate->js.ps, NULL);
+	SetPredetoastAttrsForJoin((JoinState *) nlstate);
 
 	/*
 	 * initialize child expressions
diff --git a/src/backend/executor/shared_detoast_datum.org b/src/backend/executor/shared_detoast_datum.org
new file mode 100644
index 0000000000..2836349b63
--- /dev/null
+++ b/src/backend/executor/shared_detoast_datum.org
@@ -0,0 +1,264 @@
+The problem:
+-------------
+
+In the current expression engine, a toasted datum is detoasted when
+required, but the result is discarded immediately, either by pfree it
+immediately or leave it for ResetExprContext. Arguments for which one to
+use exists sometimes. More serious problem is detoasting is expensive,
+especially for the data types like jsonb or array, which the value might
+be very huge. In the blow example, the detoasting happens twice.
+
+SELECT jb_col->'a', jb_col->'b' FROM t;
+
+Within the shared-detoast-datum, we just need to detoast once for each
+tuple, and discard it immediately when the tuple is not needed any
+more. FWIW this issue may existing for small numeric, text as well
+because of SHORT_TOAST feature where the toast's len using 1 byte rather
+than 4 bytes.
+
+Current Design
+--------------
+
+The high level design is let createplan.c and setref.c decide which
+Vars can use this feature, and then the executor save the detoast
+datum back slot->to tts_values[*] during the ExprEvalStep of
+EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:
+
+- The existing expression engine read datum from tts_values[*], no any
+  extra work need to be done.
+- Reuse the lifespan of TupleTableSlot system to manage memory. It
+  is natural to think the detoast datum is a tts_value just that it is
+  in a detoast format. Since we have a clear lifespan for TupleTableSlot
+  already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
+  them for managing the datoast datum's memory.
+- The existing projection method can copy the datoasted datum (int64)
+  automatically to the next node's slot, but keeping the ownership
+  unchanged, so only the slot where the detoast really happen take the
+  charge of it's lifespan.
+
+Assuming which Var should use this feature has been decided in
+createplan.c and setref.c already. The following strategy is used to
+reduce the run time overhead.
+
+1. Putting the detoast datum into tts->tts_values[*]. then the ExprEval
+   system doesn't need any extra effort to find them.
+
+static inline void
+ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
+{
+    if (!slot->tts_isnull[attnum] &&
+        VARATT_IS_EXTENDED(slot->tts_values[attnum]))
+    {
+        if (unlikely(slot->tts_data_mctx == NULL))
+            slot->tts_data_mctx = GenerationContextCreate(slot->tts_mcxt,
+                                     "tts_value_ctx",
+                                    ALLOCSET_START_SMALL_SIZES);
+        slot->tts_values[attnum] = PointerGetDatum(detoast_attr_ext((struct varlena *) slot->tts_values[attnum],
+                                                                 /* save the detoast value to the given MemoryContext. */
+                                        slot->tts_data_mctx));
+    }
+}
+
+2. Using a dedicated lazy created memory context under TupleTableSlot so
+   that the memory can be released effectively.
+
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+    if (slot->tts_data_mctx)
+        MemoryContextResetOnly(slot->tts_data_mctx);
+}
+
+3. Control the memory usage by releasing them whenever the
+   slot->tts_values[*] is deemed as invalidate, so the lifespan is
+   likely short.
+
+4. Reduce the run-time overhead for checking if a VAR should use
+   pre-detoast feature, 3 new steps are introduced.
+   EEOP_{INNER,OUTER,SCAN}_VAR_TOAST, so that other Var is totally non
+   related.
+
+Now comes to the createplan.c/setref.c part, which decides which Vars
+should use the shared detoast feature. The guideline of this is:
+
+1. It needs a detoast for a given expression in the previous logic.
+2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
+   detoast datum back to tts_values[*], which make tuple bigger. if we
+   do this blindly, it would be harmful to the ORDER / HASH style nodes.
+
+A high level data flow is:
+
+1. at the createplan.c, we walk the plan tree go gather the
+   CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
+   it to Plan.forbid_pre_detoast_vars via the function
+   set_plan_forbid_pre_detoast_vars_recurse.
+
+2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
+   expression, so it is a good time to track the attribute number and
+   see if the Var is directly or indirectly accessed. Usually the
+   indirectly access a Var means a detoast would happens, for
+   example an expression like a > 3. However some known expressions is
+   ignored. for example: NullTest, pg_column_compression which needs the
+   raw datum, start_with/sub_string which needs a slice
+   detoasting. Currently there is some hard code here, we may needs a
+   pg_proc.detoasting_requirement flags to make this generic.  The
+   output is {Scan|Join}.xxx_reference_attrs;
+
+Note that here I used '_reference_' rather than '_detoast_' is because
+at this part, I still don't know if it is a toastable attrbiute, which
+is known at the MakeTupleTableSlot stage.
+
+3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
+   in ScanState & JoinState, which will be passed into expression
+   engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
+   _VAR_TOAST steps are generated finally then everything is connected
+   with ExecSlotDetoastDatum!
+
+
+Testing
+-------
+
+Case 1: small numeric testing.
+===============================
+
+create table t (a numeric);
+insert into t select i from generate_series(1, 100000)i;
+
+cat 1.sql
+
+select * from t where a > 0;
+
+In this test, the current master run detoast twice for each datum. one
+in numeric_gt,  one in numeric_out.  this feature makes the detoast once.
+
+pgbench -f 1.sql -n postgres -T 10 -M prepared
+
+master: 30.218 ms
+patched: 26.957 ms
+
+
+Case 2: Big jsonbs test:
+=============================
+
+create table b(blog jsonb);
+
+INSERT INTO b
+SELECT jsonb_build_object(
+        'title', 'title ' || s.i::text,
+        'content', substring(repeat(md5(random()::text), 100), 1, 3000),
+        'subscriber', (random() * 100)::int,
+        'reader', (random() * 100)::int
+    )
+FROM generate_series(1, 10000) s(i);
+
+
+explain analyze
+select blog from b
+where cast(blog->'reader' as numeric) > 10 and
+cast(blog->'subscriber' as numeric) > 0;
+
+Dump and restore the above data into the current master:
+
+master: 24.588 ms
+patched: 17.664 ms
+
+Memory usage test:
+
+I run the workload of tpch scale 10 on against both master and patched
+versions, the memory usage looks stable and the performance doesn't have
+noticeable improvement and regression as well.
+
+A alternative design: toast cache
+---------------------------------
+
+This method is provided by Tomas during the review process. IIUC, this
+method would maintain a local HTAB which map a toast datum to a detoast
+datum and the entry is maintained / used in detoast_attr
+function. Within this method, the overall design is pretty clear and the
+code modification can be controlled in toasting system only.
+
+Comparison of advantages and disadvantages of the two schemes
+-------------------------------------------------------------
+
+advantage of Toast Cache:
+=========================
+
+- The modification can be controlled in the toasting system only.
+
+disadvantages of Toast Cache:
+=============================
+- As a cache system for user data, the cache is easy to be used up,
+  hence the memory usage is a key problem.
+- cache lookup and cache evict produces more overhead.
+
+advantage of current design:
+============================
+- the runtime overhead is pretty small.
+- the memory usage is easier to control since it is released whenever
+  the slot->tts_values[*] is invalidated.
+
+
+disadvantage of current design:
+================================
+- It modifies createplan.c & setref.c & execExpr.c & execExprInterp.c &
+  execTuples*.c, llvmjit_*.c and so on, I admit it is kinda of invasive,
+  but it is done with some clear reasons.
+
+- It assumes any_foo(toast_col) will detoast the whole datum totally,
+  however some functions like 'starts_with' just need to detoast a slice
+  of datum. we are lacking of a generic way to know this so it is hard
+  coded so far.
+
+
+Bad case testing of current design:
+====================================
+
+- The extra effort of createplan.c & setref.c & execExpr.c & InitPlan :
+
+CREATE TABLE t(a int, b numeric);
+
+q1:
+explain select * from t where a > 3;
+
+q2:
+set enable_hashjoin to off;
+set enable_mergejoin to off;
+set enable_nestloop to on;
+
+explain select * from t join t t2 using(a);
+
+q3:
+set enable_hashjoin to off;
+set enable_mergejoin to on;
+set enable_nestloop to off;
+explain select * from t join t t2 using(a);
+
+
+q4:
+set enable_hashjoin to on;
+set enable_mergejoin to off;
+set enable_nestloop to off;
+explain select * from t join t t2 using(a);
+
+
+pgbench -f x.sql -n postgres  -M simple -T10
+
+| query ID | patched (ms)                        | master (ms)                         | regression(%) |
+|----------+-------------------------------------+-------------------------------------+---------------|
+|        1 | {0.088, 0.088, 0.088, 0.087, 0.089} | {0.111, 0.103, 0.103, 0.099, 0.101} |            3% |
+|        2 |                                     |                                     |               |
+|        3 |                                     |                                     |               |
+|        4 |                                     |                                     |               |
+
+
+- The extra effort of ExecInterpExpr
+
+insert into t select i, i FROM generate_series(1, 100000) i;
+
+SELECT count(b) from t WHERE b > 0;
+
+In this query, the extra execution code is run but it doesn't buy anything.
+
+| master | patched | regression |
+|--------+---------+------------|
+|        |         |            |
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 9e0efd2668..4ed7ed8c67 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -396,30 +396,52 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
+			case EEOP_INNER_VAR_TOAST:
+			case EEOP_OUTER_VAR_TOAST:
+			case EEOP_SCAN_VAR_TOAST:
 				{
 					LLVMValueRef value,
 								isnull;
 					LLVMValueRef v_attnum;
 					LLVMValueRef v_values;
 					LLVMValueRef v_nulls;
+					LLVMValueRef v_slot;
 
-					if (opcode == EEOP_INNER_VAR)
+					if (opcode == EEOP_INNER_VAR || opcode == EEOP_INNER_VAR_TOAST)
 					{
+						v_slot = v_innerslot;
 						v_values = v_innervalues;
 						v_nulls = v_innernulls;
 					}
-					else if (opcode == EEOP_OUTER_VAR)
+					else if (opcode == EEOP_OUTER_VAR || opcode == EEOP_OUTER_VAR_TOAST)
 					{
+						v_slot = v_outerslot;
 						v_values = v_outervalues;
 						v_nulls = v_outernulls;
 					}
 					else
 					{
+						v_slot = v_scanslot;
 						v_values = v_scanvalues;
 						v_nulls = v_scannulls;
 					}
 
 					v_attnum = l_int32_const(lc, op->d.var.attnum);
+
+					if (opcode == EEOP_INNER_VAR_TOAST ||
+						opcode == EEOP_OUTER_VAR_TOAST ||
+						opcode == EEOP_SCAN_VAR_TOAST)
+					{
+						LLVMValueRef params[2];
+
+						params[0] = v_slot;
+						params[1] = l_int32_const(lc, op->d.var.attnum);
+						l_call(b,
+							   llvm_pg_var_func_type("ExecSlotDetoastDatumExternal"),
+							   llvm_pg_func(mod, "ExecSlotDetoastDatumExternal"),
+							   params, lengthof(params), "");
+					}
+
 					value = l_load_gep1(b, TypeSizeT, v_values, v_attnum, "");
 					isnull = l_load_gep1(b, TypeStorageBool, v_nulls, v_attnum, "");
 					LLVMBuildStore(b, value, v_resvaluep);
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index f93c383fd5..540fa09dc1 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -182,4 +182,5 @@ void	   *referenced_functions[] =
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecSlotDetoastDatumExternal,
 };
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 6b64c4a362..9ffa69225b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -316,7 +316,9 @@ static ModifyTable *make_modifytable(PlannerInfo *root, Plan *subplan,
 									 int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 											 GatherMergePath *best_path);
-
+static void set_plan_forbid_pre_detoast_vars_recurse(Plan *plan,
+													 List *small_tlist);
+static void set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist);
 
 /*
  * create_plan
@@ -348,6 +350,12 @@ create_plan(PlannerInfo *root, Path *best_path)
 	/* Recursively process the path tree, demanding the correct tlist result */
 	plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);
 
+	/*
+	 * After the plan tree is built completed, we start to walk for which
+	 * expressions should not used the shared-detoast feature.
+	 */
+	set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);
+
 	/*
 	 * Make sure the topmost plan node's targetlist exposes the original
 	 * column names and other decorative info.  Targetlists generated within
@@ -380,6 +388,101 @@ create_plan(PlannerInfo *root, Path *best_path)
 	return plan;
 }
 
+/*
+ * set_plan_forbid_pre_detoast_vars_recurse
+ *	 Walking the Plan tree in the top-down manner to gather the vars which
+ * should be as small as possible and record them in Plan.forbid_pre_detoast_vars
+ *
+ * plan: the plan node to walk right now.
+ * small_tlist: a list of nodes which its subplan should provide them as
+ * small as possible.
+ */
+static void
+set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)
+{
+	if (plan == NULL)
+		return;
+
+	set_plan_not_pre_detoast_vars(plan, small_tlist);
+
+	/* Recurse to its subplan.. */
+	if (IsA(plan, Sort) || IsA(plan, Memoize) || IsA(plan, WindowAgg) ||
+		IsA(plan, Hash) || IsA(plan, Material) || IsA(plan, IncrementalSort))
+	{
+		List	   *sub_small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * For the sort-like nodes, we want the output of its subplan as small
+		 * as possible, but the subplan's other expressions like Qual doesn't
+		 * have this restriction since they are not output to the upper nodes.
+		 * so we set the small_tlist to the subplan->targetlist.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, sub_small_tlist);
+	}
+	else if (IsA(plan, HashJoin) && castNode(HashJoin, plan)->left_small_tlist)
+	{
+		List	   *sub_small_tlist = get_tlist_exprs(plan->lefttree->targetlist, true);
+
+		/*
+		 * If the left_small_tlist wants a as small as possible tlist, set it
+		 * in a way like sort for the left node.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, sub_small_tlist);
+
+		/*
+		 * The righttree is a Hash node, it can be set with its own rule, so
+		 * the small_tlist provided is not important, we just need to recuse
+		 * to its subplan.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+	else
+	{
+		/*
+		 * Recurse to its children, just push down the forbid_pre_detoast_vars
+		 * to its children.
+		 */
+		set_plan_forbid_pre_detoast_vars_recurse(plan->lefttree, plan->forbid_pre_detoast_vars);
+		set_plan_forbid_pre_detoast_vars_recurse(plan->righttree, plan->forbid_pre_detoast_vars);
+	}
+}
+
+/*
+ * set_plan_not_pre_detoast_vars
+ *
+ *	Set the Plan.forbid_pre_detoast_vars according the small_tlist information.
+ *
+ * small_tlist = NIL means nothing is forbidden, or else if a Var belongs to the
+ * small_tlist, then it must not be pre-detoasted.
+ */
+static void
+set_plan_not_pre_detoast_vars(Plan *plan, List *small_tlist)
+{
+	ListCell   *lc;
+	Var		   *var;
+
+	/*
+	 * fast path, if we don't have a small_tlist, the var in targetlist is
+	 * impossible member of it. and this case might be a pretty common case.
+	 */
+	if (small_tlist == NIL)
+		return;
+
+	foreach(lc, plan->targetlist)
+	{
+		TargetEntry *te = lfirst_node(TargetEntry, lc);
+
+		if (!IsA(te->expr, Var))
+			continue;
+		var = castNode(Var, te->expr);
+		if (var->varattno <= 0)
+			continue;
+		if (list_member(small_tlist, var))
+			/* pass the recheck */
+			plan->forbid_pre_detoast_vars = lappend(plan->forbid_pre_detoast_vars, var);
+	}
+}
+
 /*
  * create_plan_recurse
  *	  Recursive guts of create_plan().
@@ -4912,6 +5015,8 @@ create_hashjoin_plan(PlannerInfo *root,
 
 	copy_generic_path_info(&join_plan->join.plan, &best_path->jpath.path);
 
+	join_plan->left_small_tlist = (best_path->num_batches > 1);
+
 	return join_plan;
 }
 
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 37abcb4701..3b609a8796 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -27,6 +27,8 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_relation.h"
 #include "tcop/utility.h"
+#include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
 
@@ -54,11 +56,47 @@ typedef struct
 	tlist_vinfo vars[FLEXIBLE_ARRAY_MEMBER];	/* has num_vars entries */
 } indexed_tlist;
 
+/*
+ * Decide which attrs are detoasted in a expressions level, this is judged
+ * at the fix_scan/join_expr stage. The recursed level is tracked when we
+ * walk to a Var, if the level is greater than 1, then it means the
+ * var needs an detoast in this expression list, there are some exceptions
+ * here, see increase_level_for_pre_detoast for details.
+ */
+typedef struct
+{
+	/* if the level is added during a certain walk. */
+	bool		level_added;
+	/* the current level during the walk. */
+	int			level;
+} intermediate_level_context;
+
+/*
+ * Context to hold the detoast attribute within a expression.
+ *
+ * XXX: this design was intent to avoid the pre-detoast-logic if the var
+ * only need to be detoasted *once*, but for now, this context is only
+ * maintained at the expression level rather than plan tree level, so it
+ * can't detect if a Var will be detoasted 2+ time at the plan level.
+ * Recording the times of a Var is detoasted in the plan tree level is
+ * complex.
+ */
+typedef struct
+{
+	/* var is accessed for the first time. */
+	Bitmapset  *existing_attrs; /* not used so far */
+	/* var is accessed for the 2+ times. */
+	Bitmapset **final_ref_attrs;
+} intermediate_var_ref_context;
+
+
 typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context scan_reference_attrs;
 } fix_scan_expr_context;
 
 typedef struct
@@ -70,6 +108,9 @@ typedef struct
 	int			rtoffset;
 	NullingRelsMatch nrm_match;
 	double		num_exec;
+	intermediate_level_context level_ctx;
+	intermediate_var_ref_context outer_reference_attrs;
+	intermediate_var_ref_context inner_reference_attrs;
 } fix_join_expr_context;
 
 typedef struct
@@ -126,8 +167,8 @@ typedef struct
 	(((con)->consttype == REGCLASSOID || (con)->consttype == OIDOID) && \
 	 !(con)->constisnull)
 
-#define fix_scan_list(root, lst, rtoffset, num_exec) \
-	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec))
+#define fix_scan_list(root, lst, rtoffset, num_exec, pre_detoast_attrs) \
+	((List *) fix_scan_expr(root, (Node *) (lst), rtoffset, num_exec, pre_detoast_attrs))
 
 static void add_rtes_to_flat_rtable(PlannerInfo *root, bool recursing);
 static void flatten_unplanned_rtes(PlannerGlobal *glob, RangeTblEntry *rte);
@@ -157,7 +198,8 @@ static Plan *set_mergeappend_references(PlannerInfo *root,
 static void set_hash_references(PlannerInfo *root, Plan *plan, int rtoffset);
 static Relids offset_relid_set(Relids relids, int rtoffset);
 static Node *fix_scan_expr(PlannerInfo *root, Node *node,
-						   int rtoffset, double num_exec);
+						   int rtoffset, double num_exec,
+						   Bitmapset **scan_reference_attrs);
 static Node *fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context);
 static bool fix_scan_expr_walker(Node *node, fix_scan_expr_context *context);
 static void set_join_references(PlannerInfo *root, Join *join, int rtoffset);
@@ -189,7 +231,10 @@ static List *fix_join_expr(PlannerInfo *root,
 						   Index acceptable_rel,
 						   int rtoffset,
 						   NullingRelsMatch nrm_match,
-						   double num_exec);
+						   double num_exec,
+						   Bitmapset **outer_reference_attrs,
+						   Bitmapset **inner_reference_attrs);
+
 static Node *fix_join_expr_mutator(Node *node,
 								   fix_join_expr_context *context);
 static Node *fix_upper_expr(PlannerInfo *root,
@@ -210,6 +255,38 @@ static List *set_windowagg_runcondition_references(PlannerInfo *root,
 												   List *runcondition,
 												   Plan *plan);
 
+/*
+ * func_use_slice_detoast
+ *   Check if a func just needs a pg_detoast_datum_slice, if so we should
+ * not pre detoast it. For now, the known function ID is hard-coded. but
+ * it'd be good that pg_proc can have a attribute like 'detoast_requirement'
+ * the value can be either of:
+ * - full
+ * - first_n (0 ~ N, the current slice method).
+ * - any. (for the incoming of partial detoast feature.
+ *
+ * I think adding this attribute to pg_proc has a stronger reason if partial
+ * detoast patch is accepted.
+ */
+static inline bool
+func_use_slice_detoast(Oid funcOid)
+{
+	/* hard code for now, and it is not used in a hot path yet. */
+	const Oid oids[] = {F_STARTS_WITH,
+
+						F_OVERLAY_BYTEA_BYTEA_INT4, F_OVERLAY_BYTEA_BYTEA_INT4_INT4,
+						F_OVERLAY_TEXT_TEXT_INT4, F_OVERLAY_TEXT_TEXT_INT4_INT4,
+
+						F_SUBSTRING_BYTEA_INT4, F_SUBSTRING_BYTEA_INT4_INT4,
+						F_SUBSTRING_TEXT_INT4, F_SUBSTRING_TEXT_INT4_INT4};
+
+	for (int i = 0; i < sizeof(oids) /sizeof(Oid); i++)
+	{
+		if (funcOid == oids[i])
+			return true;
+	}
+	return false;
+}
 
 /*****************************************************************************
  *
@@ -627,10 +704,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_SampleScan:
@@ -640,13 +723,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs
+					);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tablesample = (TableSampleClause *)
 					fix_scan_expr(root, (Node *) splan->tablesample,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexScan:
@@ -656,28 +746,40 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->indexqual =
 					fix_scan_list(root, splan->indexqual,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->indexorderby =
 					fix_scan_list(root, splan->indexorderby,
-								  rtoffset, 1);
+								  rtoffset, 1, &splan->scan.reference_attrs);
 				splan->indexorderbyorig =
 					fix_scan_list(root, splan->indexorderbyorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan), &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_IndexOnlyScan:
 			{
 				IndexOnlyScan *splan = (IndexOnlyScan *) plan;
 
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
+
 				return set_indexonlyscan_references(root, splan, rtoffset);
 			}
 			break;
@@ -690,10 +792,15 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->scan.plan.targetlist == NIL);
 				Assert(splan->scan.plan.qual == NIL);
 				splan->indexqual =
-					fix_scan_list(root, splan->indexqual, rtoffset, 1);
+					fix_scan_list(root, splan->indexqual, rtoffset, 1,
+								  &splan->scan.reference_attrs);
 				splan->indexqualorig =
 					fix_scan_list(root, splan->indexqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_BitmapHeapScan:
@@ -703,13 +810,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->bitmapqualorig =
 					fix_scan_list(root, splan->bitmapqualorig,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidScan:
@@ -719,13 +833,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidquals =
 					fix_scan_list(root, splan->tidquals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_TidRangeScan:
@@ -735,13 +856,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->tidrangequals =
 					fix_scan_list(root, splan->tidrangequals,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_SubqueryScan:
@@ -756,12 +884,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->functions =
-					fix_scan_list(root, splan->functions, rtoffset, 1);
+					fix_scan_list(root, splan->functions, rtoffset, 1,
+								  &splan->scan.reference_attrs);
+
 			}
 			break;
 		case T_TableFuncScan:
@@ -771,13 +903,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+
 				splan->tablefunc = (TableFunc *)
 					fix_scan_expr(root, (Node *) splan->tablefunc,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ValuesScan:
@@ -787,13 +923,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 				splan->values_lists =
 					fix_scan_list(root, splan->values_lists,
-								  rtoffset, 1);
+								  rtoffset, 1,
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_CteScan:
@@ -803,10 +942,16 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
+				splan->scan.plan.forbid_pre_detoast_vars =
+					fix_scan_list(root, splan->scan.plan.forbid_pre_detoast_vars,
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  NULL);
 			}
 			break;
 		case T_NamedTuplestoreScan:
@@ -816,10 +961,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_WorkTableScan:
@@ -829,10 +976,12 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				splan->scan.scanrelid += rtoffset;
 				splan->scan.plan.targetlist =
 					fix_scan_list(root, splan->scan.plan.targetlist,
-								  rtoffset, NUM_EXEC_TLIST(plan));
+								  rtoffset, NUM_EXEC_TLIST(plan),
+								  &splan->scan.reference_attrs);
 				splan->scan.plan.qual =
 					fix_scan_list(root, splan->scan.plan.qual,
-								  rtoffset, NUM_EXEC_QUAL(plan));
+								  rtoffset, NUM_EXEC_QUAL(plan),
+								  &splan->scan.reference_attrs);
 			}
 			break;
 		case T_ForeignScan:
@@ -872,7 +1021,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				mplan->param_exprs = fix_scan_list(root, mplan->param_exprs,
 												   rtoffset,
-												   NUM_EXEC_TLIST(plan));
+												   NUM_EXEC_TLIST(plan),
+												   NULL);
 				break;
 			}
 
@@ -932,9 +1082,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				Assert(splan->plan.qual == NIL);
 
 				splan->limitOffset =
-					fix_scan_expr(root, splan->limitOffset, rtoffset, 1);
+					fix_scan_expr(root, splan->limitOffset, rtoffset, 1, NULL);
 				splan->limitCount =
-					fix_scan_expr(root, splan->limitCount, rtoffset, 1);
+					fix_scan_expr(root, splan->limitCount, rtoffset, 1, NULL);
 			}
 			break;
 		case T_Agg:
@@ -987,17 +1137,17 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 				 * variable refs, so fix_scan_expr works for them.
 				 */
 				wplan->startOffset =
-					fix_scan_expr(root, wplan->startOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->startOffset, rtoffset, 1, NULL);
 				wplan->endOffset =
-					fix_scan_expr(root, wplan->endOffset, rtoffset, 1);
+					fix_scan_expr(root, wplan->endOffset, rtoffset, 1, NULL);
 				wplan->runCondition = fix_scan_list(root,
 													wplan->runCondition,
 													rtoffset,
-													NUM_EXEC_TLIST(plan));
+													NUM_EXEC_TLIST(plan), NULL);
 				wplan->runConditionOrig = fix_scan_list(root,
 														wplan->runConditionOrig,
 														rtoffset,
-														NUM_EXEC_TLIST(plan));
+														NUM_EXEC_TLIST(plan), NULL);
 			}
 			break;
 		case T_Result:
@@ -1037,14 +1187,14 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 					splan->plan.targetlist =
 						fix_scan_list(root, splan->plan.targetlist,
-									  rtoffset, NUM_EXEC_TLIST(plan));
+									  rtoffset, NUM_EXEC_TLIST(plan), NULL);
 					splan->plan.qual =
 						fix_scan_list(root, splan->plan.qual,
-									  rtoffset, NUM_EXEC_QUAL(plan));
+									  rtoffset, NUM_EXEC_QUAL(plan), NULL);
 				}
 				/* resconstantqual can't contain any subplan variable refs */
 				splan->resconstantqual =
-					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1);
+					fix_scan_expr(root, splan->resconstantqual, rtoffset, 1, NULL);
 			}
 			break;
 		case T_ProjectSet:
@@ -1060,7 +1210,7 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 
 				splan->withCheckOptionLists =
 					fix_scan_list(root, splan->withCheckOptionLists,
-								  rtoffset, 1);
+								  rtoffset, 1, NULL);
 
 				if (splan->returningLists)
 				{
@@ -1117,18 +1267,20 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 						fix_join_expr(root, splan->onConflictSet,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					splan->onConflictWhere = (Node *)
 						fix_join_expr(root, (List *) splan->onConflictWhere,
 									  NULL, itlist,
 									  linitial_int(splan->resultRelations),
-									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan));
+									  rtoffset, NRM_EQUAL, NUM_EXEC_QUAL(plan),
+									  NULL, NULL);
 
 					pfree(itlist);
 
 					splan->exclRelTlist =
-						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1);
+						fix_scan_list(root, splan->exclRelTlist, rtoffset, 1, NULL);
 				}
 
 				/*
@@ -1185,7 +1337,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 															   resultrel,
 															   rtoffset,
 															   NRM_EQUAL,
-															   NUM_EXEC_TLIST(plan));
+															   NUM_EXEC_TLIST(plan),
+															   NULL, NULL);
 
 							/* Fix quals too. */
 							action->qual = (Node *) fix_join_expr(root,
@@ -1194,7 +1347,8 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 																  resultrel,
 																  rtoffset,
 																  NRM_EQUAL,
-																  NUM_EXEC_QUAL(plan));
+																  NUM_EXEC_QUAL(plan),
+																  NULL, NULL);
 						}
 
 						/* Fix join condition too. */
@@ -1205,7 +1359,9 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 										  resultrel,
 										  rtoffset,
 										  NRM_EQUAL,
-										  NUM_EXEC_QUAL(plan));
+										  NUM_EXEC_QUAL(plan),
+										  NULL,
+										  NULL);
 						newMJC = lappend(newMJC, mergeJoinCondition);
 					}
 					splan->mergeJoinConditions = newMJC;
@@ -1371,13 +1527,16 @@ set_indexonlyscan_references(PlannerInfo *root,
 					   NUM_EXEC_QUAL((Plan *) plan));
 	/* indexqual is already transformed to reference index columns */
 	plan->indexqual = fix_scan_list(root, plan->indexqual,
-									rtoffset, 1);
+									rtoffset, 1,
+									&plan->scan.reference_attrs);
 	/* indexorderby is already transformed to reference index columns */
 	plan->indexorderby = fix_scan_list(root, plan->indexorderby,
-									   rtoffset, 1);
+									   rtoffset, 1,
+									   &plan->scan.reference_attrs);
 	/* indextlist must NOT be transformed to reference index columns */
 	plan->indextlist = fix_scan_list(root, plan->indextlist,
-									 rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+									 rtoffset, NUM_EXEC_TLIST((Plan *) plan),
+									 &plan->scan.reference_attrs);
 
 	pfree(index_itlist);
 
@@ -1424,10 +1583,10 @@ set_subqueryscan_references(PlannerInfo *root,
 		plan->scan.scanrelid += rtoffset;
 		plan->scan.plan.targetlist =
 			fix_scan_list(root, plan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) plan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) plan), NULL);
 		plan->scan.plan.qual =
 			fix_scan_list(root, plan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) plan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) plan), NULL);
 
 		result = (Plan *) plan;
 	}
@@ -1627,7 +1786,7 @@ set_foreignscan_references(PlannerInfo *root,
 		/* fdw_scan_tlist itself just needs fix_scan_list() adjustments */
 		fscan->fdw_scan_tlist =
 			fix_scan_list(root, fscan->fdw_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 	}
 	else
 	{
@@ -1637,16 +1796,16 @@ set_foreignscan_references(PlannerInfo *root,
 		 */
 		fscan->scan.plan.targetlist =
 			fix_scan_list(root, fscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) fscan), NULL);
 		fscan->scan.plan.qual =
 			fix_scan_list(root, fscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_exprs =
 			fix_scan_list(root, fscan->fdw_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 		fscan->fdw_recheck_quals =
 			fix_scan_list(root, fscan->fdw_recheck_quals,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) fscan), NULL);
 	}
 
 	fscan->fs_relids = offset_relid_set(fscan->fs_relids, rtoffset);
@@ -1705,20 +1864,20 @@ set_customscan_references(PlannerInfo *root,
 		/* custom_scan_tlist itself just needs fix_scan_list() adjustments */
 		cscan->custom_scan_tlist =
 			fix_scan_list(root, cscan->custom_scan_tlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 	}
 	else
 	{
 		/* Adjust tlist, qual, custom_exprs in the standard way */
 		cscan->scan.plan.targetlist =
 			fix_scan_list(root, cscan->scan.plan.targetlist,
-						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan));
+						  rtoffset, NUM_EXEC_TLIST((Plan *) cscan), NULL);
 		cscan->scan.plan.qual =
 			fix_scan_list(root, cscan->scan.plan.qual,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 		cscan->custom_exprs =
 			fix_scan_list(root, cscan->custom_exprs,
-						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan));
+						  rtoffset, NUM_EXEC_QUAL((Plan *) cscan), NULL);
 	}
 
 	/* Adjust child plan-nodes recursively, if needed */
@@ -2126,6 +2285,82 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
 	return (Node *) bestplan;
 }
 
+
+static inline void
+setup_intermediate_level_ctx(intermediate_level_context *ctx)
+{
+	ctx->level = 0;
+	ctx->level_added = false;
+}
+
+static inline void
+setup_intermediate_var_ref_ctx(intermediate_var_ref_context *ctx, Bitmapset **final_ref_attrs)
+{
+	ctx->existing_attrs = NULL;
+	ctx->final_ref_attrs = final_ref_attrs;
+}
+
+/*
+ * increase_level_for_pre_detoast
+ *	Check if the given Expr could detoast a Var directly, if yes,
+ * increase the level and return true. otherwise return false;
+ */
+static inline void
+increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
+{
+	/* The following nodes is impossible to detoast a Var directly. */
+	if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
+	{
+		ctx->level_added = false;
+	}
+	else if (IsA(node, FuncExpr))
+	{
+		Oid funcOid = castNode(FuncExpr, node)->funcid;
+
+		if (funcOid == F_PG_COLUMN_COMPRESSION || func_use_slice_detoast(funcOid))
+			ctx->level_added =  false;
+		else
+		{
+			ctx->level_added = true;
+			ctx->level += 1;
+		}
+	}
+	else
+	{
+		ctx->level_added = true;
+		ctx->level += 1;
+	}
+}
+
+static inline void
+decreased_level_for_pre_detoast(intermediate_level_context *ctx)
+{
+	if (ctx->level_added)
+		ctx->level -= 1;
+
+	ctx->level_added = false;
+}
+
+/*
+ * add_pre_detoast_vars
+ *		add the var's information into pre_detoast_attrs when the check is pass.
+ */
+static inline void
+add_pre_detoast_vars(intermediate_level_context *level_ctx,
+					 intermediate_var_ref_context *ctx,
+					 Var *var)
+{
+	int			attno;
+
+	if (level_ctx->level <= 1 ||
+		ctx->final_ref_attrs == NULL ||
+		var->varattno <= 0)
+		return;
+
+	attno = var->varattno - 1;
+	*ctx->final_ref_attrs = bms_add_member(*ctx->final_ref_attrs, attno);
+}
+
 /*
  * fix_scan_expr
  *		Do set_plan_references processing on a scan-level expression
@@ -2140,18 +2375,24 @@ fix_alternative_subplan(PlannerInfo *root, AlternativeSubPlan *asplan,
  * 'node': the expression to be modified
  * 'rtoffset': how much to increment varnos by
  * 'num_exec': estimated number of executions of expression
+ * 'scan_reference_attrs': gather which vars are potential to run the detoast
+ * 	on this expr, NULL means the caller doesn't have interests on this.
  *
  * The expression tree is either copied-and-modified, or modified in-place
  * if that seems safe.
  */
 static Node *
-fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
+fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset,
+			  double num_exec, Bitmapset **scan_reference_attrs)
 {
 	fix_scan_expr_context context;
 
 	context.root = root;
 	context.rtoffset = rtoffset;
 	context.num_exec = num_exec;
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.scan_reference_attrs,
+								   scan_reference_attrs);
 
 	if (rtoffset != 0 ||
 		root->multiexpr_params != NIL ||
@@ -2182,8 +2423,13 @@ fix_scan_expr(PlannerInfo *root, Node *node, int rtoffset, double num_exec)
 static Node *
 fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 {
+	Node	   *n;
+
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = copyVar((Var *) node);
@@ -2201,10 +2447,18 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 			var->varno += context->rtoffset;
 		if (var->varnosyn > 0)
 			var->varnosyn += context->rtoffset;
+
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 var);
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) var;
 	}
 	if (IsA(node, Param))
+	{
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return fix_param_node(context->root, (Param *) node);
+	}
 	if (IsA(node, Aggref))
 	{
 		Aggref	   *aggref = (Aggref *) node;
@@ -2214,8 +2468,10 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		aggparam = find_minmax_agg_replacement_param(context->root, aggref);
 		if (aggparam != NULL)
 		{
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			/* Make a copy of the Param for paranoia's sake */
 			return (Node *) copyObject(aggparam);
+
 		}
 		/* If no match, just fall through to process it normally */
 	}
@@ -2225,6 +2481,7 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 
 		Assert(!IS_SPECIAL_VARNO(cexpr->cvarno));
 		cexpr->cvarno += context->rtoffset;
+		decreased_level_for_pre_detoast(&context->level_ctx);
 		return (Node *) cexpr;
 	}
 	if (IsA(node, PlaceHolderVar))
@@ -2233,29 +2490,52 @@ fix_scan_expr_mutator(Node *node, fix_scan_expr_context *context)
 		PlaceHolderVar *phv = (PlaceHolderVar *) node;
 
 		/* XXX can we assert something about phnullingrels? */
-		return fix_scan_expr_mutator((Node *) phv->phexpr, context);
+		Node	   *n2 = fix_scan_expr_mutator((Node *) phv->phexpr, context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
 	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_scan_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		Node	   *n2 = fix_scan_expr_mutator(fix_alternative_subplan(context->root,
+																	   (AlternativeSubPlan *) node,
+																	   context->num_exec),
+											   context);
+
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return n2;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node, fix_scan_expr_mutator,
-								   (void *) context);
+	n = expression_tree_mutator(node, fix_scan_expr_mutator, (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return n;
 }
 
 static bool
 fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
 {
+	bool		ret;
+
 	if (node == NULL)
 		return false;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
+	if (IsA(node, Var))
+	{
+		add_pre_detoast_vars(&context->level_ctx,
+							 &context->scan_reference_attrs,
+							 castNode(Var, node));
+	}
 	Assert(!(IsA(node, Var) && ((Var *) node)->varno == ROWID_VAR));
 	Assert(!IsA(node, PlaceHolderVar));
 	Assert(!IsA(node, AlternativeSubPlan));
 	fix_expr_common(context->root, node);
-	return expression_tree_walker(node, fix_scan_expr_walker,
-								  (void *) context);
+	ret = expression_tree_walker(node, fix_scan_expr_walker,
+								 (void *) context);
+
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret;
 }
 
 /*
@@ -2291,7 +2571,10 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 								   (Index) 0,
 								   rtoffset,
 								   NRM_EQUAL,
-								   NUM_EXEC_QUAL((Plan *) join));
+								   NUM_EXEC_QUAL((Plan *) join),
+								   &join->outer_reference_attrs,
+								   &join->inner_reference_attrs
+		);
 
 	/* Now do join-type-specific stuff */
 	if (IsA(join, NestLoop))
@@ -2338,7 +2621,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										 (Index) 0,
 										 rtoffset,
 										 NRM_EQUAL,
-										 NUM_EXEC_QUAL((Plan *) join));
+										 NUM_EXEC_QUAL((Plan *) join),
+										 &join->outer_reference_attrs,
+										 &join->inner_reference_attrs);
 	}
 	else if (IsA(join, HashJoin))
 	{
@@ -2351,7 +2636,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										(Index) 0,
 										rtoffset,
 										NRM_EQUAL,
-										NUM_EXEC_QUAL((Plan *) join));
+										NUM_EXEC_QUAL((Plan *) join),
+										&join->outer_reference_attrs,
+										&join->inner_reference_attrs);
 
 		/*
 		 * HashJoin's hashkeys are used to look for matching tuples from its
@@ -2383,7 +2670,9 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 										  (Index) 0,
 										  rtoffset,
 										  (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-										  NUM_EXEC_TLIST((Plan *) join));
+										  NUM_EXEC_TLIST((Plan *) join),
+										  &join->outer_reference_attrs,
+										  &join->inner_reference_attrs);
 	join->plan.qual = fix_join_expr(root,
 									join->plan.qual,
 									outer_itlist,
@@ -2391,8 +2680,20 @@ set_join_references(PlannerInfo *root, Join *join, int rtoffset)
 									(Index) 0,
 									rtoffset,
 									(join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
-									NUM_EXEC_QUAL((Plan *) join));
-
+									NUM_EXEC_QUAL((Plan *) join),
+									&join->outer_reference_attrs,
+									&join->inner_reference_attrs);
+
+	join->plan.forbid_pre_detoast_vars = fix_join_expr(root,
+													   join->plan.forbid_pre_detoast_vars,
+													   outer_itlist,
+													   inner_itlist,
+													   (Index) 0,
+													   rtoffset,
+													   (join->jointype == JOIN_INNER ? NRM_EQUAL : NRM_SUPERSET),
+													   NUM_EXEC_TLIST((Plan *) join),
+													   NULL,
+													   NULL);
 	pfree(outer_itlist);
 	pfree(inner_itlist);
 }
@@ -3025,9 +3326,12 @@ fix_join_expr(PlannerInfo *root,
 			  Index acceptable_rel,
 			  int rtoffset,
 			  NullingRelsMatch nrm_match,
-			  double num_exec)
+			  double num_exec,
+			  Bitmapset **outer_reference_attrs,
+			  Bitmapset **inner_reference_attrs)
 {
 	fix_join_expr_context context;
+	List	   *ret;
 
 	context.root = root;
 	context.outer_itlist = outer_itlist;
@@ -3036,16 +3340,30 @@ fix_join_expr(PlannerInfo *root,
 	context.rtoffset = rtoffset;
 	context.nrm_match = nrm_match;
 	context.num_exec = num_exec;
-	return (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	setup_intermediate_level_ctx(&context.level_ctx);
+	setup_intermediate_var_ref_ctx(&context.outer_reference_attrs, outer_reference_attrs);
+	setup_intermediate_var_ref_ctx(&context.inner_reference_attrs, inner_reference_attrs);
+
+	ret = (List *) fix_join_expr_mutator((Node *) clauses, &context);
+
+	bms_free(context.outer_reference_attrs.existing_attrs);
+	bms_free(context.inner_reference_attrs.existing_attrs);
+
+	return ret;
 }
 
 static Node *
 fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 {
 	Var		   *newvar;
+	Node	   *ret_node;
 
 	if (node == NULL)
 		return NULL;
+
+	increase_level_for_pre_detoast(node, &context->level_ctx);
+
 	if (IsA(node, Var))
 	{
 		Var		   *var = (Var *) node;
@@ -3059,7 +3377,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* then in the inner. */
@@ -3071,7 +3395,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->rtoffset,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If it's for acceptable_rel, adjust and return it */
@@ -3081,6 +3411,9 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 			var->varno += context->rtoffset;
 			if (var->varnosyn > 0)
 				var->varnosyn += context->rtoffset;
+			/* XXX acceptable_rel?  we can ignore it for safety. */
+			decreased_level_for_pre_detoast(&context->level_ctx);
+
 			return (Node *) var;
 		}
 
@@ -3099,22 +3432,38 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  OUTER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->outer_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 		if (context->inner_itlist && context->inner_itlist->has_ph_vars)
 		{
+
 			newvar = search_indexed_tlist_for_phv(phv,
 												  context->inner_itlist,
 												  INNER_VAR,
 												  context->nrm_match);
 			if (newvar)
+			{
+				add_pre_detoast_vars(&context->level_ctx,
+									 &context->inner_reference_attrs,
+									 newvar);
+				decreased_level_for_pre_detoast(&context->level_ctx);
 				return (Node *) newvar;
+			}
 		}
 
 		/* If not supplied by input plans, evaluate the contained expr */
 		/* XXX can we assert something about phnullingrels? */
-		return fix_join_expr_mutator((Node *) phv->phexpr, context);
+		ret_node = fix_join_expr_mutator((Node *) phv->phexpr, context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
 	}
+
 	/* Try matching more complex expressions too, if tlists have any */
 	if (context->outer_itlist && context->outer_itlist->has_non_vars)
 	{
@@ -3122,7 +3471,13 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->outer_itlist,
 												  OUTER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->outer_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	if (context->inner_itlist && context->inner_itlist->has_non_vars)
 	{
@@ -3130,20 +3485,36 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
 												  context->inner_itlist,
 												  INNER_VAR);
 		if (newvar)
+		{
+			add_pre_detoast_vars(&context->level_ctx,
+								 &context->inner_reference_attrs,
+								 newvar);
+			decreased_level_for_pre_detoast(&context->level_ctx);
 			return (Node *) newvar;
+		}
 	}
 	/* Special cases (apply only AFTER failing to match to lower tlist) */
 	if (IsA(node, Param))
-		return fix_param_node(context->root, (Param *) node);
+	{
+		ret_node = fix_param_node(context->root, (Param *) node);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	if (IsA(node, AlternativeSubPlan))
-		return fix_join_expr_mutator(fix_alternative_subplan(context->root,
-															 (AlternativeSubPlan *) node,
-															 context->num_exec),
-									 context);
+	{
+		ret_node = fix_join_expr_mutator(fix_alternative_subplan(context->root,
+																 (AlternativeSubPlan *) node,
+																 context->num_exec),
+										 context);
+		decreased_level_for_pre_detoast(&context->level_ctx);
+		return ret_node;
+	}
 	fix_expr_common(context->root, node);
-	return expression_tree_mutator(node,
-								   fix_join_expr_mutator,
-								   (void *) context);
+	ret_node = expression_tree_mutator(node,
+									   fix_join_expr_mutator,
+									   (void *) context);
+	decreased_level_for_pre_detoast(&context->level_ctx);
+	return ret_node;
 }
 
 /*
@@ -3178,7 +3549,8 @@ fix_join_expr_mutator(Node *node, fix_join_expr_context *context)
  * varno = newvarno, varattno = resno of corresponding targetlist element.
  * The original tree is not modified.
  */
-static Node *
+static Node *					/* XXX: shall I care about this for shared
+								 * detoast optimization? */
 fix_upper_expr(PlannerInfo *root,
 			   Node *node,
 			   indexed_tlist *subplan_itlist,
@@ -3333,7 +3705,10 @@ set_returning_clause_references(PlannerInfo *root,
 						  resultRelation,
 						  rtoffset,
 						  NRM_EQUAL,
-						  NUM_EXEC_TLIST(topplan));
+						  NUM_EXEC_TLIST(topplan),
+						  NULL,
+						  NULL
+		);
 
 	pfree(itlist);
 
diff --git a/src/include/access/detoast.h b/src/include/access/detoast.h
index 12d8cdb356..9ddc05604e 100644
--- a/src/include/access/detoast.h
+++ b/src/include/access/detoast.h
@@ -42,6 +42,7 @@ do { \
  * ----------
  */
 extern struct varlena *detoast_external_attr(struct varlena *attr);
+extern struct varlena *detoast_external_attr_ext(struct varlena *attr, MemoryContext ctx);
 
 /* ----------
  * detoast_attr() -
@@ -51,6 +52,8 @@ extern struct varlena *detoast_external_attr(struct varlena *attr);
  * ----------
  */
 extern struct varlena *detoast_attr(struct varlena *attr);
+extern struct varlena *detoast_attr_ext(struct varlena *attr, MemoryContext ctx);
+
 
 /* ----------
  * detoast_attr_slice() -
diff --git a/src/include/access/toast_compression.h b/src/include/access/toast_compression.h
index 64d5e079fa..00ea153e4e 100644
--- a/src/include/access/toast_compression.h
+++ b/src/include/access/toast_compression.h
@@ -55,13 +55,13 @@ typedef enum ToastCompressionId
 
 /* pglz compression/decompression routines */
 extern struct varlena *pglz_compress_datum(const struct varlena *value);
-extern struct varlena *pglz_decompress_datum(const struct varlena *value);
+extern struct varlena *pglz_decompress_datum(const struct varlena *value, MemoryContext ctx);
 extern struct varlena *pglz_decompress_datum_slice(const struct varlena *value,
 												   int32 slicelength);
 
 /* lz4 compression/decompression routines */
 extern struct varlena *lz4_compress_datum(const struct varlena *value);
-extern struct varlena *lz4_decompress_datum(const struct varlena *value);
+extern struct varlena *lz4_decompress_datum(const struct varlena *value, MemoryContext ctx);
 extern struct varlena *lz4_decompress_datum_slice(const struct varlena *value,
 												  int32 slicelength);
 
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 64698202a5..9f070d784e 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,17 @@ typedef enum ExprEvalOp
 	EEOP_OUTER_VAR,
 	EEOP_SCAN_VAR,
 
+	/*
+	 * compute non-system Var value with shared-detoast-datum logic, use some
+	 * dedicated steps rather than add extra logic to existing steps is for
+	 * performance aspect, within this way, we just decide if the extra logic
+	 * is needed at ExecInitExpr stage once rather than every time of
+	 * ExecInterpExpr.
+	 */
+	EEOP_INNER_VAR_TOAST,
+	EEOP_OUTER_VAR_TOAST,
+	EEOP_SCAN_VAR_TOAST,
+
 	/* compute system Var value */
 	EEOP_INNER_SYSVAR,
 	EEOP_OUTER_SYSVAR,
@@ -855,5 +866,6 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
+extern void ExecSlotDetoastDatumExternal(TupleTableSlot *slot, int attnum);
 
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index b82655e7e5..dd5d28db17 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -19,6 +19,7 @@
 #include "access/sysattr.h"
 #include "access/tupdesc.h"
 #include "storage/buf.h"
+#include "utils/memutils.h"
 
 /*----------
  * The executor stores tuples in a "tuple table" which is a List of
@@ -126,6 +127,7 @@ typedef struct TupleTableSlot
 #define FIELDNO_TUPLETABLESLOT_ISNULL 6
 	bool	   *tts_isnull;		/* current per-attribute isnull flags */
 	MemoryContext tts_mcxt;		/* slot itself is in this context */
+	MemoryContext tts_data_mctx; /* The external content of tts_values[*] */
 	ItemPointerData tts_tid;	/* stored tuple's tid */
 	Oid			tts_tableOid;	/* table oid of tuple */
 } TupleTableSlot;
@@ -447,12 +449,24 @@ slot_is_current_xact_tuple(TupleTableSlot *slot)
 	return slot->tts_ops->is_current_xact_tuple(slot);
 }
 
+/*
+ * ExecFreePreDetoastDatum - free the memory which is allocated in tts_data_mcxt.
+ */
+static inline void
+ExecFreePreDetoastDatum(TupleTableSlot *slot)
+{
+	if (slot->tts_data_mctx)
+		MemoryContextResetOnly(slot->tts_data_mctx);
+}
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
 static inline TupleTableSlot *
 ExecClearTuple(TupleTableSlot *slot)
 {
+	ExecFreePreDetoastDatum(slot);
+
 	slot->tts_ops->clear(slot);
 
 	return slot;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8bc421e7c0..43654b2acd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1565,6 +1565,12 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the Scan nodes.
+	 */
+	Bitmapset  *scan_pre_detoast_attrs;
 } ScanState;
 
 /* ----------------
@@ -2088,6 +2094,13 @@ typedef struct JoinState
 	bool		single_match;	/* True if we should skip to next outer tuple
 								 * after finding one inner match */
 	ExprState  *joinqual;		/* JOIN quals (in addition to ps.qual) */
+
+	/*
+	 * The final attributes which should apply the pre-detoast-attrs logic on
+	 * the join nodes.
+	 */
+	Bitmapset  *outer_pre_detoast_attrs;
+	Bitmapset  *inner_pre_detoast_attrs;
 } JoinState;
 
 /* ----------------
@@ -2848,4 +2861,5 @@ typedef struct LimitState
 	TupleTableSlot *last_slot;	/* slot for evaluation of ties */
 } LimitState;
 
+extern void SetPredetoastAttrsForJoin(JoinState *joinstate);
 #endif							/* EXECNODES_H */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e025679f89..a5d6966da7 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -169,6 +169,13 @@ typedef struct Plan
 	 */
 	Bitmapset  *extParam;
 	Bitmapset  *allParam;
+
+	/*
+	 * A list of Vars which should not apply the shared-detoast-datum logic
+	 * since the upper nodes like Sort/Hash wants them as small as possible.
+	 * It's a subset of targetlist in each Plan node.
+	 */
+	List	   *forbid_pre_detoast_vars;
 } Plan;
 
 /* ----------------
@@ -387,6 +394,16 @@ typedef struct Scan
 
 	Plan		plan;
 	Index		scanrelid;		/* relid is index into the range table */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *reference_attrs;
 } Scan;
 
 /* ----------------
@@ -791,6 +808,17 @@ typedef struct Join
 	JoinType	jointype;
 	bool		inner_unique;
 	List	   *joinqual;		/* JOIN quals (in addition to plan.qual) */
+
+	/*
+	 * Records of var's varattno - 1 where the Var is accessed indirectly by
+	 * any expression, like a > 3.  However a IS [NOT] NULL is not included
+	 * since it doesn't access the tts_values[*] at all.
+	 *
+	 * This is a essential information to figure out which attrs should use
+	 * the pre-detoast-attrs logic.
+	 */
+	Bitmapset  *outer_reference_attrs;
+	Bitmapset  *inner_reference_attrs;
 } Join;
 
 /* ----------------
@@ -871,6 +899,11 @@ typedef struct HashJoin
 	 * perform lookups in the hashtable over the inner plan.
 	 */
 	List	   *hashkeys;
+
+	/*
+	 * Whether the left plan tree should use a SMALL_TLIST.
+	 */
+	bool		left_small_tlist;
 } HashJoin;
 
 /* ----------------
@@ -1590,4 +1623,24 @@ typedef enum MonotonicFunction
 	MONOTONICFUNC_BOTH = MONOTONICFUNC_INCREASING | MONOTONICFUNC_DECREASING,
 } MonotonicFunction;
 
+static inline bool
+is_join_plan(Plan *plan)
+{
+	return (plan != NULL) && (IsA(plan, NestLoop) || IsA(plan, HashJoin) || IsA(plan, MergeJoin));
+}
+
+static inline bool
+is_scan_plan(Plan *plan)
+{
+	return (plan != NULL) &&
+		(IsA(plan, SeqScan) ||
+		 IsA(plan, SampleScan) ||
+		 IsA(plan, IndexScan) ||
+		 IsA(plan, IndexOnlyScan) ||
+		 IsA(plan, BitmapIndexScan) ||
+		 IsA(plan, BitmapHeapScan) ||
+		 IsA(plan, TidScan) ||
+		 IsA(plan, SubqueryScan));
+}
+
 #endif							/* PLANNODES_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 34ec87a85e..8d18fe2472 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4106,6 +4106,8 @@ cb_cleanup_dir
 cb_options
 cb_tablespace
 cb_tablespace_mapping
+intermediate_var_ref_context
+intermediate_level_context
 manifest_data
 manifest_writer
 rfile
-- 
2.34.1

#50Andy Fan
zhihuifan1213@163.com
In reply to: Nikita Malakhov (#47)
Re: Shared detoast Datum proposal

Nikita Malakhov <hukutoc@gmail.com> writes:

Hi,

Andy, glad you've not lost interest in this work, I'm looking
forward to your improvements!

Thanks for your words, I've adjusted to the rhythm of the community and
welcome more feedback:)

--
Best Regards
Andy Fan

#51Robert Haas
robertmhaas@gmail.com
In reply to: Andy Fan (#49)
Re: Shared detoast Datum proposal

On Wed, May 22, 2024 at 9:46 PM Andy Fan <zhihuifan1213@163.com> wrote:

Please give me one more chance to explain this. What I mean is:

Take SELECT f(a) FROM t1 join t2...; for example,

When we read the Datum of a Var, we read it from tts->tts_values[*], no
matter what kind of TupleTableSlot is. So if we can put the "detoast
datum" back to the "original " tts_values, then nothing need to be
changed.

Yeah, I don't think you can do that.

- Saving the "detoast datum" version into tts_values[*] doesn't break
the above semantic and the ExprEval engine just get a detoast version
so it doesn't need to detoast it again.

I don't think this is true. If it is true, it needs to be extremely
well-justified. Even if this seems to work in simple cases, I suspect
there will be cases where it breaks badly, resulting in memory leaks
or server crashes or some other kind of horrible problem that I can't
quite imagine. Unfortunately, my knowledge of this code isn't
fantastic, so I can't say exactly what bad thing will happen, and I
can't even say with 100% certainty that anything bad will happen, but
I bet it will. It seems like it goes against everything I understand
about how TupleTableSlots are supposed to be used. (Andres would be
the best person to give a definitive answer here.)

- The keypoint is the memory management and effeiciency. for now I think
all the places where "slot->tts_nvalid" is set to 0 means the
tts_values[*] is no longer validate, then this is the place we should
release the memory for the "detoast datum". All the other places like
ExecMaterializeSlot or ExecCopySlot also need to think about the
"detoast datum" as well.

This might be a way of addressing some of the issues with this idea,
but I doubt it will be acceptable. I don't think we want to complicate
the slot API for this feature. There's too much downside to doing
that, in terms of performance and understandability.

--
Robert Haas
EDB: http://www.enterprisedb.com

#52Andy Fan
zhihuifan1213@163.com
In reply to: Robert Haas (#51)
Re: Shared detoast Datum proposal

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, May 22, 2024 at 9:46 PM Andy Fan <zhihuifan1213@163.com> wrote:

Please give me one more chance to explain this. What I mean is:

Take SELECT f(a) FROM t1 join t2...; for example,

When we read the Datum of a Var, we read it from tts->tts_values[*], no
matter what kind of TupleTableSlot is. So if we can put the "detoast
datum" back to the "original " tts_values, then nothing need to be
changed.

Yeah, I don't think you can do that.

- Saving the "detoast datum" version into tts_values[*] doesn't break
the above semantic and the ExprEval engine just get a detoast version
so it doesn't need to detoast it again.

I don't think this is true. If it is true, it needs to be extremely
well-justified. Even if this seems to work in simple cases, I suspect
there will be cases where it breaks badly, resulting in memory leaks
or server crashes or some other kind of horrible problem that I can't
quite imagine. Unfortunately, my knowledge of this code isn't
fantastic, so I can't say exactly what bad thing will happen, and I
can't even say with 100% certainty that anything bad will happen, but
I bet it will. It seems like it goes against everything I understand
about how TupleTableSlots are supposed to be used. (Andres would be
the best person to give a definitive answer here.)

- The keypoint is the memory management and effeiciency. for now I think
all the places where "slot->tts_nvalid" is set to 0 means the
tts_values[*] is no longer validate, then this is the place we should
release the memory for the "detoast datum". All the other places like
ExecMaterializeSlot or ExecCopySlot also need to think about the
"detoast datum" as well.

This might be a way of addressing some of the issues with this idea,
but I doubt it will be acceptable. I don't think we want to complicate
the slot API for this feature. There's too much downside to doing
that, in terms of performance and understandability.

Thanks for the doubt! Let's see if Andres has time to look at this. I
feel great about the current state.

--
Best Regards
Andy Fan

#53Andy Fan
zhihuifan1213@163.com
In reply to: Andy Fan (#52)
Re: Shared detoast Datum proposal

Hi,

Let's me clarify the current state of this patch and what kind of help
is needed. You can check [1]/messages/by-id/87il4jrk1l.fsf@163.com for what kind of problem it try to solve
and what is the current approach.

The current blocker is if the approach is safe (potential to memory leak
crash etc.). And I want to have some agreement on this step at least.

"""
When we access [some desired] toast column in EEOP_{SCAN|INNER_OUTER}_VAR
steps, we can detoast it immediately and save it back to
slot->tts_values. so the later ExprState can use the same detoast version
all the time. this should be more the most effective and simplest way.
"""

IMO, it doesn't break any design of TupleTableSlot and should be
safe. Here is my understanding of TupleTableSlot, please correct me if
anything I missed.

(a). TupleTableSlot define tts_values[*] / tts_isnulls[*] at the high level.
When the upper layer want a attnum, it just access tts_values/nulls[n].

(b). TupleTableSlot has many different variants, depends on how to get
tts_values[*] from them, like VirtualTupleTableSlot, HeapTupleTableSlot,
how to manage its resource (mainly memory) when the life cycle is
done, for example BufferHeapTupleTableSlot.

(c). TupleTableSlot defines a lot operation against on that, like
getsomeattrs, copy, clear and so on, for the different variants.

the impacts of my proposal on the above factor:

(a). it doesn't break it clearly,
(b). it adds a 'detoast' step for step (b) when fill the tts_values[*],
(this is done conditionally, see [1]/messages/by-id/87il4jrk1l.fsf@163.com for the overall design, we
currently focus the safety). What's more, I only added the step in
EEOP_{SCAN|INNER_OUTER}_VAR_TOAST step, it reduce the impact in another
step.
(c). a special MemoryContext in TupleTableSlot is introduced for the
detoast datum, it is reset whenever the TupleTableSlot.tts_nvalid
= 0. and I also double check every operation defined in
TupleTableSlotOps, looks nothing should be specially handled.

Robert has expressed his concern as below and wanted Andres to have a
look at, but unluckly Andres doesn't have such time so far:(

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, May 22, 2024 at 9:46 PM Andy Fan <zhihuifan1213@163.com> wrote:

Please give me one more chance to explain this. What I mean is:

Take SELECT f(a) FROM t1 join t2...; for example,

When we read the Datum of a Var, we read it from tts->tts_values[*], no
matter what kind of TupleTableSlot is. So if we can put the "detoast
datum" back to the "original " tts_values, then nothing need to be
changed.

Yeah, I don't think you can do that.

- Saving the "detoast datum" version into tts_values[*] doesn't break
the above semantic and the ExprEval engine just get a detoast version
so it doesn't need to detoast it again.

I don't think this is true. If it is true, it needs to be extremely
well-justified. Even if this seems to work in simple cases, I suspect
there will be cases where it breaks badly, resulting in memory leaks
or server crashes or some other kind of horrible problem that I can't
quite imagine. Unfortunately, my knowledge of this code isn't
fantastic, so I can't say exactly what bad thing will happen, and I
can't even say with 100% certainty that anything bad will happen, but
I bet it will. It seems like it goes against everything I understand
about how TupleTableSlots are supposed to be used. (Andres would be
the best person to give a definitive answer here.)

I think having a discussion of the design of TupleTableSlots would be
helpful for the progress since in my knowldege it doesn't against
anything of the TupleTaleSlots :(, as I stated above. I'm sure I'm
pretty open on this.

- The keypoint is the memory management and effeiciency. for now I think
all the places where "slot->tts_nvalid" is set to 0 means the
tts_values[*] is no longer validate, then this is the place we should
release the memory for the "detoast datum". All the other places like
ExecMaterializeSlot or ExecCopySlot also need to think about the
"detoast datum" as well.

This might be a way of addressing some of the issues with this idea,
but I doubt it will be acceptable. I don't think we want to complicate
the slot API for this feature. There's too much downside to doing
that, in terms of performance and understandability.

Complexity: I double check the code, it has very limitted changes on the
TupleTupleSlot APIs (less than 10 rows). see tuptable.h.

Performance: by design, it just move the chance of detoast action
*conditionly* and put it back to tts_values[*], there is no extra
overhead to find out the detoast version for sharing.

Understandability: based on how do we understand TupleTableSlot:)

[1]: /messages/by-id/87il4jrk1l.fsf@163.com

--
Best Regards
Andy Fan