Lazy JIT IR code generation to increase JIT speed with partitions

Started by Luc Vlamingabout 5 years ago18 messages

luc@swarm64.com

about 5 years ago

2 attachment(s)

Hi,

I would like to propose a small patch to the JIT machinery which makes
the IR code generation lazy. The reason for postponing the generation of
the IR code is that with partitions we get an explosion in the number of
JIT functions generated as many child tables are involved, each with
their own JITted functions, especially when e.g. partition-aware
joins/aggregates are enabled. However, only a fraction of those
functions is actually executed because the Parallel Append node
distributes the workers among the nodes. With the attached patch we get
a lazy generation which makes that this is no longer a problem.

For benchmarks I have in TPC-H and TPC-DS like queries with partitioning
by hash seen query runtimes increase by 20+ seconds even on the simpler
queries. Also I created a small benchmark to reproduce the case easily
(see attached sql file):

without patch, using 7 launched workers:
- without jit: ~220ms
- with jit: ~1880ms
without patch, using 50 launched workers:
- without jit: ~280ms
- with jit: ~3400ms

with patch, using 7 launched workers:
- without jit: ~220ms
- with jit: ~590ms

with patch, using 50 launched workers:
- without jit: ~280ms
- with jit: ~530ms

Thoughts?

With Regards,
Luc Vlaming
Swarm64

Attachments:

jit_partitions.sqlapplication/sql; name=jit_partitions.sqlDownload

v1-0001-generate-JIT-IR-code-lazily.patchtext/x-patch; charset=UTF-8; name=v1-0001-generate-JIT-IR-code-lazily.patchDownload

From 5dd5df7e29bb3262b8f7f02c4fd3896bb34c3133 Mon Sep 17 00:00:00 2001
From: Luc Vlaming <luc@swarm64.com>
Date: Mon, 28 Dec 2020 09:01:32 +0100
Subject: [PATCH v1] generate JIT IR code lazily

---
 src/backend/jit/llvm/llvmjit_expr.c | 98 +++++++++++++++++------------
 1 file changed, 59 insertions(+), 39 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 3aa08a9743..2ac79b7571 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -52,6 +52,7 @@ typedef struct CompiledExprState
 } CompiledExprState;
 
 
+static Datum ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 static Datum ExecRunCompiledExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 
 static LLVMValueRef BuildV1Call(LLVMJitContext *context, LLVMBuilderRef b,
@@ -70,18 +71,66 @@ static LLVMValueRef create_LifetimeEnd(LLVMModuleRef mod);
 					   lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
 					   ((LLVMValueRef[]){__VA_ARGS__}))
 
-
 /*
- * JIT compile expression.
+ * Prepare the JIT compile expression.
  */
 bool
 llvm_compile_expr(ExprState *state)
 {
 	PlanState  *parent = state->parent;
-	char	   *funcname;
-
 	LLVMJitContext *context = NULL;
 
+
+	/*
+	 * Right now we don't support compiling expressions without a parent, as
+	 * we need access to the EState.
+	 */
+	Assert(parent);
+
+	llvm_enter_fatal_on_oom();
+
+	/* get or create JIT context */
+	if (parent->state->es_jit)
+		context = (LLVMJitContext *) parent->state->es_jit;
+	else
+	{
+		context = llvm_create_context(parent->state->es_jit_flags);
+		parent->state->es_jit = &context->base;
+	}
+
+	/*
+	 * Don't immediately emit nor actually generate the function.
+	 * instead do so the first time the expression is actually evaluated.
+	 * That allows to emit a lot of functions together, avoiding a lot of
+	 * repeated llvm and memory remapping overhead. It also helps with not
+	 * compiling functions that will never be evaluated, as can be the case
+	 * if e.g. a parallel append node is distributing workers between its
+	 * child nodes.
+	 */
+	{
+
+		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
+
+		cstate->context = context;
+
+		state->evalfunc = ExecCompileExpr;
+		state->evalfunc_private = cstate;
+	}
+
+	llvm_leave_fatal_on_oom();
+
+	return true;
+}
+
+/*
+ * JIT compile expression.
+ */
+static Datum
+ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull)
+{
+	CompiledExprState *cstate = state->evalfunc_private;
+	LLVMJitContext *context = cstate->context;
+
 	LLVMBuilderRef b;
 	LLVMModuleRef mod;
 	LLVMValueRef eval_fn;
@@ -125,31 +174,16 @@ llvm_compile_expr(ExprState *state)
 
 	llvm_enter_fatal_on_oom();
 
-	/*
-	 * Right now we don't support compiling expressions without a parent, as
-	 * we need access to the EState.
-	 */
-	Assert(parent);
-
-	/* get or create JIT context */
-	if (parent->state->es_jit)
-		context = (LLVMJitContext *) parent->state->es_jit;
-	else
-	{
-		context = llvm_create_context(parent->state->es_jit_flags);
-		parent->state->es_jit = &context->base;
-	}
-
 	INSTR_TIME_SET_CURRENT(starttime);
 
 	mod = llvm_mutable_module(context);
 
 	b = LLVMCreateBuilder();
 
-	funcname = llvm_expand_funcname(context, "evalexpr");
+	cstate->funcname = llvm_expand_funcname(context, "evalexpr");
 
 	/* create function */
-	eval_fn = LLVMAddFunction(mod, funcname,
+	eval_fn = LLVMAddFunction(mod, cstate->funcname,
 							  llvm_pg_var_func_type("TypeExprStateEvalFunc"));
 	LLVMSetLinkage(eval_fn, LLVMExternalLinkage);
 	LLVMSetVisibility(eval_fn, LLVMDefaultVisibility);
@@ -2350,30 +2384,16 @@ llvm_compile_expr(ExprState *state)
 
 	LLVMDisposeBuilder(b);
 
-	/*
-	 * Don't immediately emit function, instead do so the first time the
-	 * expression is actually evaluated. That allows to emit a lot of
-	 * functions together, avoiding a lot of repeated llvm and memory
-	 * remapping overhead.
-	 */
-	{
-
-		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
-
-		cstate->context = context;
-		cstate->funcname = funcname;
-
-		state->evalfunc = ExecRunCompiledExpr;
-		state->evalfunc_private = cstate;
-	}
-
 	llvm_leave_fatal_on_oom();
 
 	INSTR_TIME_SET_CURRENT(endtime);
 	INSTR_TIME_ACCUM_DIFF(context->base.instr.generation_counter,
 						  endtime, starttime);
 
-	return true;
+	/* remove indirection via this function for future calls */
+	state->evalfunc = ExecRunCompiledExpr;
+
+	return ExecRunCompiledExpr(state, econtext, isNull);
 }
 
 /*
-- 
2.25.1

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Luc Vlaming (#1)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Hi,

Great to see work in this area!

On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:

I would like to propose a small patch to the JIT machinery which makes the
IR code generation lazy. The reason for postponing the generation of the IR
code is that with partitions we get an explosion in the number of JIT
functions generated as many child tables are involved, each with their own
JITted functions, especially when e.g. partition-aware joins/aggregates are
enabled. However, only a fraction of those functions is actually executed
because the Parallel Append node distributes the workers among the nodes.
With the attached patch we get a lazy generation which makes that this is no
longer a problem.

I unfortunately don't think this is quite good enough, because it'll
lead to emitting all functions separately, which can also lead to very
substantial increases of the required time (as emitting code is an
expensive step). Obviously that is only relevant in the cases where the
generated functions actually end up being used - which isn't the case in
your example.

If you e.g. look at a query like
SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY blub;
on a table without indexes, you would end up with functions for

- WHERE clause (including deforming)
- projection (including deforming)
- grouping key
- aggregate transition
- aggregate result projection

with your patch each of these would be emitted separately, instead of
one go. Which IIRC increases the required time by a significant amount,
especially if inlining is done (where each separate code generation ends
up with copies of the inlined code).

As far as I can see you've basically falsified the second part of this
comment (which you moved):

+
+	/*
+	 * Don't immediately emit nor actually generate the function.
+	 * instead do so the first time the expression is actually evaluated.
+	 * That allows to emit a lot of functions together, avoiding a lot of
+	 * repeated llvm and memory remapping overhead. It also helps with not
+	 * compiling functions that will never be evaluated, as can be the case
+	 * if e.g. a parallel append node is distributing workers between its
+	 * child nodes.
+	 */

- /*
- * Don't immediately emit function, instead do so the first time the
- * expression is actually evaluated. That allows to emit a lot of
- * functions together, avoiding a lot of repeated llvm and memory
- * remapping overhead.
- */

Greetings,

Andres Freund

Luc Vlaming

luc@swarm64.com

about 5 years ago

In reply to: Andres Freund (#2)

3 attachment(s)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

On 30-12-2020 02:57, Andres Freund wrote:

Hi,

Great to see work in this area!

On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:

I would like to propose a small patch to the JIT machinery which makes the
IR code generation lazy. The reason for postponing the generation of the IR
code is that with partitions we get an explosion in the number of JIT
functions generated as many child tables are involved, each with their own
JITted functions, especially when e.g. partition-aware joins/aggregates are
enabled. However, only a fraction of those functions is actually executed
because the Parallel Append node distributes the workers among the nodes.
With the attached patch we get a lazy generation which makes that this is no
longer a problem.

I unfortunately don't think this is quite good enough, because it'll
lead to emitting all functions separately, which can also lead to very
substantial increases of the required time (as emitting code is an
expensive step). Obviously that is only relevant in the cases where the
generated functions actually end up being used - which isn't the case in
your example.

If you e.g. look at a query like
SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY blub;
on a table without indexes, you would end up with functions for

- WHERE clause (including deforming)
- projection (including deforming)
- grouping key
- aggregate transition
- aggregate result projection

with your patch each of these would be emitted separately, instead of
one go. Which IIRC increases the required time by a significant amount,
especially if inlining is done (where each separate code generation ends
up with copies of the inlined code).

As far as I can see you've basically falsified the second part of this
comment (which you moved):
+
+	/*
+	 * Don't immediately emit nor actually generate the function.
+	 * instead do so the first time the expression is actually evaluated.
+	 * That allows to emit a lot of functions together, avoiding a lot of
+	 * repeated llvm and memory remapping overhead. It also helps with not
+	 * compiling functions that will never be evaluated, as can be the case
+	 * if e.g. a parallel append node is distributing workers between its
+	 * child nodes.
+	 */
- /*
- * Don't immediately emit function, instead do so the first time the
- * expression is actually evaluated. That allows to emit a lot of
- * functions together, avoiding a lot of repeated llvm and memory
- * remapping overhead.
- */

Greetings,

Andres Freund

Hi,

Happy to help out, and thanks for the info and suggestions.
Also, I should have first searched psql-hackers and the like, as I just
found out there is already discussions about this in [1]/messages/by-id/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793@elevated-dev.com and [2]/messages/by-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C+ksKFpSdZg=q6sTbtQ-v=aw@mail.gmail.com.
However I think the approach I took can be taken independently and then
other solutions could be added on top.

Assuming I understood all suggestions correctly, the ideas so far are:
1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed and
not optimized several times (see [1]/messages/by-id/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793@elevated-dev.com). Requires all code to be emitted
in the same module.
2. JIT only parts of the plan, based on cost (see [2]/messages/by-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C+ksKFpSdZg=q6sTbtQ-v=aw@mail.gmail.com).
3. Cache compilation results to avoid recompilation. this would either
need a shm capable optimized IR cache or would not work with parallel
workers.
4. Lazily jitting (this patch)

An idea that might not have been presented in the mailing list yet(?):
5. Only JIT in nodes that process a certain amount of rows. Assuming
there is a constant overhead for JITting and the goal is to gain runtime.

Going forward I would first try to see if my current approach can work
out. The only idea that would be counterproductive to my solution would
be solution 1. Afterwards I'd like to continue with either solution 2,
5, or 3 in the hopes that we can reduce JIT overhead to a minimum and
can therefore apply it more broadly.

To test out why and where the JIT performance decreased with my solution
I improved the test script and added various queries to model some of
the cases I think we should care about. I have not (yet) done big scale
benchmarks as these queries seemed to already show enough problems for
now. Now there are 4 queries which test JITting with/without partitions,
and with varying amounts of workers and rowcounts. I hope these are
indeed a somewhat representative set of queries.

As pointed out the current patch does create a degradation in
performance wrt queries that are not partitioned (basically q3 and q4).
After looking into those queries I noticed two things:
- q3 is very noisy wrt JIT timings. This seems to be the result of
something wrt parallel workers starting up the JITting and creating very
high amounts of noise (e.g. inlining timings varying between 3.8s and 6.2s)
- q4 seems very stable with JIT timings (after the first run).
I'm wondering if this could mean that with parallel workers quite a lot
of time is spent on startup of the llvm machinery and this gets noisy
because of OS interaction and the like?

Either way I took q4 to try and fix the regression and noticed something
interesting, given the comment from Andres: the generation and inlining
time actually decreased, but the optimization and emission time
increased. After trying out various things in the llvm_optimize_module
function and googling a bit it seems that the
LLVMPassManagerBuilderUseInlinerWithThreshold adds some very expensive
passes. I tried to construct some queries where this would actually gain
us but couldnt (yet).

For v2 of the patch-set the first patch slightly changes how we optimize
the code, which removes the aforementioned degradations in the queries.
The second patch then makes that partitions work a lot better, but
interestingly now also q4 gets a lot faster but somehow q3 does not.

Because these findings contradict the suggestions/findings from Andres
I'm wondering what I'm missing. I would continue and do some TPC-H like
tests on top, but apart from that I'm not entirely sure where we are
supposed to gain most from the call to
LLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that from the
scenarios I now tested it seems that the pain is actually in the code
optimization and possibly rather specific passes and not necessarily in
how many modules are emitted.

If there are more / better queries / datasets / statistics I can run and
gather I would be glad to do so :) To me the current results seem
however fairly promising.

Looking forward to your thoughts & suggestions.

With regards,
Luc
Swarm64

===================================
Results from the test script on my machine:

parameters: jit=on workers=5 jit-inline=0 jit-optimize=0
query1: HEAD - 08.088901 #runs=5 #JIT=12014
query1: HEAD+01 - 06.369646 #runs=5 #JIT=12014
query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044
query2: HEAD - 17.628126 #runs=5 #JIT=24074
query2: HEAD+01 - 10.786114 #runs=5 #JIT=24074
query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083
query3: HEAD - 00.220141 #runs=5 #JIT=29
query3: HEAD+01 - 00.210917 #runs=5 #JIT=29
query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25
query4: HEAD - 00.052305 #runs=100 #JIT=10
query4: HEAD+01 - 00.038319 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=0 jit-optimize=0
query1: HEAD - 14.922044 #runs=5 #JIT=102104
query1: HEAD+01 - 11.356347 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241
query2: HEAD - 18.477133 #runs=5 #JIT=40122
query2: HEAD+01 - 11.028579 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087
query3: HEAD - 00.235587 #runs=5 #JIT=209
query3: HEAD+01 - 00.219597 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127
query4: HEAD - 00.052534 #runs=100 #JIT=10
query4: HEAD+01 - 00.038881 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0
query1: HEAD - 12.696588 #runs=5 #JIT=102104
query1: HEAD+01 - 12.279387 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211
query2: HEAD - 12.091824 #runs=5 #JIT=40122
query2: HEAD+01 - 11.543042 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088
query3: HEAD - 00.122208 #runs=5 #JIT=209
query3: HEAD+01 - 00.114153 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131
query4: HEAD - 00.033125 #runs=100 #JIT=10
query4: HEAD+01 - 00.029818 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06
query1: HEAD - 02.760343 #runs=5 #JIT=102104
query1: HEAD+01 - 02.742944 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292
query2: HEAD - 02.396965 #runs=5 #JIT=40122
query2: HEAD+01 - 02.394724 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089
query3: HEAD - 00.186633 #runs=5 #JIT=209
query3: HEAD+01 - 00.189623 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125
query4: HEAD - 00.013277 #runs=100 #JIT=10
query4: HEAD+01 - 00.012078 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06
query1: HEAD - 02.339973 #runs=5 #JIT=102104
query1: HEAD+01 - 02.333525 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243
query2: HEAD - 02.268987 #runs=5 #JIT=40122
query2: HEAD+01 - 02.248729 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088
query3: HEAD - 00.084531 #runs=5 #JIT=209
query3: HEAD+01 - 00.091616 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.08668 #runs=5 #JIT=127
query4: HEAD - 00.005371 #runs=100 #JIT=10
query4: HEAD+01 - 00.0053 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3

===================================
[1]: /messages/by-id/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793@elevated-dev.com
/messages/by-id/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793@elevated-dev.com
[2]: /messages/by-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C+ksKFpSdZg=q6sTbtQ-v=aw@mail.gmail.com
/messages/by-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C+ksKFpSdZg=q6sTbtQ-v=aw@mail.gmail.com

Attachments:

jit_partitions.sqlapplication/sql; name=jit_partitions.sqlDownload

v2-0001-improve-jitting-performance-somewhat.patchtext/x-patch; charset=UTF-8; name=v2-0001-improve-jitting-performance-somewhat.patchDownload

From 59ce1e834a9adb1cd15484df6fd81015e8e8fe80 Mon Sep 17 00:00:00 2001
From: Luc Vlaming <luc@swarm64.com>
Date: Wed, 30 Dec 2020 11:35:12 +0100
Subject: [PATCH v2 1/2] improve jitting performance somewhat

---
 src/backend/commands/explain.c |  1 +
 src/backend/jit/llvm/llvmjit.c | 17 ++++++-----------
 src/include/jit/jit.h          |  3 +++
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 43f9b01e83..4a6a9dfb67 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -858,6 +858,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, JitInstrumentation *ji)
 		appendStringInfoString(es->str, "JIT:\n");
 		es->indent++;
 
+		ExplainPropertyInteger("Modules", NULL, ji->created_modules, es);
 		ExplainPropertyInteger("Functions", NULL, ji->created_functions, es);
 
 		ExplainIndentText(es);
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 9c4fc75f65..dd23a09f40 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -241,6 +241,8 @@ llvm_mutable_module(LLVMJitContext *context)
 		context->module = LLVMModuleCreateWithName("pg");
 		LLVMSetTarget(context->module, llvm_triple);
 		LLVMSetDataLayout(context->module, llvm_layout);
+
+		context->base.instr.created_modules++;
 	}
 
 	return context->module;
@@ -578,12 +580,7 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderSetOptLevel(llvm_pmb, compile_optlevel);
 	llvm_fpm = LLVMCreateFunctionPassManagerForModule(module);
 
-	if (context->base.flags & PGJIT_OPT3)
-	{
-		/* TODO: Unscientifically determined threshold */
-		LLVMPassManagerBuilderUseInlinerWithThreshold(llvm_pmb, 512);
-	}
-	else
+	if (!(context->base.flags & PGJIT_OPT3))
 	{
 		/* we rely on mem2reg heavily, so emit even in the O0 case */
 		LLVMAddPromoteMemoryToRegisterPass(llvm_fpm);
@@ -611,11 +608,9 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderPopulateModulePassManager(llvm_pmb,
 													llvm_mpm);
 	/* always use always-inliner pass */
-	if (!(context->base.flags & PGJIT_OPT3))
-		LLVMAddAlwaysInlinerPass(llvm_mpm);
-	/* if doing inlining, but no expensive optimization, add inlining pass */
-	if (context->base.flags & PGJIT_INLINE
-		&& !(context->base.flags & PGJIT_OPT3))
+	LLVMAddAlwaysInlinerPass(llvm_mpm);
+	/* if doing inlining, add inlining pass */
+	if (context->base.flags & PGJIT_INLINE)
 		LLVMAddFunctionInliningPass(llvm_mpm);
 	LLVMRunPassManager(llvm_mpm, context->module);
 	LLVMDisposePassManager(llvm_mpm);
diff --git a/src/include/jit/jit.h b/src/include/jit/jit.h
index 5a77f7d4a0..a0c29a76c8 100644
--- a/src/include/jit/jit.h
+++ b/src/include/jit/jit.h
@@ -26,6 +26,9 @@
 
 typedef struct JitInstrumentation
 {
+	/* number of emitted modules */
+	size_t		created_modules;
+
 	/* number of emitted functions */
 	size_t		created_functions;
 
-- 
2.25.1

v2-0002-generate-JIT-IR-code-lazily.patchtext/x-patch; charset=UTF-8; name=v2-0002-generate-JIT-IR-code-lazily.patchDownload

From e9ce1a0f066fd7d1bd9e5a66863dcb79e67c552f Mon Sep 17 00:00:00 2001
From: Luc Vlaming <luc@swarm64.com>
Date: Mon, 28 Dec 2020 09:01:32 +0100
Subject: [PATCH v2 2/2] generate JIT IR code lazily

---
 src/backend/jit/llvm/llvmjit_expr.c | 96 +++++++++++++++++------------
 1 file changed, 57 insertions(+), 39 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 3aa08a9743..7483803d4a 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -52,6 +52,7 @@ typedef struct CompiledExprState
 } CompiledExprState;
 
 
+static Datum ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 static Datum ExecRunCompiledExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 
 static LLVMValueRef BuildV1Call(LLVMJitContext *context, LLVMBuilderRef b,
@@ -70,18 +71,64 @@ static LLVMValueRef create_LifetimeEnd(LLVMModuleRef mod);
 					   lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
 					   ((LLVMValueRef[]){__VA_ARGS__}))
 
-
 /*
- * JIT compile expression.
+ * Prepare the JIT compile expression.
  */
 bool
 llvm_compile_expr(ExprState *state)
 {
 	PlanState  *parent = state->parent;
-	char	   *funcname;
-
 	LLVMJitContext *context = NULL;
 
+
+	/*
+	 * Right now we don't support compiling expressions without a parent, as
+	 * we need access to the EState.
+	 */
+	Assert(parent);
+
+	llvm_enter_fatal_on_oom();
+
+	/* get or create JIT context */
+	if (parent->state->es_jit)
+		context = (LLVMJitContext *) parent->state->es_jit;
+	else
+	{
+		context = llvm_create_context(parent->state->es_jit_flags);
+		parent->state->es_jit = &context->base;
+	}
+
+	/*
+	 * Don't immediately emit nor actually generate the function.
+	 * Instead do so the first time the expression is actually evaluated.
+	 * This helps with not compiling functions that will never be evaluated,
+	 * as can be the case if e.g. a parallel append node is distributing
+	 * workers between its child nodes.
+	 */
+	{
+
+		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
+
+		cstate->context = context;
+
+		state->evalfunc = ExecCompileExpr;
+		state->evalfunc_private = cstate;
+	}
+
+	llvm_leave_fatal_on_oom();
+
+	return true;
+}
+
+/*
+ * JIT compile expression.
+ */
+static Datum
+ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull)
+{
+	CompiledExprState *cstate = state->evalfunc_private;
+	LLVMJitContext *context = cstate->context;
+
 	LLVMBuilderRef b;
 	LLVMModuleRef mod;
 	LLVMValueRef eval_fn;
@@ -125,31 +172,16 @@ llvm_compile_expr(ExprState *state)
 
 	llvm_enter_fatal_on_oom();
 
-	/*
-	 * Right now we don't support compiling expressions without a parent, as
-	 * we need access to the EState.
-	 */
-	Assert(parent);
-
-	/* get or create JIT context */
-	if (parent->state->es_jit)
-		context = (LLVMJitContext *) parent->state->es_jit;
-	else
-	{
-		context = llvm_create_context(parent->state->es_jit_flags);
-		parent->state->es_jit = &context->base;
-	}
-
 	INSTR_TIME_SET_CURRENT(starttime);
 
 	mod = llvm_mutable_module(context);
 
 	b = LLVMCreateBuilder();
 
-	funcname = llvm_expand_funcname(context, "evalexpr");
+	cstate->funcname = llvm_expand_funcname(context, "evalexpr");
 
 	/* create function */
-	eval_fn = LLVMAddFunction(mod, funcname,
+	eval_fn = LLVMAddFunction(mod, cstate->funcname,
 							  llvm_pg_var_func_type("TypeExprStateEvalFunc"));
 	LLVMSetLinkage(eval_fn, LLVMExternalLinkage);
 	LLVMSetVisibility(eval_fn, LLVMDefaultVisibility);
@@ -2350,30 +2382,16 @@ llvm_compile_expr(ExprState *state)
 
 	LLVMDisposeBuilder(b);
 
-	/*
-	 * Don't immediately emit function, instead do so the first time the
-	 * expression is actually evaluated. That allows to emit a lot of
-	 * functions together, avoiding a lot of repeated llvm and memory
-	 * remapping overhead.
-	 */
-	{
-
-		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
-
-		cstate->context = context;
-		cstate->funcname = funcname;
-
-		state->evalfunc = ExecRunCompiledExpr;
-		state->evalfunc_private = cstate;
-	}
-
 	llvm_leave_fatal_on_oom();
 
 	INSTR_TIME_SET_CURRENT(endtime);
 	INSTR_TIME_ACCUM_DIFF(context->base.instr.generation_counter,
 						  endtime, starttime);
 
-	return true;
+	/* remove indirection via this function for future calls */
+	state->evalfunc = ExecRunCompiledExpr;
+
+	return ExecRunCompiledExpr(state, econtext, isNull);
 }
 
 /*
-- 
2.25.1

Luc Vlaming

luc@swarm64.com

about 5 years ago

In reply to: Luc Vlaming (#3)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

On 30-12-2020 14:23, Luc Vlaming wrote:

On 30-12-2020 02:57, Andres Freund wrote:
Hi,

Great to see work in this area!

On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:

I would like to propose a small patch to the JIT machinery which
makes the
IR code generation lazy. The reason for postponing the generation of
the IR
code is that with partitions we get an explosion in the number of JIT
functions generated as many child tables are involved, each with
their own
JITted functions, especially when e.g. partition-aware
joins/aggregates are
enabled. However, only a fraction of those functions is actually
executed
because the Parallel Append node distributes the workers among the
nodes.
With the attached patch we get a lazy generation which makes that
this is no
longer a problem.

I unfortunately don't think this is quite good enough, because it'll
lead to emitting all functions separately, which can also lead to very
substantial increases of the required time (as emitting code is an
expensive step). Obviously that is only relevant in the cases where the
generated functions actually end up being used - which isn't the case in
your example.

If you e.g. look at a query like
SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY blub;
on a table without indexes, you would end up with functions for

- WHERE clause (including deforming)
- projection (including deforming)
- grouping key
- aggregate transition
- aggregate result projection

with your patch each of these would be emitted separately, instead of
one go. Which IIRC increases the required time by a significant amount,
especially if inlining is done (where each separate code generation ends
up with copies of the inlined code).

As far as I can see you've basically falsified the second part of this
comment (which you moved):
+
+    /*
+     * Don't immediately emit nor actually generate the function.
+     * instead do so the first time the expression is actually 
evaluated.
+     * That allows to emit a lot of functions together, avoiding a 
lot of
+     * repeated llvm and memory remapping overhead. It also helps 
with not
+     * compiling functions that will never be evaluated, as can be 
the case
+     * if e.g. a parallel append node is distributing workers 
between its
+     * child nodes.
+     */
-    /*
-     * Don't immediately emit function, instead do so the first time
the
-     * expression is actually evaluated. That allows to emit a lot of
-     * functions together, avoiding a lot of repeated llvm and memory
-     * remapping overhead.
-     */

Greetings,

Andres Freund
Hi,

Happy to help out, and thanks for the info and suggestions.
Also, I should have first searched psql-hackers and the like, as I just
found out there is already discussions about this in [1] and [2].
However I think the approach I took can be taken independently and then
other solutions could be added on top.

Assuming I understood all suggestions correctly, the ideas so far are:
1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed and
not optimized several times (see [1]). Requires all code to be emitted
in the same module.
2. JIT only parts of the plan, based on cost (see [2]).
3. Cache compilation results to avoid recompilation. this would either
need a shm capable optimized IR cache or would not work with parallel
workers.
4. Lazily jitting (this patch)

An idea that might not have been presented in the mailing list yet(?):
5. Only JIT in nodes that process a certain amount of rows. Assuming
there is a constant overhead for JITting and the goal is to gain runtime.

Going forward I would first try to see if my current approach can work
out. The only idea that would be counterproductive to my solution would
be solution 1. Afterwards I'd like to continue with either solution 2,
5, or 3 in the hopes that we can reduce JIT overhead to a minimum and
can therefore apply it more broadly.

To test out why and where the JIT performance decreased with my solution
I improved the test script and added various queries to model some of
the cases I think we should care about. I have not (yet) done big scale
benchmarks as these queries seemed to already show enough problems for
now. Now there are 4 queries which test JITting with/without partitions,
and with varying amounts of workers and rowcounts. I hope these are
indeed a somewhat representative set of queries.

As pointed out the current patch does create a degradation in
performance wrt queries that are not partitioned (basically q3 and q4).
After looking into those queries I noticed two things:
- q3 is very noisy wrt JIT timings. This seems to be the result of
something wrt parallel workers starting up the JITting and creating very
high amounts of noise (e.g. inlining timings varying between 3.8s and 6.2s)
- q4 seems very stable with JIT timings (after the first run).
I'm wondering if this could mean that with parallel workers quite a lot
of time is spent on startup of the llvm machinery and this gets noisy
because of OS interaction and the like?

Either way I took q4 to try and fix the regression and noticed something
interesting, given the comment from Andres: the generation and inlining
time actually decreased, but the optimization and emission time
increased. After trying out various things in the llvm_optimize_module
function and googling a bit it seems that the
LLVMPassManagerBuilderUseInlinerWithThreshold adds some very expensive
passes. I tried to construct some queries where this would actually gain
us but couldnt (yet).

For v2 of the patch-set the first patch slightly changes how we optimize
the code, which removes the aforementioned degradations in the queries.
The second patch then makes that partitions work a lot better, but
interestingly now also q4 gets a lot faster but somehow q3 does not.

Because these findings contradict the suggestions/findings from Andres
I'm wondering what I'm missing. I would continue and do some TPC-H like
tests on top, but apart from that I'm not entirely sure where we are
supposed to gain most from the call to
LLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that from the
scenarios I now tested it seems that the pain is actually in the code
optimization and possibly rather specific passes and not necessarily in
how many modules are emitted.

If there are more / better queries / datasets / statistics I can run and
gather I would be glad to do so :) To me the current results seem
however fairly promising.

Looking forward to your thoughts & suggestions.

With regards,
Luc
Swarm64

===================================
Results from the test script on my machine:

parameters: jit=on workers=5 jit-inline=0 jit-optimize=0
query1: HEAD       - 08.088901 #runs=5 #JIT=12014
query1: HEAD+01    - 06.369646 #runs=5 #JIT=12014
query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044
query2: HEAD       - 17.628126 #runs=5 #JIT=24074
query2: HEAD+01    - 10.786114 #runs=5 #JIT=24074
query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083
query3: HEAD       - 00.220141 #runs=5 #JIT=29
query3: HEAD+01    - 00.210917 #runs=5 #JIT=29
query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25
query4: HEAD       - 00.052305 #runs=100 #JIT=10
query4: HEAD+01    - 00.038319 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=0 jit-optimize=0
query1: HEAD       - 14.922044 #runs=5 #JIT=102104
query1: HEAD+01    - 11.356347 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241
query2: HEAD       - 18.477133 #runs=5 #JIT=40122
query2: HEAD+01    - 11.028579 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087
query3: HEAD       - 00.235587 #runs=5 #JIT=209
query3: HEAD+01    - 00.219597 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127
query4: HEAD       - 00.052534 #runs=100 #JIT=10
query4: HEAD+01    - 00.038881 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0
query1: HEAD       - 12.696588 #runs=5 #JIT=102104
query1: HEAD+01    - 12.279387 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211
query2: HEAD       - 12.091824 #runs=5 #JIT=40122
query2: HEAD+01    - 11.543042 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088
query3: HEAD       - 00.122208 #runs=5 #JIT=209
query3: HEAD+01    - 00.114153 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131
query4: HEAD       - 00.033125 #runs=100 #JIT=10
query4: HEAD+01    - 00.029818 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06
query1: HEAD       - 02.760343 #runs=5 #JIT=102104
query1: HEAD+01    - 02.742944 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292
query2: HEAD       - 02.396965 #runs=5 #JIT=40122
query2: HEAD+01    - 02.394724 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089
query3: HEAD       - 00.186633 #runs=5 #JIT=209
query3: HEAD+01    - 00.189623 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125
query4: HEAD       - 00.013277 #runs=100 #JIT=10
query4: HEAD+01    - 00.012078 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06
query1: HEAD       - 02.339973 #runs=5 #JIT=102104
query1: HEAD+01    - 02.333525 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243
query2: HEAD       - 02.268987 #runs=5 #JIT=40122
query2: HEAD+01    - 02.248729 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088
query3: HEAD       - 00.084531 #runs=5 #JIT=209
query3: HEAD+01    - 00.091616 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.08668 #runs=5 #JIT=127
query4: HEAD       - 00.005371 #runs=100 #JIT=10
query4: HEAD+01    - 00.0053   #runs=100 #JIT=10
query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3

===================================
[1]
/messages/by-id/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793@elevated-dev.com

[2]
/messages/by-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C+ksKFpSdZg=q6sTbtQ-v=aw@mail.gmail.com

Hi,

Did some TPCH testing today on a TPCH 100G to see regressions there.
Results (query/HEAD/patched/speedup)

1 9.49 9.25 1.03
3 11.87 11.65 1.02
4 23.74 21.24 1.12
5 11.66 11.07 1.05
6 7.82 7.72 1.01
7 12.1 11.23 1.08
8 12.99 11.2 1.16
9 71.2 68.05 1.05
10 17.72 17.31 1.02
11 4.75 4.16 1.14
12 10.47 10.27 1.02
13 38.23 38.71 0.99
14 8.69 8.5 1.02
15 12.63 12.6 1.00
19 8.56 8.37 1.02
22 10.34 9.25 1.12

Cheers,
Luc

Luc Vlaming

luc@swarm64.com

almost 5 years ago

In reply to: Luc Vlaming (#4)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Hi everyone, Andres,

On 03-01-2021 11:05, Luc Vlaming wrote:

On 30-12-2020 14:23, Luc Vlaming wrote:

On 30-12-2020 02:57, Andres Freund wrote:

Hi,

Great to see work in this area!

I would like this topic to somehow progress and was wondering what other
benchmarks / tests would be needed to have some progress? I've so far
provided benchmarks for small(ish) queries and some tpch numbers,
assuming those would be enough.

On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:

I would like to propose a small patch to the JIT machinery which
makes the
IR code generation lazy. The reason for postponing the generation of
the IR
code is that with partitions we get an explosion in the number of JIT
functions generated as many child tables are involved, each with
their own
JITted functions, especially when e.g. partition-aware
joins/aggregates are
enabled. However, only a fraction of those functions is actually
executed
because the Parallel Append node distributes the workers among the
nodes.
With the attached patch we get a lazy generation which makes that
this is no
longer a problem.

I unfortunately don't think this is quite good enough, because it'll
lead to emitting all functions separately, which can also lead to very
substantial increases of the required time (as emitting code is an
expensive step). Obviously that is only relevant in the cases where the
generated functions actually end up being used - which isn't the case in
your example.

If you e.g. look at a query like
SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY
blub;
on a table without indexes, you would end up with functions for

- WHERE clause (including deforming)
- projection (including deforming)
- grouping key
- aggregate transition
- aggregate result projection

with your patch each of these would be emitted separately, instead of
one go. Which IIRC increases the required time by a significant amount,
especially if inlining is done (where each separate code generation ends
up with copies of the inlined code).

As far as I can see you've basically falsified the second part of this
comment (which you moved):
+
+    /*
+     * Don't immediately emit nor actually generate the function.
+     * instead do so the first time the expression is actually 
evaluated.
+     * That allows to emit a lot of functions together, avoiding a 
lot of
+     * repeated llvm and memory remapping overhead. It also helps 
with not
+     * compiling functions that will never be evaluated, as can be 
the case
+     * if e.g. a parallel append node is distributing workers 
between its
+     * child nodes.
+     */
-    /*
-     * Don't immediately emit function, instead do so the first
time the
-     * expression is actually evaluated. That allows to emit a lot of
-     * functions together, avoiding a lot of repeated llvm and memory
-     * remapping overhead.
-     */

Greetings,

Andres Freund
Hi,

Happy to help out, and thanks for the info and suggestions.
Also, I should have first searched psql-hackers and the like, as I
just found out there is already discussions about this in [1] and [2].
However I think the approach I took can be taken independently and
then other solutions could be added on top.

Assuming I understood all suggestions correctly, the ideas so far are:
1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed
and not optimized several times (see [1]). Requires all code to be
emitted in the same module.
2. JIT only parts of the plan, based on cost (see [2]).
3. Cache compilation results to avoid recompilation. this would either
need a shm capable optimized IR cache or would not work with parallel
workers.
4. Lazily jitting (this patch)

An idea that might not have been presented in the mailing list yet(?):
5. Only JIT in nodes that process a certain amount of rows. Assuming
there is a constant overhead for JITting and the goal is to gain runtime.

Going forward I would first try to see if my current approach can work
out. The only idea that would be counterproductive to my solution
would be solution 1. Afterwards I'd like to continue with either
solution 2, 5, or 3 in the hopes that we can reduce JIT overhead to a
minimum and can therefore apply it more broadly.

To test out why and where the JIT performance decreased with my
solution I improved the test script and added various queries to model
some of the cases I think we should care about. I have not (yet) done
big scale benchmarks as these queries seemed to already show enough
problems for now. Now there are 4 queries which test JITting
with/without partitions, and with varying amounts of workers and
rowcounts. I hope these are indeed a somewhat representative set of
queries.

As pointed out the current patch does create a degradation in
performance wrt queries that are not partitioned (basically q3 and
q4). After looking into those queries I noticed two things:
- q3 is very noisy wrt JIT timings. This seems to be the result of
something wrt parallel workers starting up the JITting and creating
very high amounts of noise (e.g. inlining timings varying between 3.8s
and 6.2s)
- q4 seems very stable with JIT timings (after the first run).
I'm wondering if this could mean that with parallel workers quite a
lot of time is spent on startup of the llvm machinery and this gets
noisy because of OS interaction and the like?

Either way I took q4 to try and fix the regression and noticed
something interesting, given the comment from Andres: the generation
and inlining time actually decreased, but the optimization and
emission time increased. After trying out various things in the
llvm_optimize_module function and googling a bit it seems that the
LLVMPassManagerBuilderUseInlinerWithThreshold adds some very expensive
passes. I tried to construct some queries where this would actually
gain us but couldnt (yet).

For v2 of the patch-set the first patch slightly changes how we
optimize the code, which removes the aforementioned degradations in
the queries. The second patch then makes that partitions work a lot
better, but interestingly now also q4 gets a lot faster but somehow q3
does not.

Because these findings contradict the suggestions/findings from Andres
I'm wondering what I'm missing. I would continue and do some TPC-H
like tests on top, but apart from that I'm not entirely sure where we
are supposed to gain most from the call to
LLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that from
the scenarios I now tested it seems that the pain is actually in the
code optimization and possibly rather specific passes and not
necessarily in how many modules are emitted.

If there are more / better queries / datasets / statistics I can run
and gather I would be glad to do so :) To me the current results seem
however fairly promising.

Looking forward to your thoughts & suggestions.

With regards,
Luc
Swarm64

===================================
Results from the test script on my machine:

  parameters: jit=on workers=5 jit-inline=0 jit-optimize=0
  query1: HEAD       - 08.088901 #runs=5 #JIT=12014
  query1: HEAD+01    - 06.369646 #runs=5 #JIT=12014
  query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044
  query2: HEAD       - 17.628126 #runs=5 #JIT=24074
  query2: HEAD+01    - 10.786114 #runs=5 #JIT=24074
  query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083
  query3: HEAD       - 00.220141 #runs=5 #JIT=29
  query3: HEAD+01    - 00.210917 #runs=5 #JIT=29
  query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25
  query4: HEAD       - 00.052305 #runs=100 #JIT=10
  query4: HEAD+01    - 00.038319 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=0 jit-optimize=0
  query1: HEAD       - 14.922044 #runs=5 #JIT=102104
  query1: HEAD+01    - 11.356347 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241
  query2: HEAD       - 18.477133 #runs=5 #JIT=40122
  query2: HEAD+01    - 11.028579 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087
  query3: HEAD       - 00.235587 #runs=5 #JIT=209
  query3: HEAD+01    - 00.219597 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127
  query4: HEAD       - 00.052534 #runs=100 #JIT=10
  query4: HEAD+01    - 00.038881 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0
  query1: HEAD       - 12.696588 #runs=5 #JIT=102104
  query1: HEAD+01    - 12.279387 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211
  query2: HEAD       - 12.091824 #runs=5 #JIT=40122
  query2: HEAD+01    - 11.543042 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088
  query3: HEAD       - 00.122208 #runs=5 #JIT=209
  query3: HEAD+01    - 00.114153 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131
  query4: HEAD       - 00.033125 #runs=100 #JIT=10
  query4: HEAD+01    - 00.029818 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06
  query1: HEAD       - 02.760343 #runs=5 #JIT=102104
  query1: HEAD+01    - 02.742944 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292
  query2: HEAD       - 02.396965 #runs=5 #JIT=40122
  query2: HEAD+01    - 02.394724 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089
  query3: HEAD       - 00.186633 #runs=5 #JIT=209
  query3: HEAD+01    - 00.189623 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125
  query4: HEAD       - 00.013277 #runs=100 #JIT=10
  query4: HEAD+01    - 00.012078 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06
  query1: HEAD       - 02.339973 #runs=5 #JIT=102104
  query1: HEAD+01    - 02.333525 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243
  query2: HEAD       - 02.268987 #runs=5 #JIT=40122
  query2: HEAD+01    - 02.248729 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088
  query3: HEAD       - 00.084531 #runs=5 #JIT=209
  query3: HEAD+01    - 00.091616 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.08668 #runs=5 #JIT=127
  query4: HEAD       - 00.005371 #runs=100 #JIT=10
  query4: HEAD+01    - 00.0053   #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3

===================================
[1]
/messages/by-id/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793@elevated-dev.com

[2]
/messages/by-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C+ksKFpSdZg=q6sTbtQ-v=aw@mail.gmail.com
Hi,

Did some TPCH testing today on a TPCH 100G to see regressions there.
Results (query/HEAD/patched/speedup)

1    9.49    9.25    1.03
3    11.87    11.65    1.02
4    23.74    21.24    1.12
5    11.66    11.07    1.05
6    7.82    7.72    1.01
7    12.1    11.23    1.08
8    12.99    11.2    1.16
9    71.2    68.05    1.05
10    17.72    17.31    1.02
11    4.75    4.16    1.14
12    10.47    10.27    1.02
13    38.23    38.71    0.99
14    8.69    8.5    1.02
15    12.63    12.6    1.00
19    8.56    8.37    1.02
22    10.34    9.25    1.12

Cheers,
Luc

Kind regards,
Luc

Luc Vlaming

luc@swarm64.com

almost 5 years ago

In reply to: Luc Vlaming (#5)

2 attachment(s)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

On 18-01-2021 08:47, Luc Vlaming wrote:

Hi everyone, Andres,

On 03-01-2021 11:05, Luc Vlaming wrote:

On 30-12-2020 14:23, Luc Vlaming wrote:

On 30-12-2020 02:57, Andres Freund wrote:

Hi,

Great to see work in this area!

I would like this topic to somehow progress and was wondering what other
benchmarks / tests would be needed to have some progress? I've so far
provided benchmarks for small(ish) queries and some tpch numbers,
assuming those would be enough.
On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:

I would like to propose a small patch to the JIT machinery which
makes the
IR code generation lazy. The reason for postponing the generation
of the IR
code is that with partitions we get an explosion in the number of JIT
functions generated as many child tables are involved, each with
their own
JITted functions, especially when e.g. partition-aware
joins/aggregates are
enabled. However, only a fraction of those functions is actually
executed
because the Parallel Append node distributes the workers among the
nodes.
With the attached patch we get a lazy generation which makes that
this is no
longer a problem.

I unfortunately don't think this is quite good enough, because it'll
lead to emitting all functions separately, which can also lead to very
substantial increases of the required time (as emitting code is an
expensive step). Obviously that is only relevant in the cases where the
generated functions actually end up being used - which isn't the
case in
your example.

If you e.g. look at a query like
SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY
blub;
on a table without indexes, you would end up with functions for

- WHERE clause (including deforming)
- projection (including deforming)
- grouping key
- aggregate transition
- aggregate result projection

with your patch each of these would be emitted separately, instead of
one go. Which IIRC increases the required time by a significant amount,
especially if inlining is done (where each separate code generation
ends
up with copies of the inlined code).

As far as I can see you've basically falsified the second part of this
comment (which you moved):
+
+    /*
+     * Don't immediately emit nor actually generate the function.
+     * instead do so the first time the expression is actually 
evaluated.
+     * That allows to emit a lot of functions together, avoiding a 
lot of
+     * repeated llvm and memory remapping overhead. It also helps 
with not
+     * compiling functions that will never be evaluated, as can be 
the case
+     * if e.g. a parallel append node is distributing workers 
between its
+     * child nodes.
+     */
-    /*
-     * Don't immediately emit function, instead do so the first
time the
-     * expression is actually evaluated. That allows to emit a lot of
-     * functions together, avoiding a lot of repeated llvm and memory
-     * remapping overhead.
-     */

Greetings,

Andres Freund
Hi,

Happy to help out, and thanks for the info and suggestions.
Also, I should have first searched psql-hackers and the like, as I
just found out there is already discussions about this in [1] and [2].
However I think the approach I took can be taken independently and
then other solutions could be added on top.

Assuming I understood all suggestions correctly, the ideas so far are:
1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed
and not optimized several times (see [1]). Requires all code to be
emitted in the same module.
2. JIT only parts of the plan, based on cost (see [2]).
3. Cache compilation results to avoid recompilation. this would
either need a shm capable optimized IR cache or would not work with
parallel workers.
4. Lazily jitting (this patch)

An idea that might not have been presented in the mailing list yet(?):
5. Only JIT in nodes that process a certain amount of rows. Assuming
there is a constant overhead for JITting and the goal is to gain
runtime.

Going forward I would first try to see if my current approach can
work out. The only idea that would be counterproductive to my
solution would be solution 1. Afterwards I'd like to continue with
either solution 2, 5, or 3 in the hopes that we can reduce JIT
overhead to a minimum and can therefore apply it more broadly.

To test out why and where the JIT performance decreased with my
solution I improved the test script and added various queries to
model some of the cases I think we should care about. I have not
(yet) done big scale benchmarks as these queries seemed to already
show enough problems for now. Now there are 4 queries which test
JITting with/without partitions, and with varying amounts of workers
and rowcounts. I hope these are indeed a somewhat representative set
of queries.

As pointed out the current patch does create a degradation in
performance wrt queries that are not partitioned (basically q3 and
q4). After looking into those queries I noticed two things:
- q3 is very noisy wrt JIT timings. This seems to be the result of
something wrt parallel workers starting up the JITting and creating
very high amounts of noise (e.g. inlining timings varying between
3.8s and 6.2s)
- q4 seems very stable with JIT timings (after the first run).
I'm wondering if this could mean that with parallel workers quite a
lot of time is spent on startup of the llvm machinery and this gets
noisy because of OS interaction and the like?

Either way I took q4 to try and fix the regression and noticed
something interesting, given the comment from Andres: the generation
and inlining time actually decreased, but the optimization and
emission time increased. After trying out various things in the
llvm_optimize_module function and googling a bit it seems that the
LLVMPassManagerBuilderUseInlinerWithThreshold adds some very
expensive passes. I tried to construct some queries where this would
actually gain us but couldnt (yet).

For v2 of the patch-set the first patch slightly changes how we
optimize the code, which removes the aforementioned degradations in
the queries. The second patch then makes that partitions work a lot
better, but interestingly now also q4 gets a lot faster but somehow
q3 does not.

Because these findings contradict the suggestions/findings from
Andres I'm wondering what I'm missing. I would continue and do some
TPC-H like tests on top, but apart from that I'm not entirely sure
where we are supposed to gain most from the call to
LLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that from
the scenarios I now tested it seems that the pain is actually in the
code optimization and possibly rather specific passes and not
necessarily in how many modules are emitted.

If there are more / better queries / datasets / statistics I can run
and gather I would be glad to do so :) To me the current results seem
however fairly promising.

Looking forward to your thoughts & suggestions.

With regards,
Luc
Swarm64

===================================
Results from the test script on my machine:

  parameters: jit=on workers=5 jit-inline=0 jit-optimize=0
  query1: HEAD       - 08.088901 #runs=5 #JIT=12014
  query1: HEAD+01    - 06.369646 #runs=5 #JIT=12014
  query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044
  query2: HEAD       - 17.628126 #runs=5 #JIT=24074
  query2: HEAD+01    - 10.786114 #runs=5 #JIT=24074
  query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083
  query3: HEAD       - 00.220141 #runs=5 #JIT=29
  query3: HEAD+01    - 00.210917 #runs=5 #JIT=29
  query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25
  query4: HEAD       - 00.052305 #runs=100 #JIT=10
  query4: HEAD+01    - 00.038319 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=0 jit-optimize=0
  query1: HEAD       - 14.922044 #runs=5 #JIT=102104
  query1: HEAD+01    - 11.356347 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241
  query2: HEAD       - 18.477133 #runs=5 #JIT=40122
  query2: HEAD+01    - 11.028579 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087
  query3: HEAD       - 00.235587 #runs=5 #JIT=209
  query3: HEAD+01    - 00.219597 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127
  query4: HEAD       - 00.052534 #runs=100 #JIT=10
  query4: HEAD+01    - 00.038881 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0
  query1: HEAD       - 12.696588 #runs=5 #JIT=102104
  query1: HEAD+01    - 12.279387 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211
  query2: HEAD       - 12.091824 #runs=5 #JIT=40122
  query2: HEAD+01    - 11.543042 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088
  query3: HEAD       - 00.122208 #runs=5 #JIT=209
  query3: HEAD+01    - 00.114153 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131
  query4: HEAD       - 00.033125 #runs=100 #JIT=10
  query4: HEAD+01    - 00.029818 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06
  query1: HEAD       - 02.760343 #runs=5 #JIT=102104
  query1: HEAD+01    - 02.742944 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292
  query2: HEAD       - 02.396965 #runs=5 #JIT=40122
  query2: HEAD+01    - 02.394724 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089
  query3: HEAD       - 00.186633 #runs=5 #JIT=209
  query3: HEAD+01    - 00.189623 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125
  query4: HEAD       - 00.013277 #runs=100 #JIT=10
  query4: HEAD+01    - 00.012078 #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3

  parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06
  query1: HEAD       - 02.339973 #runs=5 #JIT=102104
  query1: HEAD+01    - 02.333525 #runs=5 #JIT=102104
  query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243
  query2: HEAD       - 02.268987 #runs=5 #JIT=40122
  query2: HEAD+01    - 02.248729 #runs=5 #JIT=40122
  query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088
  query3: HEAD       - 00.084531 #runs=5 #JIT=209
  query3: HEAD+01    - 00.091616 #runs=5 #JIT=209
  query3: HEAD+01+02 - 00.08668 #runs=5 #JIT=127
  query4: HEAD       - 00.005371 #runs=100 #JIT=10
  query4: HEAD+01    - 00.0053   #runs=100 #JIT=10
  query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3

===================================
[1]
/messages/by-id/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793@elevated-dev.com

[2]
/messages/by-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C+ksKFpSdZg=q6sTbtQ-v=aw@mail.gmail.com
Hi,

Did some TPCH testing today on a TPCH 100G to see regressions there.
Results (query/HEAD/patched/speedup)

1    9.49    9.25    1.03
3    11.87    11.65    1.02
4    23.74    21.24    1.12
5    11.66    11.07    1.05
6    7.82    7.72    1.01
7    12.1    11.23    1.08
8    12.99    11.2    1.16
9    71.2    68.05    1.05
10    17.72    17.31    1.02
11    4.75    4.16    1.14
12    10.47    10.27    1.02
13    38.23    38.71    0.99
14    8.69    8.5    1.02
15    12.63    12.6    1.00
19    8.56    8.37    1.02
22    10.34    9.25    1.12

Cheers,
Luc
Kind regards,
Luc

Providing a new set of rebased patches with a better description in the
hopes this helps reviewability. Also registering this to the next CF to
increase visibility.

Regards,
Luc

Attachments:

v3-0002-Do-not-generate-the-IR-code-ahead-of-time-but-laz.patchtext/x-patch; charset=UTF-8; name=v3-0002-Do-not-generate-the-IR-code-ahead-of-time-but-laz.patchDownload

From 98b793bf4c80802e8da79626febc3c4dc567a844 Mon Sep 17 00:00:00 2001
From: Luc Vlaming <luc@swarm64.com>
Date: Mon, 12 Apr 2021 11:10:58 +0200
Subject: [PATCH v3 2/2] Do not generate the IR code ahead of time, but lazy
 upon first call.

Now we combine the IR generation, optimization and emission passes
all in the same time, which seems to gain us in almost all cases,
also (very) small queries with JIT enabled.

For plans where functions might actually not be called at all we actually
gain several factors runtime. This happens e.g. in plans with parallel append
nodes as there the workers get distributed and are unlikely to participate
in all subplans. Most common case for this is usage of queries
on partitioned tables combined with JIT.
---
 src/backend/jit/llvm/llvmjit_expr.c | 96 +++++++++++++++++------------
 1 file changed, 57 insertions(+), 39 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 0f9cc790c7..74c31051cd 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -52,6 +52,7 @@ typedef struct CompiledExprState
 } CompiledExprState;
 
 
+static Datum ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 static Datum ExecRunCompiledExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 
 static LLVMValueRef BuildV1Call(LLVMJitContext *context, LLVMBuilderRef b,
@@ -70,18 +71,64 @@ static LLVMValueRef create_LifetimeEnd(LLVMModuleRef mod);
 					   lengthof(((LLVMValueRef[]){__VA_ARGS__})), \
 					   ((LLVMValueRef[]){__VA_ARGS__}))
 
-
 /*
- * JIT compile expression.
+ * Prepare the JIT compile expression.
  */
 bool
 llvm_compile_expr(ExprState *state)
 {
 	PlanState  *parent = state->parent;
-	char	   *funcname;
-
 	LLVMJitContext *context = NULL;
 
+
+	/*
+	 * Right now we don't support compiling expressions without a parent, as
+	 * we need access to the EState.
+	 */
+	Assert(parent);
+
+	llvm_enter_fatal_on_oom();
+
+	/* get or create JIT context */
+	if (parent->state->es_jit)
+		context = (LLVMJitContext *) parent->state->es_jit;
+	else
+	{
+		context = llvm_create_context(parent->state->es_jit_flags);
+		parent->state->es_jit = &context->base;
+	}
+
+	/*
+	 * Don't immediately emit nor actually generate the function.
+	 * Instead do so the first time the expression is actually evaluated.
+	 * This helps with not compiling functions that will never be evaluated,
+	 * as can be the case if e.g. a parallel append node is distributing
+	 * workers between its child nodes.
+	 */
+	{
+
+		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
+
+		cstate->context = context;
+
+		state->evalfunc = ExecCompileExpr;
+		state->evalfunc_private = cstate;
+	}
+
+	llvm_leave_fatal_on_oom();
+
+	return true;
+}
+
+/*
+ * JIT compile expression.
+ */
+static Datum
+ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull)
+{
+	CompiledExprState *cstate = state->evalfunc_private;
+	LLVMJitContext *context = cstate->context;
+
 	LLVMBuilderRef b;
 	LLVMModuleRef mod;
 	LLVMValueRef eval_fn;
@@ -125,31 +172,16 @@ llvm_compile_expr(ExprState *state)
 
 	llvm_enter_fatal_on_oom();
 
-	/*
-	 * Right now we don't support compiling expressions without a parent, as
-	 * we need access to the EState.
-	 */
-	Assert(parent);
-
-	/* get or create JIT context */
-	if (parent->state->es_jit)
-		context = (LLVMJitContext *) parent->state->es_jit;
-	else
-	{
-		context = llvm_create_context(parent->state->es_jit_flags);
-		parent->state->es_jit = &context->base;
-	}
-
 	INSTR_TIME_SET_CURRENT(starttime);
 
 	mod = llvm_mutable_module(context);
 
 	b = LLVMCreateBuilder();
 
-	funcname = llvm_expand_funcname(context, "evalexpr");
+	cstate->funcname = llvm_expand_funcname(context, "evalexpr");
 
 	/* create function */
-	eval_fn = LLVMAddFunction(mod, funcname,
+	eval_fn = LLVMAddFunction(mod, cstate->funcname,
 							  llvm_pg_var_func_type("TypeExprStateEvalFunc"));
 	LLVMSetLinkage(eval_fn, LLVMExternalLinkage);
 	LLVMSetVisibility(eval_fn, LLVMDefaultVisibility);
@@ -2356,30 +2388,16 @@ llvm_compile_expr(ExprState *state)
 
 	LLVMDisposeBuilder(b);
 
-	/*
-	 * Don't immediately emit function, instead do so the first time the
-	 * expression is actually evaluated. That allows to emit a lot of
-	 * functions together, avoiding a lot of repeated llvm and memory
-	 * remapping overhead.
-	 */
-	{
-
-		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
-
-		cstate->context = context;
-		cstate->funcname = funcname;
-
-		state->evalfunc = ExecRunCompiledExpr;
-		state->evalfunc_private = cstate;
-	}
-
 	llvm_leave_fatal_on_oom();
 
 	INSTR_TIME_SET_CURRENT(endtime);
 	INSTR_TIME_ACCUM_DIFF(context->base.instr.generation_counter,
 						  endtime, starttime);
 
-	return true;
+	/* remove indirection via this function for future calls */
+	state->evalfunc = ExecRunCompiledExpr;
+
+	return ExecRunCompiledExpr(state, econtext, isNull);
 }
 
 /*
-- 
2.25.1

v3-0001-Improve-jitting-performance-by-not-emitting-the.patchtext/x-patch; charset=UTF-8; name=v3-0001-Improve-jitting-performance-by-not-emitting-the.patchDownload

From 73eba2697046eeb3984f144555bb0b32e13162b7 Mon Sep 17 00:00:00 2001
From: Luc Vlaming <luc@swarm64.com>
Date: Mon, 12 Apr 2021 11:05:05 +0200
Subject: [PATCH v3 1/2] Improve jitting performance by not emitting the
 LLVMPassManagerBuilderUseInlinerWithThreshold pass.

This pass contains some very expensive parts which in experiments so
far do not gain us much compared to the runtime we gain afterwards
when running queries. Instead now emit the simpler inliner passes
also for PGJIT_OPT3.

To monitor how many modules are now created an extra statistic is
added so we can test effectively how much performance is
gained / lost because of this.
---
 src/backend/commands/explain.c |  1 +
 src/backend/jit/llvm/llvmjit.c | 17 ++++++-----------
 src/include/jit/jit.h          |  3 +++
 3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b62a76e7e5..ac97e9b44b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -894,6 +894,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, JitInstrumentation *ji)
 		appendStringInfoString(es->str, "JIT:\n");
 		es->indent++;
 
+		ExplainPropertyInteger("Modules", NULL, ji->created_modules, es);
 		ExplainPropertyInteger("Functions", NULL, ji->created_functions, es);
 
 		ExplainIndentText(es);
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 98a27f08bf..18797e696b 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -241,6 +241,8 @@ llvm_mutable_module(LLVMJitContext *context)
 		context->module = LLVMModuleCreateWithName("pg");
 		LLVMSetTarget(context->module, llvm_triple);
 		LLVMSetDataLayout(context->module, llvm_layout);
+
+		context->base.instr.created_modules++;
 	}
 
 	return context->module;
@@ -578,12 +580,7 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderSetOptLevel(llvm_pmb, compile_optlevel);
 	llvm_fpm = LLVMCreateFunctionPassManagerForModule(module);
 
-	if (context->base.flags & PGJIT_OPT3)
-	{
-		/* TODO: Unscientifically determined threshold */
-		LLVMPassManagerBuilderUseInlinerWithThreshold(llvm_pmb, 512);
-	}
-	else
+	if (!(context->base.flags & PGJIT_OPT3))
 	{
 		/* we rely on mem2reg heavily, so emit even in the O0 case */
 		LLVMAddPromoteMemoryToRegisterPass(llvm_fpm);
@@ -611,11 +608,9 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderPopulateModulePassManager(llvm_pmb,
 													llvm_mpm);
 	/* always use always-inliner pass */
-	if (!(context->base.flags & PGJIT_OPT3))
-		LLVMAddAlwaysInlinerPass(llvm_mpm);
-	/* if doing inlining, but no expensive optimization, add inlining pass */
-	if (context->base.flags & PGJIT_INLINE
-		&& !(context->base.flags & PGJIT_OPT3))
+	LLVMAddAlwaysInlinerPass(llvm_mpm);
+	/* if doing inlining, add inlining pass */
+	if (context->base.flags & PGJIT_INLINE)
 		LLVMAddFunctionInliningPass(llvm_mpm);
 	LLVMRunPassManager(llvm_mpm, context->module);
 	LLVMDisposePassManager(llvm_mpm);
diff --git a/src/include/jit/jit.h b/src/include/jit/jit.h
index b634df30b9..7a080074c0 100644
--- a/src/include/jit/jit.h
+++ b/src/include/jit/jit.h
@@ -26,6 +26,9 @@
 
 typedef struct JitInstrumentation
 {
+	/* number of emitted modules */
+	size_t		created_modules;
+
 	/* number of emitted functions */
 	size_t		created_functions;
 
-- 
2.25.1

David Geier

geidav.pg@gmail.com

over 3 years ago

In reply to: Luc Vlaming (#6)

1 attachment(s)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Hi hackers,

I picked this up and had a closer look in which way the total JIT time
depends on the number of modules to be jitted.

Indeed, the total JIT time increases the more modules are used. The reason
for this to happen is that the inlining pass loads and deserializes all to
be inlined modules (.bc files) from disk prior to inlining them via
llvm::IRMover. There's already a cache for such modules in the code, but it
is currently unused. This is because llvm::IRMover takes the module to be
inlined as std::unique_ptr<llvm::Module>. The by-value argument requires
the source module to be moved, which means it cannot be reused afterwards.
The code is accounting for that by erasing the module from the cache after
inlining it, which in turns requires reloading the module next time a
reference to it is encountered.

Instead of each time loading and deserializing all to be inlined modules
from disk, they can reside in the cache and instead be cloned via
llvm::CloneModule() before they get inlined. Key to calling
llvm::CloneModule() is fully deserializing the module upfront, instead of
loading the module lazily. That is why I changed the call from
LLVMGetBitcodeModuleInContext2() (which lazily loads the module via
llvm::getOwningLazyBitcodeModule()) to LLVMParseBitCodeInContext2() (which
fully loads the module via llvm::parseBitcodeFile()). Beyond that it seems
like that prior to LLVM 13, cloning modules could fail with an assertion
(not sure though if that would cause problems in a release build without
assertions). Andres reported this problem back in the days here [1]https://lists.llvm.org/pipermail/llvm-dev/2018-March/122111.html. In the
meanwhile the issue got discussed in [2]https://reviews.llvm.org/D96531 and finally fixed for LLVM 13, see
[3]: https://reviews.llvm.org/rG22a52dfddcefad4f275eb8ad1cc0e200074c2d8a

Lazily jitting, typically, results in the generation of half as many
modules as there are functions. The IR for tuple deforming and expression
evaluation is always created together in ExecCompileExpr(), before the
jitted program gets used for the first time. However, if e.g. no rows gets
processed or a plan node is never executed, with the new version the
corresponding JIT program will never be created and the function / module
count can be lower.

In the following you can see some performance numbers and the reproduction
steps. Everything is based on REL_14_2. The query I built is based on
Andres initial suggestion and results in 53 functions. With the patch,
instead of a single module, 27 modules are created. Arbitrarily more
modules and functions can be generated by adding more UNIONs to the query
below. I also left out the first invocation in every test, because during
the first invocation the .bc files to be inlined still need to be loaded,
which is (1) slower now because the bitcode file is parsed completely and
(2) needs to be done only once as it now gets cached.

SET jit_above_cost = 0;
SET max_parallel_workers_per_gather = 0;
SET jit_inline_above_cost = 0;
SET jit_optimize_above_cost = 0;
CREATE TABLE foo (blub INT, zap INT, blarg INT);
INSERT INTO foo SELECT generate_series(1, 1000000), 1, 2;
ANALYZE foo;
EXPLAIN (ANALYZE, VERBOSE)
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg)
UNION
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg)
UNION
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg)
UNION
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg)
UNION
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg)
UNION
(SELECT blarg, count(*), sum(zap) FROM foo WHERE blub = 2 GROUP BY blarg);

Baseline (1 module)
-------------------
Timing: Generation 0.432 ms, Inlining 4.705 ms, Optimization 29.662 ms,
Emission 21.300 ms, Total 56.098 ms (1 module, 9 functions)
Timing: Generation 2.042 ms, Inlining 4.014 ms, Optimization 164.745 ms,
Emission 120.334 ms, Total 291.135 ms (1 module, 59 functions)

CloneModule()
-------------
Timing: Generation 0.454 ms, Inlining 0.077 ms, Optimization 14.441 ms,
Emission 12.234 ms, Total 27.206 ms (1 module, 9 functions)
Timing: Generation 2.061 ms, Inlining 0.193 ms, Optimization 95.627 ms,
Emission 77.652 ms, Total 175.533 ms (1 module, 59 functions)

CloneModule(), lazy jitting (> 1 modules)
-----------------------------------------
Timing: Generation 0.457 ms, Inlining 0.154 ms, Optimization 15.972 ms,
Emission 13.152 ms, Total 29.735 ms ( 4 modules, 8 functions)
Timing: Generation 2.715 ms, Inlining 0.974 ms, Optimization 105.122 ms,
Emission 86.776 ms, Total 195.587 ms (27 modules, 53 functions)

CloneModule(), lazy jitting, optimization settings
--------------------------------------------------
Timing: Generation 0.758 ms, Inlining 0.153 ms, Optimization 13.137 ms,
Emission 13.805 ms, Total 27.854 ms ( 4 modules, 8 functions)
Timing: Generation 2.712 ms, Inlining 1.133 ms, Optimization 92.661 ms,
Emission 84.538 ms, Total 181.043 ms (27 modules, 53 functions)

Cloning the module instead of reloading it brings down inlining time, even
for a single module. Obviously, it pays off the more modules are used.
However, curiously the time spent on optimizing is also reduced (95ms
instead of 164ms). Could this be because some of the applied optimizations
are ending up in the cached module? With more modules the total time spent
on jitting seems to increase slightly (175ms vs 195ms). Almost all of that
time is spent on optimization and emission. Time for inlining only
increases from ~0.2ms to ~1.1ms. With improved optimization settings the
final time spent is pretty much on par with the single module variant
(175ms vs 181ms). So overall it looks like using many modules would this
way have competitive performance for the worst-case (all functions used),
but much better performance for cases where a significant amount of the
functions remain unused.

@Andres: could you provide me with the queries that caused the assertion
failure in LLVM? Have you ever observed a segfault with a
non-assert-enabled build? I just want to make sure this is truly fixed in
LLVM 13. Running 'make check-world' all tests passed.

The attached patch currently does not yet have a case distinction for LLVM
13. But that would be straightforward to add.

Thanks for your consideration.

--
David Geier
(ServiceNow)

[1]: https://lists.llvm.org/pipermail/llvm-dev/2018-March/122111.html
[2]: https://reviews.llvm.org/D96531
[3]: https://reviews.llvm.org/rG22a52dfddcefad4f275eb8ad1cc0e200074c2d8a

On Fri, Jun 24, 2022 at 2:13 PM Luc Vlaming <luc@swarm64.com> wrote:

Show quoted text

On 18-01-2021 08:47, Luc Vlaming wrote:

Hi everyone, Andres,

On 03-01-2021 11:05, Luc Vlaming wrote:

On 30-12-2020 14:23, Luc Vlaming wrote:

On 30-12-2020 02:57, Andres Freund wrote:

Hi,

Great to see work in this area!

I would like this topic to somehow progress and was wondering what other
benchmarks / tests would be needed to have some progress? I've so far
provided benchmarks for small(ish) queries and some tpch numbers,
assuming those would be enough.

On 2020-12-28 09:44:26 +0100, Luc Vlaming wrote:

I would like to propose a small patch to the JIT machinery which
makes the
IR code generation lazy. The reason for postponing the generation
of the IR
code is that with partitions we get an explosion in the number of JIT
functions generated as many child tables are involved, each with
their own
JITted functions, especially when e.g. partition-aware
joins/aggregates are
enabled. However, only a fraction of those functions is actually
executed
because the Parallel Append node distributes the workers among the
nodes.
With the attached patch we get a lazy generation which makes that
this is no
longer a problem.

I unfortunately don't think this is quite good enough, because it'll
lead to emitting all functions separately, which can also lead to very
substantial increases of the required time (as emitting code is an
expensive step). Obviously that is only relevant in the cases where

the

generated functions actually end up being used - which isn't the
case in
your example.

If you e.g. look at a query like
SELECT blub, count(*),sum(zap) FROM foo WHERE blarg = 3 GROUP BY
blub;
on a table without indexes, you would end up with functions for

- WHERE clause (including deforming)
- projection (including deforming)
- grouping key
- aggregate transition
- aggregate result projection

with your patch each of these would be emitted separately, instead of
one go. Which IIRC increases the required time by a significant

amount,
especially if inlining is done (where each separate code generation
ends
up with copies of the inlined code).

As far as I can see you've basically falsified the second part of this
comment (which you moved):
+
+    /*
+     * Don't immediately emit nor actually generate the function.
+     * instead do so the first time the expression is actually
evaluated.
+     * That allows to emit a lot of functions together, avoiding a
lot of
+     * repeated llvm and memory remapping overhead. It also helps
with not
+     * compiling functions that will never be evaluated, as can be
the case
+     * if e.g. a parallel append node is distributing workers
between its
+     * child nodes.
+     */
- /*
- * Don't immediately emit function, instead do so the first
time the
- * expression is actually evaluated. That allows to emit a lot
of

- * functions together, avoiding a lot of repeated llvm and

memory

- * remapping overhead.
- */

Greetings,

Andres Freund

Hi,

Happy to help out, and thanks for the info and suggestions.
Also, I should have first searched psql-hackers and the like, as I
just found out there is already discussions about this in [1] and [2].
However I think the approach I took can be taken independently and
then other solutions could be added on top.

Assuming I understood all suggestions correctly, the ideas so far are:
1. add a LLVMAddMergeFunctionsPass so that duplicate code is removed
and not optimized several times (see [1]). Requires all code to be
emitted in the same module.
2. JIT only parts of the plan, based on cost (see [2]).
3. Cache compilation results to avoid recompilation. this would
either need a shm capable optimized IR cache or would not work with
parallel workers.
4. Lazily jitting (this patch)

An idea that might not have been presented in the mailing list yet(?):
5. Only JIT in nodes that process a certain amount of rows. Assuming
there is a constant overhead for JITting and the goal is to gain
runtime.

Going forward I would first try to see if my current approach can
work out. The only idea that would be counterproductive to my
solution would be solution 1. Afterwards I'd like to continue with
either solution 2, 5, or 3 in the hopes that we can reduce JIT
overhead to a minimum and can therefore apply it more broadly.

To test out why and where the JIT performance decreased with my
solution I improved the test script and added various queries to
model some of the cases I think we should care about. I have not
(yet) done big scale benchmarks as these queries seemed to already
show enough problems for now. Now there are 4 queries which test
JITting with/without partitions, and with varying amounts of workers
and rowcounts. I hope these are indeed a somewhat representative set
of queries.

As pointed out the current patch does create a degradation in
performance wrt queries that are not partitioned (basically q3 and
q4). After looking into those queries I noticed two things:
- q3 is very noisy wrt JIT timings. This seems to be the result of
something wrt parallel workers starting up the JITting and creating
very high amounts of noise (e.g. inlining timings varying between
3.8s and 6.2s)
- q4 seems very stable with JIT timings (after the first run).
I'm wondering if this could mean that with parallel workers quite a
lot of time is spent on startup of the llvm machinery and this gets
noisy because of OS interaction and the like?

Either way I took q4 to try and fix the regression and noticed
something interesting, given the comment from Andres: the generation
and inlining time actually decreased, but the optimization and
emission time increased. After trying out various things in the
llvm_optimize_module function and googling a bit it seems that the
LLVMPassManagerBuilderUseInlinerWithThreshold adds some very
expensive passes. I tried to construct some queries where this would
actually gain us but couldnt (yet).

For v2 of the patch-set the first patch slightly changes how we
optimize the code, which removes the aforementioned degradations in
the queries. The second patch then makes that partitions work a lot
better, but interestingly now also q4 gets a lot faster but somehow
q3 does not.

Because these findings contradict the suggestions/findings from
Andres I'm wondering what I'm missing. I would continue and do some
TPC-H like tests on top, but apart from that I'm not entirely sure
where we are supposed to gain most from the call to
LLVMPassManagerBuilderUseInlinerWithThreshold(). Reason is that from
the scenarios I now tested it seems that the pain is actually in the
code optimization and possibly rather specific passes and not
necessarily in how many modules are emitted.

If there are more / better queries / datasets / statistics I can run
and gather I would be glad to do so :) To me the current results seem
however fairly promising.

Looking forward to your thoughts & suggestions.

With regards,
Luc
Swarm64

===================================
Results from the test script on my machine:

parameters: jit=on workers=5 jit-inline=0 jit-optimize=0
query1: HEAD - 08.088901 #runs=5 #JIT=12014
query1: HEAD+01 - 06.369646 #runs=5 #JIT=12014
query1: HEAD+01+02 - 01.248596 #runs=5 #JIT=1044
query2: HEAD - 17.628126 #runs=5 #JIT=24074
query2: HEAD+01 - 10.786114 #runs=5 #JIT=24074
query2: HEAD+01+02 - 01.262084 #runs=5 #JIT=1083
query3: HEAD - 00.220141 #runs=5 #JIT=29
query3: HEAD+01 - 00.210917 #runs=5 #JIT=29
query3: HEAD+01+02 - 00.229575 #runs=5 #JIT=25
query4: HEAD - 00.052305 #runs=100 #JIT=10
query4: HEAD+01 - 00.038319 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.018533 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=0 jit-optimize=0
query1: HEAD - 14.922044 #runs=5 #JIT=102104
query1: HEAD+01 - 11.356347 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.641409 #runs=5 #JIT=1241
query2: HEAD - 18.477133 #runs=5 #JIT=40122
query2: HEAD+01 - 11.028579 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.872588 #runs=5 #JIT=1087
query3: HEAD - 00.235587 #runs=5 #JIT=209
query3: HEAD+01 - 00.219597 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.233975 #runs=5 #JIT=127
query4: HEAD - 00.052534 #runs=100 #JIT=10
query4: HEAD+01 - 00.038881 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.018268 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=0
query1: HEAD - 12.696588 #runs=5 #JIT=102104
query1: HEAD+01 - 12.279387 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.512643 #runs=5 #JIT=1211
query2: HEAD - 12.091824 #runs=5 #JIT=40122
query2: HEAD+01 - 11.543042 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.774382 #runs=5 #JIT=1088
query3: HEAD - 00.122208 #runs=5 #JIT=209
query3: HEAD+01 - 00.114153 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.139906 #runs=5 #JIT=131
query4: HEAD - 00.033125 #runs=100 #JIT=10
query4: HEAD+01 - 00.029818 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.015099 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=0 jit-optimize=1e+06
query1: HEAD - 02.760343 #runs=5 #JIT=102104
query1: HEAD+01 - 02.742944 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.460169 #runs=5 #JIT=1292
query2: HEAD - 02.396965 #runs=5 #JIT=40122
query2: HEAD+01 - 02.394724 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.425303 #runs=5 #JIT=1089
query3: HEAD - 00.186633 #runs=5 #JIT=209
query3: HEAD+01 - 00.189623 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.193272 #runs=5 #JIT=125
query4: HEAD - 00.013277 #runs=100 #JIT=10
query4: HEAD+01 - 00.012078 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.004846 #runs=100 #JIT=3

parameters: jit=on workers=50 jit-inline=1e+06 jit-optimize=1e+06
query1: HEAD - 02.339973 #runs=5 #JIT=102104
query1: HEAD+01 - 02.333525 #runs=5 #JIT=102104
query1: HEAD+01+02 - 00.342824 #runs=5 #JIT=1243
query2: HEAD - 02.268987 #runs=5 #JIT=40122
query2: HEAD+01 - 02.248729 #runs=5 #JIT=40122
query2: HEAD+01+02 - 00.306829 #runs=5 #JIT=1088
query3: HEAD - 00.084531 #runs=5 #JIT=209
query3: HEAD+01 - 00.091616 #runs=5 #JIT=209
query3: HEAD+01+02 - 00.08668 #runs=5 #JIT=127
query4: HEAD - 00.005371 #runs=100 #JIT=10
query4: HEAD+01 - 00.0053 #runs=100 #JIT=10
query4: HEAD+01+02 - 00.002422 #runs=100 #JIT=3

===================================
[1]

/messages/by-id/7736C40E-6DB5-4E7A-8FE3-4B2AB8E22793@elevated-dev.com

[2]

/messages/by-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C+ksKFpSdZg=q6sTbtQ-v=aw@mail.gmail.com

Hi,

Did some TPCH testing today on a TPCH 100G to see regressions there.
Results (query/HEAD/patched/speedup)

1 9.49 9.25 1.03
3 11.87 11.65 1.02
4 23.74 21.24 1.12
5 11.66 11.07 1.05
6 7.82 7.72 1.01
7 12.1 11.23 1.08
8 12.99 11.2 1.16
9 71.2 68.05 1.05
10 17.72 17.31 1.02
11 4.75 4.16 1.14
12 10.47 10.27 1.02
13 38.23 38.71 0.99
14 8.69 8.5 1.02
15 12.63 12.6 1.00
19 8.56 8.37 1.02
22 10.34 9.25 1.12

Cheers,
Luc

Kind regards,
Luc

Providing a new set of rebased patches with a better description in the
hopes this helps reviewability. Also registering this to the next CF to
increase visibility.

Regards,
Luc

Attachments:

0001-Lazily-JIT.patchtext/x-patch; charset=US-ASCII; name=0001-Lazily-JIT.patchDownload

From 3735861eb120587105676c4df5818d16bac2ad2e Mon Sep 17 00:00:00 2001
From: David Geier <david.geier@servicenow.com>
Date: Mon, 27 Jun 2022 12:28:29 +0200
Subject: [PATCH] Lazily JIT

---
 src/backend/commands/explain.c          |  1 +
 src/backend/jit/jit.c                   |  1 +
 src/backend/jit/llvm/llvmjit.c          | 16 ++---
 src/backend/jit/llvm/llvmjit_expr.c     | 95 +++++++++++++++----------
 src/backend/jit/llvm/llvmjit_inline.cpp | 14 ++--
 src/include/jit/jit.h                   |  3 +
 6 files changed, 74 insertions(+), 56 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 81a227d8b8..2cb201c093 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -890,6 +890,7 @@ ExplainPrintJIT(ExplainState *es, int jit_flags, JitInstrumentation *ji)
 		appendStringInfoString(es->str, "JIT:\n");
 		es->indent++;
 
+		ExplainPropertyInteger("Modules", NULL, ji->created_modules, es);
 		ExplainPropertyInteger("Functions", NULL, ji->created_functions, es);
 
 		ExplainIndentText(es);
diff --git a/src/backend/jit/jit.c b/src/backend/jit/jit.c
index 2da300e000..b574357125 100644
--- a/src/backend/jit/jit.c
+++ b/src/backend/jit/jit.c
@@ -183,6 +183,7 @@ jit_compile_expr(struct ExprState *state)
 void
 InstrJitAgg(JitInstrumentation *dst, JitInstrumentation *add)
 {
+	dst->created_modules += add->created_modules;
 	dst->created_functions += add->created_functions;
 	INSTR_TIME_ADD(dst->generation_counter, add->generation_counter);
 	INSTR_TIME_ADD(dst->inlining_counter, add->inlining_counter);
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index fb29449573..3e3340992d 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -236,6 +236,7 @@ llvm_mutable_module(LLVMJitContext *context)
 	 */
 	if (!context->module)
 	{
+		context->base.instr.created_modules++;
 		context->compiled = false;
 		context->module_generation = llvm_generation++;
 		context->module = LLVMModuleCreateWithName("pg");
@@ -578,12 +579,7 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderSetOptLevel(llvm_pmb, compile_optlevel);
 	llvm_fpm = LLVMCreateFunctionPassManagerForModule(module);
 
-	if (context->base.flags & PGJIT_OPT3)
-	{
-		/* TODO: Unscientifically determined threshold */
-		LLVMPassManagerBuilderUseInlinerWithThreshold(llvm_pmb, 512);
-	}
-	else
+	if (!(context->base.flags & PGJIT_OPT3))
 	{
 		/* we rely on mem2reg heavily, so emit even in the O0 case */
 		LLVMAddPromoteMemoryToRegisterPass(llvm_fpm);
@@ -611,11 +607,9 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderPopulateModulePassManager(llvm_pmb,
 													llvm_mpm);
 	/* always use always-inliner pass */
-	if (!(context->base.flags & PGJIT_OPT3))
-		LLVMAddAlwaysInlinerPass(llvm_mpm);
-	/* if doing inlining, but no expensive optimization, add inlining pass */
-	if (context->base.flags & PGJIT_INLINE
-		&& !(context->base.flags & PGJIT_OPT3))
+	LLVMAddAlwaysInlinerPass(llvm_mpm);
+	/* if doing inlining, add inlining pass */
+	if (context->base.flags & PGJIT_INLINE)
 		LLVMAddFunctionInliningPass(llvm_mpm);
 	LLVMRunPassManager(llvm_mpm, context->module);
 	LLVMDisposePassManager(llvm_mpm);
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 8a4075bdaf..df27c0dd4b 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -52,6 +52,7 @@ typedef struct CompiledExprState
 } CompiledExprState;
 
 
+static Datum ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 static Datum ExecRunCompiledExpr(ExprState *state, ExprContext *econtext, bool *isNull);
 
 static LLVMValueRef BuildV1Call(LLVMJitContext *context, LLVMBuilderRef b,
@@ -72,16 +73,63 @@ static LLVMValueRef create_LifetimeEnd(LLVMModuleRef mod);
 
 
 /*
- * JIT compile expression.
+ * Prepare the JIT compile expression.
  */
 bool
 llvm_compile_expr(ExprState *state)
 {
 	PlanState  *parent = state->parent;
-	char	   *funcname;
-
 	LLVMJitContext *context = NULL;
 
+
+	/*
+	 * Right now we don't support compiling expressions without a parent, as
+	 * we need access to the EState.
+	 */
+	Assert(parent);
+
+	llvm_enter_fatal_on_oom();
+
+	/* get or create JIT context */
+	if (parent->state->es_jit)
+		context = (LLVMJitContext *) parent->state->es_jit;
+	else
+	{
+		context = llvm_create_context(parent->state->es_jit_flags);
+		parent->state->es_jit = &context->base;
+	}
+
+	/*
+	 * Don't immediately emit nor actually generate the function.
+	 * Instead do so the first time the expression is actually evaluated.
+	 * This helps with not compiling functions that will never be evaluated,
+	 * as can be the case if e.g. a parallel append node is distributing
+	 * workers between its child nodes.
+	 */
+	{
+
+		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
+
+		cstate->context = context;
+
+		state->evalfunc = ExecCompileExpr;
+		state->evalfunc_private = cstate;
+	}
+
+	llvm_leave_fatal_on_oom();
+
+	return true;
+}
+
+/*
+ * JIT compile expression.
+ */
+static Datum
+ExecCompileExpr(ExprState *state, ExprContext *econtext, bool *isNull)
+{
+	CompiledExprState *cstate = state->evalfunc_private;
+	LLVMJitContext *context = cstate->context;
+
 	LLVMBuilderRef b;
 	LLVMModuleRef mod;
 	LLVMValueRef eval_fn;
@@ -125,31 +173,16 @@ llvm_compile_expr(ExprState *state)
 
 	llvm_enter_fatal_on_oom();
 
-	/*
-	 * Right now we don't support compiling expressions without a parent, as
-	 * we need access to the EState.
-	 */
-	Assert(parent);
-
-	/* get or create JIT context */
-	if (parent->state->es_jit)
-		context = (LLVMJitContext *) parent->state->es_jit;
-	else
-	{
-		context = llvm_create_context(parent->state->es_jit_flags);
-		parent->state->es_jit = &context->base;
-	}
-
 	INSTR_TIME_SET_CURRENT(starttime);
 
 	mod = llvm_mutable_module(context);
 
 	b = LLVMCreateBuilder();
 
-	funcname = llvm_expand_funcname(context, "evalexpr");
+	cstate->funcname = llvm_expand_funcname(context, "evalexpr");
 
 	/* create function */
-	eval_fn = LLVMAddFunction(mod, funcname,
+	eval_fn = LLVMAddFunction(mod, cstate->funcname,
 							  llvm_pg_var_func_type("TypeExprStateEvalFunc"));
 	LLVMSetLinkage(eval_fn, LLVMExternalLinkage);
 	LLVMSetVisibility(eval_fn, LLVMDefaultVisibility);
@@ -2356,30 +2389,16 @@ llvm_compile_expr(ExprState *state)
 
 	LLVMDisposeBuilder(b);
 
-	/*
-	 * Don't immediately emit function, instead do so the first time the
-	 * expression is actually evaluated. That allows to emit a lot of
-	 * functions together, avoiding a lot of repeated llvm and memory
-	 * remapping overhead.
-	 */
-	{
-
-		CompiledExprState *cstate = palloc0(sizeof(CompiledExprState));
-
-		cstate->context = context;
-		cstate->funcname = funcname;
-
-		state->evalfunc = ExecRunCompiledExpr;
-		state->evalfunc_private = cstate;
-	}
-
 	llvm_leave_fatal_on_oom();
 
 	INSTR_TIME_SET_CURRENT(endtime);
 	INSTR_TIME_ACCUM_DIFF(context->base.instr.generation_counter,
 						  endtime, starttime);
 
-	return true;
+	/* remove indirection via this function for future calls */
+	state->evalfunc = ExecRunCompiledExpr;
+
+	return ExecRunCompiledExpr(state, econtext, isNull);
 }
 
 /*
diff --git a/src/backend/jit/llvm/llvmjit_inline.cpp b/src/backend/jit/llvm/llvmjit_inline.cpp
index 6f03595db5..43c3f7ddd3 100644
--- a/src/backend/jit/llvm/llvmjit_inline.cpp
+++ b/src/backend/jit/llvm/llvmjit_inline.cpp
@@ -62,6 +62,7 @@ extern "C"
 #include <llvm/IR/ModuleSummaryIndex.h>
 #include <llvm/Linker/IRMover.h>
 #include <llvm/Support/ManagedStatic.h>
+#include <llvm/Transforms/Utils/Cloning.h>
 
 
 /*
@@ -287,7 +288,6 @@ llvm_build_inline_plan(llvm::Module *mod)
 				elog(FATAL, "failed to materialize metadata");
 
 			Assert(!funcDef->isDeclaration());
-			Assert(funcDef->hasExternalLinkage());
 
 			llvm::StringSet<> importVars;
 			llvm::SmallPtrSet<const llvm::Function *, 8> visitedFunctions;
@@ -377,13 +377,13 @@ llvm_execute_inline_plan(llvm::Module *mod, ImportMapTy *globalsToInline)
 		const llvm::StringSet<>& modGlobalsToInline = toInline.second;
 		llvm::SetVector<llvm::GlobalValue *> GlobalsToImport;
 
-		Assert(module_cache->count(modPath));
-		std::unique_ptr<llvm::Module> importMod(std::move((*module_cache)[modPath]));
-		module_cache->erase(modPath);
-
 		if (modGlobalsToInline.empty())
 			continue;
 
+		auto iter = module_cache->find(modPath);
+		Assert(iter != module_cache->end());
+		auto &importMod = iter->second;
+
 		for (auto &glob: modGlobalsToInline)
 		{
 			llvm::StringRef SymbolName = glob.first();
@@ -451,7 +451,7 @@ llvm_execute_inline_plan(llvm::Module *mod, ImportMapTy *globalsToInline)
 #else
 #define IRMOVE_PARAMS
 #endif
-		if (Mover.move(std::move(importMod), GlobalsToImport.getArrayRef(),
+		if (Mover.move(llvm::CloneModule(*importMod), GlobalsToImport.getArrayRef(),
 					   [](llvm::GlobalValue &, llvm::IRMover::ValueAdder) {}
 					   IRMOVE_PARAMS))
 			elog(FATAL, "function import failed with linker error");
@@ -490,7 +490,7 @@ load_module(llvm::StringRef Identifier)
 	if (LLVMCreateMemoryBufferWithContentsOfFile(path, &buf, &msg))
 		elog(FATAL, "failed to open bitcode file \"%s\": %s",
 			 path, msg);
-	if (LLVMGetBitcodeModuleInContext2(LLVMGetGlobalContext(), buf, &mod))
+	if (LLVMParseBitcodeInContext2(LLVMGetGlobalContext(), buf, &mod))
 		elog(FATAL, "failed to parse bitcode in file \"%s\"", path);
 
 	/*
diff --git a/src/include/jit/jit.h b/src/include/jit/jit.h
index b634df30b9..7a080074c0 100644
--- a/src/include/jit/jit.h
+++ b/src/include/jit/jit.h
@@ -26,6 +26,9 @@
 
 typedef struct JitInstrumentation
 {
+	/* number of emitted modules */
+	size_t		created_modules;
+
 	/* number of emitted functions */
 	size_t		created_functions;
 
-- 
2.32.0

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Luc Vlaming (#5)

1 attachment(s)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

On 2021-Jan-18, Luc Vlaming wrote:

I would like this topic to somehow progress and was wondering what other
benchmarks / tests would be needed to have some progress? I've so far
provided benchmarks for small(ish) queries and some tpch numbers, assuming
those would be enough.

Hi, some time ago I reported a case[1]/messages/by-id/202111141706.wqq7xoyigwa2@alvherre.pgsql where our JIT implementation does
a very poor job and perhaps the changes that you're making could explain
what is going on, and maybe even fix it:

[1]: /messages/by-id/202111141706.wqq7xoyigwa2@alvherre.pgsql

The query for which I investigated the problem involved some pg_logical
metadata tables, so I didn't post it anywhere public; but the blog post
I found later contains a link to a query that shows the same symptoms,
and which is luckily still available online:
https://gist.github.com/saicitus/251ba20b211e9e73285af35e61b19580
I attach it here in case it goes missing sometime.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

David Geier

geidav.pg@gmail.com

over 3 years ago

In reply to: Alvaro Herrera (#8)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Hi Alvaro,

That's a very interesting case and might indeed be fixed or at least
improved by this patch. I tried to reproduce this, but at least when
running a simple, serial query with increasing numbers of functions, the
time spent per function is linear or even slightly sub-linear (same as Tom
observed in [1]).

I also couldn't reproduce the JIT runtimes you shared, when running the
attached catalog query. The catalog query ran serially for me with the
following JIT stats:

JIT:
Functions: 169
Options: Inlining true, Optimization true, Expressions true, Deforming
true
Timing: Generation 12.223 ms, Inlining 17.323 ms, Optimization 388.491
ms, Emission 283.464 ms, Total 701.501 ms

Is it possible that the query ran in parallel for you? For parallel
queries, every worker JITs all of the functions it uses. Even though the
workers might JIT the functions in parallel, the time reported in the
EXPLAIN ANALYZE output is the sum of the time spent by all workers. With
this patch applied, the JIT time drops significantly, as many of the
generated functions remain unused.

JIT:
Modules: 15
Functions: 26
Options: Inlining true, Optimization true, Expressions true, Deforming
true
Timing: Generation 1.931 ms, Inlining 0.722 ms, Optimization 67.195 ms,
Emission 70.347 ms, Total 140.195 ms

Of course, this does not prove that the nonlinearity that you observed went
away. Could you share with me how you ran the query so that I can reproduce
your numbers on master to then compare them with the patched version? Also,
which LLVM version did you run with? I'm currently running with LLVM 13.

Thanks!

--
David Geier
(ServiceNow)

On Mon, Jun 27, 2022 at 5:37 PM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Show quoted text

On 2021-Jan-18, Luc Vlaming wrote:

I would like this topic to somehow progress and was wondering what other
benchmarks / tests would be needed to have some progress? I've so far
provided benchmarks for small(ish) queries and some tpch numbers,

assuming

those would be enough.

Hi, some time ago I reported a case[1] where our JIT implementation does
a very poor job and perhaps the changes that you're making could explain
what is going on, and maybe even fix it:

[1] /messages/by-id/202111141706.wqq7xoyigwa2@alvherre.pgsql

The query for which I investigated the problem involved some pg_logical
metadata tables, so I didn't post it anywhere public; but the blog post
I found later contains a link to a query that shows the same symptoms,
and which is luckily still available online:
https://gist.github.com/saicitus/251ba20b211e9e73285af35e61b19580
I attach it here in case it goes missing sometime.

--
Álvaro Herrera 48°01'N 7°57'E —
https://www.EnterpriseDB.com/

#10

Luc Vlaming Hummel

luc.vlaming@servicenow.com

over 3 years ago

In reply to: David Geier (#9)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Hi Alvaro, hi David,

Thanks for reviewing this and the interesting examples!

Wanted to give a bit of extra insight as to why I'd love to have a system that can lazily emit JIT code and hence creates roughly a module per function:
In the end I'm hoping that we can migrate to a system where we only JIT after a configurable cost has been exceeded for this node, as well as a configurable amount of rows has actually been processed.
Reason is that this would safeguard against some problematic planning issues wrt JIT (node not being executed, row count being massively off).
It would also allow for more finegrained control, with a cost system similar to most other planning costs, as they are also per node and not globally, and would potentially allow us to only JIT things where we expect to truly gain any runtime compared to the costs of doing it.

If this means we have to invest more in making it cheap(er) to emit modules, I'm all for that. Kudos to David for fixing the caching in that sense :) @Andres if there's any other things we ought to fix to make this cheap (enough) compared to the previous code I'd love to know your thoughts.

Best,
Luc Vlaming
(ServiceNow)

From: David Geier <geidav.pg@gmail.com>
Sent: Wednesday, June 29, 2022 11:03 AM
To: Alvaro Herrera <alvherre@alvh.no-ip.org>
Cc: Luc Vlaming <luc@swarm64.com>; Andres Freund <andres@anarazel.de>; PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: Lazy JIT IR code generation to increase JIT speed with partitions

[External Email]

Hi Alvaro,

That's a very interesting case and might indeed be fixed or at least improved by this patch. I tried to reproduce this, but at least when running a simple, serial query with increasing numbers of functions, the time spent per function is linear or even slightly sub-linear (same as Tom observed in [1]/messages/by-id/202111141706.wqq7xoyigwa2@alvherre.pgsql).

I also couldn't reproduce the JIT runtimes you shared, when running the attached catalog query. The catalog query ran serially for me with the following JIT stats:

JIT:
Functions: 169
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 12.223 ms, Inlining 17.323 ms, Optimization 388.491 ms, Emission 283.464 ms, Total 701.501 ms

Is it possible that the query ran in parallel for you? For parallel queries, every worker JITs all of the functions it uses. Even though the workers might JIT the functions in parallel, the time reported in the EXPLAIN ANALYZE output is the sum of the time spent by all workers. With this patch applied, the JIT time drops significantly, as many of the generated functions remain unused.

JIT:
Modules: 15
Functions: 26
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.931 ms, Inlining 0.722 ms, Optimization 67.195 ms, Emission 70.347 ms, Total 140.195 ms

Of course, this does not prove that the nonlinearity that you observed went away. Could you share with me how you ran the query so that I can reproduce your numbers on master to then compare them with the patched version? Also, which LLVM version did you run with? I'm currently running with LLVM 13.

Thanks!

--
David Geier
(ServiceNow)

On Mon, Jun 27, 2022 at 5:37 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2021-Jan-18, Luc Vlaming wrote:

I would like this topic to somehow progress and was wondering what other
benchmarks / tests would be needed to have some progress? I've so far
provided benchmarks for small(ish) queries and some tpch numbers, assuming
those would be enough.

[1]: /messages/by-id/202111141706.wqq7xoyigwa2@alvherre.pgsql

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#11

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Luc Vlaming Hummel (#10)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Hi,

On 2022-07-04 06:43:00 +0000, Luc Vlaming Hummel wrote:

Thanks for reviewing this and the interesting examples!

Wanted to give a bit of extra insight as to why I'd love to have a system that can lazily emit JIT code and hence creates roughly a module per function:
In the end I'm hoping that we can migrate to a system where we only JIT after a configurable cost has been exceeded for this node, as well as a configurable amount of rows has actually been processed.
Reason is that this would safeguard against some problematic planning issues
wrt JIT (node not being executed, row count being massively off).

I still don't see how it's viable to move to always doing function-by-function
emission overhead wise.

I also want to go to do JIT in the background and triggered by acutal
usage. But to me it seems a dead end to require moving to
one-function-per-module model for that.

If this means we have to invest more in making it cheap(er) to emit modules,
I'm all for that.

I think that's just inherently more expensive and thus a no-go.

@Andres if there's any other things we ought to fix to make this cheap
(enough) compared to the previous code I'd love to know your thoughts.

I'm not seeing it.

Greetings,

Andres Freund

#12

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David Geier (#7)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Hi,

On 2022-06-27 16:55:55 +0200, David Geier wrote:

Indeed, the total JIT time increases the more modules are used. The reason
for this to happen is that the inlining pass loads and deserializes all to
be inlined modules (.bc files) from disk prior to inlining them via
llvm::IRMover. There's already a cache for such modules in the code, but it
is currently unused. This is because llvm::IRMover takes the module to be
inlined as std::unique_ptr<llvm::Module>. The by-value argument requires
the source module to be moved, which means it cannot be reused afterwards.
The code is accounting for that by erasing the module from the cache after
inlining it, which in turns requires reloading the module next time a
reference to it is encountered.

Instead of each time loading and deserializing all to be inlined modules
from disk, they can reside in the cache and instead be cloned via
llvm::CloneModule() before they get inlined. Key to calling
llvm::CloneModule() is fully deserializing the module upfront, instead of
loading the module lazily. That is why I changed the call from
LLVMGetBitcodeModuleInContext2() (which lazily loads the module via
llvm::getOwningLazyBitcodeModule()) to LLVMParseBitCodeInContext2() (which
fully loads the module via llvm::parseBitcodeFile()). Beyond that it seems
like that prior to LLVM 13, cloning modules could fail with an assertion
(not sure though if that would cause problems in a release build without
assertions). Andres reported this problem back in the days here [1]. In the
meanwhile the issue got discussed in [2] and finally fixed for LLVM 13, see
[3].

Unfortunately that doesn't work right now - that's where I had started. The
problem is that IRMover renames types. Which, in the case of cloned modules
unfortunately means that types used cloned modules are also renamed in the
"origin" module. Which then causes problems down the line, because parts of
the LLVM code match types by type names.

That can then have the effect of drastically decreasing code generation
quality over time, because e.g. inlining never manages to find signatures
compatible.

However, curiously the time spent on optimizing is also reduced (95ms
instead of 164ms). Could this be because some of the applied optimizations
are ending up in the cached module?

I suspect it's more that optimization stops being able to do a lot, due to the
type renamign issue.

@Andres: could you provide me with the queries that caused the assertion
failure in LLVM?

I don't think I have the concrete query. What I tend to do is to run the whole
regression tests with forced JITing. I'm fairly certain this triggered the bug
at the time.

Have you ever observed a segfault with a non-assert-enabled build?

I think I observed bogus code generation that then could lead to segfaults or
such.

I just want to make sure this is truly fixed in LLVM 13. Running 'make
check-world' all tests passed.

With jit-ing forced for everything?

One more thing to try is to jit-compile twice and ensure the code is the
same. It certainly wasn't in the past due to the above issue.

Greetings,

Andres Freund

#13

David Geier

geidav.pg@gmail.com

over 3 years ago

In reply to: Andres Freund (#12)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

On Mon, Jul 4, 2022 at 10:32 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-06-27 16:55:55 +0200, David Geier wrote:

Indeed, the total JIT time increases the more modules are used. The

reason

for this to happen is that the inlining pass loads and deserializes all

to

be inlined modules (.bc files) from disk prior to inlining them via
llvm::IRMover. There's already a cache for such modules in the code, but

it

is currently unused. This is because llvm::IRMover takes the module to be
inlined as std::unique_ptr<llvm::Module>. The by-value argument requires
the source module to be moved, which means it cannot be reused

afterwards.

The code is accounting for that by erasing the module from the cache

after

inlining it, which in turns requires reloading the module next time a
reference to it is encountered.

Instead of each time loading and deserializing all to be inlined modules
from disk, they can reside in the cache and instead be cloned via
llvm::CloneModule() before they get inlined. Key to calling
llvm::CloneModule() is fully deserializing the module upfront, instead of
loading the module lazily. That is why I changed the call from
LLVMGetBitcodeModuleInContext2() (which lazily loads the module via
llvm::getOwningLazyBitcodeModule()) to LLVMParseBitCodeInContext2()

(which

fully loads the module via llvm::parseBitcodeFile()). Beyond that it

seems

like that prior to LLVM 13, cloning modules could fail with an assertion
(not sure though if that would cause problems in a release build without
assertions). Andres reported this problem back in the days here [1]. In

the

meanwhile the issue got discussed in [2] and finally fixed for LLVM 13,

see

[3].

Unfortunately that doesn't work right now - that's where I had started. The
problem is that IRMover renames types. Which, in the case of cloned modules
unfortunately means that types used cloned modules are also renamed in the
"origin" module. Which then causes problems down the line, because parts of
the LLVM code match types by type names.

That can then have the effect of drastically decreasing code generation
quality over time, because e.g. inlining never manages to find signatures
compatible.

Looking at the LLVM patches these issues seem to be gone.

The same suggests dumping the bitcode and running all tests with JIT
force-enabled.
See below.

However, curiously the time spent on optimizing is also reduced (95ms
instead of 164ms). Could this be because some of the applied

optimizations

are ending up in the cached module?

I suspect it's more that optimization stops being able to do a lot, due to
the
type renamign issue.

@Andres: could you provide me with the queries that caused the assertion
failure in LLVM?

I don't think I have the concrete query. What I tend to do is to run the
whole
regression tests with forced JITing. I'm fairly certain this triggered the
bug
at the time.

Have you ever observed a segfault with a non-assert-enabled build?

I think I observed bogus code generation that then could lead to segfaults
or
such.

OK, then using a non-assert-enabled LLVM build seems good enough.
Especially, given the fixes they merged in [3].

I just want to make sure this is truly fixed in LLVM 13. Running 'make
check-world' all tests passed.

With jit-ing forced for everything?

One more thing to try is to jit-compile twice and ensure the code is the
same. It certainly wasn't in the past due to the above issue.

Yes, with jitting force-enabled everything passes (I removed all checks

for the various 'jit_above' costs from planner.c).

I also dumped the bitcode via 'SET jit_dump_bitcode = ON' and compared the
bitcode output after inlining and after optimization for two successive
executions of the very same query. The .bc files after inlining only
differ in two bytes. The .bc files after optimization differ in a lot more
bytes. However, the same is true for the master branch. How do you exactly
conclude the identity of compilation output?

--
David Geier
(ServiceNow)

#14

David Geier

geidav.pg@gmail.com

over 3 years ago

In reply to: Andres Freund (#11)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Can you elaborate a bit more on how you conclude that?

Looking at the numbers I measured in one of my previous e-mails, it
looks to me like the overhead of using multiple modules is fairly low
and only measurable in queries with dozens of modules. Given that JIT is
most useful in queries that process a fair amount of rows, having to
spend marginally more time on creating the JIT program while being able
to use JIT much more fine grained seems desirable. For example, the time
you lose for handling more modules, you save right away because not the
whole plan gets JIT compiled.

It is a trade-off between optimizing for the best case where everything
in the plan can truly benefit from jitting and hence a single module
that has it all is best, vs the worst-case where almost nothing truly
profits from jitting and hence only a small fraction of the plan should
actually be jitted. The penalty for the best case seems low though,
because (1) the overhead is low in absolute terms, and (2) also if the
entire plan truly benefits from jitting, spending sub-ms more per node
seems neglectable because there is anyways going to be significant time
spent.

--
David Geier
(ServiceNow)

Show quoted text

On 7/4/22 22:23, Andres Freund wrote:

Hi,

On 2022-07-04 06:43:00 +0000, Luc Vlaming Hummel wrote:

Thanks for reviewing this and the interesting examples!

Wanted to give a bit of extra insight as to why I'd love to have a system that can lazily emit JIT code and hence creates roughly a module per function:
In the end I'm hoping that we can migrate to a system where we only JIT after a configurable cost has been exceeded for this node, as well as a configurable amount of rows has actually been processed.
Reason is that this would safeguard against some problematic planning issues
wrt JIT (node not being executed, row count being massively off).

I still don't see how it's viable to move to always doing function-by-function
emission overhead wise.

I also want to go to do JIT in the background and triggered by acutal
usage. But to me it seems a dead end to require moving to
one-function-per-module model for that.

If this means we have to invest more in making it cheap(er) to emit modules,
I'm all for that.

I think that's just inherently more expensive and thus a no-go.

@Andres if there's any other things we ought to fix to make this cheap
(enough) compared to the previous code I'd love to know your thoughts.

I'm not seeing it.

Greetings,

Andres Freund

#15

Dmitry Dolgov

9erthalion6@gmail.com

about 3 years ago

In reply to: David Geier (#13)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

On Thu, Jul 14, 2022 at 02:45:29PM +0200, David Geier wrote:
On Mon, Jul 4, 2022 at 10:32 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-06-27 16:55:55 +0200, David Geier wrote:

Indeed, the total JIT time increases the more modules are used. The

reason

for this to happen is that the inlining pass loads and deserializes all

to

be inlined modules (.bc files) from disk prior to inlining them via
llvm::IRMover. There's already a cache for such modules in the code, but

it

is currently unused. This is because llvm::IRMover takes the module to be
inlined as std::unique_ptr<llvm::Module>. The by-value argument requires
the source module to be moved, which means it cannot be reused

afterwards.

The code is accounting for that by erasing the module from the cache

after

inlining it, which in turns requires reloading the module next time a
reference to it is encountered.

Instead of each time loading and deserializing all to be inlined modules
from disk, they can reside in the cache and instead be cloned via
llvm::CloneModule() before they get inlined. Key to calling
llvm::CloneModule() is fully deserializing the module upfront, instead of
loading the module lazily. That is why I changed the call from
LLVMGetBitcodeModuleInContext2() (which lazily loads the module via
llvm::getOwningLazyBitcodeModule()) to LLVMParseBitCodeInContext2()

(which

fully loads the module via llvm::parseBitcodeFile()). Beyond that it

seems

like that prior to LLVM 13, cloning modules could fail with an assertion
(not sure though if that would cause problems in a release build without
assertions). Andres reported this problem back in the days here [1]. In

the

meanwhile the issue got discussed in [2] and finally fixed for LLVM 13,

see

[3].

Unfortunately that doesn't work right now - that's where I had started. The
problem is that IRMover renames types. Which, in the case of cloned modules
unfortunately means that types used cloned modules are also renamed in the
"origin" module. Which then causes problems down the line, because parts of
the LLVM code match types by type names.

That can then have the effect of drastically decreasing code generation
quality over time, because e.g. inlining never manages to find signatures
compatible.

Hi,

Thanks for the patch, looks quite interesting!

First to summarize things a bit: from what I understand there are two
suggestions on the table, one is about caching modules when doing
inlining, the second is about actual lazy jitting. Are those to tightly
coupled together, could they be split into two separate patches? It
would make it a bit easier to review and test.

I was playing with the caching part of the patch (still have to think
about the lazy jitting), which generally seems to be a good idea. From
what I see the main concern here is a chance that IRMover will rename
types, degrading performance of the generated code in long run. I have
to admit, I'm not fully sure mentioned LLVM 13 fix [1]https://reviews.llvm.org/rG22a52dfddcefad4f275eb8ad1cc0e200074c2d8a completely
eliminates those concerns, somehow its description is formulated in not
very committing way ("don't know if this patch fixes ..., but it does
fix a few soundness issues that have crept in."). But I haven't found
any crashes or false asserts coming from the LLVM side (using v13),
running several rounds of regression tests with forced jitting, so a
point to the fix.

I would be curious to learn how Andres was troubleshooting type renaming
issues? Using LLVM 13 from packages, jitting the same query twice and
dumping the bitcode out showed some difference in types a-la
"%struct.ExprState.142" vs "%struct.ExprState.430" (from what I
understood, the whole thing is an identifier, including the number) with
the patch, but the very same changes are happening on the main branch as
well. Of course, I was inspecting bitcode only for certain queries, it
doesn't exclude that some other examples actually feature type renaming.

In general, it would be interesting to know how to do some sort of "smoke
tests" for the generated code, e.g. in case if LLVM has fixed this
particular issue, but they might reappear in the future?

I did few performance tests and got numbers similar to posted in the
thread, inlining time being impressively reduced (~10 times) as well as
(suspiciously) optimization time (~1.5 times). The approach was a bit
different though, I've run the sequence of example queries from the
thread using pgbench and checked jit counters from pgss.

Few small notes:

If caching of modules is safe from LLVM >= 13, I guess it should be
wrapped into a corresponding condition on LLVM_VERSION_MAJOR, right?

Why the assert about hasExternalLinkage was removed from the
llvmjit_inline.cpp?

For so much discussion about such a small change there is definitely not
enough commentaries in the code about dangers of cloning modules.

Also, maybe a stupid question, but how big this cache should be? From
what I understand it will be cleared on llvm_shutdown, does it mean it
can grow unbounded if e.g. a single session is firing all the time
different queries at the db?

[1]: https://reviews.llvm.org/rG22a52dfddcefad4f275eb8ad1cc0e200074c2d8a

#16

David Geier

geidav.pg@gmail.com

almost 3 years ago

In reply to: Dmitry Dolgov (#15)

1 attachment(s)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Hi,

Thanks for taking a look!

On 12/1/22 22:12, Dmitry Dolgov wrote:

First to summarize things a bit: from what I understand there are two
suggestions on the table, one is about caching modules when doing
inlining, the second is about actual lazy jitting. Are those to tightly
coupled together, could they be split into two separate patches? It
would make it a bit easier to review and test.

Yes.

I was playing with the caching part of the patch (still have to think
about the lazy jitting), which generally seems to be a good idea. From
what I see the main concern here is a chance that IRMover will rename
types, degrading performance of the generated code in long run. I have
to admit, I'm not fully sure mentioned LLVM 13 fix [1] completely
eliminates those concerns, somehow its description is formulated in not
very committing way ("don't know if this patch fixes ..., but it does
fix a few soundness issues that have crept in."). But I haven't found
any crashes or false asserts coming from the LLVM side (using v13),
running several rounds of regression tests with forced jitting, so a
point to the fix.

I would be curious to learn how Andres was troubleshooting type renaming
issues? Using LLVM 13 from packages, jitting the same query twice and
dumping the bitcode out showed some difference in types a-la
"%struct.ExprState.142" vs "%struct.ExprState.430" (from what I
understood, the whole thing is an identifier, including the number) with
the patch, but the very same changes are happening on the main branch as
well. Of course, I was inspecting bitcode only for certain queries, it
doesn't exclude that some other examples actually feature type renaming.

In general, it would be interesting to know how to do some sort of "smoke
tests" for the generated code, e.g. in case if LLVM has fixed this
particular issue, but they might reappear in the future?

+1, especially with my findings described below.

I did few performance tests and got numbers similar to posted in the
thread, inlining time being impressively reduced (~10 times) as well as
(suspiciously) optimization time (~1.5 times). The approach was a bit
different though, I've run the sequence of example queries from the
thread using pgbench and checked jit counters from pgss.

Few small notes:

If caching of modules is safe from LLVM >= 13, I guess it should be
wrapped into a corresponding condition on LLVM_VERSION_MAJOR, right?

Why the assert about hasExternalLinkage was removed from the
llvmjit_inline.cpp?

For so much discussion about such a small change there is definitely not
enough commentaries in the code about dangers of cloning modules.

Also, maybe a stupid question, but how big this cache should be? From
what I understand it will be cleared on llvm_shutdown, does it mean it
can grow unbounded if e.g. a single session is firing all the time
different queries at the db?

The cache doesn't contain the modules generated for a query but the
bitcode files with the to-be-inlined functions stored on disk. So the
cache would stop growing as soon as a connection process has loaded all
of them. Nevertheless, if a workload uses many connections and truly all
or at least most of the bitcode files that could be quite some memory.

Beyond that I made an unfortunate discovery: the inlining time goes down
by so much because no inlining is happening anymore :-/. This is because
I cloned the module only when passing it to the IRMover, not right after
taking it from the cache. However, after the module is taken from the
cache it's still modified. llvm_execute_inline_plan() executes

if (valueToImport->hasExternalLinkage()) {
valueToImport->setLinkage(LinkageTypes::AvailableExternallyLinkage); }

But when checking if a function is inlinable in function_inlinable() we
check

if (F.hasAvailableExternallyLinkage()) return false;

which makes the code bail out for any function to be inlined.

It's very curious as to why we didn't really see that when dumping the
bitcode. It seems like the bitcode is always different enough to not
spot that. So +1 on your point on having a smoke test in place to verify
things still work.

If I change the code to clone the module right after taking it from the
cache and before making the changes to it, the JIT time remains high and
seems even higher than when re-reading the file from disk. Profiling
showed that llvm::CloneModule() is just super slow, especially for big
bitcode files like numeric.bc. I haven't investigated why that is and if
we can do something about it. I also don't plan to do so for the moment
being.

For reference, I attached the patch which only contains the caching
code. It's the variant that clones early.

--
David Geier
(ServiceNow)

Attachments:

0001-Cache-modules-in-JIT-inlining.patchtext/x-patch; charset=UTF-8; name=0001-Cache-modules-in-JIT-inlining.patchDownload

From 6a13803de83aaa46d34880e1b7c44bf93eb87f35 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Mon, 27 Jun 2022 12:28:29 +0200
Subject: [PATCH] Cache modules in JIT inlining

---
 src/backend/jit/llvm/llvmjit_inline.cpp | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit_inline.cpp b/src/backend/jit/llvm/llvmjit_inline.cpp
index dc35e002f5..b6bc599e60 100644
--- a/src/backend/jit/llvm/llvmjit_inline.cpp
+++ b/src/backend/jit/llvm/llvmjit_inline.cpp
@@ -63,6 +63,9 @@ extern "C"
 #include <llvm/Linker/IRMover.h>
 #include <llvm/Support/ManagedStatic.h>
 #include <llvm/Support/MemoryBuffer.h>
+#if LLVM_VERSION_MAJOR >= 13
+#include <llvm/Transforms/Utils/Cloning.h>
+#endif
 
 
 /*
@@ -378,13 +381,19 @@ llvm_execute_inline_plan(llvm::Module *mod, ImportMapTy *globalsToInline)
 		const llvm::StringSet<>& modGlobalsToInline = toInline.second;
 		llvm::SetVector<llvm::GlobalValue *> GlobalsToImport;
 
-		Assert(module_cache->count(modPath));
-		std::unique_ptr<llvm::Module> importMod(std::move((*module_cache)[modPath]));
-		module_cache->erase(modPath);
-
 		if (modGlobalsToInline.empty())
 			continue;
 
+		auto iter = module_cache->find(modPath);
+		Assert(iter != module_cache->end());
+
+#if LLVM_VERSION_MAJOR >= 13
+		auto importMod = llvm::CloneModule(*iter->second);
+#else
+		auto importMod = std::move(iter->second);
+		module_cache->erase(modPath);
+#endif
+
 		for (auto &glob: modGlobalsToInline)
 		{
 			llvm::StringRef SymbolName = glob.first();
@@ -491,7 +500,11 @@ load_module(llvm::StringRef Identifier)
 	if (LLVMCreateMemoryBufferWithContentsOfFile(path, &buf, &msg))
 		elog(FATAL, "failed to open bitcode file \"%s\": %s",
 			 path, msg);
+#if LLVM_VERSION_MAJOR >= 13
+	if (LLVMParseBitcodeInContext2(LLVMGetGlobalContext(), buf, &mod))
+#else
 	if (LLVMGetBitcodeModuleInContext2(LLVMGetGlobalContext(), buf, &mod))
+#endif
 		elog(FATAL, "failed to parse bitcode in file \"%s\"", path);
 
 	/*
-- 
2.34.1

#17

Dmitry Dolgov

9erthalion6@gmail.com

almost 3 years ago

In reply to: David Geier (#16)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

On Fri, Jan 27, 2023 at 10:02:32AM +0100, David Geier wrote:
It's very curious as to why we didn't really see that when dumping the
bitcode. It seems like the bitcode is always different enough to not spot
that.

As I've noted off-list, there was noticeable difference in the dumped
bitcode, which I haven't noticed since we were talking mostly about
differences between executions of the same query.

#18

David Geier

geidav.pg@gmail.com

almost 3 years ago

In reply to: Dmitry Dolgov (#17)

Re: Lazy JIT IR code generation to increase JIT speed with partitions

Hi Dmitry,

On 1/27/23 16:18, Dmitry Dolgov wrote:

As I've noted off-list, there was noticeable difference in the dumped
bitcode, which I haven't noticed since we were talking mostly about
differences between executions of the same query.

Thanks for the clarification and also thanks for helping with this effort.

--
David Geier
(ServiceNow)