Query is over 2x slower with jit=on

Started by Andreas Joseph Kroghover 7 years ago46 messages

andreas@visena.com

over 7 years ago

Hi all.

I don't know whether this is expected or not but I'm experiencing over 2x
slowdown on a large query in PG-11 with JIT=on.

(query is a prepared statement executed with "explain analyze execute
myprepared(arg1, arg2, ..., argn)")

After 10 executions these are the results (the first 5 executed in > 16s, then
the plan changed)

With jit=on:
https://explain.depesz.com/s/vYB
Planning Time: 0.336 ms
JIT:
Functions: 716
Generation Time: 78.404 ms
Inlining: false
Inlining Time: 0.000 ms
Optimization: false
Optimization Time: 43.916 ms
Emission Time: 600.031 ms
Execution Time: 2035.150 ms
(385 rows)

With jit=off:
https://explain.depesz.com/s/X6mA
Planning Time: 0.371 ms
Execution Time: 833.941 ms
(377 rows)

Both are master as of 55d26ff638f063fbccf57843f2c27f9795895a5c

The query largely consists of CTEs with aggregation which are FULL OUTER
JOIN'ed.

On v10 the query executes in:
Execution time: 1159.628 ms

So v11 (with jit=off) is about 25% faster (due to parallel hash-join I think),
which is nice!

What's the deal with jit making it slower?

--
Andreas Joseph Krogh

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Andreas Joseph Krogh (#1)

Re: Query is over 2x slower with jit=on

On 2018-04-18 17:35:31 +0200, Andreas Joseph Krogh wrote:

With jit=on:
https://explain.depesz.com/s/vYB
Planning Time: 0.336 ms
JIT:
ï¿½Functions: 716
ï¿½Generation Time: 78.404 ms
ï¿½Inlining: false
ï¿½Inlining Time: 0.000 ms
ï¿½Optimization: false
ï¿½Optimization Time: 43.916 ms
ï¿½Emission Time: 600.031 ms

Any chance this is a debug LLVM build?

What's the deal with jit making it slower?

JIT has cost, and sometimes it's not beneficial. Here our heuristics
when to JIT appear to be a bit off. In the parallel world it's worse
because the JITing is duplicated for parallel workers atm.

Greetings,

Andres Freund

Andreas Joseph Krogh

andreas@visena.com

over 7 years ago

In reply to: Andres Freund (#2)

Sv: Re: Query is over 2x slower with jit=on

På onsdag 18. april 2018 kl. 17:50:55, skrev Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>>:
On 2018-04-18 17:35:31 +0200, Andreas Joseph Krogh wrote:

With jit=on:
https://explain.depesz.com/s/vYB
Planning Time: 0.336 ms
JIT:
Functions: 716
Generation Time: 78.404 ms
Inlining: false
Inlining Time: 0.000 ms
Optimization: false
Optimization Time: 43.916 ms
Emission Time: 600.031 ms

Any chance this is a debug LLVM build?

What's the deal with jit making it slower?

Simon Riggs

simon@2ndquadrant.com

over 7 years ago

In reply to: Andres Freund (#2)

Re: Query is over 2x slower with jit=on

On 18 April 2018 at 16:50, Andres Freund <andres@anarazel.de> wrote:

On 2018-04-18 17:35:31 +0200, Andreas Joseph Krogh wrote:

With jit=on:
https://explain.depesz.com/s/vYB
Planning Time: 0.336 ms
JIT:
Functions: 716
Generation Time: 78.404 ms
Inlining: false
Inlining Time: 0.000 ms
Optimization: false
Optimization Time: 43.916 ms
Emission Time: 600.031 ms

Any chance this is a debug LLVM build?

What's the deal with jit making it slower?

JIT has cost, and sometimes it's not beneficial. Here our heuristics
when to JIT appear to be a bit off. In the parallel world it's worse
because the JITing is duplicated for parallel workers atm.

Please change the name of the "JIT" parameter to something meaningful
to humans before this gets too far into the wild.

SSL is somewhat understandable because its not a Postgres-private term.

geqo is regrettable and we really don't want any more too
short/abbreviated parameter names.

Think of our EOU if every GUC was a TLA.

Thanks

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Chapman Flack

chap@anastigmatix.net

over 7 years ago

In reply to: Simon Riggs (#4)

Re: Query is over 2x slower with jit=on

On 04/18/2018 12:27 PM, Simon Riggs wrote:

Please change the name of the "JIT" parameter to something meaningful
to humans before this gets too far into the wild.

SSL is somewhat understandable because its not a Postgres-private term.

JIT is hardly a Postgres-private term. It's a familiar term in a
widespread community, though I might cede the point that the
community in question is more compiler wonks than DBAs.

-Chap

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Chapman Flack (#5)

Re: Query is over 2x slower with jit=on

On April 18, 2018 9:50:48 AM PDT, Chapman Flack <chap@anastigmatix.net> wrote:

On 04/18/2018 12:27 PM, Simon Riggs wrote:

Please change the name of the "JIT" parameter to something meaningful
to humans before this gets too far into the wild.

SSL is somewhat understandable because its not a Postgres-private

term.

JIT is hardly a Postgres-private term. It's a familiar term in a
widespread community, though I might cede the point that the
community in question is more compiler wonks than DBAs.

You're right. There's another thread where we discussed this. Also due to concern of Simon's. Can we not do so separately here?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Andres Freund (#2)

Re: Query is over 2x slower with jit=on

On Wed, Apr 18, 2018 at 11:50 AM, Andres Freund <andres@anarazel.de> wrote:

JIT has cost, and sometimes it's not beneficial. Here our heuristics
when to JIT appear to be a bit off. In the parallel world it's worse
because the JITing is duplicated for parallel workers atm.

It seems like you're describing it as if the JIT just didn't produce
gains sufficient to make up for the cost of doing it, but that's not
really the issue here AFAICS. Here the JIT actually made code that
run slower than the un-JIT-ted code. That seems like a different sort
of problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Robert Haas (#7)

Re: Query is over 2x slower with jit=on

On April 18, 2018 12:16:35 PM PDT, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 18, 2018 at 11:50 AM, Andres Freund <andres@anarazel.de>
wrote:

JIT has cost, and sometimes it's not beneficial. Here our heuristics
when to JIT appear to be a bit off. In the parallel world it's worse
because the JITing is duplicated for parallel workers atm.

It seems like you're describing it as if the JIT just didn't produce
gains sufficient to make up for the cost of doing it, but that's not
really the issue here AFAICS. Here the JIT actually made code that
run slower than the un-JIT-ted code. That seems like a different sort
of problem.

Not convinced that that is true - the issue is more likely that JIT work in workers is counted as execute time... Gotta add that somehow, not sure what the best way would be.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Andres Freund (#8)

Re: Query is over 2x slower with jit=on

On Wed, Apr 18, 2018 at 3:29 PM, Andres Freund <andres@anarazel.de> wrote:

Not convinced that that is true - the issue is more likely that JIT work in workers is counted as execute time... Gotta add that somehow, not sure what the best way would be.

Oh, that does seem like something that should be fixed. If that's
what is happening here, it's bound to confuse a lot of people.
Probably you need to add some code to
ExecParallelRetrieveInstrumentation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Andreas Joseph Krogh

andreas@visena.com

over 7 years ago

In reply to: Andreas Joseph Krogh (#3)

Sv: Sv: Re: Query is over 2x slower with jit=on

På onsdag 18. april 2018 kl. 18:26:03, skrev Andreas Joseph Krogh <
andreas@visena.com <mailto:andreas@visena.com>>:
På onsdag 18. april 2018 kl. 17:50:55, skrev Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>>:
On 2018-04-18 17:35:31 +0200, Andreas Joseph Krogh wrote:

With jit=on:
https://explain.depesz.com/s/vYB
Planning Time: 0.336 ms
JIT:
Functions: 716
Generation Time: 78.404 ms
Inlining: false
Inlining Time: 0.000 ms
Optimization: false
Optimization Time: 43.916 ms
Emission Time: 600.031 ms

Any chance this is a debug LLVM build?

What's the deal with jit making it slower?

JIT has cost, and sometimes it's not beneficial. Here our heuristics
when to JIT appear to be a bit off. In the parallel world it's worse
because the JITing is duplicated for parallel workers atm.

PostgreSQL is built with "--enable-debug --with-llvm". LLVM is the one which
comes with Ubuntu-17.10.

Some more info;

Without --enable-debug (seems this didn't impact performance mutch):

First 5 executions (prepared-statement issued from psql)
Planning Time: 47.634 ms
JIT:
Functions: 725
Generation Time: 74.748 ms
Inlining: true
Inlining Time: 90.763 ms
Optimization: true
Optimization Time: 5822.516 ms
Emission Time: 3089.127 ms
Execution Time: 16375.996 ms

After 5 executions (prepared-statement issued from psql)
Planning Time: 0.385 ms
JIT:
Functions: 716
Generation Time: 76.382 ms
Inlining: false
Inlining Time: 0.000 ms
Optimization: false
Optimization Time: 41.709 ms
Emission Time: 613.074 ms
Execution Time: 2031.830 ms

jit=off:

Planning Time: 0.171 ms
Execution Time: 832.489 ms

-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com>
<https://www.visena.com>

#11

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Robert Haas (#9)

Re: Query is over 2x slower with jit=on

Hi,

On 2018-04-18 18:37:30 -0400, Robert Haas wrote:

On Wed, Apr 18, 2018 at 3:29 PM, Andres Freund <andres@anarazel.de> wrote:

Not convinced that that is true - the issue is more likely that JIT work in workers is counted as execute time... Gotta add that somehow, not sure what the best way would be.

Oh, that does seem like something that should be fixed. If that's
what is happening here, it's bound to confuse a lot of people.
Probably you need to add some code to
ExecParallelRetrieveInstrumentation.

I had lost track of this, and we unfortunately hadn't added an open item
back then. I think we should add it now?

RMT (with me recused), do you think we should accept the new code fixing
this would entail? And thus that this should be an open item? It's
arguably a new feature, although I don't find that a terribly convincing
position.

Greetings,

Andres Freund

#12

Jonathan S. Katz

jkatz@postgresql.org

over 7 years ago

In reply to: Andres Freund (#11)

Re: Query is over 2x slower with jit=on

On Aug 22, 2018, at 12:12 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-04-18 18:37:30 -0400, Robert Haas wrote:

On Wed, Apr 18, 2018 at 3:29 PM, Andres Freund <andres@anarazel.de> wrote:

Not convinced that that is true - the issue is more likely that JIT work in workers is counted as execute time... Gotta add that somehow, not sure what the best way would be.

Oh, that does seem like something that should be fixed. If that's
what is happening here, it's bound to confuse a lot of people.
Probably you need to add some code to
ExecParallelRetrieveInstrumentation.

I had lost track of this, and we unfortunately hadn't added an open item
back then. I think we should add it now?

RMT (with me recused), do you think we should accept the new code fixing
this would entail? And thus that this should be an open item? It's
arguably a new feature, although I don't find that a terribly convincing
position.

Reviewed the earlier discussion. I can argue this both ways.

As it stands right now it seems like we will be defaulting “jit = off” which
means there will only be so many people who will be using jit in PG11,
and as such we could provide the patch for 12.

However, for the people who do enable JIT in PG11, this is a scenario
that would cause the query to perform in an unexpected way that is no
fault of their own, which one would argue is a bug. I have not tried to
reproduce it myself, but from looking at the explain plans from Andreas,
I’m confident I could craft a moderately complex query that could
demonstrate the performance degradation with jit = on.

While adding the code may constitute being a new feature, it is a feature
that would prevent user frustration on something that we are highlighting as a
“major feature” of PG11, even if it’s not enabled by default.

What I would you Andres is how complex will the patch be and how much time
will it take? As such I would +1 it for open items right now unless it seems like
this could take a significant of time to come up with a proper patch, in which
case I would reconsider, but at least with it on open items we would be able
to continuously review.

Thanks,

Jonathan

#13

Andreas Joseph Krogh

andreas@visena.com

over 7 years ago

In reply to: Andres Freund (#11)

Sv: Re: Query is over 2x slower with jit=on

På onsdag 22. august 2018 kl. 18:12:41, skrev Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>>:
Hi,

On 2018-04-18 18:37:30 -0400, Robert Haas wrote:

On Wed, Apr 18, 2018 at 3:29 PM, Andres Freund <andres@anarazel.de> wrote:

Not convinced that that is true - the issue is more likely that JIT work

in workers is counted as execute time... Gotta add that somehow, not sure what
the best way would be.

Oh, that does seem like something that should be fixed. If that's
what is happening here, it's bound to confuse a lot of people.
Probably you need to add some code to
ExecParallelRetrieveInstrumentation.

I had lost track of this, and we unfortunately hadn't added an open item
back then. I think we should add it now?

Greetings,

Andres Freund

Just to be clear; The query really runs slower (wall-clock time), it's not
just the timing.

--
Andreas Joseph Krogh

#14

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Andreas Joseph Krogh (#13)

Re: Query is over 2x slower with jit=on

On 2018-08-22 18:39:18 +0200, Andreas Joseph Krogh wrote:

Just to be clear; The queryï¿½reallyï¿½runs slower (wall-clock time), it's not
just the timing.

Greetings,

Andres Freund

#15

Andreas Joseph Krogh

andreas@visena.com

over 7 years ago

In reply to: Andres Freund (#14)

Sv: Re: Query is over 2x slower with jit=on

På onsdag 22. august 2018 kl. 18:51:55, skrev Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>>:
On 2018-08-22 18:39:18 +0200, Andreas Joseph Krogh wrote:

Just to be clear; The query really runs slower (wall-clock time), it's not
just the timing.

I bet it's not actually running slower, it "just" takes longer to start
up due to the JITing in each worker. I suspect what we should do is to
multiple the cost limits by the number of workers, to model that. But
without the fixed instrumentation that's harder to see...

Well, yes, that might be. By "runs" I meant from me hitting ENTER in psql to
the time the query finishes...

I thought JITing of prepared queries happended once (in "prepare") so it
didn't have to do the JITing every time the query is executed. Isn't
the previously generated bytecode usable for subsequent queries?

--
Andreas Joseph Krogh

#16

Pierre Ducroquet

p.psql@pinaraf.info

over 7 years ago

In reply to: Andres Freund (#14)

Re: Query is over 2x slower with jit=on

On Wednesday, August 22, 2018 6:51:55 PM CEST Andres Freund wrote:

On 2018-08-22 18:39:18 +0200, Andreas Joseph Krogh wrote:

Just to be clear; The query really runs slower (wall-clock time), it's not
just the timing.

I bet it's not actually running slower, it "just" takes longer to start
up due to the JITing in each worker. I suspect what we should do is to
multiple the cost limits by the number of workers, to model that. But
without the fixed instrumentation that's harder to see...

It depends on the query. It has been shown in other threads that query can
indeed take longer to run because of JITing : if the cost is too low to fire
LLVM optimizer, the generated code can be so bad it will be slower than the
non-JIT executor.
Cf for instance a previous discussion here : http://www.postgresql-archive.org/PATCH-LLVM-tuple-deforming-improvements-td6029385.html

I think it would be interesting to try the query from this thread with a patch
forcing the LLVM codegen to O1 (I found no PassManager there to play with, it
seems to be an off/on/extreme switch ; patch 0001-LLVM-Use-the-O1-CodeGen-
level.patch from thread mentioned above).

#17

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Andreas Joseph Krogh (#15)

Re: Query is over 2x slower with jit=on

On 2018-08-22 19:51:12 +0200, Andreas Joseph Krogh wrote:

I thoughtï¿½JITing of preparedï¿½queries happended once (in "prepare")

No, it happens when the first JITed function is executed.

ï¿½so it didn't have to do the JITing every time the query is
executed. Isn't theï¿½previously generated bytecode usable for
subsequent queries?

No, not currently. There's some reasons preventing that (primarily that
we currently rely on addresses of certain things not to change during
execution). There's ongoing work to change that, but that's certainly
not going to be ready for v11.

Greetings,

Andres Freund

#18

Andreas Joseph Krogh

andreas@visena.com

over 7 years ago

In reply to: Andres Freund (#17)

Re: Query is over 2x slower with jit=on

På onsdag 22. august 2018 kl. 20:52:05, skrev Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>>:
On 2018-08-22 19:51:12 +0200, Andreas Joseph Krogh wrote:

I thought JITing of prepared queries happended once (in "prepare")

No, it happens when the first JITed function is executed.

so it didn't have to do the JITing every time the query is
executed. Isn't the previously generated bytecode usable for
subsequent queries?

Greetings,

Andres Freund

Ok, thanks for clarifying.

--
Andreas Joseph Krogh

#19

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Pierre Ducroquet (#16)

Re: Query is over 2x slower with jit=on

On 2018-08-22 20:48:48 +0200, Pierre Ducroquet wrote:

On Wednesday, August 22, 2018 6:51:55 PM CEST Andres Freund wrote:

On 2018-08-22 18:39:18 +0200, Andreas Joseph Krogh wrote:

Just to be clear; The query really runs slower (wall-clock time), it's not
just the timing.

I bet it's not actually running slower, it "just" takes longer to start
up due to the JITing in each worker. I suspect what we should do is to
multiple the cost limits by the number of workers, to model that. But
without the fixed instrumentation that's harder to see...

It depends on the query. It has been shown in other threads that query can
indeed take longer to run because of JITing : if the cost is too low to fire
LLVM optimizer, the generated code can be so bad it will be slower than the
non-JIT executor.

This largely seems to be orthogonal to what I'm talking about.

Cf for instance a previous discussion here : http://www.postgresql-archive.org/PATCH-LLVM-tuple-deforming-improvements-td6029385.html

I'd wish people stopped using www.postgresql-archive.org. It's *NOT*
postgresql.org maintained, in fact I do not know who does. It does shows
ads when downloading links, which I'm personally not ok with.

Greetings,

Andres Freund

#20

Jonathan S. Katz

jkatz@postgresql.org

over 7 years ago

In reply to: Andreas Joseph Krogh (#18)

Re: Query is over 2x slower with jit=on

On Aug 22, 2018, at 2:58 PM, Andreas Joseph Krogh <andreas@visena.com> wrote:

På onsdag 22. august 2018 kl. 20:52:05, skrev Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>>:
On 2018-08-22 19:51:12 +0200, Andreas Joseph Krogh wrote:

I thought JITing of prepared queries happended once (in "prepare")

No, it happens when the first JITed function is executed.

so it didn't have to do the JITing every time the query is
executed. Isn't the previously generated bytecode usable for
subsequent queries?

No, not currently. There's some reasons preventing that (primarily that
we currently rely on addresses of certain things not to change during
execution). There's ongoing work to change that, but that's certainly
not going to be ready for v11.

Greetings,

Andres Freund

Ok, thanks for clarifying.

Per earlier note[1]/messages/by-id/7F838324-064B-4A24-952C-2800CFBD39D6@postgresql.org </messages/by-id/7F838324-064B-4A24-952C-2800CFBD39D6@postgresql.org> I was able to reproduce this issue with similar results to
Andreas while running 11 Beta 3.

jit = on
https://explain.depesz.com/s/vgzD <https://explain.depesz.com/s/vgzD>

Planning Time: 0.921 ms
JIT:
Functions: 193
Generation Time: 121.595 ms
Inlining: false
Inlining Time: 0.000 ms
Optimization: false
Optimization Time: 58.045 ms
Emission Time: 1201.100 ms
Execution Time: 1628.017 ms

jit = off
https://explain.depesz.com/s/AvZM <https://explain.depesz.com/s/AvZM>

Planning Time: 1.398 ms
Execution Time: 256.473 ms

I increased the the search range I used in the query by 3x, and got these numbers:

jit=on
Planning Time: 0.931 ms
JIT:
Functions: 184
Generation Time: 126.587 ms
Inlining: true
Inlining Time: 98.865 ms
Optimization: true
Optimization Time: 20518.982 ms
Emission Time: 7270.963 ms
Execution Time: 28772.576 ms

jit=off
Planning Time: 1.527 ms
Execution Time: 959.160 ms

So, I would +1 this for open items.

Jonathan

[1]: /messages/by-id/7F838324-064B-4A24-952C-2800CFBD39D6@postgresql.org </messages/by-id/7F838324-064B-4A24-952C-2800CFBD39D6@postgresql.org>

#21

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Jonathan S. Katz (#20)

Re: Query is over 2x slower with jit=on

On 2018-08-22 18:29:58 -0400, Jonathan S. Katz wrote:

On Aug 22, 2018, at 2:58 PM, Andreas Joseph Krogh <andreas@visena.com> wrote:

Pï¿½ onsdag 22. august 2018 kl. 20:52:05, skrev Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>>:
On 2018-08-22 19:51:12 +0200, Andreas Joseph Krogh wrote:

I thought JITing of prepared queries happended once (in "prepare")

No, it happens when the first JITed function is executed.

so it didn't have to do the JITing every time the query is
executed. Isn't the previously generated bytecode usable for
subsequent queries?

No, not currently. There's some reasons preventing that (primarily that
we currently rely on addresses of certain things not to change during
execution). There's ongoing work to change that, but that's certainly
not going to be ready for v11.

Greetings,

Andres Freund

Ok, thanks for clarifying.

Per earlier note[1] I was able to reproduce this issue with similar results to
Andreas while running 11 Beta 3.

jit = on
https://explain.depesz.com/s/vgzD <https://explain.depesz.com/s/vgzD>

Planning Time: 0.921 ms
JIT:
Functions: 193
Generation Time: 121.595 ms
Inlining: false
Inlining Time: 0.000 ms
Optimization: false
Optimization Time: 58.045 ms
Emission Time: 1201.100 ms
Execution Time: 1628.017 ms

jit = off
https://explain.depesz.com/s/AvZM <https://explain.depesz.com/s/AvZM>

Planning Time: 1.398 ms
Execution Time: 256.473 ms

I increased the the search range I used in the query by 3x, and got these numbers:

jit=on
Planning Time: 0.931 ms
JIT:
Functions: 184
Generation Time: 126.587 ms
Inlining: true
Inlining Time: 98.865 ms
Optimization: true
Optimization Time: 20518.982 ms
Emission Time: 7270.963 ms
Execution Time: 28772.576 ms

jit=off
Planning Time: 1.527 ms
Execution Time: 959.160 ms

For the archives sake: This likely largely is the consequence of
building with LLVM's expensive assertions enabled, as confirmed by
Jonathan over IM.

Greetings,

Andres Freund

#22

Jonathan S. Katz

jkatz@postgresql.org

over 7 years ago

In reply to: Andres Freund (#21)

Re: Query is over 2x slower with jit=on

On Aug 22, 2018, at 7:13 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-08-22 18:29:58 -0400, Jonathan S. Katz wrote:

On Aug 22, 2018, at 2:58 PM, Andreas Joseph Krogh <andreas@visena.com> wrote:

På onsdag 22. august 2018 kl. 20:52:05, skrev Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>>:
On 2018-08-22 19:51:12 +0200, Andreas Joseph Krogh wrote:

I thought JITing of prepared queries happended once (in "prepare")

No, it happens when the first JITed function is executed.

so it didn't have to do the JITing every time the query is
executed. Isn't the previously generated bytecode usable for
subsequent queries?

No, not currently. There's some reasons preventing that (primarily that
we currently rely on addresses of certain things not to change during
execution). There's ongoing work to change that, but that's certainly
not going to be ready for v11.

Greetings,

Andres Freund

Ok, thanks for clarifying.

Per earlier note[1] I was able to reproduce this issue with similar results to
Andreas while running 11 Beta 3.

jit = on
https://explain.depesz.com/s/vgzD <https://explain.depesz.com/s/vgzD>

Planning Time: 0.921 ms
JIT:
Functions: 193
Generation Time: 121.595 ms
Inlining: false
Inlining Time: 0.000 ms
Optimization: false
Optimization Time: 58.045 ms
Emission Time: 1201.100 ms
Execution Time: 1628.017 ms

jit = off
https://explain.depesz.com/s/AvZM <https://explain.depesz.com/s/AvZM>

Planning Time: 1.398 ms
Execution Time: 256.473 ms

I increased the the search range I used in the query by 3x, and got these numbers:

jit=on
Planning Time: 0.931 ms
JIT:
Functions: 184
Generation Time: 126.587 ms
Inlining: true
Inlining Time: 98.865 ms
Optimization: true
Optimization Time: 20518.982 ms
Emission Time: 7270.963 ms
Execution Time: 28772.576 ms

jit=off
Planning Time: 1.527 ms
Execution Time: 959.160 ms

For the archives sake: This likely largely is the consequence of
building with LLVM's expensive assertions enabled, as confirmed by
Jonathan over IM.

I recompiled with the release version of LLVM. jit=on was still slower,
but the discrepancy was not as bad as the previously reported result:

jit = off
Planning Time: 0.938 ms
Execution Time: 935.599 ms

jit = on
Planning Time: 0.951 ms
JIT:
Functions: 184
Generation Time: 17.605 ms
Inlining: true
Inlining Time: 20.522 ms
Optimization: true
Optimization Time: 1001.034 ms
Emission Time: 665.319 ms
Execution Time: 2491.560 ms

However, it was still 2x+ slower, so still +1ing for open items.

Jonathan

#23

Andreas Joseph Krogh

andreas@visena.com

over 7 years ago

In reply to: Jonathan S. Katz (#22)

Re: Query is over 2x slower with jit=on

På torsdag 23. august 2018 kl. 03:00:42, skrev Jonathan S. Katz <
jkatz@postgresql.org <mailto:jkatz@postgresql.org>>:

On Aug 22, 2018, at 7:13 PM, Andres Freund <andres@anarazel.de> wrote:

[snip]

For the archives sake: This likely largely is the consequence of
building with LLVM's expensive assertions enabled, as confirmed by
Jonathan over IM.

I recompiled with the release version of LLVM. jit=on was still slower,
but the discrepancy was not as bad as the previously reported result:

jit = off
Planning Time: 0.938 ms
Execution Time: 935.599 ms

#24

Andreas Joseph Krogh

andreas@visena.com

over 7 years ago

In reply to: Andreas Joseph Krogh (#23)

Sv: Re: Query is over 2x slower with jit=on

På torsdag 23. august 2018 kl. 09:14:56, skrev Andreas Joseph Krogh <
andreas@visena.com <mailto:andreas@visena.com>>:
På torsdag 23. august 2018 kl. 03:00:42, skrev Jonathan S. Katz <
jkatz@postgresql.org <mailto:jkatz@postgresql.org>>:

On Aug 22, 2018, at 7:13 PM, Andres Freund <andres@anarazel.de> wrote:

[snip]

For the archives sake: This likely largely is the consequence of
building with LLVM's expensive assertions enabled, as confirmed by
Jonathan over IM.

I recompiled with the release version of LLVM. jit=on was still slower,
but the discrepancy was not as bad as the previously reported result:

jit = off
Planning Time: 0.938 ms
Execution Time: 935.599 ms

However, it was still 2x+ slower, so still +1ing for open items.

I compiled with whatever switches LLVM that comes with Ubuntu 18.04 is built
with, and without debugging or assertions.

With 11b3 as of 825f10fbda7a5d8a48d187b8193160e5e44e4011 I'm repeatedly
getting these results with jit=on, after 10 runs:

Planning Time: 0.266 ms
JIT:
Functions: 686
Generation Time: 71.895 ms
Inlining: false
Inlining Time: 0.000 ms
Optimization: false
Optimization Time: 39.906 ms
Emission Time: 589.944 ms
Execution Time: 2198.928 ms

Turning jit=off gives this:

Planning Time: 0.180 ms
Execution Time: 938.451 ms

I can provide dataset offlist if anyone wants to look into this.

--
Andreas Joseph Krogh

#25

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Andres Freund (#14)

1 attachment(s)

Re: Query is over 2x slower with jit=on

On 22 August 2018 at 22:21, Andres Freund <andres@anarazel.de> wrote:

On 2018-08-22 18:39:18 +0200, Andreas Joseph Krogh wrote:

Just to be clear; The query really runs slower (wall-clock time), it's not
just the timing.

I bet it's not actually running slower, it "just" takes longer to start
up due to the JITing in each worker. I suspect what we should do is to
multiple the cost limits by the number of workers, to model that. But
without the fixed instrumentation that's harder to see...

Attached is a patch that accumulates the per-worker jit counters into
the leader process.

Similar to SharedExecutorInstrumentation, I added another shared
memory segment SharedJitInstrumenation, which is used to accumulate
each of the workers' jit instrumentation into the leader's.

Earlier I was thinking we can have a common
SharedExecutorInstrumentation structure that will have both
instrumentions, one usual instrumentation, and the other jit
instrumentation. But then thought that fields in
SharedExecutorInstrumentation are mostly not relevant to
JitInstrumentation. So kept a separate segment in the shared memory.

It may happen that the jit context does not get created at the backend
because there was no jit compile done, but at the workers it got
created. In that case, before retrieving the jit data from the
workers, the estate->es_jit needs to be allocated at the backend. For
that, I had to have a new function jit_create_context(), and for that
a new jit provider callback function create_context(). I used the
exsiting llvm_create_context() for this callback function.

JitContext now has those counters in a separate JitInstrumentation
structure, so that the same structure can be shared by
SharedJitInstrumentation and JitContext.

-----------------

I think we better show per-worker jit info also. The current patch
does not show that. I think it would be easy to continue on the patch
to show per-worker info also. Under the Gather node, we can show
per-worker jit counters. I think this would be useful too, besides the
cumulative figures in the leader process. Comments ?

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

jit_instr_account_workers.patchapplication/octet-stream; name=jit_instr_account_workers.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 16a80a0..8959a2b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -564,7 +564,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	 * at a later stage.
 	 */
 	if (queryDesc->estate->es_jit && es->costs &&
-		queryDesc->estate->es_jit->created_functions > 0)
+		queryDesc->estate->es_jit->jit_instr.created_functions > 0)
 		ExplainPrintJIT(es, queryDesc);
 
 	/*
@@ -705,28 +705,28 @@ ExplainPrintJIT(ExplainState *es, QueryDesc *queryDesc)
 		appendStringInfo(es->str, "JIT:\n");
 	}
 
-	ExplainPropertyInteger("Functions", NULL, jc->created_functions, es);
+	ExplainPropertyInteger("Functions", NULL, jc->jit_instr.created_functions, es);
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Generation Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->generation_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->jit_instr.generation_counter),
 							 3, es);
 
 	ExplainPropertyBool("Inlining", jc->flags & PGJIT_INLINE, es);
 
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Inlining Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->inlining_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->jit_instr.inlining_counter),
 							 3, es);
 
 	ExplainPropertyBool("Optimization", jc->flags & PGJIT_OPT3, es);
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Optimization Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->optimization_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->jit_instr.optimization_counter),
 							 3, es);
 
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Emission Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->emission_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->jit_instr.emission_counter),
 							 3, es);
 
 	ExplainCloseGroup("JIT", "JIT", true, es);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ee0f07a..1d8182b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -37,6 +37,7 @@
 #include "executor/nodeSeqscan.h"
 #include "executor/nodeSort.h"
 #include "executor/tqueue.h"
+#include "jit/jit.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
 #include "optimizer/planner.h"
@@ -62,6 +63,7 @@
 #define PARALLEL_KEY_INSTRUMENTATION	UINT64CONST(0xE000000000000006)
 #define PARALLEL_KEY_DSA				UINT64CONST(0xE000000000000007)
 #define PARALLEL_KEY_QUERY_TEXT		UINT64CONST(0xE000000000000008)
+#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
 
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
@@ -105,6 +107,18 @@ struct SharedExecutorInstrumentation
 	(AssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
 	 (Instrumentation *) (((char *) sei) + sei->instrument_offset))
 
+/*
+ * DSM structure for accumulating jit instrumentation of all workers.
+ *
+ * num_workers: Number of workers.
+ */
+struct SharedJitInstrumentation
+{
+	int			num_workers;
+	JitInstrumentation jit_instr[FLEXIBLE_ARRAY_MEMBER];
+	/* array of JitInstrumentation objects * num_workers follows */
+};
+
 /* Context object for ExecParallelEstimate. */
 typedef struct ExecParallelEstimateContext
 {
@@ -573,9 +587,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	char	   *paramlistinfo_space;
 	BufferUsage *bufusage_space;
 	SharedExecutorInstrumentation *instrumentation = NULL;
+	SharedJitInstrumentation *jit_instrumentation = NULL;
 	int			pstmt_len;
 	int			paramlistinfo_len;
 	int			instrumentation_len = 0;
+	int			jit_instrumentation_len;
 	int			instrument_offset = 0;
 	Size		dsa_minsize = dsa_minimum_size();
 	char	   *query_string;
@@ -661,6 +677,17 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 	}
 
+	/* Estimate space for JIT instrumentation, if required. */
+	if (estate->es_jit_flags != PGJIT_NONE)
+	{
+		jit_instrumentation_len =
+			offsetof(SharedJitInstrumentation, jit_instr) +
+			sizeof(JitInstrumentation) * nworkers;
+		jit_instrumentation_len = MAXALIGN(jit_instrumentation_len);
+		shm_toc_estimate_chunk(&pcxt->estimator, jit_instrumentation_len);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
 	/* Estimate space for DSA area. */
 	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -734,6 +761,17 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		pei->instrumentation = instrumentation;
 	}
 
+	if (estate->es_jit_flags != PGJIT_NONE)
+	{
+		jit_instrumentation = shm_toc_allocate(pcxt->toc, jit_instrumentation_len);
+		jit_instrumentation->num_workers = nworkers;
+		memset(jit_instrumentation->jit_instr, 0,
+			   sizeof(JitInstrumentation) * nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
+					   jit_instrumentation);
+		pei->jit_instrumentation = jit_instrumentation;
+	}
+
 	/*
 	 * Create a DSA area that can be used by the leader and all workers.
 	 * (However, if we failed to create a DSM and are using private memory
@@ -990,6 +1028,35 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 }
 
 /*
+ * Add up the workers' JIT instrumentation from dynamic shared memory.
+ */
+static void
+ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
+									   SharedJitInstrumentation *shared_jit)
+{
+	int			n;
+	JitContext *jit = planstate->state->es_jit;
+
+	/* If the leader hasn't yet created a jit context, allocate one now. */
+	if (!jit)
+	{
+		planstate->state->es_jit = jit =
+			jit_create_context(planstate->state->es_jit_flags);
+	}
+
+	/*
+	 * It may happen that we couldn't create a context, possibly because
+	 * the jit provider could not be loaded. In that case, don't bother about
+	 * retrieving from the workers.
+	 */
+	if (jit != NULL)
+	{
+		for (n = 0; n < shared_jit->num_workers; ++n)
+			InstrJitAgg(&jit->jit_instr, &shared_jit->jit_instr[n]);
+	}
+}
+
+/*
  * Finish parallel execution.  We wait for parallel workers to finish, and
  * accumulate their buffer usage.
  */
@@ -1054,6 +1121,11 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
 		ExecParallelRetrieveInstrumentation(pei->planstate,
 											pei->instrumentation);
 
+	/* Accumulate JIT instrumentation, if any. */
+	if (pei->jit_instrumentation)
+		ExecParallelRetrieveJitInstrumentation(pei->planstate,
+											pei->jit_instrumentation);
+
 	/* Free any serialized parameters. */
 	if (DsaPointerIsValid(pei->param_exec))
 	{
@@ -1260,6 +1332,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	DestReceiver *receiver;
 	QueryDesc  *queryDesc;
 	SharedExecutorInstrumentation *instrumentation;
+	SharedJitInstrumentation *jit_instrumentation;
 	int			instrument_options = 0;
 	void	   *area_space;
 	dsa_area   *area;
@@ -1273,6 +1346,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
 	if (instrumentation != NULL)
 		instrument_options = instrumentation->instrument_options;
+	jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
+										 true);
 	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
 
 	/* Setting debug_query_string for individual workers */
@@ -1336,6 +1411,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 		ExecParallelReportInstrumentation(queryDesc->planstate,
 										  instrumentation);
 
+	/* Report JIT instrumentation data if any */
+	if (queryDesc->estate->es_jit && jit_instrumentation != NULL)
+	{
+		Assert(ParallelWorkerNumber < jit_instrumentation->num_workers);
+		jit_instrumentation->jit_instr[ParallelWorkerNumber] =
+			queryDesc->estate->es_jit->jit_instr;
+	}
+
 	/* Must do this after capturing instrumentation. */
 	ExecutorEnd(queryDesc);
 
diff --git a/src/backend/jit/jit.c b/src/backend/jit/jit.c
index c170309..8513d86 100644
--- a/src/backend/jit/jit.c
+++ b/src/backend/jit/jit.c
@@ -182,6 +182,34 @@ jit_compile_expr(struct ExprState *state)
 	return false;
 }
 
+/*
+ * Ask provider to create a JIT context.
+ *
+ * Returns the jit context if successful, NULL if not.
+ */
+JitContext *
+jit_create_context(int flags)
+{
+	/* this also takes !jit_enabled into account */
+	if (provider_init())
+		return provider.create_context(flags);
+	else
+		return NULL;
+}
+
+/* Aggregate JIT instrumentation information */
+void
+InstrJitAgg(JitInstrumentation *dst, JitInstrumentation *add)
+{
+	dst->created_functions = add_size(dst->created_functions,
+									  add->created_functions);
+
+	INSTR_TIME_ADD(dst->generation_counter, add->generation_counter);
+	INSTR_TIME_ADD(dst->inlining_counter, add->inlining_counter);
+	INSTR_TIME_ADD(dst->optimization_counter, add->optimization_counter);
+	INSTR_TIME_ADD(dst->emission_counter, add->emission_counter);
+}
+
 static bool
 file_exists(const char *name)
 {
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 640c27f..5a1c67d 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -122,6 +122,7 @@ void
 _PG_jit_provider_init(JitProviderCallbacks *cb)
 {
 	cb->reset_after_error = llvm_reset_after_error;
+	cb->create_context = llvm_create_context;
 	cb->release_context = llvm_release_context;
 	cb->compile_expr = llvm_compile_expr;
 }
@@ -133,7 +134,7 @@ _PG_jit_provider_init(JitProviderCallbacks *cb)
  * the context is explicitly released, or when the lifetime of
  * CurrentResourceOwner ends (usually the end of the current [sub]xact).
  */
-LLVMJitContext *
+JitContext *
 llvm_create_context(int jitFlags)
 {
 	LLVMJitContext *context;
@@ -152,7 +153,7 @@ llvm_create_context(int jitFlags)
 	context->base.resowner = CurrentResourceOwner;
 	ResourceOwnerRememberJIT(CurrentResourceOwner, PointerGetDatum(context));
 
-	return context;
+	return (JitContext *) context;
 }
 
 /*
@@ -224,7 +225,7 @@ llvm_expand_funcname(struct LLVMJitContext *context, const char *basename)
 {
 	Assert(context->module != NULL);
 
-	context->base.created_functions++;
+	context->base.jit_instr.created_functions++;
 
 	/*
 	 * Previously we used dots to separate, but turns out some tools, e.g.
@@ -504,7 +505,7 @@ llvm_compile_module(LLVMJitContext *context)
 		INSTR_TIME_SET_CURRENT(starttime);
 		llvm_inline(context->module);
 		INSTR_TIME_SET_CURRENT(endtime);
-		INSTR_TIME_ACCUM_DIFF(context->base.inlining_counter,
+		INSTR_TIME_ACCUM_DIFF(context->base.jit_instr.inlining_counter,
 							  endtime, starttime);
 	}
 
@@ -524,7 +525,7 @@ llvm_compile_module(LLVMJitContext *context)
 	INSTR_TIME_SET_CURRENT(starttime);
 	llvm_optimize_module(context, context->module);
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.optimization_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base.jit_instr.optimization_counter,
 						  endtime, starttime);
 
 	if (jit_dump_bitcode)
@@ -575,7 +576,7 @@ llvm_compile_module(LLVMJitContext *context)
 	}
 #endif
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.emission_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base.jit_instr.emission_counter,
 						  endtime, starttime);
 
 	context->module = NULL;
@@ -596,9 +597,9 @@ llvm_compile_module(LLVMJitContext *context)
 
 	ereport(DEBUG1,
 			(errmsg("time to inline: %.3fs, opt: %.3fs, emit: %.3fs",
-					INSTR_TIME_GET_DOUBLE(context->base.inlining_counter),
-					INSTR_TIME_GET_DOUBLE(context->base.optimization_counter),
-					INSTR_TIME_GET_DOUBLE(context->base.emission_counter)),
+					INSTR_TIME_GET_DOUBLE(context->base.jit_instr.inlining_counter),
+					INSTR_TIME_GET_DOUBLE(context->base.jit_instr.optimization_counter),
+					INSTR_TIME_GET_DOUBLE(context->base.jit_instr.emission_counter)),
 			 errhidestmt(true),
 			 errhidecontext(true)));
 }
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 0f31093..96f4a9d 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -129,7 +129,8 @@ llvm_compile_expr(ExprState *state)
 	}
 	else
 	{
-		context = llvm_create_context(parent->state->es_jit_flags);
+		context = (LLVMJitContext *)
+				  llvm_create_context(parent->state->es_jit_flags);
 
 		if (parent)
 		{
@@ -2557,7 +2558,7 @@ llvm_compile_expr(ExprState *state)
 	llvm_leave_fatal_on_oom();
 
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.generation_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base.jit_instr.generation_counter,
 						  endtime, starttime);
 
 	return true;
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 626a66c..35dad58 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -20,6 +20,7 @@
 #include "utils/dsa.h"
 
 typedef struct SharedExecutorInstrumentation SharedExecutorInstrumentation;
+typedef struct SharedJitInstrumentation SharedJitInstrumentation;
 
 typedef struct ParallelExecutorInfo
 {
@@ -27,6 +28,7 @@ typedef struct ParallelExecutorInfo
 	ParallelContext *pcxt;		/* parallel context we're using */
 	BufferUsage *buffer_usage;	/* points to bufusage area in DSM */
 	SharedExecutorInstrumentation *instrumentation; /* optional */
+	SharedJitInstrumentation *jit_instrumentation; /* optional */
 	dsa_area   *area;			/* points to DSA area in DSM */
 	dsa_pointer param_exec;		/* serialized PARAM_EXEC parameters */
 	bool		finished;		/* set true by ExecParallelFinish */
diff --git a/src/include/jit/jit.h b/src/include/jit/jit.h
index b451f40..1a6c1b2 100644
--- a/src/include/jit/jit.h
+++ b/src/include/jit/jit.h
@@ -23,14 +23,8 @@
 #define PGJIT_EXPR	   (1 << 3)
 #define PGJIT_DEFORM   (1 << 4)
 
-
-typedef struct JitContext
+typedef struct JitInstrumentation
 {
-	/* see PGJIT_* above */
-	int			flags;
-
-	ResourceOwner resowner;
-
 	/* number of emitted functions */
 	size_t		created_functions;
 
@@ -45,6 +39,14 @@ typedef struct JitContext
 
 	/* accumulated time for code emission */
 	instr_time	emission_counter;
+} JitInstrumentation;
+
+typedef struct JitContext
+{
+	/* see PGJIT_* above */
+	int			flags;
+	ResourceOwner resowner;
+	JitInstrumentation jit_instr;
 } JitContext;
 
 typedef struct JitProviderCallbacks JitProviderCallbacks;
@@ -52,6 +54,7 @@ typedef struct JitProviderCallbacks JitProviderCallbacks;
 extern void _PG_jit_provider_init(JitProviderCallbacks *cb);
 typedef void (*JitProviderInit) (JitProviderCallbacks *cb);
 typedef void (*JitProviderResetAfterErrorCB) (void);
+typedef JitContext * (*JitProviderCreateContextCB) (int flags);
 typedef void (*JitProviderReleaseContextCB) (JitContext *context);
 struct ExprState;
 typedef bool (*JitProviderCompileExprCB) (struct ExprState *state);
@@ -59,6 +62,7 @@ typedef bool (*JitProviderCompileExprCB) (struct ExprState *state);
 struct JitProviderCallbacks
 {
 	JitProviderResetAfterErrorCB reset_after_error;
+	JitProviderCreateContextCB create_context;
 	JitProviderReleaseContextCB release_context;
 	JitProviderCompileExprCB compile_expr;
 };
@@ -85,6 +89,8 @@ extern void jit_release_context(JitContext *context);
  * not be able to perform JIT (i.e. return false).
  */
 extern bool jit_compile_expr(struct ExprState *state);
+extern JitContext *jit_create_context(int flags);
+extern void InstrJitAgg(JitInstrumentation *dst, JitInstrumentation *add);
 
 
 #endif							/* JIT_H */
diff --git a/src/include/jit/llvmjit.h b/src/include/jit/llvmjit.h
index c81cff8..be1c7ce 100644
--- a/src/include/jit/llvmjit.h
+++ b/src/include/jit/llvmjit.h
@@ -91,7 +91,7 @@ extern void llvm_leave_fatal_on_oom(void);
 extern void llvm_reset_after_error(void);
 extern void llvm_assert_in_fatal_section(void);
 
-extern LLVMJitContext *llvm_create_context(int jitFlags);
+extern JitContext * llvm_create_context(int jitFlags);
 extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);
 extern char *llvm_expand_funcname(LLVMJitContext *context, const char *basename);
 extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);

#26

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Amit Khandekar (#25)

Re: Query is over 2x slower with jit=on

Hi,

On 2018-09-10 15:42:55 +0530, Amit Khandekar wrote:

Attached is a patch that accumulates the per-worker jit counters into
the leader process.

Thanks!

I think we better show per-worker jit info also. The current patch
does not show that. I think it would be easy to continue on the patch
to show per-worker info also. Under the Gather node, we can show
per-worker jit counters. I think this would be useful too, besides the
cumulative figures in the leader process. Comments ?

Yes, I think that'd be good. I think we either should print the stats at
the top level as $leader_value, $combined_worker_value, $total_value or
just have the $combined_worker_value at the level where we print other
stats from the worker, too.

/*
+ * Add up the workers' JIT instrumentation from dynamic shared memory.
+ */
+static void
+ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
+									   SharedJitInstrumentation *shared_jit)
+{
+	int			n;
+	JitContext *jit = planstate->state->es_jit;
+
+	/* If the leader hasn't yet created a jit context, allocate one now. */
+	if (!jit)
+	{
+		planstate->state->es_jit = jit =
+			jit_create_context(planstate->state->es_jit_flags);
+	}

Wouldn't it be better to move the jit instrumentation to outside of the
context, to avoid having to do this? Or just cope with not having
instrumentation for the leader in this case? We'd kinda need to deal
with failure to create one anyway?

Greetings,

Andres Freund

#27

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Andres Freund (#26)

Re: Query is over 2x slower with jit=on

On 10 September 2018 at 21:39, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-09-10 15:42:55 +0530, Amit Khandekar wrote:

Attached is a patch that accumulates the per-worker jit counters into
the leader process.

Thanks!

I think we better show per-worker jit info also. The current patch
does not show that. I think it would be easy to continue on the patch
to show per-worker info also. Under the Gather node, we can show
per-worker jit counters. I think this would be useful too, besides the
cumulative figures in the leader process. Comments ?

Yes, I think that'd be good.

Ok. Will continue on the patch.

I think we either should print the stats at
the top level as $leader_value, $combined_worker_value, $total_value or
just have the $combined_worker_value at the level where we print other
stats from the worker, too.

Yes, I think we can follow and be consistent with whatever way in
which the other worker stats are printed. Will check.

Note: Since there can be a multiple separate Gather plans under a plan
tree, I think we can show this info for each Gather plan.

/*
+ * Add up the workers' JIT instrumentation from dynamic shared memory.
+ */
+static void
+ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
+                                                                        SharedJitInstrumentation *shared_jit)
+{
+     int                     n;
+     JitContext *jit = planstate->state->es_jit;
+
+     /* If the leader hasn't yet created a jit context, allocate one now. */
+     if (!jit)
+     {
+             planstate->state->es_jit = jit =
+                     jit_create_context(planstate->state->es_jit_flags);
+     }

Yeah, I think taking out the instrumentation out of the context looks
better. Will work on that.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#28

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#27)

2 attachment(s)

Re: Query is over 2x slower with jit=on

On 11 September 2018 at 14:50, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 10 September 2018 at 21:39, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2018-09-10 15:42:55 +0530, Amit Khandekar wrote:

Attached is a patch that accumulates the per-worker jit counters into
the leader process.

Thanks!

I think we better show per-worker jit info also. The current patch
does not show that. I think it would be easy to continue on the patch
to show per-worker info also. Under the Gather node, we can show
per-worker jit counters. I think this would be useful too, besides the
cumulative figures in the leader process. Comments ?

Yes, I think that'd be good.

Ok. Will continue on the patch.

I think we either should print the stats at
the top level as $leader_value, $combined_worker_value, $total_value or
just have the $combined_worker_value at the level where we print other
stats from the worker, too.

Yes, I think we can follow and be consistent with whatever way in
which the other worker stats are printed. Will check.

Note: Since there can be a multiple separate Gather plans under a plan
tree, I think we can show this info for each Gather plan.

The attached patch shows per-worker information, besides the
cumulative figure in the end of plan. Attached is the complete output
of an aggregate parallel query on tenk1 table (created using the
regression tests). You can see that the per-worker figures are under
each of the Gather plans.

We can show combined values of all the workers under a Gather plan as
one single value, but I guess if we just show per-worker values, we
don't have to additionally show combined value. Comments ?
Also, I have kept the the per-worker info only for verbose=true.

The Gather plan's leader value is not shown. We also don't show
leader-specific values when we show the usual per-worker
instrumentation values. So I think we are consistent. Actually it is a
bit tedious to collect only leader-specific values. And I don't think
it is worth trying for it. For worker, I am not collecting
per-plan-node info, because we want to show total worker figures, and
not per-plan-node figures. For the normal instrumentations, we want to
show per-node info. For leader-specific instrumentation, we would need
to store per-(Gather-)plan info.

Need to finalize the indentation and the other explain formats. One
option is to keep the per-worker JIT info in a single line, like how
we print the normal per-worker instrumentation :
Worker 1: actual time=20.971..35.975 rows=39991 loops=1

Below is a snippet :

Aggregate (cost=18656.00..18656.01 rows=1 width=8) (actual
time=593.917..593.917 rows=1 loops=1)
Output: count(*)
-> Hash Join (cost=10718.00..18456.00 rows=80000 width=0) (actual
time=496.650..593.883 rows=16 loops=1)
Inner Unique: true
Hash Cond: ((a.unique1 = b.unique1) AND (a.two =
(row_number() OVER (?))))
-> Gather (cost=0.00..5918.00 rows=160000 width=8) (actual
time=192.319..223.384 rows=160000 loops=1)
Output: a.unique1, a.two
Workers Planned: 4
Workers Launched: 4
Jit for Worker : 0
Functions: 2
Generation Time: 0.273 ms
Inlining: true
Inlining Time: 43.686 ms
Optimization: true
Optimization Time: 10.788 ms
Emission Time: 8.438 ms
Jit for Worker : 1
Functions: 2
Generation Time: 0.293 ms
Inlining: true
Inlining Time: 72.587 ms
Optimization: true
Optimization Time: 10.386 ms
Emission Time: 8.115 ms
..........
..........
..........
Planning Time: 0.548 ms
Jit:
Functions: 40
Generation Time: 3.892 ms
Inlining: true
Inlining Time: 409.397 ms
Optimization: true
Optimization Time: 174.708 ms
Emission Time: 91.785 ms
Execution Time: 610.262 ms
(98 rows)

/*
+ * Add up the workers' JIT instrumentation from dynamic shared memory.
+ */
+static void
+ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
+                                                                        SharedJitInstrumentation *shared_jit)
+{
+     int                     n;
+     JitContext *jit = planstate->state->es_jit;
+
+     /* If the leader hasn't yet created a jit context, allocate one now. */
+     if (!jit)
+     {
+             planstate->state->es_jit = jit =
+                     jit_create_context(planstate->state->es_jit_flags);
+     }
Wouldn't it be better to move the jit instrumentation to outside of the
context, to avoid having to do this? Or just cope with not having
instrumentation for the leader in this case? We'd kinda need to deal
with failure to create one anyway?
Yeah, I think taking out the instrumentation out of the context looks
better. Will work on that.

Not yet done this. Will do now.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

jit_instr_account_workers_v2.patchapplication/octet-stream; name=jit_instr_account_workers_v2.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 16a80a0..f998062 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -564,8 +564,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	 * at a later stage.
 	 */
 	if (queryDesc->estate->es_jit && es->costs &&
-		queryDesc->estate->es_jit->created_functions > 0)
-		ExplainPrintJIT(es, queryDesc);
+		queryDesc->estate->es_jit->jit_instr.created_functions > 0)
+		ExplainPrintJIT(es, queryDesc->estate->es_jit_flags,
+						&queryDesc->estate->es_jit->jit_instr, -1);
 
 	/*
 	 * Close down the query and free resources.  Include time for this in the
@@ -690,43 +691,54 @@ ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc)
 
 /*
  * ExplainPrintJIT -
- *	  Append information about JITing to es->str.
+ *	  Append information about JITing to es->str. Can be used to print
+ *    the JIT instrumentation of the backend or that of a specific worker.
  */
 void
-ExplainPrintJIT(ExplainState *es, QueryDesc *queryDesc)
+ExplainPrintJIT(ExplainState *es, int jit_flags,
+				JitInstrumentation *jit_instr, int worker_i)
 {
-	JitContext *jc = queryDesc->estate->es_jit;
+	bool		for_workers = (worker_i >= 0);
+	char	   *jit_group_label = (for_workers ? "Workers JIT" : "JIT");
 
-	ExplainOpenGroup("JIT", "JIT", true, es);
+	ExplainOpenGroup(jit_group_label, jit_group_label, true, es);
 
-	if (es->format == EXPLAIN_FORMAT_TEXT)
+	if (for_workers)
 	{
-		es->indent += 1;
-		appendStringInfo(es->str, "JIT:\n");
+		ExplainPropertyInteger("Jit for Worker ", NULL, worker_i, es);
+
+		/* Jump to the specific worker */
+		jit_instr = jit_instr + worker_i;
 	}
+	else
+		ExplainPropertyText("Jit", "", es);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+		es->indent += 1;
+
+	ExplainPropertyInteger("Functions", NULL, jit_instr->created_functions, es);
 
-	ExplainPropertyInteger("Functions", NULL, jc->created_functions, es);
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Generation Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->generation_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jit_instr->generation_counter),
 							 3, es);
 
-	ExplainPropertyBool("Inlining", jc->flags & PGJIT_INLINE, es);
+	ExplainPropertyBool("Inlining", jit_flags & PGJIT_INLINE, es);
 
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Inlining Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->inlining_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jit_instr->inlining_counter),
 							 3, es);
 
-	ExplainPropertyBool("Optimization", jc->flags & PGJIT_OPT3, es);
+	ExplainPropertyBool("Optimization", jit_flags & PGJIT_OPT3, es);
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Optimization Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->optimization_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jit_instr->optimization_counter),
 							 3, es);
 
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Emission Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->emission_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jit_instr->emission_counter),
 							 3, es);
 
 	ExplainCloseGroup("JIT", "JIT", true, es);
@@ -1515,6 +1527,26 @@ ExplainNode(PlanState *planstate, List *ancestors,
 					ExplainPropertyInteger("Workers Launched", NULL,
 										   nworkers, es);
 				}
+
+				/* Print per-worker Jit instrumentation. */
+
+				if (es->analyze && es->verbose &&
+					outerPlanState(planstate)->worker_jit_instrument)
+				{
+					PlanState *child = outerPlanState(planstate);
+					int			n;
+					SharedJitInstrumentation *w = child->worker_jit_instrument;
+
+					/* TODO: Consider es->costs before printing ? */
+					for (n = 0; n < w->num_workers; ++n)
+					{
+						/* TODO: Just skip for created_functions <= 0 ?? */
+						if (w->jit_instr[n].created_functions > 0)
+							ExplainPrintJIT(es, child->state->es_jit_flags,
+											w->jit_instr, n);
+					}
+				}
+
 				if (gather->single_copy || es->format != EXPLAIN_FORMAT_TEXT)
 					ExplainPropertyBool("Single Copy", gather->single_copy, es);
 			}
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ee0f07a..754d858 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -37,6 +37,7 @@
 #include "executor/nodeSeqscan.h"
 #include "executor/nodeSort.h"
 #include "executor/tqueue.h"
+#include "jit/jit.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
 #include "optimizer/planner.h"
@@ -62,6 +63,7 @@
 #define PARALLEL_KEY_INSTRUMENTATION	UINT64CONST(0xE000000000000006)
 #define PARALLEL_KEY_DSA				UINT64CONST(0xE000000000000007)
 #define PARALLEL_KEY_QUERY_TEXT		UINT64CONST(0xE000000000000008)
+#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
 
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
@@ -573,9 +575,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	char	   *paramlistinfo_space;
 	BufferUsage *bufusage_space;
 	SharedExecutorInstrumentation *instrumentation = NULL;
+	SharedJitInstrumentation *jit_instrumentation = NULL;
 	int			pstmt_len;
 	int			paramlistinfo_len;
 	int			instrumentation_len = 0;
+	int			jit_instrumentation_len;
 	int			instrument_offset = 0;
 	Size		dsa_minsize = dsa_minimum_size();
 	char	   *query_string;
@@ -661,6 +665,17 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 	}
 
+	/* Estimate space for JIT instrumentation, if required. */
+	if (estate->es_jit_flags != PGJIT_NONE)
+	{
+		jit_instrumentation_len =
+			offsetof(SharedJitInstrumentation, jit_instr) +
+			sizeof(JitInstrumentation) * nworkers;
+		jit_instrumentation_len = MAXALIGN(jit_instrumentation_len);
+		shm_toc_estimate_chunk(&pcxt->estimator, jit_instrumentation_len);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
 	/* Estimate space for DSA area. */
 	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -734,6 +749,17 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		pei->instrumentation = instrumentation;
 	}
 
+	if (estate->es_jit_flags != PGJIT_NONE)
+	{
+		jit_instrumentation = shm_toc_allocate(pcxt->toc, jit_instrumentation_len);
+		jit_instrumentation->num_workers = nworkers;
+		memset(jit_instrumentation->jit_instr, 0,
+			   sizeof(JitInstrumentation) * nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
+					   jit_instrumentation);
+		pei->jit_instrumentation = jit_instrumentation;
+	}
+
 	/*
 	 * Create a DSA area that can be used by the leader and all workers.
 	 * (However, if we failed to create a DSM and are using private memory
@@ -990,6 +1016,55 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 }
 
 /*
+ * Add up the workers' JIT instrumentation from dynamic shared memory.
+ */
+static void
+ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
+									   SharedJitInstrumentation *shared_jit)
+{
+	int			n;
+	JitContext *jit = planstate->state->es_jit;
+	MemoryContext oldcontext;
+	int			ibytes;
+
+	/* If the leader hasn't yet created a jit context, allocate one now. */
+	if (!jit)
+	{
+		planstate->state->es_jit = jit =
+			jit_create_context(planstate->state->es_jit_flags);
+	}
+
+	/*
+	 * It may happen that we couldn't create a context, possibly because
+	 * the jit provider could not be loaded. In that case, don't bother about
+	 * retrieving from the workers.
+	 */
+	if (jit != NULL)
+	{
+		for (n = 0; n < shared_jit->num_workers; ++n)
+			InstrJitAgg(&jit->jit_instr, &shared_jit->jit_instr[n]);
+	}
+
+	/*
+	 * Also store the per-worker detail.
+	 *
+	 * Similar to ExecParallelRetrieveInstrumentation(), allocate the
+	 * instrumentation in per-query context.
+	 */
+
+	oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
+	ibytes = offsetof(SharedJitInstrumentation, jit_instr)
+			 + mul_size(shared_jit->num_workers, sizeof(JitInstrumentation));
+	ibytes = MAXALIGN(ibytes);
+	elog(LOG, "ibytes : %d; num_workers: %d; sizeof(JitInstrumentation): %ld",
+			   ibytes, shared_jit->num_workers, sizeof(JitInstrumentation));
+	planstate->worker_jit_instrument = palloc(ibytes);
+	MemoryContextSwitchTo(oldcontext);
+
+	memcpy(planstate->worker_jit_instrument, shared_jit, ibytes);
+}
+
+/*
  * Finish parallel execution.  We wait for parallel workers to finish, and
  * accumulate their buffer usage.
  */
@@ -1054,6 +1129,11 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
 		ExecParallelRetrieveInstrumentation(pei->planstate,
 											pei->instrumentation);
 
+	/* Accumulate JIT instrumentation, if any. */
+	if (pei->jit_instrumentation)
+		ExecParallelRetrieveJitInstrumentation(pei->planstate,
+											pei->jit_instrumentation);
+
 	/* Free any serialized parameters. */
 	if (DsaPointerIsValid(pei->param_exec))
 	{
@@ -1260,6 +1340,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	DestReceiver *receiver;
 	QueryDesc  *queryDesc;
 	SharedExecutorInstrumentation *instrumentation;
+	SharedJitInstrumentation *jit_instrumentation;
 	int			instrument_options = 0;
 	void	   *area_space;
 	dsa_area   *area;
@@ -1273,6 +1354,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
 	if (instrumentation != NULL)
 		instrument_options = instrumentation->instrument_options;
+	jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
+										 true);
 	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
 
 	/* Setting debug_query_string for individual workers */
@@ -1336,6 +1419,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 		ExecParallelReportInstrumentation(queryDesc->planstate,
 										  instrumentation);
 
+	/* Report JIT instrumentation data if any */
+	if (queryDesc->estate->es_jit && jit_instrumentation != NULL)
+	{
+		Assert(ParallelWorkerNumber < jit_instrumentation->num_workers);
+		jit_instrumentation->jit_instr[ParallelWorkerNumber] =
+			queryDesc->estate->es_jit->jit_instr;
+	}
+
 	/* Must do this after capturing instrumentation. */
 	ExecutorEnd(queryDesc);
 
diff --git a/src/backend/jit/jit.c b/src/backend/jit/jit.c
index c170309..8513d86 100644
--- a/src/backend/jit/jit.c
+++ b/src/backend/jit/jit.c
@@ -182,6 +182,34 @@ jit_compile_expr(struct ExprState *state)
 	return false;
 }
 
+/*
+ * Ask provider to create a JIT context.
+ *
+ * Returns the jit context if successful, NULL if not.
+ */
+JitContext *
+jit_create_context(int flags)
+{
+	/* this also takes !jit_enabled into account */
+	if (provider_init())
+		return provider.create_context(flags);
+	else
+		return NULL;
+}
+
+/* Aggregate JIT instrumentation information */
+void
+InstrJitAgg(JitInstrumentation *dst, JitInstrumentation *add)
+{
+	dst->created_functions = add_size(dst->created_functions,
+									  add->created_functions);
+
+	INSTR_TIME_ADD(dst->generation_counter, add->generation_counter);
+	INSTR_TIME_ADD(dst->inlining_counter, add->inlining_counter);
+	INSTR_TIME_ADD(dst->optimization_counter, add->optimization_counter);
+	INSTR_TIME_ADD(dst->emission_counter, add->emission_counter);
+}
+
 static bool
 file_exists(const char *name)
 {
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 640c27f..5a1c67d 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -122,6 +122,7 @@ void
 _PG_jit_provider_init(JitProviderCallbacks *cb)
 {
 	cb->reset_after_error = llvm_reset_after_error;
+	cb->create_context = llvm_create_context;
 	cb->release_context = llvm_release_context;
 	cb->compile_expr = llvm_compile_expr;
 }
@@ -133,7 +134,7 @@ _PG_jit_provider_init(JitProviderCallbacks *cb)
  * the context is explicitly released, or when the lifetime of
  * CurrentResourceOwner ends (usually the end of the current [sub]xact).
  */
-LLVMJitContext *
+JitContext *
 llvm_create_context(int jitFlags)
 {
 	LLVMJitContext *context;
@@ -152,7 +153,7 @@ llvm_create_context(int jitFlags)
 	context->base.resowner = CurrentResourceOwner;
 	ResourceOwnerRememberJIT(CurrentResourceOwner, PointerGetDatum(context));
 
-	return context;
+	return (JitContext *) context;
 }
 
 /*
@@ -224,7 +225,7 @@ llvm_expand_funcname(struct LLVMJitContext *context, const char *basename)
 {
 	Assert(context->module != NULL);
 
-	context->base.created_functions++;
+	context->base.jit_instr.created_functions++;
 
 	/*
 	 * Previously we used dots to separate, but turns out some tools, e.g.
@@ -504,7 +505,7 @@ llvm_compile_module(LLVMJitContext *context)
 		INSTR_TIME_SET_CURRENT(starttime);
 		llvm_inline(context->module);
 		INSTR_TIME_SET_CURRENT(endtime);
-		INSTR_TIME_ACCUM_DIFF(context->base.inlining_counter,
+		INSTR_TIME_ACCUM_DIFF(context->base.jit_instr.inlining_counter,
 							  endtime, starttime);
 	}
 
@@ -524,7 +525,7 @@ llvm_compile_module(LLVMJitContext *context)
 	INSTR_TIME_SET_CURRENT(starttime);
 	llvm_optimize_module(context, context->module);
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.optimization_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base.jit_instr.optimization_counter,
 						  endtime, starttime);
 
 	if (jit_dump_bitcode)
@@ -575,7 +576,7 @@ llvm_compile_module(LLVMJitContext *context)
 	}
 #endif
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.emission_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base.jit_instr.emission_counter,
 						  endtime, starttime);
 
 	context->module = NULL;
@@ -596,9 +597,9 @@ llvm_compile_module(LLVMJitContext *context)
 
 	ereport(DEBUG1,
 			(errmsg("time to inline: %.3fs, opt: %.3fs, emit: %.3fs",
-					INSTR_TIME_GET_DOUBLE(context->base.inlining_counter),
-					INSTR_TIME_GET_DOUBLE(context->base.optimization_counter),
-					INSTR_TIME_GET_DOUBLE(context->base.emission_counter)),
+					INSTR_TIME_GET_DOUBLE(context->base.jit_instr.inlining_counter),
+					INSTR_TIME_GET_DOUBLE(context->base.jit_instr.optimization_counter),
+					INSTR_TIME_GET_DOUBLE(context->base.jit_instr.emission_counter)),
 			 errhidestmt(true),
 			 errhidecontext(true)));
 }
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 0f31093..96f4a9d 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -129,7 +129,8 @@ llvm_compile_expr(ExprState *state)
 	}
 	else
 	{
-		context = llvm_create_context(parent->state->es_jit_flags);
+		context = (LLVMJitContext *)
+				  llvm_create_context(parent->state->es_jit_flags);
 
 		if (parent)
 		{
@@ -2557,7 +2558,7 @@ llvm_compile_expr(ExprState *state)
 	llvm_leave_fatal_on_oom();
 
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.generation_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base.jit_instr.generation_counter,
 						  endtime, starttime);
 
 	return true;
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 9b75baa..164cf59 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -81,7 +81,8 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
 extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
 extern void ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc);
 
-extern void ExplainPrintJIT(ExplainState *es, QueryDesc *queryDesc);
+extern void ExplainPrintJIT(ExplainState *es, int jit_flags,
+				JitInstrumentation *jit_instr, int worker_i);
 
 extern void ExplainQueryText(ExplainState *es, QueryDesc *queryDesc);
 
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 626a66c..ea1991f 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -27,6 +27,7 @@ typedef struct ParallelExecutorInfo
 	ParallelContext *pcxt;		/* parallel context we're using */
 	BufferUsage *buffer_usage;	/* points to bufusage area in DSM */
 	SharedExecutorInstrumentation *instrumentation; /* optional */
+	SharedJitInstrumentation *jit_instrumentation; /* optional */
 	dsa_area   *area;			/* points to DSA area in DSM */
 	dsa_pointer param_exec;		/* serialized PARAM_EXEC parameters */
 	bool		finished;		/* set true by ExecParallelFinish */
diff --git a/src/include/jit/jit.h b/src/include/jit/jit.h
index b451f40..ba74e67 100644
--- a/src/include/jit/jit.h
+++ b/src/include/jit/jit.h
@@ -23,14 +23,8 @@
 #define PGJIT_EXPR	   (1 << 3)
 #define PGJIT_DEFORM   (1 << 4)
 
-
-typedef struct JitContext
+typedef struct JitInstrumentation
 {
-	/* see PGJIT_* above */
-	int			flags;
-
-	ResourceOwner resowner;
-
 	/* number of emitted functions */
 	size_t		created_functions;
 
@@ -45,6 +39,26 @@ typedef struct JitContext
 
 	/* accumulated time for code emission */
 	instr_time	emission_counter;
+} JitInstrumentation;
+
+/*
+ * DSM structure for accumulating jit instrumentation of all workers.
+ *
+ * num_workers: Number of workers.
+ */
+typedef struct SharedJitInstrumentation
+{
+	int			num_workers;
+	JitInstrumentation jit_instr[FLEXIBLE_ARRAY_MEMBER];
+	/* array of JitInstrumentation objects * num_workers follows */
+} SharedJitInstrumentation;
+
+typedef struct JitContext
+{
+	/* see PGJIT_* above */
+	int			flags;
+	ResourceOwner resowner;
+	JitInstrumentation jit_instr;
 } JitContext;
 
 typedef struct JitProviderCallbacks JitProviderCallbacks;
@@ -52,6 +66,7 @@ typedef struct JitProviderCallbacks JitProviderCallbacks;
 extern void _PG_jit_provider_init(JitProviderCallbacks *cb);
 typedef void (*JitProviderInit) (JitProviderCallbacks *cb);
 typedef void (*JitProviderResetAfterErrorCB) (void);
+typedef JitContext * (*JitProviderCreateContextCB) (int flags);
 typedef void (*JitProviderReleaseContextCB) (JitContext *context);
 struct ExprState;
 typedef bool (*JitProviderCompileExprCB) (struct ExprState *state);
@@ -59,6 +74,7 @@ typedef bool (*JitProviderCompileExprCB) (struct ExprState *state);
 struct JitProviderCallbacks
 {
 	JitProviderResetAfterErrorCB reset_after_error;
+	JitProviderCreateContextCB create_context;
 	JitProviderReleaseContextCB release_context;
 	JitProviderCompileExprCB compile_expr;
 };
@@ -85,6 +101,8 @@ extern void jit_release_context(JitContext *context);
  * not be able to perform JIT (i.e. return false).
  */
 extern bool jit_compile_expr(struct ExprState *state);
+extern JitContext *jit_create_context(int flags);
+extern void InstrJitAgg(JitInstrumentation *dst, JitInstrumentation *add);
 
 
 #endif							/* JIT_H */
diff --git a/src/include/jit/llvmjit.h b/src/include/jit/llvmjit.h
index c81cff8..be1c7ce 100644
--- a/src/include/jit/llvmjit.h
+++ b/src/include/jit/llvmjit.h
@@ -91,7 +91,7 @@ extern void llvm_leave_fatal_on_oom(void);
 extern void llvm_reset_after_error(void);
 extern void llvm_assert_in_fatal_section(void);
 
-extern LLVMJitContext *llvm_create_context(int jitFlags);
+extern JitContext * llvm_create_context(int jitFlags);
 extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);
 extern char *llvm_expand_funcname(LLVMJitContext *context, const char *basename);
 extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index c830f14..8418625 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -18,6 +18,7 @@
 #include "access/heapam.h"
 #include "access/tupconvert.h"
 #include "executor/instrument.h"
+#include "jit/jit.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
@@ -923,6 +924,9 @@ typedef struct PlanState
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
+	/* Per-worker JIT instrumentation */
+	SharedJitInstrumentation *worker_jit_instrument;
+
 	/*
 	 * Common structural data for all Plan types.  These links to subsidiary
 	 * state trees parallel links in the associated plan tree (except for the

jit_explain.outapplication/octet-stream; name=jit_explain.outDownload

#29

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Amit Khandekar (#28)

Re: Query is over 2x slower with jit=on

Amit Khandekar <amitdkhan.pg@gmail.com> writes:

On 11 September 2018 at 14:50, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 10 September 2018 at 21:39, Andres Freund <andres@anarazel.de> wrote:

On 2018-09-10 15:42:55 +0530, Amit Khandekar wrote:

I think we better show per-worker jit info also.

Just to throw a contrarian opinion into this: I find the current EXPLAIN
output for JIT to be insanely verbose already.

regards, tom lane

#30

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#28)

1 attachment(s)

Re: Query is over 2x slower with jit=on

On 14 September 2018 at 16:48, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 11 September 2018 at 14:50, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On 10 September 2018 at 21:39, Andres Freund <andres@anarazel.de> wrote:
/*
+ * Add up the workers' JIT instrumentation from dynamic shared memory.
+ */
+static void
+ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
+                                                                        SharedJitInstrumentation *shared_jit)
+{
+     int                     n;
+     JitContext *jit = planstate->state->es_jit;
+
+     /* If the leader hasn't yet created a jit context, allocate one now. */
+     if (!jit)
+     {
+             planstate->state->es_jit = jit =
+                     jit_create_context(planstate->state->es_jit_flags);
+     }
Wouldn't it be better to move the jit instrumentation to outside of the
context, to avoid having to do this? Or just cope with not having
instrumentation for the leader in this case? We'd kinda need to deal
with failure to create one anyway?
Yeah, I think taking out the instrumentation out of the context looks
better. Will work on that.
Not yet done this. Will do now.

Attached v3 patch that does the above change.

The jit instrumentation counters are used by the provider code
directly. So I think even if we keep the instrumentation outside of
the context, we should let the resource manager handle deallocation of
the instrumentation, similar to how it deallocates the whole jit
context. So we should allocate the instrumentation in
TopMemoryContext. Otherwise, if the instrumentation is allocated in
per-query context, and if the provider functions accesses it after the
Portal resource is cleaned up then the instrumentation would have been
already de-allocated.

So for this, I thought we better have two separate contexts: base
context (i.e. the usual JitContext) and provider-specific context.
JitContext has new field provider_context pointing to LLVMJitContext.
And LLVMJitContext has a 'base' field pointing to the base context
(i.e. JitContext)

So in ExecParallelRetrieveJitInstrumentation(), where the jit context
is created if it's NULL, I now create just the base context. Later,
on-demand the JitContext->provider_context will be allocated in
llvm_compile().

This way, we can release both the base context and provider context
during ResourceOwnerRelease(), and at the same time continue to be
able to access the jit instrumentation from the provider.

-----------

BTW There is this code in llvm_compile_expr() on HEAD that does not
handle the !parent case correctly :

/* get or create JIT context */
if (parent && parent->state->es_jit)
{
context = (LLVMJitContext *) parent->state->es_jit;
}
else
{
context = llvm_create_context(parent->state->es_jit_flags);

if (parent)
{
parent->state->es_jit = &context->base;
}
}

Here, llvm_create_context() will crash if parent is NULL. Is it that
parent can never be NULL ? I think so, because in jit_compile_expr()
we skip everything if there is no plan state. So probably there should
be an Assert(parent != NULL) rather than handling parent check. Or
else, we need to decide what JIT flags to supply to
llvm_create_context() if we let the provider create the context
without a plan state. For now, in the changes that did to the above
snippet, I have kept a TODO.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

jit_instr_account_workers_v3.patchapplication/octet-stream; name=jit_instr_account_workers_v3.patchDownload

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 16a80a0..f998062 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -564,8 +564,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	 * at a later stage.
 	 */
 	if (queryDesc->estate->es_jit && es->costs &&
-		queryDesc->estate->es_jit->created_functions > 0)
-		ExplainPrintJIT(es, queryDesc);
+		queryDesc->estate->es_jit->jit_instr.created_functions > 0)
+		ExplainPrintJIT(es, queryDesc->estate->es_jit_flags,
+						&queryDesc->estate->es_jit->jit_instr, -1);
 
 	/*
 	 * Close down the query and free resources.  Include time for this in the
@@ -690,43 +691,54 @@ ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc)
 
 /*
  * ExplainPrintJIT -
- *	  Append information about JITing to es->str.
+ *	  Append information about JITing to es->str. Can be used to print
+ *    the JIT instrumentation of the backend or that of a specific worker.
  */
 void
-ExplainPrintJIT(ExplainState *es, QueryDesc *queryDesc)
+ExplainPrintJIT(ExplainState *es, int jit_flags,
+				JitInstrumentation *jit_instr, int worker_i)
 {
-	JitContext *jc = queryDesc->estate->es_jit;
+	bool		for_workers = (worker_i >= 0);
+	char	   *jit_group_label = (for_workers ? "Workers JIT" : "JIT");
 
-	ExplainOpenGroup("JIT", "JIT", true, es);
+	ExplainOpenGroup(jit_group_label, jit_group_label, true, es);
 
-	if (es->format == EXPLAIN_FORMAT_TEXT)
+	if (for_workers)
 	{
-		es->indent += 1;
-		appendStringInfo(es->str, "JIT:\n");
+		ExplainPropertyInteger("Jit for Worker ", NULL, worker_i, es);
+
+		/* Jump to the specific worker */
+		jit_instr = jit_instr + worker_i;
 	}
+	else
+		ExplainPropertyText("Jit", "", es);
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+		es->indent += 1;
+
+	ExplainPropertyInteger("Functions", NULL, jit_instr->created_functions, es);
 
-	ExplainPropertyInteger("Functions", NULL, jc->created_functions, es);
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Generation Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->generation_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jit_instr->generation_counter),
 							 3, es);
 
-	ExplainPropertyBool("Inlining", jc->flags & PGJIT_INLINE, es);
+	ExplainPropertyBool("Inlining", jit_flags & PGJIT_INLINE, es);
 
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Inlining Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->inlining_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jit_instr->inlining_counter),
 							 3, es);
 
-	ExplainPropertyBool("Optimization", jc->flags & PGJIT_OPT3, es);
+	ExplainPropertyBool("Optimization", jit_flags & PGJIT_OPT3, es);
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Optimization Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->optimization_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jit_instr->optimization_counter),
 							 3, es);
 
 	if (es->analyze && es->timing)
 		ExplainPropertyFloat("Emission Time", "ms",
-							 1000.0 * INSTR_TIME_GET_DOUBLE(jc->emission_counter),
+							 1000.0 * INSTR_TIME_GET_DOUBLE(jit_instr->emission_counter),
 							 3, es);
 
 	ExplainCloseGroup("JIT", "JIT", true, es);
@@ -1515,6 +1527,26 @@ ExplainNode(PlanState *planstate, List *ancestors,
 					ExplainPropertyInteger("Workers Launched", NULL,
 										   nworkers, es);
 				}
+
+				/* Print per-worker Jit instrumentation. */
+
+				if (es->analyze && es->verbose &&
+					outerPlanState(planstate)->worker_jit_instrument)
+				{
+					PlanState *child = outerPlanState(planstate);
+					int			n;
+					SharedJitInstrumentation *w = child->worker_jit_instrument;
+
+					/* TODO: Consider es->costs before printing ? */
+					for (n = 0; n < w->num_workers; ++n)
+					{
+						/* TODO: Just skip for created_functions <= 0 ?? */
+						if (w->jit_instr[n].created_functions > 0)
+							ExplainPrintJIT(es, child->state->es_jit_flags,
+											w->jit_instr, n);
+					}
+				}
+
 				if (gather->single_copy || es->format != EXPLAIN_FORMAT_TEXT)
 					ExplainPropertyBool("Single Copy", gather->single_copy, es);
 			}
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index c93084e..8152099 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -37,6 +37,7 @@
 #include "executor/nodeSort.h"
 #include "executor/nodeSubplan.h"
 #include "executor/tqueue.h"
+#include "jit/jit.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
 #include "optimizer/planner.h"
@@ -62,6 +63,7 @@
 #define PARALLEL_KEY_INSTRUMENTATION	UINT64CONST(0xE000000000000006)
 #define PARALLEL_KEY_DSA				UINT64CONST(0xE000000000000007)
 #define PARALLEL_KEY_QUERY_TEXT		UINT64CONST(0xE000000000000008)
+#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
 
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
@@ -573,9 +575,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	char	   *paramlistinfo_space;
 	BufferUsage *bufusage_space;
 	SharedExecutorInstrumentation *instrumentation = NULL;
+	SharedJitInstrumentation *jit_instrumentation = NULL;
 	int			pstmt_len;
 	int			paramlistinfo_len;
 	int			instrumentation_len = 0;
+	int			jit_instrumentation_len;
 	int			instrument_offset = 0;
 	Size		dsa_minsize = dsa_minimum_size();
 	char	   *query_string;
@@ -671,6 +675,17 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 	}
 
+	/* Estimate space for JIT instrumentation, if required. */
+	if (estate->es_jit_flags != PGJIT_NONE)
+	{
+		jit_instrumentation_len =
+			offsetof(SharedJitInstrumentation, jit_instr) +
+			sizeof(JitInstrumentation) * nworkers;
+		jit_instrumentation_len = MAXALIGN(jit_instrumentation_len);
+		shm_toc_estimate_chunk(&pcxt->estimator, jit_instrumentation_len);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
 	/* Estimate space for DSA area. */
 	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -744,6 +759,17 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		pei->instrumentation = instrumentation;
 	}
 
+	if (estate->es_jit_flags != PGJIT_NONE)
+	{
+		jit_instrumentation = shm_toc_allocate(pcxt->toc, jit_instrumentation_len);
+		jit_instrumentation->num_workers = nworkers;
+		memset(jit_instrumentation->jit_instr, 0,
+			   sizeof(JitInstrumentation) * nworkers);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
+					   jit_instrumentation);
+		pei->jit_instrumentation = jit_instrumentation;
+	}
+
 	/*
 	 * Create a DSA area that can be used by the leader and all workers.
 	 * (However, if we failed to create a DSM and are using private memory
@@ -1004,6 +1030,51 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 }
 
 /*
+ * Add up the workers' JIT instrumentation from dynamic shared memory.
+ */
+static void
+ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
+									   SharedJitInstrumentation *shared_jit)
+{
+	int			n;
+	JitContext *jit = planstate->state->es_jit;
+	MemoryContext oldcontext;
+	int			ibytes;
+
+	/*
+	 * If the leader hasn't yet created a jit context, allocate a base context.
+	 * The provider context will be created on demand.
+	 */
+	if (!jit)
+	{
+		planstate->state->es_jit = jit =
+			jit_create_base_context(planstate->state->es_jit_flags);
+	}
+
+	/* Accummulate all the workers' instrumentations. */
+	for (n = 0; n < shared_jit->num_workers; ++n)
+		InstrJitAgg(&jit->jit_instr, &shared_jit->jit_instr[n]);
+
+	/*
+	 * Also store the per-worker detail.
+	 *
+	 * Similar to ExecParallelRetrieveInstrumentation(), allocate the
+	 * instrumentation in per-query context.
+	 */
+
+	oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
+	ibytes = offsetof(SharedJitInstrumentation, jit_instr)
+			 + mul_size(shared_jit->num_workers, sizeof(JitInstrumentation));
+	ibytes = MAXALIGN(ibytes);
+	elog(LOG, "ibytes : %d; num_workers: %d; sizeof(JitInstrumentation): %ld",
+			   ibytes, shared_jit->num_workers, sizeof(JitInstrumentation));
+	planstate->worker_jit_instrument = palloc(ibytes);
+	MemoryContextSwitchTo(oldcontext);
+
+	memcpy(planstate->worker_jit_instrument, shared_jit, ibytes);
+}
+
+/*
  * Finish parallel execution.  We wait for parallel workers to finish, and
  * accumulate their buffer usage.
  */
@@ -1068,6 +1139,11 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
 		ExecParallelRetrieveInstrumentation(pei->planstate,
 											pei->instrumentation);
 
+	/* Accumulate JIT instrumentation, if any. */
+	if (pei->jit_instrumentation)
+		ExecParallelRetrieveJitInstrumentation(pei->planstate,
+											pei->jit_instrumentation);
+
 	/* Free any serialized parameters. */
 	if (DsaPointerIsValid(pei->param_exec))
 	{
@@ -1274,6 +1350,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	DestReceiver *receiver;
 	QueryDesc  *queryDesc;
 	SharedExecutorInstrumentation *instrumentation;
+	SharedJitInstrumentation *jit_instrumentation;
 	int			instrument_options = 0;
 	void	   *area_space;
 	dsa_area   *area;
@@ -1287,6 +1364,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
 	if (instrumentation != NULL)
 		instrument_options = instrumentation->instrument_options;
+	jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
+										 true);
 	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
 
 	/* Setting debug_query_string for individual workers */
@@ -1350,6 +1429,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 		ExecParallelReportInstrumentation(queryDesc->planstate,
 										  instrumentation);
 
+	/* Report JIT instrumentation data if any */
+	if (queryDesc->estate->es_jit && jit_instrumentation != NULL)
+	{
+		Assert(ParallelWorkerNumber < jit_instrumentation->num_workers);
+		jit_instrumentation->jit_instr[ParallelWorkerNumber] =
+			queryDesc->estate->es_jit->jit_instr;
+	}
+
 	/* Must do this after capturing instrumentation. */
 	ExecutorEnd(queryDesc);
 
diff --git a/src/backend/jit/jit.c b/src/backend/jit/jit.c
index 7d3883d..8c4385b 100644
--- a/src/backend/jit/jit.c
+++ b/src/backend/jit/jit.c
@@ -27,6 +27,7 @@
 #include "executor/execExpr.h"
 #include "jit/jit.h"
 #include "miscadmin.h"
+#include "utils/memutils.h"
 #include "utils/resowner_private.h"
 #include "utils/fmgrprotos.h"
 
@@ -140,9 +141,11 @@ jit_reset_after_error(void)
 void
 jit_release_context(JitContext *context)
 {
-	if (provider_successfully_loaded)
-		provider.release_context(context);
-
+	if (provider_successfully_loaded && context->provider_context)
+	{
+		provider.release_context(context->provider_context);
+		pfree(context->provider_context);
+	}
 	ResourceOwnerForgetJIT(context->resowner, PointerGetDatum(context));
 	pfree(context);
 }
@@ -182,6 +185,36 @@ jit_compile_expr(struct ExprState *state)
 	return false;
 }
 
+JitContext *
+jit_create_base_context(int flags)
+{
+	JitContext *context;
+
+	ResourceOwnerEnlargeJIT(CurrentResourceOwner);
+
+	context = MemoryContextAllocZero(TopMemoryContext, sizeof(JitContext));
+	context->flags = flags;
+
+	/* ensure cleanup */
+	context->resowner = CurrentResourceOwner;
+	ResourceOwnerRememberJIT(CurrentResourceOwner, PointerGetDatum(context));
+
+	return context;
+}
+
+/* Aggregate JIT instrumentation information */
+void
+InstrJitAgg(JitInstrumentation *dst, JitInstrumentation *add)
+{
+	dst->created_functions = add_size(dst->created_functions,
+									  add->created_functions);
+
+	INSTR_TIME_ADD(dst->generation_counter, add->generation_counter);
+	INSTR_TIME_ADD(dst->inlining_counter, add->inlining_counter);
+	INSTR_TIME_ADD(dst->optimization_counter, add->optimization_counter);
+	INSTR_TIME_ADD(dst->emission_counter, add->emission_counter);
+}
+
 static bool
 file_exists(const char *name)
 {
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 640c27f..6c6f078 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -102,7 +102,8 @@ static LLVMOrcJITStackRef llvm_opt0_orc;
 static LLVMOrcJITStackRef llvm_opt3_orc;
 
 
-static void llvm_release_context(JitContext *context);
+static void *llvm_create_provider_context(JitContext *jit_context);
+static void llvm_release_context(void *context);
 static void llvm_session_initialize(void);
 static void llvm_shutdown(int code, Datum arg);
 static void llvm_compile_module(LLVMJitContext *context);
@@ -129,28 +130,39 @@ _PG_jit_provider_init(JitProviderCallbacks *cb)
 /*
  * Create a context for JITing work.
  *
+ * The provider first creates the base context if not already created; and then
+ * creates the provider-specific context.
  * The context, including subsidiary resources, will be cleaned up either when
  * the context is explicitly released, or when the lifetime of
  * CurrentResourceOwner ends (usually the end of the current [sub]xact).
  */
-LLVMJitContext *
-llvm_create_context(int jitFlags)
+JitContext *
+llvm_create_context(JitContext *jit, int flags)
+{
+	if (!jit)
+		jit = jit_create_base_context(flags);
+
+	if (!jit->provider_context)
+		jit->provider_context = llvm_create_provider_context(jit);
+
+	return jit;
+}
+
+/*
+ * Create a provider-specific JIT context.
+ */
+static void *
+llvm_create_provider_context(JitContext *jit_context)
 {
 	LLVMJitContext *context;
 
 	llvm_assert_in_fatal_section();
-
 	llvm_session_initialize();
 
-	ResourceOwnerEnlargeJIT(CurrentResourceOwner);
-
-	context = MemoryContextAllocZero(TopMemoryContext,
-									 sizeof(LLVMJitContext));
-	context->base.flags = jitFlags;
+	context = MemoryContextAllocZero(TopMemoryContext, sizeof(LLVMJitContext));
 
-	/* ensure cleanup */
-	context->base.resowner = CurrentResourceOwner;
-	ResourceOwnerRememberJIT(CurrentResourceOwner, PointerGetDatum(context));
+	/* Have a handle to the base context */
+	context->base = jit_context;
 
 	return context;
 }
@@ -159,7 +171,7 @@ llvm_create_context(int jitFlags)
  * Release resources required by one llvm context.
  */
 static void
-llvm_release_context(JitContext *context)
+llvm_release_context(void *context)
 {
 	LLVMJitContext *llvm_context = (LLVMJitContext *) context;
 
@@ -224,7 +236,7 @@ llvm_expand_funcname(struct LLVMJitContext *context, const char *basename)
 {
 	Assert(context->module != NULL);
 
-	context->base.created_functions++;
+	context->base->jit_instr.created_functions++;
 
 	/*
 	 * Previously we used dots to separate, but turns out some tools, e.g.
@@ -422,7 +434,7 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMValueRef func;
 	int			compile_optlevel;
 
-	if (context->base.flags & PGJIT_OPT3)
+	if (context->base->flags & PGJIT_OPT3)
 		compile_optlevel = 3;
 	else
 		compile_optlevel = 0;
@@ -436,7 +448,7 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderSetOptLevel(llvm_pmb, compile_optlevel);
 	llvm_fpm = LLVMCreateFunctionPassManagerForModule(module);
 
-	if (context->base.flags & PGJIT_OPT3)
+	if (context->base->flags & PGJIT_OPT3)
 	{
 		/* TODO: Unscientifically determined threshhold */
 		LLVMPassManagerBuilderUseInlinerWithThreshold(llvm_pmb, 512);
@@ -469,11 +481,11 @@ llvm_optimize_module(LLVMJitContext *context, LLVMModuleRef module)
 	LLVMPassManagerBuilderPopulateModulePassManager(llvm_pmb,
 													llvm_mpm);
 	/* always use always-inliner pass */
-	if (!(context->base.flags & PGJIT_OPT3))
+	if (!(context->base->flags & PGJIT_OPT3))
 		LLVMAddAlwaysInlinerPass(llvm_mpm);
 	/* if doing inlining, but no expensive optimization, add inlining pass */
-	if (context->base.flags & PGJIT_INLINE
-		&& !(context->base.flags & PGJIT_OPT3))
+	if (context->base->flags & PGJIT_INLINE
+		&& !(context->base->flags & PGJIT_OPT3))
 		LLVMAddFunctionInliningPass(llvm_mpm);
 	LLVMRunPassManager(llvm_mpm, context->module);
 	LLVMDisposePassManager(llvm_mpm);
@@ -493,18 +505,18 @@ llvm_compile_module(LLVMJitContext *context)
 	instr_time	starttime;
 	instr_time	endtime;
 
-	if (context->base.flags & PGJIT_OPT3)
+	if (context->base->flags & PGJIT_OPT3)
 		compile_orc = llvm_opt3_orc;
 	else
 		compile_orc = llvm_opt0_orc;
 
 	/* perform inlining */
-	if (context->base.flags & PGJIT_INLINE)
+	if (context->base->flags & PGJIT_INLINE)
 	{
 		INSTR_TIME_SET_CURRENT(starttime);
 		llvm_inline(context->module);
 		INSTR_TIME_SET_CURRENT(endtime);
-		INSTR_TIME_ACCUM_DIFF(context->base.inlining_counter,
+		INSTR_TIME_ACCUM_DIFF(context->base->jit_instr.inlining_counter,
 							  endtime, starttime);
 	}
 
@@ -524,7 +536,7 @@ llvm_compile_module(LLVMJitContext *context)
 	INSTR_TIME_SET_CURRENT(starttime);
 	llvm_optimize_module(context, context->module);
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.optimization_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base->jit_instr.optimization_counter,
 						  endtime, starttime);
 
 	if (jit_dump_bitcode)
@@ -575,7 +587,7 @@ llvm_compile_module(LLVMJitContext *context)
 	}
 #endif
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.emission_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base->jit_instr.emission_counter,
 						  endtime, starttime);
 
 	context->module = NULL;
@@ -596,9 +608,9 @@ llvm_compile_module(LLVMJitContext *context)
 
 	ereport(DEBUG1,
 			(errmsg("time to inline: %.3fs, opt: %.3fs, emit: %.3fs",
-					INSTR_TIME_GET_DOUBLE(context->base.inlining_counter),
-					INSTR_TIME_GET_DOUBLE(context->base.optimization_counter),
-					INSTR_TIME_GET_DOUBLE(context->base.emission_counter)),
+					INSTR_TIME_GET_DOUBLE(context->base->jit_instr.inlining_counter),
+					INSTR_TIME_GET_DOUBLE(context->base->jit_instr.optimization_counter),
+					INSTR_TIME_GET_DOUBLE(context->base->jit_instr.emission_counter)),
 			 errhidestmt(true),
 			 errhidecontext(true)));
 }
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 0f31093..1214214 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -123,19 +123,19 @@ llvm_compile_expr(ExprState *state)
 	llvm_enter_fatal_on_oom();
 
 	/* get or create JIT context */
-	if (parent && parent->state->es_jit)
+	if (parent)
 	{
-		context = (LLVMJitContext *) parent->state->es_jit;
+		parent->state->es_jit = llvm_create_context(parent->state->es_jit,
+													parent->state->es_jit_flags);
+		context = parent->state->es_jit->provider_context;
 	}
 	else
 	{
-		context = llvm_create_context(parent->state->es_jit_flags);
-
-		if (parent)
-		{
-			parent->state->es_jit = &context->base;
-		}
-
+		/*
+		 * TODO: The !parent case is not handled correctly even on HEAD. Where
+		 * would we get jit_flags from ?
+		 */
+		context = (LLVMJitContext *) llvm_create_context(NULL, 0);
 	}
 
 	INSTR_TIME_SET_CURRENT(starttime);
@@ -336,7 +336,7 @@ llvm_compile_expr(ExprState *state)
 					 * function specific to tupledesc and the exact number of
 					 * to-be-extracted attributes.
 					 */
-					if (desc && (context->base.flags & PGJIT_DEFORM))
+					if (desc && (context->base->flags & PGJIT_DEFORM))
 					{
 						LLVMValueRef params[1];
 						LLVMValueRef l_jit_deform;
@@ -2557,7 +2557,7 @@ llvm_compile_expr(ExprState *state)
 	llvm_leave_fatal_on_oom();
 
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.generation_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base->jit_instr.generation_counter,
 						  endtime, starttime);
 
 	return true;
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 9b75baa..164cf59 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -81,7 +81,8 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
 extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
 extern void ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc);
 
-extern void ExplainPrintJIT(ExplainState *es, QueryDesc *queryDesc);
+extern void ExplainPrintJIT(ExplainState *es, int jit_flags,
+				JitInstrumentation *jit_instr, int worker_i);
 
 extern void ExplainQueryText(ExplainState *es, QueryDesc *queryDesc);
 
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 626a66c..ea1991f 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -27,6 +27,7 @@ typedef struct ParallelExecutorInfo
 	ParallelContext *pcxt;		/* parallel context we're using */
 	BufferUsage *buffer_usage;	/* points to bufusage area in DSM */
 	SharedExecutorInstrumentation *instrumentation; /* optional */
+	SharedJitInstrumentation *jit_instrumentation; /* optional */
 	dsa_area   *area;			/* points to DSA area in DSM */
 	dsa_pointer param_exec;		/* serialized PARAM_EXEC parameters */
 	bool		finished;		/* set true by ExecParallelFinish */
diff --git a/src/include/jit/jit.h b/src/include/jit/jit.h
index b451f40..c5ea103 100644
--- a/src/include/jit/jit.h
+++ b/src/include/jit/jit.h
@@ -23,14 +23,8 @@
 #define PGJIT_EXPR	   (1 << 3)
 #define PGJIT_DEFORM   (1 << 4)
 
-
-typedef struct JitContext
+typedef struct JitInstrumentation
 {
-	/* see PGJIT_* above */
-	int			flags;
-
-	ResourceOwner resowner;
-
 	/* number of emitted functions */
 	size_t		created_functions;
 
@@ -45,6 +39,27 @@ typedef struct JitContext
 
 	/* accumulated time for code emission */
 	instr_time	emission_counter;
+} JitInstrumentation;
+
+/*
+ * DSM structure for accumulating jit instrumentation of all workers.
+ *
+ * num_workers: Number of workers.
+ */
+typedef struct SharedJitInstrumentation
+{
+	int			num_workers;
+	JitInstrumentation jit_instr[FLEXIBLE_ARRAY_MEMBER];
+	/* array of JitInstrumentation objects * num_workers follows */
+} SharedJitInstrumentation;
+
+typedef struct JitContext
+{
+	/* see PGJIT_* above */
+	int			flags;
+	ResourceOwner resowner;
+	JitInstrumentation jit_instr;
+	void	   *provider_context;
 } JitContext;
 
 typedef struct JitProviderCallbacks JitProviderCallbacks;
@@ -52,13 +67,15 @@ typedef struct JitProviderCallbacks JitProviderCallbacks;
 extern void _PG_jit_provider_init(JitProviderCallbacks *cb);
 typedef void (*JitProviderInit) (JitProviderCallbacks *cb);
 typedef void (*JitProviderResetAfterErrorCB) (void);
-typedef void (*JitProviderReleaseContextCB) (JitContext *context);
+typedef JitContext * (*JitProviderCreateContextCB) (int flags);
+typedef void (*JitProviderReleaseContextCB) (void *context);
 struct ExprState;
 typedef bool (*JitProviderCompileExprCB) (struct ExprState *state);
 
 struct JitProviderCallbacks
 {
 	JitProviderResetAfterErrorCB reset_after_error;
+	JitProviderCreateContextCB create_context;
 	JitProviderReleaseContextCB release_context;
 	JitProviderCompileExprCB compile_expr;
 };
@@ -85,6 +102,8 @@ extern void jit_release_context(JitContext *context);
  * not be able to perform JIT (i.e. return false).
  */
 extern bool jit_compile_expr(struct ExprState *state);
+extern JitContext *jit_create_base_context(int flags);
+extern void InstrJitAgg(JitInstrumentation *dst, JitInstrumentation *add);
 
 
 #endif							/* JIT_H */
diff --git a/src/include/jit/llvmjit.h b/src/include/jit/llvmjit.h
index c81cff8..e9ef624 100644
--- a/src/include/jit/llvmjit.h
+++ b/src/include/jit/llvmjit.h
@@ -37,7 +37,7 @@ extern "C"
 
 typedef struct LLVMJitContext
 {
-	JitContext	base;
+	JitContext	*base;
 
 	/* number of modules created */
 	size_t		module_generation;
@@ -91,7 +91,7 @@ extern void llvm_leave_fatal_on_oom(void);
 extern void llvm_reset_after_error(void);
 extern void llvm_assert_in_fatal_section(void);
 
-extern LLVMJitContext *llvm_create_context(int jitFlags);
+extern JitContext *llvm_create_context(JitContext *jit, int flags);
 extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);
 extern char *llvm_expand_funcname(LLVMJitContext *context, const char *basename);
 extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 687d7cd..51a3611 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -18,6 +18,7 @@
 #include "access/heapam.h"
 #include "access/tupconvert.h"
 #include "executor/instrument.h"
+#include "jit/jit.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
@@ -923,6 +924,9 @@ typedef struct PlanState
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
+	/* Per-worker JIT instrumentation */
+	SharedJitInstrumentation *worker_jit_instrument;
+
 	/*
 	 * Common structural data for all Plan types.  These links to subsidiary
 	 * state trees parallel links in the associated plan tree (except for the

#31

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Tom Lane (#29)

Re: Query is over 2x slower with jit=on

On 18 September 2018 at 03:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Khandekar <amitdkhan.pg@gmail.com> writes:

I think we better show per-worker jit info also.

Just to throw a contrarian opinion into this: I find the current EXPLAIN
output for JIT to be insanely verbose already.

One option is to make the jit info displayed on a single line, with
optimize and inlining counters shown only when the corresponding
optimization was attempted. Currently they are shown even if they are
disabled.

In the patch, I have kept the per-worker output verbose-only.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#32

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Tom Lane (#29)

Re: Query is over 2x slower with jit=on

Hi,

On 2018-09-17 17:50:15 -0400, Tom Lane wrote:

Amit Khandekar <amitdkhan.pg@gmail.com> writes:

On 11 September 2018 at 14:50, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 10 September 2018 at 21:39, Andres Freund <andres@anarazel.de> wrote:

On 2018-09-10 15:42:55 +0530, Amit Khandekar wrote:

I think we better show per-worker jit info also.

Just to throw a contrarian opinion into this: I find the current EXPLAIN
output for JIT to be insanely verbose already.

Hm, it'd have been nice to get that feedback a little bit earlier, I did
inquire...

Currently:

JIT:
Functions: 2
Generation Time: 0.680 ms
Inlining: true
Inlining Time: 7.591 ms
Optimization: true
Optimization Time: 20.522 ms
Emission Time: 14.607 ms

How about making that:

JIT:
Functions: 2
Options: Inlining, Optimization
Times (Total, Generation, Inlining, Optimization, Emission): 43.4 ms, 0.680 ms, 7.591 ms, 20.522 ms, 14.607 ms

or something similar?

Greetings,

Andres Freund

#33

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Andres Freund (#32)

Re: Query is over 2x slower with jit=on

Andres Freund <andres@anarazel.de> writes:

On 2018-09-17 17:50:15 -0400, Tom Lane wrote:

Just to throw a contrarian opinion into this: I find the current EXPLAIN
output for JIT to be insanely verbose already.

Hm, it'd have been nice to get that feedback a little bit earlier, I did
inquire...

Currently:

JIT:
Functions: 2
Generation Time: 0.680 ms
Inlining: true
Inlining Time: 7.591 ms
Optimization: true
Optimization Time: 20.522 ms
Emission Time: 14.607 ms

Just to clarify, that seems perfectly fine for the "machine readable"
output formats. I'd just like fewer lines in the "human readable"
output.

How about making that:

JIT:
Functions: 2
Options: Inlining, Optimization
Times (Total, Generation, Inlining, Optimization, Emission): 43.4 ms, 0.680 ms, 7.591 ms, 20.522 ms, 14.607 ms

or something similar?

That's going in the right direction. Personally I'd make the last line
more like

Times: generation 0.680 ms, inlining 7.591 ms, optimization 20.522 ms, emission 14.607 ms, total 43.4 ms

(total at the end seems more natural to me, YMMV). Also, the "options"
format you suggest here seems a bit too biased towards binary on/off
options --- what happens when there's a three-way option? So maybe that
line should be like

Options: inlining on, optimization on

though I'm less sure about that part.

regards, tom lane

#34

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Tom Lane (#33)

Re: Query is over 2x slower with jit=on

On 2018-09-19 23:26:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2018-09-17 17:50:15 -0400, Tom Lane wrote:

Just to throw a contrarian opinion into this: I find the current EXPLAIN
output for JIT to be insanely verbose already.

Hm, it'd have been nice to get that feedback a little bit earlier, I did
inquire...

Currently:

JIT:
Functions: 2
Generation Time: 0.680 ms
Inlining: true
Inlining Time: 7.591 ms
Optimization: true
Optimization Time: 20.522 ms
Emission Time: 14.607 ms

Just to clarify, that seems perfectly fine for the "machine readable"
output formats. I'd just like fewer lines in the "human readable"
output.

Yea, I do think that's a fair complaint.

How about making that:

JIT:
Functions: 2

FWIW, not that I want to do that now, but at some point it might make
sense to sub-divide this into things like number of "expressions",
"tuple deforming", "plans", ... Just mentioning that if somebody wants
to comment on reformatting this as well, if we're tinkering anyway.

Options: Inlining, Optimization
Times (Total, Generation, Inlining, Optimization, Emission): 43.4 ms, 0.680 ms, 7.591 ms, 20.522 ms, 14.607 ms

or something similar?

That's going in the right direction. Personally I'd make the last line
more like

Times: generation 0.680 ms, inlining 7.591 ms, optimization 20.522 ms, emission 14.607 ms, total 43.4 ms

Yea, that's probably easier to read.

(total at the end seems more natural to me, YMMV).

I kind of think doing it first is best, because that's usually the first
thing one wants to know.

Also, the "options" format you suggest here seems a bit too biased
towards binary on/off options --- what happens when there's a
three-way option? So maybe that line should be like

Options: inlining on, optimization on

though I'm less sure about that part.

I'm pretty certain you're right :). There's already arguments around
making optimization more gradual (akin to O1,2,3).

Greetings,

Andres Freund

#35

Pavel Stehule

pavel.stehule@gmail.com

over 7 years ago

In reply to: Andres Freund (#34)

Re: Query is over 2x slower with jit=on

čt 20. 9. 2018 v 5:39 odesílatel Andres Freund <andres@anarazel.de> napsal:

On 2018-09-19 23:26:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2018-09-17 17:50:15 -0400, Tom Lane wrote:

Just to throw a contrarian opinion into this: I find the current

EXPLAIN

output for JIT to be insanely verbose already.

Hm, it'd have been nice to get that feedback a little bit earlier, I

did

inquire...

Currently:

JIT:
Functions: 2
Generation Time: 0.680 ms
Inlining: true
Inlining Time: 7.591 ms
Optimization: true
Optimization Time: 20.522 ms
Emission Time: 14.607 ms

Just to clarify, that seems perfectly fine for the "machine readable"
output formats. I'd just like fewer lines in the "human readable"
output.

Yea, I do think that's a fair complaint.

How about making that:

JIT:
Functions: 2

FWIW, not that I want to do that now, but at some point it might make
sense to sub-divide this into things like number of "expressions",
"tuple deforming", "plans", ... Just mentioning that if somebody wants
to comment on reformatting this as well, if we're tinkering anyway.

Options: Inlining, Optimization
Times (Total, Generation, Inlining, Optimization, Emission): 43.4

ms, 0.680 ms, 7.591 ms, 20.522 ms, 14.607 ms

Pavel

Show quoted text

or something similar?

That's going in the right direction. Personally I'd make the last line
more like

Times: generation 0.680 ms, inlining 7.591 ms, optimization 20.522

ms, emission 14.607 ms, total 43.4 ms

Yea, that's probably easier to read.

(total at the end seems more natural to me, YMMV).

I kind of think doing it first is best, because that's usually the first
thing one wants to know.

Also, the "options" format you suggest here seems a bit too biased
towards binary on/off options --- what happens when there's a
three-way option? So maybe that line should be like

Options: inlining on, optimization on

though I'm less sure about that part.

I'm pretty certain you're right :). There's already arguments around
making optimization more gradual (akin to O1,2,3).

Greetings,

Andres Freund

#36

Stephen Frost

sfrost@snowman.net

over 7 years ago

In reply to: Andres Freund (#34)

Re: Query is over 2x slower with jit=on

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2018-09-19 23:26:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

JIT:
Functions: 2
Generation Time: 0.680 ms
Inlining: true
Inlining Time: 7.591 ms
Optimization: true
Optimization Time: 20.522 ms
Emission Time: 14.607 ms

[...]

How about making that:
JIT:
Functions: 2

FWIW, not that I want to do that now, but at some point it might make
sense to sub-divide this into things like number of "expressions",
"tuple deforming", "plans", ... Just mentioning that if somebody wants
to comment on reformatting this as well, if we're tinkering anyway.

I'd actually think we'd maybe want some kind of 'verbose' mode which
shows exactly what got JIT'd and what didn't- one of the questions that
I think people will be asking is "why didn't X get JIT'd?" and I don't
think that's very easy to figure out currently.

Options: Inlining, Optimization
Times (Total, Generation, Inlining, Optimization, Emission): 43.4 ms, 0.680 ms, 7.591 ms, 20.522 ms, 14.607 ms

or something similar?

That's going in the right direction. Personally I'd make the last line
more like

Times: generation 0.680 ms, inlining 7.591 ms, optimization 20.522 ms, emission 14.607 ms, total 43.4 ms

Yea, that's probably easier to read.

I tend to agree that it's easier to read but I'm not sure we need to
quite go that far in reducing the number of rows.

(total at the end seems more natural to me, YMMV).

I agree with this..

I kind of think doing it first is best, because that's usually the first
thing one wants to know.

and this, so what about:

Times:
generation 0.680 ms, inlining 7.591 ms, optimization 20.522 ms, emission 14.607 ms
Total: 43.4 ms

Gets the total out there quick on the left where you're scanning down
while keeping the detailed info above for reviewing after.

Also, the "options" format you suggest here seems a bit too biased
towards binary on/off options --- what happens when there's a
three-way option? So maybe that line should be like

Options: inlining on, optimization on

though I'm less sure about that part.

I'm pretty certain you're right :). There's already arguments around
making optimization more gradual (akin to O1,2,3).

That certainly sounds like it'd be very neat to have though I wonder how
well we'll be able to automatically plan out which optimization level to
use when..

Thanks!

Stephen

#37

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Stephen Frost (#36)

Re: Query is over 2x slower with jit=on

On 2018-09-20 09:07:21 -0400, Stephen Frost wrote:

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2018-09-19 23:26:52 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

JIT:
Functions: 2
Generation Time: 0.680 ms
Inlining: true
Inlining Time: 7.591 ms
Optimization: true
Optimization Time: 20.522 ms
Emission Time: 14.607 ms

[...]

How about making that:
JIT:
Functions: 2

FWIW, not that I want to do that now, but at some point it might make
sense to sub-divide this into things like number of "expressions",
"tuple deforming", "plans", ... Just mentioning that if somebody wants
to comment on reformatting this as well, if we're tinkering anyway.

I'd actually think we'd maybe want some kind of 'verbose' mode which
shows exactly what got JIT'd and what didn't- one of the questions that
I think people will be asking is "why didn't X get JIT'd?" and I don't
think that's very easy to figure out currently.

That seems largely a separate discussion / feature though, right? I'm
not entirely clear what precisely you mean with "why didn't X get
JIT'd?" - currently that's a whole query decision.

I'm pretty certain you're right :). There's already arguments around
making optimization more gradual (akin to O1,2,3).

That certainly sounds like it'd be very neat to have though I wonder how
well we'll be able to automatically plan out which optimization level to
use when..

Well, that's not really different from having to decide whether to use
JIT or not. I suspect that once / if we get caching and/or background
JIT compilation, we can get a lot more creative around this.

Greetings,

Andres Freund

#38

Stephen Frost

sfrost@snowman.net

over 7 years ago

In reply to: Andres Freund (#37)

Re: Query is over 2x slower with jit=on

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2018-09-20 09:07:21 -0400, Stephen Frost wrote:

* Andres Freund (andres@anarazel.de) wrote:

FWIW, not that I want to do that now, but at some point it might make
sense to sub-divide this into things like number of "expressions",
"tuple deforming", "plans", ... Just mentioning that if somebody wants
to comment on reformatting this as well, if we're tinkering anyway.

I'd actually think we'd maybe want some kind of 'verbose' mode which
shows exactly what got JIT'd and what didn't- one of the questions that
I think people will be asking is "why didn't X get JIT'd?" and I don't
think that's very easy to figure out currently.

That seems largely a separate discussion / feature though, right? I'm
not entirely clear what precisely you mean with "why didn't X get
JIT'd?" - currently that's a whole query decision.

There are things that can't be JIT'd currently and, I'm thinking anyway,
we'd JIT what we are able to and report out what was JIT'd in some
fashion which would inform a user what was able to be JIT'd and what
wasn't. I'll admit, perhaps this is more of a debugging tool than
something that a end-user would find useful, unless there's possibly
some way that a user could effect what happens.

I'm pretty certain you're right :). There's already arguments around
making optimization more gradual (akin to O1,2,3).

That certainly sounds like it'd be very neat to have though I wonder how
well we'll be able to automatically plan out which optimization level to
use when..

Well, that's not really different from having to decide whether to use
JIT or not. I suspect that once / if we get caching and/or background
JIT compilation, we can get a lot more creative around this.

yeah, I definitely understand that, cacheing, in particular, I would
expect to play a very large role.

Thanks!

Stephen

#39

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Andres Freund (#34)

Re: Query is over 2x slower with jit=on

Hi,

On 2018-09-19 20:39:22 -0700, Andres Freund wrote:

On 2018-09-19 23:26:52 -0400, Tom Lane wrote:

That's going in the right direction. Personally I'd make the last line
more like

Times: generation 0.680 ms, inlining 7.591 ms, optimization 20.522 ms, emission 14.607 ms, total 43.4 ms

Yea, that's probably easier to read.

I'm wondering about upper-casing the individual times (and options) -
we're largely upper-casing properties, and for json/xml output each
would still be a property. Seems a tad bit more consistent. I now have:

FORMAT text:
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.298 ms, Inlining 2.250 ms, Optimization 5.797 ms, Emission 5.246 ms, Total 13.591 ms

FORMAT xml:
<JIT>
<Functions>2</Functions>
<Options>
<Inlining>true</Inlining>
<Optimization>true</Optimization>
<Expressions>true</Expressions>
<Deforming>true</Deforming>
</Options>
<Timing>
<Generation>0.651</Generation>
<Inlining>2.260</Inlining>
<Optimization>14.752</Optimization>
<Emission>7.764</Emission>
<Total>25.427</Total>
</Timing>
</JIT>

FORMAT json:
"JIT": {
"Functions": 2,
"Options": {
"Inlining": true,
"Optimization": true,
"Expressions": true,
"Deforming": true
},
"Timing": {
"Generation": 0.238,
"Inlining": 0.807,
"Optimization": 4.661,
"Emission": 4.236,
"Total": 9.942
}
},

(total at the end seems more natural to me, YMMV).

I kind of think doing it first is best, because that's usually the first
thing one wants to know.

Also, the "options" format you suggest here seems a bit too biased
towards binary on/off options --- what happens when there's a
three-way option? So maybe that line should be like

Options: inlining on, optimization on

though I'm less sure about that part.

Now that space is less of a concern, I added expressions, and deforming
as additional options - seems reasonable to have all PGJIT_* options
imo.

Btw, I chose true/false rather than on/off, to be consistent with
ExplainPropertyBool - but I've no strong feelings about it.

Greetings,

Andres Freund

#40

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Amit Khandekar (#30)

1 attachment(s)

Re: Query is over 2x slower with jit=on

On 2018-09-18 10:03:02 +0530, Amit Khandekar wrote:

Attached v3 patch that does the above change.

Attached is a revised version of that patch. I've changed quite a few
things:
- I've reverted the split of "base" and "provider specific" contexts - I
don't think it really buys us anything here.

- I've reverted the context creation changes - instead of creating a
context in the leader just to store instrumentation in the worker,
there's now a new EState->es_jit_combined_instr.

- That also means worker instrumentation doesn't get folded into the
leader's instrumentation. This seems good for the future and for
extensions - it's not actually "linear" time that's spent doing
JIT in workers (& leader), as all of that work happens in
parallel. Being able to disentangle that seems important.

- Only allocate / transport JIT instrumentation if instrumentation is
enabled.

- Smaller cleanups.

Forcing parallelism and JIT to be on, I get:

postgres[20216][1]=# EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT count(*) FROM pg_class;

Finalize Aggregate (cost=15.43..15.44 rows=1 width=8) (actual time=183.282..183.282 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=13
-> Gather (cost=15.41..15.42 rows=2 width=8) (actual time=152.835..152.904 rows=2 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 2
Workers Launched: 2
JIT for worker 0:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.480 ms, Inlining 78.789 ms, Optimization 5.797 ms, Emission 5.554 ms, Total 90.620 ms
JIT for worker 1:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.475 ms, Inlining 78.632 ms, Optimization 5.954 ms, Emission 5.900 ms, Total 90.962 ms
Buffers: shared hit=13
-> Partial Aggregate (cost=15.41..15.42 rows=1 width=8) (actual time=90.608..90.609 rows=1 loops=2)
Output: PARTIAL count(*)
Buffers: shared hit=13
Worker 0: actual time=90.426..90.426 rows=1 loops=1
Buffers: shared hit=7
Worker 1: actual time=90.791..90.792 rows=1 loops=1
Buffers: shared hit=6
-> Parallel Seq Scan on pg_catalog.pg_class (cost=0.00..14.93 rows=193 width=0) (actual time=0.006..0.048 rows=193 loops=2)
Output: relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, reltoastrelid, relhasindex, relisshared, relpersistence, relkind, relnatts, relchecks, relhasoids, relhasrules, relhastriggers, relhassubclass, relrowsecurity, relfo
Buffers: shared hit=13
Worker 0: actual time=0.006..0.047 rows=188 loops=1
Buffers: shared hit=7
Worker 1: actual time=0.006..0.048 rows=198 loops=1
Buffers: shared hit=6
Planning Time: 0.152 ms
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.955 ms, Inlining 157.422 ms, Optimization 11.751 ms, Emission 11.454 ms, Total 181.582 ms
Execution Time: 184.197 ms

Instead of "JIT for worker 0" we could instead do "Worker 0: JIT", but
I'm not sure that's better, given that the JIT group is multi-line
output.

Currently structured formats show this as:
"JIT": {
"Worker Number": 1,
"Functions": 2,
"Options": {
"Inlining": true,
"Optimization": true,
"Expressions": true,
"Deforming": true
},
"Timing": {
"Generation": 0.374,
"Inlining": 70.341,
"Optimization": 8.006,
"Emission": 7.670,
"Total": 86.390
}
},

This needs a bit more polish tomorrow, but I'm starting to like where
this is going. Comments?

Greetings,

Andres Freund

Attachments:

0001-Collect-JIT-instrumentation-from-workers.patchtext/x-diff; charset=us-asciiDownload

From dd7d504df9fca0ee0cc0a48930980cda2ad8be1d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 25 Sep 2018 01:46:34 -0700
Subject: [PATCH] Collect JIT instrumentation from workers.

Author: Amit Khandekar and Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/CAJ3gD9eLrz51RK_gTkod+71iDcjpB_N8eC6vU2AW-VicsAERpQ@mail.gmail.com
Backpatch: 11-
---
 contrib/auto_explain/auto_explain.c |  6 +-
 src/backend/commands/explain.c      | 87 +++++++++++++++++++----------
 src/backend/executor/execMain.c     | 15 +++++
 src/backend/executor/execParallel.c | 82 +++++++++++++++++++++++++++
 src/backend/jit/jit.c               | 11 ++++
 src/backend/jit/llvm/llvmjit.c      | 14 ++---
 src/backend/jit/llvm/llvmjit_expr.c |  2 +-
 src/include/commands/explain.h      |  3 +-
 src/include/executor/execParallel.h |  1 +
 src/include/jit/jit.h               | 27 +++++++--
 src/include/nodes/execnodes.h       |  8 +++
 11 files changed, 209 insertions(+), 47 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 0c0eb3fb9e0..a19c1d2a3c3 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -362,9 +362,9 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			ExplainPrintPlan(es, queryDesc);
 			if (es->analyze && auto_explain_log_triggers)
 				ExplainPrintTriggers(es, queryDesc);
-			if (queryDesc->estate->es_jit && es->costs &&
-				queryDesc->estate->es_jit->created_functions > 0)
-				ExplainPrintJIT(es, queryDesc);
+			if (es->costs)
+				ExplainPrintJIT(es, queryDesc->estate->es_jit_flags,
+								queryDesc->estate->es_jit_combined_instr, -1);
 			ExplainEndOutput(es);
 
 			/* Remove last line break */
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 01520154e48..2ecd278f7e3 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -563,9 +563,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	 * depending on build options.  Might want to separate that out from COSTS
 	 * at a later stage.
 	 */
-	if (queryDesc->estate->es_jit && es->costs &&
-		queryDesc->estate->es_jit->created_functions > 0)
-		ExplainPrintJIT(es, queryDesc);
+	if (es->costs)
+		ExplainPrintJIT(es, queryDesc->estate->es_jit_flags,
+						queryDesc->estate->es_jit_combined_instr, -1);
 
 	/*
 	 * Close down the query and free resources.  Include time for this in the
@@ -690,20 +690,26 @@ ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc)
 
 /*
  * ExplainPrintJIT -
- *	  Append information about JITing to es->str.
+ *	  Append information about JITing to es->str. Can be used to print
+ *    the JIT instrumentation of the backend or that of a specific worker.
  */
 void
-ExplainPrintJIT(ExplainState *es, QueryDesc *queryDesc)
+ExplainPrintJIT(ExplainState *es, int jit_flags,
+				JitInstrumentation *ji, int worker_num)
 {
-	JitContext *jc = queryDesc->estate->es_jit;
 	instr_time	total_time;
+	bool		for_workers = (worker_num >= 0);
+
+	/* don't print information if no JITing happened */
+	if (!ji || ji->created_functions == 0)
+		return;
 
 	/* calculate total time */
 	INSTR_TIME_SET_ZERO(total_time);
-	INSTR_TIME_ADD(total_time, jc->generation_counter);
-	INSTR_TIME_ADD(total_time, jc->inlining_counter);
-	INSTR_TIME_ADD(total_time, jc->optimization_counter);
-	INSTR_TIME_ADD(total_time, jc->emission_counter);
+	INSTR_TIME_ADD(total_time, ji->generation_counter);
+	INSTR_TIME_ADD(total_time, ji->inlining_counter);
+	INSTR_TIME_ADD(total_time, ji->optimization_counter);
+	INSTR_TIME_ADD(total_time, ji->emission_counter);
 
 	ExplainOpenGroup("JIT", "JIT", true, es);
 
@@ -711,27 +717,30 @@ ExplainPrintJIT(ExplainState *es, QueryDesc *queryDesc)
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
 		appendStringInfoSpaces(es->str, es->indent * 2);
-		appendStringInfo(es->str, "JIT:\n");
+		if (for_workers)
+			appendStringInfo(es->str, "JIT for worker %u:\n", worker_num);
+		else
+			appendStringInfo(es->str, "JIT:\n");
 		es->indent += 1;
 
-		ExplainPropertyInteger("Functions", NULL, jc->created_functions, es);
+		ExplainPropertyInteger("Functions", NULL, ji->created_functions, es);
 
 		appendStringInfoSpaces(es->str, es->indent * 2);
 		appendStringInfo(es->str, "Options: %s %s, %s %s, %s %s, %s %s\n",
-						 "Inlining", jc->flags & PGJIT_INLINE ? "true" : "false",
-						 "Optimization", jc->flags & PGJIT_OPT3 ? "true" : "false",
-						 "Expressions", jc->flags & PGJIT_EXPR ? "true" : "false",
-						 "Deforming", jc->flags & PGJIT_DEFORM ? "true" : "false");
+						 "Inlining", jit_flags & PGJIT_INLINE ? "true" : "false",
+						 "Optimization", jit_flags & PGJIT_OPT3 ? "true" : "false",
+						 "Expressions", jit_flags & PGJIT_EXPR ? "true" : "false",
+						 "Deforming", jit_flags & PGJIT_DEFORM ? "true" : "false");
 
 		if (es->analyze && es->timing)
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
 							 "Timing: %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms, %s %.3f ms\n",
-							 "Generation", 1000.0 * INSTR_TIME_GET_DOUBLE(jc->generation_counter),
-							 "Inlining", 1000.0 * INSTR_TIME_GET_DOUBLE(jc->inlining_counter),
-							 "Optimization", 1000.0 * INSTR_TIME_GET_DOUBLE(jc->optimization_counter),
-							 "Emission", 1000.0 * INSTR_TIME_GET_DOUBLE(jc->emission_counter),
+							 "Generation", 1000.0 * INSTR_TIME_GET_DOUBLE(ji->generation_counter),
+							 "Inlining", 1000.0 * INSTR_TIME_GET_DOUBLE(ji->inlining_counter),
+							 "Optimization", 1000.0 * INSTR_TIME_GET_DOUBLE(ji->optimization_counter),
+							 "Emission", 1000.0 * INSTR_TIME_GET_DOUBLE(ji->emission_counter),
 							 "Total", 1000.0 * INSTR_TIME_GET_DOUBLE(total_time));
 		}
 
@@ -739,13 +748,14 @@ ExplainPrintJIT(ExplainState *es, QueryDesc *queryDesc)
 	}
 	else
 	{
-		ExplainPropertyInteger("Functions", NULL, jc->created_functions, es);
+		ExplainPropertyInteger("Worker Number", NULL, worker_num, es);
+		ExplainPropertyInteger("Functions", NULL, ji->created_functions, es);
 
 		ExplainOpenGroup("Options", "Options", true, es);
-		ExplainPropertyBool("Inlining", jc->flags & PGJIT_INLINE, es);
-		ExplainPropertyBool("Optimization", jc->flags & PGJIT_OPT3, es);
-		ExplainPropertyBool("Expressions", jc->flags & PGJIT_EXPR, es);
-		ExplainPropertyBool("Deforming", jc->flags & PGJIT_DEFORM, es);
+		ExplainPropertyBool("Inlining", jit_flags & PGJIT_INLINE, es);
+		ExplainPropertyBool("Optimization", jit_flags & PGJIT_OPT3, es);
+		ExplainPropertyBool("Expressions", jit_flags & PGJIT_EXPR, es);
+		ExplainPropertyBool("Deforming", jit_flags & PGJIT_DEFORM, es);
 		ExplainCloseGroup("Options", "Options", true, es);
 
 		if (es->analyze && es->timing)
@@ -753,16 +763,16 @@ ExplainPrintJIT(ExplainState *es, QueryDesc *queryDesc)
 			ExplainOpenGroup("Timing", "Timing", true, es);
 
 			ExplainPropertyFloat("Generation", "ms",
-								 1000.0 * INSTR_TIME_GET_DOUBLE(jc->generation_counter),
+								 1000.0 * INSTR_TIME_GET_DOUBLE(ji->generation_counter),
 								 3, es);
 			ExplainPropertyFloat("Inlining", "ms",
-								 1000.0 * INSTR_TIME_GET_DOUBLE(jc->inlining_counter),
+								 1000.0 * INSTR_TIME_GET_DOUBLE(ji->inlining_counter),
 								 3, es);
 			ExplainPropertyFloat("Optimization", "ms",
-								 1000.0 * INSTR_TIME_GET_DOUBLE(jc->optimization_counter),
+								 1000.0 * INSTR_TIME_GET_DOUBLE(ji->optimization_counter),
 								 3, es);
 			ExplainPropertyFloat("Emission", "ms",
-								 1000.0 * INSTR_TIME_GET_DOUBLE(jc->emission_counter),
+								 1000.0 * INSTR_TIME_GET_DOUBLE(ji->emission_counter),
 								 3, es);
 			ExplainPropertyFloat("Total", "ms",
 								 1000.0 * INSTR_TIME_GET_DOUBLE(total_time),
@@ -1554,6 +1564,25 @@ ExplainNode(PlanState *planstate, List *ancestors,
 					ExplainPropertyInteger("Workers Launched", NULL,
 										   nworkers, es);
 				}
+
+				/*
+				 * Print per-worker Jit instrumentation. Use same conditions
+				 * as for the leader's JIT instrumentation, see comment there.
+				 */
+				if (es->costs && es->verbose &&
+					outerPlanState(planstate)->worker_jit_instrument)
+				{
+					PlanState *child = outerPlanState(planstate);
+					int			n;
+					SharedJitInstrumentation *w = child->worker_jit_instrument;
+
+					for (n = 0; n < w->num_workers; ++n)
+					{
+						ExplainPrintJIT(es, child->state->es_jit_flags,
+										&w->jit_instr[n], n);
+					}
+				}
+
 				if (gather->single_copy || es->format != EXPLAIN_FORMAT_TEXT)
 					ExplainPropertyBool("Single Copy", gather->single_copy, es);
 			}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 85d980356b7..e36f58a0151 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -48,6 +48,7 @@
 #include "executor/execdebug.h"
 #include "executor/nodeSubplan.h"
 #include "foreign/fdwapi.h"
+#include "jit/jit.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "optimizer/clauses.h"
@@ -494,6 +495,20 @@ standard_ExecutorEnd(QueryDesc *queryDesc)
 
 	ExecEndPlan(queryDesc->planstate, estate);
 
+	/*
+	 * If this process has done JIT, either merge stats into worker stats, or
+	 * this process' stats as the global stats if no workers did JIT.
+	 */
+	if (estate->es_instrument && queryDesc->estate->es_jit)
+	{
+		if (queryDesc->estate->es_jit_combined_instr)
+			InstrJitAgg(queryDesc->estate->es_jit_combined_instr,
+						&queryDesc->estate->es_jit->instr);
+		else
+			queryDesc->estate->es_jit_combined_instr =
+				&queryDesc->estate->es_jit->instr;
+	}
+
 	/* do away with our snapshots */
 	UnregisterSnapshot(estate->es_snapshot);
 	UnregisterSnapshot(estate->es_crosscheck_snapshot);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index c93084e4d2a..84ee0098c30 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -37,6 +37,7 @@
 #include "executor/nodeSort.h"
 #include "executor/nodeSubplan.h"
 #include "executor/tqueue.h"
+#include "jit/jit.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
 #include "optimizer/planner.h"
@@ -62,6 +63,7 @@
 #define PARALLEL_KEY_INSTRUMENTATION	UINT64CONST(0xE000000000000006)
 #define PARALLEL_KEY_DSA				UINT64CONST(0xE000000000000007)
 #define PARALLEL_KEY_QUERY_TEXT		UINT64CONST(0xE000000000000008)
+#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
 
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
@@ -573,9 +575,11 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	char	   *paramlistinfo_space;
 	BufferUsage *bufusage_space;
 	SharedExecutorInstrumentation *instrumentation = NULL;
+	SharedJitInstrumentation *jit_instrumentation = NULL;
 	int			pstmt_len;
 	int			paramlistinfo_len;
 	int			instrumentation_len = 0;
+	int			jit_instrumentation_len;
 	int			instrument_offset = 0;
 	Size		dsa_minsize = dsa_minimum_size();
 	char	   *query_string;
@@ -669,6 +673,17 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 					 mul_size(e.nnodes, nworkers));
 		shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+		/* Estimate space for JIT instrumentation, if required. */
+		if (estate->es_jit_flags != PGJIT_NONE)
+		{
+			jit_instrumentation_len =
+				offsetof(SharedJitInstrumentation, jit_instr) +
+				sizeof(JitInstrumentation) * nworkers;
+			jit_instrumentation_len = MAXALIGN(jit_instrumentation_len);
+			shm_toc_estimate_chunk(&pcxt->estimator, jit_instrumentation_len);
+			shm_toc_estimate_keys(&pcxt->estimator, 1);
+		}
 	}
 
 	/* Estimate space for DSA area. */
@@ -742,6 +757,18 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
 					   instrumentation);
 		pei->instrumentation = instrumentation;
+
+		if (estate->es_jit_flags != PGJIT_NONE)
+		{
+			jit_instrumentation = shm_toc_allocate(pcxt->toc,
+												   jit_instrumentation_len);
+			jit_instrumentation->num_workers = nworkers;
+			memset(jit_instrumentation->jit_instr, 0,
+				   sizeof(JitInstrumentation) * nworkers);
+			shm_toc_insert(pcxt->toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
+						   jit_instrumentation);
+			pei->jit_instrumentation = jit_instrumentation;
+		}
 	}
 
 	/*
@@ -1003,6 +1030,45 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 								 instrumentation);
 }
 
+/*
+ * Add up the workers' JIT instrumentation from dynamic shared memory.
+ */
+static void
+ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
+									   SharedJitInstrumentation *shared_jit)
+{
+	JitInstrumentation *combined;
+	int			ibytes;
+
+	int			n;
+
+	/*
+	 * Accumulate worker JIT instrumentation into the combined JIT, allocating
+	 * it if required. Note this is separate from the leader instrumentation.
+	 */
+	if (!planstate->state->es_jit_combined_instr)
+		planstate->state->es_jit_combined_instr =
+			MemoryContextAllocZero(planstate->state->es_query_cxt, sizeof(JitInstrumentation));
+	combined = planstate->state->es_jit_combined_instr;
+
+	/* Accummulate all the workers' instrumentations. */
+	for (n = 0; n < shared_jit->num_workers; ++n)
+		InstrJitAgg(combined, &shared_jit->jit_instr[n]);
+
+	/*
+	 * Store the per-worker detail.
+	 *
+	 * Similar to ExecParallelRetrieveInstrumentation(), allocate the
+	 * instrumentation in per-query context.
+	 */
+	ibytes = offsetof(SharedJitInstrumentation, jit_instr)
+			 + mul_size(shared_jit->num_workers, sizeof(JitInstrumentation));
+	planstate->worker_jit_instrument =
+		MemoryContextAlloc(planstate->state->es_query_cxt, ibytes);
+
+	memcpy(planstate->worker_jit_instrument, shared_jit, ibytes);
+}
+
 /*
  * Finish parallel execution.  We wait for parallel workers to finish, and
  * accumulate their buffer usage.
@@ -1068,6 +1134,11 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
 		ExecParallelRetrieveInstrumentation(pei->planstate,
 											pei->instrumentation);
 
+	/* Accumulate JIT instrumentation, if any. */
+	if (pei->jit_instrumentation)
+		ExecParallelRetrieveJitInstrumentation(pei->planstate,
+											pei->jit_instrumentation);
+
 	/* Free any serialized parameters. */
 	if (DsaPointerIsValid(pei->param_exec))
 	{
@@ -1274,6 +1345,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	DestReceiver *receiver;
 	QueryDesc  *queryDesc;
 	SharedExecutorInstrumentation *instrumentation;
+	SharedJitInstrumentation *jit_instrumentation;
 	int			instrument_options = 0;
 	void	   *area_space;
 	dsa_area   *area;
@@ -1287,6 +1359,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
 	if (instrumentation != NULL)
 		instrument_options = instrumentation->instrument_options;
+	jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
+										 true);
 	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
 
 	/* Setting debug_query_string for individual workers */
@@ -1350,6 +1424,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 		ExecParallelReportInstrumentation(queryDesc->planstate,
 										  instrumentation);
 
+	/* Report JIT instrumentation data if any */
+	if (queryDesc->estate->es_jit && jit_instrumentation != NULL)
+	{
+		Assert(ParallelWorkerNumber < jit_instrumentation->num_workers);
+		jit_instrumentation->jit_instr[ParallelWorkerNumber] =
+			queryDesc->estate->es_jit->instr;
+	}
+
 	/* Must do this after capturing instrumentation. */
 	ExecutorEnd(queryDesc);
 
diff --git a/src/backend/jit/jit.c b/src/backend/jit/jit.c
index 7d3883d3bf4..5d1f2e57be1 100644
--- a/src/backend/jit/jit.c
+++ b/src/backend/jit/jit.c
@@ -182,6 +182,17 @@ jit_compile_expr(struct ExprState *state)
 	return false;
 }
 
+/* Aggregate JIT instrumentation information */
+void
+InstrJitAgg(JitInstrumentation *dst, JitInstrumentation *add)
+{
+	dst->created_functions += add->created_functions;
+	INSTR_TIME_ADD(dst->generation_counter, add->generation_counter);
+	INSTR_TIME_ADD(dst->inlining_counter, add->inlining_counter);
+	INSTR_TIME_ADD(dst->optimization_counter, add->optimization_counter);
+	INSTR_TIME_ADD(dst->emission_counter, add->emission_counter);
+}
+
 static bool
 file_exists(const char *name)
 {
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 640c27fc408..7510698f863 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -224,7 +224,7 @@ llvm_expand_funcname(struct LLVMJitContext *context, const char *basename)
 {
 	Assert(context->module != NULL);
 
-	context->base.created_functions++;
+	context->base.instr.created_functions++;
 
 	/*
 	 * Previously we used dots to separate, but turns out some tools, e.g.
@@ -504,7 +504,7 @@ llvm_compile_module(LLVMJitContext *context)
 		INSTR_TIME_SET_CURRENT(starttime);
 		llvm_inline(context->module);
 		INSTR_TIME_SET_CURRENT(endtime);
-		INSTR_TIME_ACCUM_DIFF(context->base.inlining_counter,
+		INSTR_TIME_ACCUM_DIFF(context->base.instr.inlining_counter,
 							  endtime, starttime);
 	}
 
@@ -524,7 +524,7 @@ llvm_compile_module(LLVMJitContext *context)
 	INSTR_TIME_SET_CURRENT(starttime);
 	llvm_optimize_module(context, context->module);
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.optimization_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base.instr.optimization_counter,
 						  endtime, starttime);
 
 	if (jit_dump_bitcode)
@@ -575,7 +575,7 @@ llvm_compile_module(LLVMJitContext *context)
 	}
 #endif
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.emission_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base.instr.emission_counter,
 						  endtime, starttime);
 
 	context->module = NULL;
@@ -596,9 +596,9 @@ llvm_compile_module(LLVMJitContext *context)
 
 	ereport(DEBUG1,
 			(errmsg("time to inline: %.3fs, opt: %.3fs, emit: %.3fs",
-					INSTR_TIME_GET_DOUBLE(context->base.inlining_counter),
-					INSTR_TIME_GET_DOUBLE(context->base.optimization_counter),
-					INSTR_TIME_GET_DOUBLE(context->base.emission_counter)),
+					INSTR_TIME_GET_DOUBLE(context->base.instr.inlining_counter),
+					INSTR_TIME_GET_DOUBLE(context->base.instr.optimization_counter),
+					INSTR_TIME_GET_DOUBLE(context->base.instr.emission_counter)),
 			 errhidestmt(true),
 			 errhidecontext(true)));
 }
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 0f3109334e8..7454d05acaf 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2557,7 +2557,7 @@ llvm_compile_expr(ExprState *state)
 	llvm_leave_fatal_on_oom();
 
 	INSTR_TIME_SET_CURRENT(endtime);
-	INSTR_TIME_ACCUM_DIFF(context->base.generation_counter,
+	INSTR_TIME_ACCUM_DIFF(context->base.instr.generation_counter,
 						  endtime, starttime);
 
 	return true;
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 9b75baae6e6..b0b6f1e15a4 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -81,7 +81,8 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
 extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
 extern void ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc);
 
-extern void ExplainPrintJIT(ExplainState *es, QueryDesc *queryDesc);
+extern void ExplainPrintJIT(ExplainState *es, int jit_flags,
+				struct JitInstrumentation *jit_instr, int worker_i);
 
 extern void ExplainQueryText(ExplainState *es, QueryDesc *queryDesc);
 
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 626a66c27a0..1b4c35e5f14 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -27,6 +27,7 @@ typedef struct ParallelExecutorInfo
 	ParallelContext *pcxt;		/* parallel context we're using */
 	BufferUsage *buffer_usage;	/* points to bufusage area in DSM */
 	SharedExecutorInstrumentation *instrumentation; /* optional */
+	struct SharedJitInstrumentation *jit_instrumentation; /* optional */
 	dsa_area   *area;			/* points to DSA area in DSM */
 	dsa_pointer param_exec;		/* serialized PARAM_EXEC parameters */
 	bool		finished;		/* set true by ExecParallelFinish */
diff --git a/src/include/jit/jit.h b/src/include/jit/jit.h
index b451f4027e7..6cb3bdc89f0 100644
--- a/src/include/jit/jit.h
+++ b/src/include/jit/jit.h
@@ -24,13 +24,8 @@
 #define PGJIT_DEFORM   (1 << 4)
 
 
-typedef struct JitContext
+typedef struct JitInstrumentation
 {
-	/* see PGJIT_* above */
-	int			flags;
-
-	ResourceOwner resowner;
-
 	/* number of emitted functions */
 	size_t		created_functions;
 
@@ -45,6 +40,25 @@ typedef struct JitContext
 
 	/* accumulated time for code emission */
 	instr_time	emission_counter;
+} JitInstrumentation;
+
+/*
+ * DSM structure for accumulating jit instrumentation of all workers.
+ */
+typedef struct SharedJitInstrumentation
+{
+	int			num_workers;
+	JitInstrumentation jit_instr[FLEXIBLE_ARRAY_MEMBER];
+} SharedJitInstrumentation;
+
+typedef struct JitContext
+{
+	/* see PGJIT_* above */
+	int			flags;
+
+	ResourceOwner resowner;
+
+	JitInstrumentation instr;
 } JitContext;
 
 typedef struct JitProviderCallbacks JitProviderCallbacks;
@@ -85,6 +99,7 @@ extern void jit_release_context(JitContext *context);
  * not be able to perform JIT (i.e. return false).
  */
 extern bool jit_compile_expr(struct ExprState *state);
+extern void InstrJitAgg(JitInstrumentation *dst, JitInstrumentation *add);
 
 
 #endif							/* JIT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 687d7cd2f41..03ad5169762 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -569,9 +569,14 @@ typedef struct EState
 	 * JIT information. es_jit_flags indicates whether JIT should be performed
 	 * and with which options.  es_jit is created on-demand when JITing is
 	 * performed.
+	 *
+	 * es_jit_combined_instr, at the end of query execution with
+	 * instrumentation enabled, is the the combined instrumentation
+	 * information of leader and followers.
 	 */
 	int			es_jit_flags;
 	struct JitContext *es_jit;
+	struct JitInstrumentation *es_jit_combined_instr;
 } EState;
 
 
@@ -923,6 +928,9 @@ typedef struct PlanState
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
+	/* Per-worker JIT instrumentation */
+	struct SharedJitInstrumentation *worker_jit_instrument;
+
 	/*
 	 * Common structural data for all Plan types.  These links to subsidiary
 	 * state trees parallel links in the associated plan tree (except for the
-- 
2.18.0.rc2.dirty

#41

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Andres Freund (#40)

Re: Query is over 2x slower with jit=on

Hi,

On 2018-09-25 01:47:49 -0700, Andres Freund wrote:

Forcing parallelism and JIT to be on, I get:

postgres[20216][1]=# EXPLAIN (ANALYZE, VERBOSE, BUFFERS) SELECT count(*) FROM pg_class;

Finalize Aggregate (cost=15.43..15.44 rows=1 width=8) (actual time=183.282..183.282 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=13
-> Gather (cost=15.41..15.42 rows=2 width=8) (actual time=152.835..152.904 rows=2 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 2
Workers Launched: 2
JIT for worker 0:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.480 ms, Inlining 78.789 ms, Optimization 5.797 ms, Emission 5.554 ms, Total 90.620 ms
JIT for worker 1:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.475 ms, Inlining 78.632 ms, Optimization 5.954 ms, Emission 5.900 ms, Total 90.962 ms
Buffers: shared hit=13
-> Partial Aggregate (cost=15.41..15.42 rows=1 width=8) (actual time=90.608..90.609 rows=1 loops=2)
Output: PARTIAL count(*)
Buffers: shared hit=13
Worker 0: actual time=90.426..90.426 rows=1 loops=1
Buffers: shared hit=7
Worker 1: actual time=90.791..90.792 rows=1 loops=1
Buffers: shared hit=6
-> Parallel Seq Scan on pg_catalog.pg_class (cost=0.00..14.93 rows=193 width=0) (actual time=0.006..0.048 rows=193 loops=2)
Output: relname, relnamespace, reltype, reloftype, relowner, relam, relfilenode, reltablespace, relpages, reltuples, relallvisible, reltoastrelid, relhasindex, relisshared, relpersistence, relkind, relnatts, relchecks, relhasoids, relhasrules, relhastriggers, relhassubclass, relrowsecurity, relfo
Buffers: shared hit=13
Worker 0: actual time=0.006..0.047 rows=188 loops=1
Buffers: shared hit=7
Worker 1: actual time=0.006..0.048 rows=198 loops=1
Buffers: shared hit=6
Planning Time: 0.152 ms
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.955 ms, Inlining 157.422 ms, Optimization 11.751 ms, Emission 11.454 ms, Total 181.582 ms
Execution Time: 184.197 ms

With parallelism on, this is the aggregated cost of doing JIT
compilation. I wonder if, in VERBOSE mode, we should separately display
the cost of JIT for the leader. Comments?

- Andres

#42

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Andres Freund (#41)

Re: Query is over 2x slower with jit=on

Hi,

On 2018-09-25 12:50:34 -0700, Andres Freund wrote:

On 2018-09-25 01:47:49 -0700, Andres Freund wrote:

Planning Time: 0.152 ms
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.955 ms, Inlining 157.422 ms, Optimization 11.751 ms, Emission 11.454 ms, Total 181.582 ms
Execution Time: 184.197 ms

With parallelism on, this is the aggregated cost of doing JIT
compilation. I wonder if, in VERBOSE mode, we should separately display
the cost of JIT for the leader. Comments?

I've pushed the change without that bit - it's just a few additional
lines if we want to change that.

Greetings,

Andres Freund

#43

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Andres Freund (#40)

Re: Query is over 2x slower with jit=on

On Tue, 25 Sep 2018 at 14:17, Andres Freund <andres@anarazel.de> wrote:

On 2018-09-18 10:03:02 +0530, Amit Khandekar wrote:

Attached v3 patch that does the above change.

Attached is a revised version of that patch. I've changed quite a few
things:
- I've reverted the split of "base" and "provider specific" contexts - I
don't think it really buys us anything here.

The idea was to have a single estate field that accumulates all the
JIT counters of leader as well as workers. I see that you want to
delay the merging of workers and backend counters until end of query
execution. More points on this in the bottom section.

- I've reverted the context creation changes - instead of creating a
context in the leader just to store instrumentation in the worker,
there's now a new EState->es_jit_combined_instr.

The context created in the leader was a light context, involving only
the resource owner stuff, and not the provider initialization.

- That also means worker instrumentation doesn't get folded into the
leader's instrumentation.

You mean the worker instrumentation doesn't get folded into the leader
until the query execution end, right ? In the committed code, I see
that now we merge the leader instrumentation into the combined worker
instrumentation in standard_ExecutorEnd().

This seems good for the future and for
extensions - it's not actually "linear" time that's spent doing
JIT in workers (& leader), as all of that work happens in
parallel. Being able to disentangle that seems important.

Ok. Your point is: we should have the backend and workers info stored
in two separate fields, and combine them only when we need it; so that
we will be in a position to show combined workers-only info separately
in the future. From the code, it looks like the es_jit_combined_instr
stores combined workers info not just from a single Gather node, but
all the Gather nodes in the plan. If we want to have separate workers
info, I am not sure if it makes sense in combining workers from two
separate Gather nodes; because these two sets of workers are
unrelated, aren't they ?

This needs a bit more polish tomorrow, but I'm starting to like where
this is going. Comments?

Yeah, I think the plan output looks reasonable compact now. Thanks.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#44

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Amit Khandekar (#43)

Re: Query is over 2x slower with jit=on

On 2018-09-26 23:45:35 +0530, Amit Khandekar wrote:

On Tue, 25 Sep 2018 at 14:17, Andres Freund <andres@anarazel.de> wrote:

On 2018-09-18 10:03:02 +0530, Amit Khandekar wrote:

Attached v3 patch that does the above change.

Attached is a revised version of that patch. I've changed quite a few
things:
- I've reverted the split of "base" and "provider specific" contexts - I
don't think it really buys us anything here.

The idea was to have a single estate field that accumulates all the
JIT counters of leader as well as workers. I see that you want to
delay the merging of workers and backend counters until end of query
execution. More points on this in the bottom section.

No, I never want to merge things into the leader's stats.

- I've reverted the context creation changes - instead of creating a
context in the leader just to store instrumentation in the worker,
there's now a new EState->es_jit_combined_instr.

The context created in the leader was a light context, involving only
the resource owner stuff, and not the provider initialization.

I know, but it added a fair bit of infrastructure and complications just
for that - and I see very very little reason that that'd ever be a
necessary separation.

- That also means worker instrumentation doesn't get folded into the
leader's instrumentation.

You mean the worker instrumentation doesn't get folded into the leader
until the query execution end, right ? In the committed code, I see
that now we merge the leader instrumentation into the combined worker
instrumentation in standard_ExecutorEnd().

No, I mean it *never* gets folded into the leader's
instrumentation. There's a *separate* instrumentation field, where they
do get combined. But that still allows code to print out the leader's
stats alone.

This seems good for the future and for
extensions - it's not actually "linear" time that's spent doing
JIT in workers (& leader), as all of that work happens in
parallel. Being able to disentangle that seems important.

Ok. Your point is: we should have the backend and workers info stored
in two separate fields, and combine them only when we need it; so that
we will be in a position to show combined workers-only info separately
in the future. From the code, it looks like the es_jit_combined_instr
stores combined workers info not just from a single Gather node, but
all the Gather nodes in the plan. If we want to have separate workers
info, I am not sure if it makes sense in combining workers from two
separate Gather nodes; because these two sets of workers are
unrelated, aren't they ?

Well, we now have all the individual stats around in an unmodified
manner. We could print both the aggregate and individualized stats.

Greetings,

Andres Freund

#45

Lukas Fittl

lukas@fittl.com

over 7 years ago

In reply to: Andres Freund (#42)

1 attachment(s)

Re: Query is over 2x slower with jit=on

On Tue, Sep 25, 2018 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:

I've pushed the change without that bit - it's just a few additional
lines if we want to change that.

It seems that since this commit, JIT statistics are now only being printed
for parallel query plans. This is due to ExplainPrintJIT being called
before ExecutorEnd, so in a non-parallel query,
queryDesc->estate->es_jit_combined_instr will never get set.

Attached a patch that instead introduces a new ExplainPrintJITSummary
method that summarizes the statistics before they get printed.

I've also removed an (I believe) unnecessary "if (estate->es_instrument)"
check that prevented EXPLAIN without ANALYZE from showing whether JIT would
be used or not.

In addition this also updates a missed section in the documentation with
the new stats output format.

Best,
Lukas

--
Lukas Fittl

Attachments:

fix-jit-statistics-in-EXPLAIN-for-non-parallel-queries-v1.patchapplication/octet-stream; name=fix-jit-statistics-in-EXPLAIN-for-non-parallel-queries-v1.patchDownload

From a60c742e4b3ede8676ad70b0e1657f12b8a0aa5e Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 30 Sep 2018 22:24:20 -0700
Subject: [PATCH] Fix display of JIT statistics in EXPLAIN for non-parallel
 queries

This bug was introduced by e63441c3f5ca4a3cdb47f9c7fed9e61dde17b60d
since ExecutorEnd is only called after the query plan is being output
by EXPLAIN/auto_explain. Instead, introduce a new ExplainPrintJITSummary
method that takes care of the summarization and then outputs the stats.

In passing also fix outdated JIT documentation that was still using the
old JIT statistics format.
---
 contrib/auto_explain/auto_explain.c |  3 +--
 doc/src/sgml/jit.sgml               |  8 ++------
 src/backend/commands/explain.c      | 30 +++++++++++++++++++++++++++--
 src/backend/executor/execMain.c     | 15 ---------------
 src/include/commands/explain.h      |  1 +
 5 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index b9c0896433..03be05129f 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -336,8 +336,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			if (es->analyze && auto_explain_log_triggers)
 				ExplainPrintTriggers(es, queryDesc);
 			if (es->costs)
-				ExplainPrintJIT(es, queryDesc->estate->es_jit_flags,
-								queryDesc->estate->es_jit_combined_instr, -1);
+				ExplainPrintJITSummary(es, queryDesc);
 			ExplainEndOutput(es);
 
 			/* Remove last line break */
diff --git a/doc/src/sgml/jit.sgml b/doc/src/sgml/jit.sgml
index 8387a4f6be..a21a07ef71 100644
--- a/doc/src/sgml/jit.sgml
+++ b/doc/src/sgml/jit.sgml
@@ -169,12 +169,8 @@ SET
  Planning Time: 0.133 ms
  JIT:
    Functions: 3
-   Generation Time: 1.259 ms
-   Inlining: false
-   Inlining Time: 0.000 ms
-   Optimization: false
-   Optimization Time: 0.797 ms
-   Emission Time: 5.048 ms
+   Options: Inlining false, Optimization false, Expressions true, Deforming true
+   Timing: Generation 1.259 ms, Inlining 0.000 ms, Optimization 0.797 ms, Emission 5.048 ms, Total 7.104 ms
  Execution Time: 7.416 ms
 </screen>
    As visible here, <acronym>JIT</acronym> was used, but inlining and
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ed6afe79a9..32df216f2c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -564,8 +564,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	 * at a later stage.
 	 */
 	if (es->costs)
-		ExplainPrintJIT(es, queryDesc->estate->es_jit_flags,
-						queryDesc->estate->es_jit_combined_instr, -1);
+		ExplainPrintJITSummary(es, queryDesc);
 
 	/*
 	 * Close down the query and free resources.  Include time for this in the
@@ -688,6 +687,33 @@ ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc)
 	ExplainCloseGroup("Triggers", "Triggers", false, es);
 }
 
+/*
+ * ExplainPrintJITSummary -
+ *    Print summarized JIT instrumentation from leader and workers
+ */
+void
+ExplainPrintJITSummary(ExplainState *es, QueryDesc *queryDesc)
+{
+	JitInstrumentation ji;
+
+	if (!queryDesc->estate->es_jit)
+		return;
+
+	memset(&ji, 0, sizeof(JitInstrumentation));
+
+	/*
+	 * Work with a copy instead of modifying the leader state, since this
+	 * function may be called twice
+	 */
+	InstrJitAgg(&ji, &queryDesc->estate->es_jit->instr);
+
+	/* If this process has done JIT in parallel workers, merge stats */
+	if (queryDesc->estate->es_jit_combined_instr)
+		InstrJitAgg(&ji, queryDesc->estate->es_jit_combined_instr);
+
+	ExplainPrintJIT(es, queryDesc->estate->es_jit_flags, &ji, -1);
+}
+
 /*
  * ExplainPrintJIT -
  *	  Append information about JITing to es->str.
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 04f14c9178..572506e7c2 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -495,21 +495,6 @@ standard_ExecutorEnd(QueryDesc *queryDesc)
 
 	ExecEndPlan(queryDesc->planstate, estate);
 
-	/*
-	 * If this process has done JIT, either merge stats into worker stats, or
-	 * use this process' stats as the global stats if no parallelism was used
-	 * / no workers did JIT.
-	 */
-	if (estate->es_instrument && queryDesc->estate->es_jit)
-	{
-		if (queryDesc->estate->es_jit_combined_instr)
-			InstrJitAgg(queryDesc->estate->es_jit_combined_instr,
-						&queryDesc->estate->es_jit->instr);
-		else
-			queryDesc->estate->es_jit_combined_instr =
-				&queryDesc->estate->es_jit->instr;
-	}
-
 	/* do away with our snapshots */
 	UnregisterSnapshot(estate->es_snapshot);
 	UnregisterSnapshot(estate->es_crosscheck_snapshot);
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index b0b6f1e15a..d3f70fda08 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -81,6 +81,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
 extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
 extern void ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc);
 
+extern void ExplainPrintJITSummary(ExplainState *es, QueryDesc *queryDesc);
 extern void ExplainPrintJIT(ExplainState *es, int jit_flags,
 				struct JitInstrumentation *jit_instr, int worker_i);
 
-- 
2.19.0

#46

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Lukas Fittl (#45)

Re: Query is over 2x slower with jit=on

Hi,

On 2018-10-01 00:32:18 -0700, Lukas Fittl wrote:

On Tue, Sep 25, 2018 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:

I've pushed the change without that bit - it's just a few additional
lines if we want to change that.

It seems that since this commit, JIT statistics are now only being printed
for parallel query plans. This is due to ExplainPrintJIT being called
before ExecutorEnd, so in a non-parallel query,
queryDesc->estate->es_jit_combined_instr will never get set.

Ugh.

Attached a patch that instead introduces a new ExplainPrintJITSummary
method that summarizes the statistics before they get printed.

Yea, I had something like that, and somehow concluded that it wasn't
needed, and moved it to the wrong place :/.

I've also removed an (I believe) unnecessary "if (estate->es_instrument)"
check that prevented EXPLAIN without ANALYZE from showing whether JIT would
be used or not.

In addition this also updates a missed section in the documentation with
the new stats output format.

Thanks a lot for the patch! I've pushed it with some mild changes
(renamed es_jit_combined_stats to worker_stats; changed
ExplainPrintJITSummary to not look at es_jit, but es_jit_flags as
theoretically es_jit might not be allocated; very minor comment
changes).

- Andres