New expression evaluator and indirect jumps

Started by Jeff Davisalmost 9 years ago2 messages
#1Jeff Davis
pgsql@j-davis.com

Andres,

Thank you for your great work on the expression evaluator:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755

I was looking at the dispatch code, and it goes to significant effort
(using computed goto) to generate many indirect jumps instead of just
one. The reasoning is that CPUs can do better branch prediction when
it can predict separately for each of the indirect jumps.

But the paper here: https://hal.inria.fr/hal-01100647/document claims
that it's not really needed on newer CPUs because they are better at
branch prediction. I skimmed it, and if I understand correctly, modern
branch predictors use some history, so it can predict based on the
instructions executed before it got to the indirect jump.

I tried looking through the discussion on this list, but most seemed
to resolve around which compilers generated the assembly we wanted
rather than how much it actually improved performance. Can someone
please point me to the numbers? Do they refute the conclusions in the
paper, or are we concerned about a wider range of processors?

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#1)
Re: New expression evaluator and indirect jumps

Hi Jeff,

On 2017-04-01 17:36:42 -0700, Jeff Davis wrote:

Thank you for your great work on the expression evaluator:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755

I was looking at the dispatch code, and it goes to significant effort
(using computed goto) to generate many indirect jumps instead of just
one. The reasoning is that CPUs can do better branch prediction when
it can predict separately for each of the indirect jumps.

Right.

But the paper here: https://hal.inria.fr/hal-01100647/document claims
that it's not really needed on newer CPUs because they are better at
branch prediction. I skimmed it, and if I understand correctly, modern
branch predictors use some history, so it can predict based on the
instructions executed before it got to the indirect jump.

Yea, it's true that the benefits on modern CPUs are smaller than they
used to be. But, for one the branch history buffers are of very limited
size, which in many cases will make prediction an issue again. For
another, the switch based dispatch has the issue that it'll still
perform boundary checks on the opcode, which has some performance
cost.

I tried looking through the discussion on this list, but most seemed
to resolve around which compilers generated the assembly we wanted
rather than how much it actually improved performance. Can someone
please point me to the numbers? Do they refute the conclusions in the
paper, or are we concerned about a wider range of processors?

I ran a lot of benchmarks during development, and either there was no
performance difference between computed gotos and switch based
threading, or computed gotos come out ahead. In expression heavy cases,
e.g. TPC-H Q01, there's a considerable advantage (~3.5% total, making it
something like ~15% expression evaluation speedup). I primarily
evaluated performance on a skylake (i.e. newer than haswell), rather
than on my older nehalem workstation, to avoid optimizing for the wrong
thing.

I am not particularly concerned about !x86 processors, but a lot of them
indeed seem to have a lot less elaborate branch predictors (especially
ARM). Also, nehalem and sandy bridge are still quite common out there,
especially in servers.

Since the cost of maintaining the computed goto stuff isn't that high,
I'm not really concerned here.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers