PSA: New intel MDS vulnerability mitigations cause measurable slowdown
Hi,
There's a new set of CPU vulnerabilities, so far only affecting intel
CPUs. Cribbing from the linux-kernel announcement I'm referring to
https://xenbits.xen.org/xsa/advisory-297.html
for details.
The "fix" is for the OS to perform some extra mitigations:
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html
https://www.kernel.org/doc/html/latest/x86/mds.html#mds
*And* SMT/hyperthreading needs to be disabled, to be fully safe.
Fun.
I've run a quick pgbench benchmark:
*Without* disabling SMT, for readonly pgbench, I'm seeing regressions
between 7-11%, depending on the size of shared_buffers (and some runtime
variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU.
I'd be surprised if there weren't adversarial loads with bigger
slowdowns - what gets more expensive with the mitigations is syscalls.
Most OSs / distributions either have rolled these changes out already,
or will do so soon. So it's likely that most of us and our users will be
affected by this soon. At least on linux the part of the mitigation
that makes syscalls slower (blowing away buffers at the end of a sycall)
is enabled by default, but SMT is not disabled by default.
Greetings,
Andres Freund
On Wed, May 15, 2019 at 10:31 AM Andres Freund <andres@anarazel.de> wrote:
*Without* disabling SMT, for readonly pgbench, I'm seeing regressions
between 7-11%, depending on the size of shared_buffers (and some runtime
variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU.
I'd be surprised if there weren't adversarial loads with bigger
slowdowns - what gets more expensive with the mitigations is syscalls.
Yikes. This all in warm shared buffers, right? So effectively this
is the cost of recvfrom() and sendto() going up? Did you use -M
prepared? If not, there would also be a couple of lseek(SEEK_END)
calls in between for planning... I wonder how many more
syscall-taxing mitigations we need before relation size caching pays
off.
--
Thomas Munro
https://enterprisedb.com
Hi,
On 2019-05-15 12:52:47 +1200, Thomas Munro wrote:
On Wed, May 15, 2019 at 10:31 AM Andres Freund <andres@anarazel.de> wrote:
*Without* disabling SMT, for readonly pgbench, I'm seeing regressions
between 7-11%, depending on the size of shared_buffers (and some runtime
variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU.
I'd be surprised if there weren't adversarial loads with bigger
slowdowns - what gets more expensive with the mitigations is syscalls.Yikes. This all in warm shared buffers, right?
Not initially, but it ought to warm up quite quickly. I ran something
boiling down to pgbench -q -i -s 200; psql -c 'vacuum (freeze, analyze,
verbose)'; pgbench -n -S -c 32 -j 32 -S -M prepared -T 100 -P1. As both
pgbench -i's COPY and VACUUM use ringbuffers, initially s_b will
effectively be empty.
So effectively this is the cost of recvfrom() and sendto() going up?
Plus epoll_wait(). And read(), for the cases where s_b was smaller than
the data.
Did you use -M prepared?
Yes.
If not, there would also be a couple of lseek(SEEK_END) calls in
between for planning... I wonder how many more syscall-taxing
mitigations we need before relation size caching pays off.
Yea, I suspect we're going to have to go there soon for a number of
reasons.
- Andres
Hi,
On 2019-05-14 15:30:52 -0700, Andres Freund wrote:
There's a new set of CPU vulnerabilities, so far only affecting intel
CPUs. Cribbing from the linux-kernel announcement I'm referring to
https://xenbits.xen.org/xsa/advisory-297.html
for details.The "fix" is for the OS to perform some extra mitigations:
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html
https://www.kernel.org/doc/html/latest/x86/mds.html#mds*And* SMT/hyperthreading needs to be disabled, to be fully safe.
Fun.
I've run a quick pgbench benchmark:
*Without* disabling SMT, for readonly pgbench, I'm seeing regressions
between 7-11%, depending on the size of shared_buffers (and some runtime
variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU.
I'd be surprised if there weren't adversarial loads with bigger
slowdowns - what gets more expensive with the mitigations is syscalls.
The profile after the mitigations looks like:
+ 3.62% postgres [kernel.vmlinux] [k] do_syscall_64
+ 2.99% postgres postgres [.] _bt_compare
+ 2.76% postgres postgres [.] hash_search_with_hash_value
+ 2.33% postgres [kernel.vmlinux] [k] entry_SYSCALL_64
+ 1.69% pgbench [kernel.vmlinux] [k] do_syscall_64
+ 1.61% postgres postgres [.] AllocSetAlloc
1.41% postgres postgres [.] PostgresMain
+ 1.22% pgbench [kernel.vmlinux] [k] entry_SYSCALL_64
+ 1.11% postgres postgres [.] LWLockAcquire
+ 0.86% postgres postgres [.] PinBuffer
+ 0.80% postgres postgres [.] LockAcquireExtended
+ 0.78% postgres [kernel.vmlinux] [k] psi_task_change
0.76% pgbench pgbench [.] threadRun
0.69% postgres postgres [.] LWLockRelease
+ 0.69% postgres postgres [.] SearchCatCache1
0.66% postgres postgres [.] LockReleaseAll
+ 0.65% postgres postgres [.] GetSnapshotData
+ 0.58% postgres postgres [.] hash_seq_search
0.54% postgres postgres [.] hash_search
+ 0.53% postgres [kernel.vmlinux] [k] __switch_to
+ 0.53% postgres postgres [.] hash_any
0.52% pgbench libpq.so.5.12 [.] pqParseInput3
0.50% pgbench [kernel.vmlinux] [k] do_raw_spin_lock
where do_syscall_64 show this instruction profile:
│ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)
│ {
│ asm_volatile_goto("1:"
1.58 │ ↓ jmpq bd
│ mds_clear_cpu_buffers():
│ * Works with any segment selector, but a valid writable
│ * data segment is the fastest variant.
│ *
│ * "cc" clobber is required because VERW modifies ZF.
│ */
│ asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc");
77.38 │ verw 0x13fea53(%rip) # ffffffff82400ee0 <ds.4768>
│ do_syscall_64():
│ }
│
│ syscall_return_slowpath(regs);
│ }
13.18 │ bd: pop %rbx
0.08 │ pop %rbp
│ ← retq
│ nr = syscall_trace_enter(regs);
│ c0: mov %rbp,%rdi
│ → callq syscall_trace_enter
Where verw is the instruction that was recycled to now have the
side-effect of flushing CPU buffers.
Greetings,
Andres Freund
On Wed, May 15, 2019 at 1:13 PM Andres Freund <andres@anarazel.de> wrote:
I've run a quick pgbench benchmark:
*Without* disabling SMT, for readonly pgbench, I'm seeing regressions
between 7-11%, depending on the size of shared_buffers (and some runtime
variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU.
I'd be surprised if there weren't adversarial loads with bigger
slowdowns - what gets more expensive with the mitigations is syscalls.
This stuff landed in my FreeBSD 13.0-CURRENT kernel, so I was curious
to measure it with and without the earlier mitigations. On my humble
i7-8550U laptop with the new 1.22 microcode installed, with my usual
settings of PTI=on and IBRS=off, so far MDS=VERW gives me ~1.5% loss
of TPS with a single client, up to 4.3% loss of TPS for 16 clients,
but it didn't go higher when I tried 32 clients. This was a tiny
scale 10 database, though in a quick test it didn't look like it was
worse with scale 100.
With all three mitigations activated, my little dev machine has gone
from being able to do ~11.8 million baseline syscalls per second to
~1.6 million, or ~1.4 million with the AVX variant of the mitigation.
Raw getuid() syscalls per second:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 11798658 4764159 3274043
off on 2652564 1941606 1655356
on off 4973053 2932906 2339779
on on 1988527 1556922 1378798
pgbench read-only transactions per second, 1 client thread:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 19393 18949 18615
off on 17946 17586 17323
on off 19381 19015 18696
on on 18045 17709 17418
pgbench -M prepared read-only transactions per second, 1 client thread:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 35020 34049 33200
off on 31658 30902 30229
on off 35445 34353 33415
on on 32415 31599 30712
pgbench -M prepared read-only transactions per second, 4 client threads:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 79515 76898 76465
off on 63608 62220 61952
on off 77863 75431 74847
on on 62709 60790 60575
pgbench -M prepared read-only transactions per second, 16 client threads:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 125984 121164 120468
off on 112884 108346 107984
on off 121032 116156 115462
on on 108889 104636 104027
time gmake -s check:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 16.78 16.85 17.03
off on 18.19 18.81 19.08
on off 16.67 16.86 17.33
on on 18.58 18.83 18.99
--
Thomas Munro
https://enterprisedb.com
Missatge de Thomas Munro <thomas.munro@gmail.com> del dia dj., 16 de
maig 2019 a les 13:09:
On Wed, May 15, 2019 at 1:13 PM Andres Freund <andres@anarazel.de> wrote:
I've run a quick pgbench benchmark:
*Without* disabling SMT, for readonly pgbench, I'm seeing regressions
between 7-11%, depending on the size of shared_buffers (and some runtime
variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU.
I'd be surprised if there weren't adversarial loads with bigger
slowdowns - what gets more expensive with the mitigations is syscalls.This stuff landed in my FreeBSD 13.0-CURRENT kernel, so I was curious
to measure it with and without the earlier mitigations. On my humble
i7-8550U laptop with the new 1.22 microcode installed, with my usual
settings of PTI=on and IBRS=off, so far MDS=VERW gives me ~1.5% loss
of TPS with a single client, up to 4.3% loss of TPS for 16 clients,
but it didn't go higher when I tried 32 clients. This was a tiny
scale 10 database, though in a quick test it didn't look like it was
worse with scale 100.With all three mitigations activated, my little dev machine has gone
from being able to do ~11.8 million baseline syscalls per second to
Did you mean "1.8"?
~1.6 million, or ~1.4 million with the AVX variant of the mitigation.
Raw getuid() syscalls per second:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 11798658 4764159 3274043
off on 2652564 1941606 1655356
on off 4973053 2932906 2339779
on on 1988527 1556922 1378798pgbench read-only transactions per second, 1 client thread:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 19393 18949 18615
off on 17946 17586 17323
on off 19381 19015 18696
on on 18045 17709 17418pgbench -M prepared read-only transactions per second, 1 client thread:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 35020 34049 33200
off on 31658 30902 30229
on off 35445 34353 33415
on on 32415 31599 30712pgbench -M prepared read-only transactions per second, 4 client threads:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 79515 76898 76465
off on 63608 62220 61952
on off 77863 75431 74847
on on 62709 60790 60575pgbench -M prepared read-only transactions per second, 16 client threads:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 125984 121164 120468
off on 112884 108346 107984
on off 121032 116156 115462
on on 108889 104636 104027time gmake -s check:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 16.78 16.85 17.03
off on 18.19 18.81 19.08
on off 16.67 16.86 17.33
on on 18.58 18.83 18.99--
Thomas Munro
https://enterprisedb.com
--
Albert Cervera i Areny
http://www.NaN-tic.com
Tel. 93 553 18 03
On 5/16/19 12:24 PM, Albert Cervera i Areny wrote:
Missatge de Thomas Munro <thomas.munro@gmail.com> del dia dj., 16 de
maig 2019 a les 13:09:With all three mitigations activated, my little dev machine has gone
from being able to do ~11.8 million baseline syscalls per second toDid you mean "1.8"?
Not in what I thought I saw:
~1.6 million, or ~1.4 million ...
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 11798658 4764159 3274043
^^^^^^^^
off on 2652564 1941606 1655356
on off 4973053 2932906 2339779
on on 1988527 1556922 1378798
^^^^^^^ ^^^^^^^
-Chap
On Fri, May 17, 2019 at 5:26 AM Chapman Flack <chap@anastigmatix.net> wrote:
On 5/16/19 12:24 PM, Albert Cervera i Areny wrote:
Missatge de Thomas Munro <thomas.munro@gmail.com> del dia dj., 16 de
maig 2019 a les 13:09:With all three mitigations activated, my little dev machine has gone
from being able to do ~11.8 million baseline syscalls per second toDid you mean "1.8"?
Not in what I thought I saw:
~1.6 million, or ~1.4 million ...
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 11798658 4764159 3274043^^^^^^^^
off on 2652564 1941606 1655356
on off 4973053 2932906 2339779
on on 1988527 1556922 1378798^^^^^^^ ^^^^^^^
Right. Actually it's worse than that -- after I posted I realised
that I had some debug stuff enabled in my kernel that was slowing
things down a bit, so I reran the tests overnight with a production
kernel and here is what I see this morning. It's actually ~17.8
million syscalls/sec -> ~1.7 million syscalls/sec, if you go from all
mitigations off to all mitigations on, or -> ~3.2 million for just PTI
+ MDS. And the loss of TPS is ~5% for the case I was most interested
in, just turning on MDS=VERW if you already had PTI on and IBRS off.
Raw getuid() syscalls per second:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 17771744 5372032 3575035
off on 3060923 2166527 1817052
on off 5622591 3150883 2463934
on on 2213190 1687748 1475605
pgbench read-only transactions per second, 1 client thread:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 22414 22103 21571
off on 21298 20817 20418
on off 22473 22080 21550
on on 21286 20850 20386
pgbench -M prepared read-only transactions per second, 1 client thread:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 43508 42476 41123
off on 40729 39483 38555
on off 44110 42989 42012
on on 41143 39990 38798
pgbench -M prepared read-only transactions per second, 4 client threads:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 100735 97689 96662
off on 80142 77804 77064
on off 100540 97010 95827
on on 79492 76976 76226
pgbench -M prepared read-only transactions per second, 16 client threads:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 161015 152978 152556
off on 145605 139438 139179
on off 155359 147691 146987
on on 140976 134978 134177
pgbench -M prepared read-only transactions per second, 16 client threads:
PTI IBRS MDS=off MDS=VERW MDS=AVX
===== ===== ======== ======== ========
off off 157986 150132 149436
off on 142618 136220 135901
on off 153482 146214 145839
on on 138650 133074 132142
--
Thomas Munro
https://enterprisedb.com
On Fri, May 17, 2019 at 9:42 AM Thomas Munro <thomas.munro@gmail.com> wrote:
pgbench -M prepared read-only transactions per second, 16 client threads:
(That second "16 client threads" line should read "32 client threads".)
--
Thomas Munro
https://enterprisedb.com