1. 17 3月, 2015 3 次提交
  2. 10 3月, 2015 3 次提交
  3. 07 3月, 2015 3 次提交
    • D
      x86/asm: Optimize unnecessarily wide TEST instructions · 3e1aa7cb
      Denys Vlasenko 提交于
      By the nature of the TEST operation, it is often possible to test
      a narrower part of the operand:
      
          "testl $3,  mem"  ->  "testb $3, mem",
          "testq $3, %rcx"  ->  "testb $3, %cl"
      
      This results in shorter instructions, because the TEST instruction
      has no sign-entending byte-immediate forms unlike other ALU ops.
      
      Note that this change does not create any LCP (Length-Changing Prefix)
      stalls, which happen when adding a 0x66 prefix, which happens when
      16-bit immediates are used, which changes such TEST instructions:
      
        [test_opcode] [modrm] [imm32]
      
      to:
      
        [0x66] [test_opcode] [modrm] [imm16]
      
      where [imm16] has a *different length* now: 2 bytes instead of 4.
      This confuses the decoder and slows down execution.
      
      REX prefixes were carefully designed to almost never hit this case:
      adding REX prefix does not change instruction length except MOVABS
      and MOV [addr],RAX instruction.
      
      This patch does not add instructions which would use a 0x66 prefix,
      code changes in assembly are:
      
          -48 f7 07 01 00 00 00 	testq  $0x1,(%rdi)
          +f6 07 01             	testb  $0x1,(%rdi)
          -48 f7 c1 01 00 00 00 	test   $0x1,%rcx
          +f6 c1 01             	test   $0x1,%cl
          -48 f7 c1 02 00 00 00 	test   $0x2,%rcx
          +f6 c1 02             	test   $0x2,%cl
          -41 f7 c2 01 00 00 00 	test   $0x1,%r10d
          +41 f6 c2 01          	test   $0x1,%r10b
          -48 f7 c1 04 00 00 00 	test   $0x4,%rcx
          +f6 c1 04             	test   $0x4,%cl
          -48 f7 c1 08 00 00 00 	test   $0x8,%rcx
          +f6 c1 08             	test   $0x8,%cl
      
      Linus further notes:
      
         "There are no stalls from using 8-bit instruction forms.
      
          Now, changing from 64-bit or 32-bit 'test' instructions to 8-bit ones
          *could* cause problems if it ends up having forwarding issues, so that
          instead of just forwarding the result, you end up having to wait for
          it to be stable in the L1 cache (or possibly the register file). The
          forwarding from the store buffer is simplest and most reliable if the
          read is done at the exact same address and the exact same size as the
          write that gets forwarded.
      
          But that's true only if:
      
           (a) the write was very recent and is still in the write queue. I'm
               not sure that's the case here anyway.
      
           (b) on at least most Intel microarchitectures, you have to test a
               different byte than the lowest one (so forwarding a 64-bit write
               to a 8-bit read ends up working fine, as long as the 8-bit read
               is of the low 8 bits of the written data).
      
          A very similar issue *might* show up for registers too, not just
          memory writes, if you use 'testb' with a high-byte register (where
          instead of forwarding the value from the original producer it needs to
          go through the register file and then shifted). But it's mainly a
          problem for store buffers.
      
          But afaik, the way Denys changed the test instructions, neither of the
          above issues should be true.
      
          The real problem for store buffer forwarding tends to be "write 8
          bits, read 32 bits". That can be really surprisingly expensive,
          because the read ends up having to wait until the write has hit the
          cacheline, and we might talk tens of cycles of latency here. But
          "write 32 bits, read the low 8 bits" *should* be fast on pretty much
          all x86 chips, afaik."
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1425675332-31576-1-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3e1aa7cb
    • A
      x86/asm/entry: Replace this_cpu_sp0() with current_top_of_stack() and fix it on x86_32 · a7fcf28d
      Andy Lutomirski 提交于
      I broke 32-bit kernels.  The implementation of sp0 was correct
      as far as I can tell, but sp0 was much weirder on x86_32 than I
      realized.  It has the following issues:
      
       - Init's sp0 is inconsistent with everything else's: non-init tasks
         are offset by 8 bytes.  (I have no idea why, and the comment is unhelpful.)
      
       - vm86 does crazy things to sp0.
      
      Fix it up by replacing this_cpu_sp0() with
      current_top_of_stack() and using a new percpu variable to track
      the top of the stack on x86_32.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 75182b16 ("x86/asm/entry: Switch all C consumers of kernel_stack to this_cpu_sp0()")
      Link: http://lkml.kernel.org/r/d09dbe270883433776e0cbee3c7079433349e96d.1425692936.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a7fcf28d
    • A
      x86/asm/entry: Delay loading sp0 slightly on task switch · b27559a4
      Andy Lutomirski 提交于
      The change:
      
        75182b16 ("x86/asm/entry: Switch all C consumers of kernel_stack to this_cpu_sp0()")
      
      had the unintended side effect of changing the return value of
      current_thread_info() during part of the context switch process.
      Change it back.
      
      This has no effect as far as I can tell -- it's just for
      consistency.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/9fcaa47dd8487db59eed7a3911b6ae409476763e.1425692936.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b27559a4
  4. 06 3月, 2015 6 次提交
  5. 05 3月, 2015 14 次提交
  6. 04 3月, 2015 1 次提交
  7. 28 2月, 2015 1 次提交
  8. 26 2月, 2015 1 次提交
  9. 25 2月, 2015 2 次提交
  10. 24 2月, 2015 1 次提交
    • D
      x86/xen: allow privcmd hypercalls to be preempted · fdfd811d
      David Vrabel 提交于
      Hypercalls submitted by user space tools via the privcmd driver can
      take a long time (potentially many 10s of seconds) if the hypercall
      has many sub-operations.
      
      A fully preemptible kernel may deschedule such as task in any upcall
      called from a hypercall continuation.
      
      However, in a kernel with voluntary or no preemption, hypercall
      continuations in Xen allow event handlers to be run but the task
      issuing the hypercall will not be descheduled until the hypercall is
      complete and the ioctl returns to user space.  These long running
      tasks may also trigger the kernel's soft lockup detection.
      
      Add xen_preemptible_hcall_begin() and xen_preemptible_hcall_end() to
      bracket hypercalls that may be preempted.  Use these in the privcmd
      driver.
      
      When returning from an upcall, call xen_maybe_preempt_hcall() which
      adds a schedule point if if the current task was within a preemptible
      hypercall.
      
      Since _cond_resched() can move the task to a different CPU, clear and
      set xen_in_preemptible_hcall around the call.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      fdfd811d
  11. 23 2月, 2015 5 次提交
    • B
      x86/asm: Cleanup prefetch primitives · a930dc45
      Borislav Petkov 提交于
      This is based on a patch originally by hpa.
      
      With the current improvements to the alternatives, we can simply use %P1
      as a mem8 operand constraint and rely on the toolchain to generate the
      proper instruction sizes. For example, on 32-bit, where we use an empty
      old instruction we get:
      
        apply_alternatives: feat: 6*32+8, old: (c104648b, len: 4), repl: (c195566c, len: 4)
        c104648b: alt_insn: 90 90 90 90
        c195566c: rpl_insn: 0f 0d 4b 5c
      
        ...
      
        apply_alternatives: feat: 6*32+8, old: (c18e09b4, len: 3), repl: (c1955948, len: 3)
        c18e09b4: alt_insn: 90 90 90
        c1955948: rpl_insn: 0f 0d 08
      
        ...
      
        apply_alternatives: feat: 6*32+8, old: (c1190cf9, len: 7), repl: (c1955a79, len: 7)
        c1190cf9: alt_insn: 90 90 90 90 90 90 90
        c1955a79: rpl_insn: 0f 0d 0d a0 d4 85 c1
      
      all with the proper padding done depending on the size of the
      replacement instruction the compiler generates.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      a930dc45
    • B
      x86/entry_32: Convert X86_INVD_BUG to ALTERNATIVE macro · 8e65f6e0
      Borislav Petkov 提交于
      Booting a 486 kernel on an AMD guest with this patch applied, says:
      
        apply_alternatives: feat: 0*32+25, old: (c160a475, len: 5), repl: (c19557d4, len: 5)
        c160a475: alt_insn: 68 10 35 00 c1
        c19557d4: rpl_insn: 68 80 39 00 c1
      
      which is:
      
        old insn VA: 0xc160a475, CPU feat: X86_FEATURE_XMM, size: 5
        simd_coprocessor_error:
                 c160a475:      68 10 35 00 c1          push $0xc1003510 <do_general_protection>
        repl insn: 0xc19557d4, size: 5
                 c160a475:      68 80 39 00 c1          push $0xc1003980 <do_simd_coprocessor_error>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      8e65f6e0
    • B
      x86/alternatives: Use optimized NOPs for padding · 4fd4b6e5
      Borislav Petkov 提交于
      Alternatives allow now for an empty old instruction. In this case we go
      and pad the space with NOPs at assembly time. However, there are the
      optimal, longer NOPs which should be used. Do that at patching time by
      adding alt_instr.padlen-sized NOPs at the old instruction address.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      4fd4b6e5
    • B
      x86/alternatives: Make JMPs more robust · 48c7a250
      Borislav Petkov 提交于
      Up until now we had to pay attention to relative JMPs in alternatives
      about how their relative offset gets computed so that the jump target
      is still correct. Or, as it is the case for near CALLs (opcode e8), we
      still have to go and readjust the offset at patching time.
      
      What is more, the static_cpu_has_safe() facility had to forcefully
      generate 5-byte JMPs since we couldn't rely on the compiler to generate
      properly sized ones so we had to force the longest ones. Worse than
      that, sometimes it would generate a replacement JMP which is longer than
      the original one, thus overwriting the beginning of the next instruction
      at patching time.
      
      So, in order to alleviate all that and make using JMPs more
      straight-forward we go and pad the original instruction in an
      alternative block with NOPs at build time, should the replacement(s) be
      longer. This way, alternatives users shouldn't pay special attention
      so that original and replacement instruction sizes are fine but the
      assembler would simply add padding where needed and not do anything
      otherwise.
      
      As a second aspect, we go and recompute JMPs at patching time so that we
      can try to make 5-byte JMPs into two-byte ones if possible. If not, we
      still have to recompute the offsets as the replacement JMP gets put far
      away in the .altinstr_replacement section leading to a wrong offset if
      copied verbatim.
      
      For example, on a locally generated kernel image
      
        old insn VA: 0xffffffff810014bd, CPU feat: X86_FEATURE_ALWAYS, size: 2
        __switch_to:
         ffffffff810014bd:      eb 21                   jmp ffffffff810014e0
        repl insn: size: 5
        ffffffff81d0b23c:       e9 b1 62 2f ff          jmpq ffffffff810014f2
      
      gets corrected to a 2-byte JMP:
      
        apply_alternatives: feat: 3*32+21, old: (ffffffff810014bd, len: 2), repl: (ffffffff81d0b23c, len: 5)
        alt_insn: e9 b1 62 2f ff
        recompute_jumps: next_rip: ffffffff81d0b241, tgt_rip: ffffffff810014f2, new_displ: 0x00000033, ret len: 2
        converted to: eb 33 90 90 90
      
      and a 5-byte JMP:
      
        old insn VA: 0xffffffff81001516, CPU feat: X86_FEATURE_ALWAYS, size: 2
        __switch_to:
         ffffffff81001516:      eb 30                   jmp ffffffff81001548
        repl insn: size: 5
         ffffffff81d0b241:      e9 10 63 2f ff          jmpq ffffffff81001556
      
      gets shortened into a two-byte one:
      
        apply_alternatives: feat: 3*32+21, old: (ffffffff81001516, len: 2), repl: (ffffffff81d0b241, len: 5)
        alt_insn: e9 10 63 2f ff
        recompute_jumps: next_rip: ffffffff81d0b246, tgt_rip: ffffffff81001556, new_displ: 0x0000003e, ret len: 2
        converted to: eb 3e 90 90 90
      
      ... and so on.
      
      This leads to a net win of around
      
      40ish replacements * 3 bytes savings =~ 120 bytes of I$
      
      on an AMD guest which means some savings of precious instruction cache
      bandwidth. The padding to the shorter 2-byte JMPs are single-byte NOPs
      which on smart microarchitectures means discarding NOPs at decode time
      and thus freeing up execution bandwidth.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      48c7a250
    • B
      x86/alternatives: Add instruction padding · 4332195c
      Borislav Petkov 提交于
      Up until now we have always paid attention to make sure the length of
      the new instruction replacing the old one is at least less or equal to
      the length of the old instruction. If the new instruction is longer, at
      the time it replaces the old instruction it will overwrite the beginning
      of the next instruction in the kernel image and cause your pants to
      catch fire.
      
      So instead of having to pay attention, teach the alternatives framework
      to pad shorter old instructions with NOPs at buildtime - but only in the
      case when
      
        len(old instruction(s)) < len(new instruction(s))
      
      and add nothing in the >= case. (In that case we do add_nops() when
      patching).
      
      This way the alternatives user shouldn't have to care about instruction
      sizes and simply use the macros.
      
      Add asm ALTERNATIVE* flavor macros too, while at it.
      
      Also, we need to save the pad length in a separate struct alt_instr
      member for NOP optimization and the way to do that reliably is to carry
      the pad length instead of trying to detect whether we're looking at
      single-byte NOPs or at pathological instruction offsets like e9 90 90 90
      90, for example, which is a valid instruction.
      
      Thanks to Michael Matz for the great help with toolchain questions.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      4332195c