1. 03 4月, 2015 1 次提交
    • R
      x86/asm: Add support for the CLWB instruction · d9dc64f3
      Ross Zwisler 提交于
      Add support for the new CLWB (cache line write back)
      instruction.  This instruction was announced in the document
      "Intel Architecture Instruction Set Extensions Programming
      Reference" with reference number 319433-022.
      
        https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
      
      The CLWB instruction is used to write back the contents of
      dirtied cache lines to memory without evicting the cache lines
      from the processor's cache hierarchy.  This should be used in
      favor of clflushopt or clflush in cases where you require the
      cache line to be written to memory but plan to access the data
      again in the near future.
      
      One of the main use cases for this is with persistent memory
      where CLWB can be used with PCOMMIT to ensure that data has been
      accepted to memory and is durable on the DIMM.
      
      This function shows how to properly use CLWB/CLFLUSHOPT/CLFLUSH
      and PCOMMIT with appropriate fencing:
      
      void flush_and_commit_buffer(void *vaddr, unsigned int size)
      {
      	void *vend = vaddr + size - 1;
      
      	for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
      		clwb(vaddr);
      
      	/* Flush any possible final partial cacheline */
      	clwb(vend);
      
      	/*
      	 * Use SFENCE to order CLWB/CLFLUSHOPT/CLFLUSH cache flushes.
      	 * (MFENCE via mb() also works)
      	 */
      	wmb();
      
      	/* PCOMMIT and the required SFENCE for ordering */
      	pcommit_sfence();
      }
      
      After this function completes the data pointed to by vaddr is
      has been accepted to memory and will be durable if the vaddr
      points to persistent memory.
      
      Regarding the details of how the alternatives assembly is set
      up, we need one additional byte at the beginning of the CLFLUSH
      so that we can flip it into a CLFLUSHOPT by changing that byte
      into a 0x66 prefix.  Two options are to either insert a 1 byte
      ASM_NOP1, or to add a 1 byte NOP_DS_PREFIX.  Both have no
      functional effect with the plain CLFLUSH, but I've been told
      that executing a CLFLUSH + prefix should be faster than
      executing a CLFLUSH + NOP.
      
      We had to hard code the assembly for CLWB because, lacking the
      ability to assemble the CLWB instruction itself, the next
      closest thing is to have an xsaveopt instruction with a 0x66
      prefix.  Unfortunately XSAVEOPT itself is also relatively new,
      and isn't included by all the GCC versions that the kernel needs
      to support.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1422377631-8986-3-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d9dc64f3
  2. 23 2月, 2015 2 次提交
    • B
      x86/alternatives: Make JMPs more robust · 48c7a250
      Borislav Petkov 提交于
      Up until now we had to pay attention to relative JMPs in alternatives
      about how their relative offset gets computed so that the jump target
      is still correct. Or, as it is the case for near CALLs (opcode e8), we
      still have to go and readjust the offset at patching time.
      
      What is more, the static_cpu_has_safe() facility had to forcefully
      generate 5-byte JMPs since we couldn't rely on the compiler to generate
      properly sized ones so we had to force the longest ones. Worse than
      that, sometimes it would generate a replacement JMP which is longer than
      the original one, thus overwriting the beginning of the next instruction
      at patching time.
      
      So, in order to alleviate all that and make using JMPs more
      straight-forward we go and pad the original instruction in an
      alternative block with NOPs at build time, should the replacement(s) be
      longer. This way, alternatives users shouldn't pay special attention
      so that original and replacement instruction sizes are fine but the
      assembler would simply add padding where needed and not do anything
      otherwise.
      
      As a second aspect, we go and recompute JMPs at patching time so that we
      can try to make 5-byte JMPs into two-byte ones if possible. If not, we
      still have to recompute the offsets as the replacement JMP gets put far
      away in the .altinstr_replacement section leading to a wrong offset if
      copied verbatim.
      
      For example, on a locally generated kernel image
      
        old insn VA: 0xffffffff810014bd, CPU feat: X86_FEATURE_ALWAYS, size: 2
        __switch_to:
         ffffffff810014bd:      eb 21                   jmp ffffffff810014e0
        repl insn: size: 5
        ffffffff81d0b23c:       e9 b1 62 2f ff          jmpq ffffffff810014f2
      
      gets corrected to a 2-byte JMP:
      
        apply_alternatives: feat: 3*32+21, old: (ffffffff810014bd, len: 2), repl: (ffffffff81d0b23c, len: 5)
        alt_insn: e9 b1 62 2f ff
        recompute_jumps: next_rip: ffffffff81d0b241, tgt_rip: ffffffff810014f2, new_displ: 0x00000033, ret len: 2
        converted to: eb 33 90 90 90
      
      and a 5-byte JMP:
      
        old insn VA: 0xffffffff81001516, CPU feat: X86_FEATURE_ALWAYS, size: 2
        __switch_to:
         ffffffff81001516:      eb 30                   jmp ffffffff81001548
        repl insn: size: 5
         ffffffff81d0b241:      e9 10 63 2f ff          jmpq ffffffff81001556
      
      gets shortened into a two-byte one:
      
        apply_alternatives: feat: 3*32+21, old: (ffffffff81001516, len: 2), repl: (ffffffff81d0b241, len: 5)
        alt_insn: e9 10 63 2f ff
        recompute_jumps: next_rip: ffffffff81d0b246, tgt_rip: ffffffff81001556, new_displ: 0x0000003e, ret len: 2
        converted to: eb 3e 90 90 90
      
      ... and so on.
      
      This leads to a net win of around
      
      40ish replacements * 3 bytes savings =~ 120 bytes of I$
      
      on an AMD guest which means some savings of precious instruction cache
      bandwidth. The padding to the shorter 2-byte JMPs are single-byte NOPs
      which on smart microarchitectures means discarding NOPs at decode time
      and thus freeing up execution bandwidth.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      48c7a250
    • B
      x86/alternatives: Add instruction padding · 4332195c
      Borislav Petkov 提交于
      Up until now we have always paid attention to make sure the length of
      the new instruction replacing the old one is at least less or equal to
      the length of the old instruction. If the new instruction is longer, at
      the time it replaces the old instruction it will overwrite the beginning
      of the next instruction in the kernel image and cause your pants to
      catch fire.
      
      So instead of having to pay attention, teach the alternatives framework
      to pad shorter old instructions with NOPs at buildtime - but only in the
      case when
      
        len(old instruction(s)) < len(new instruction(s))
      
      and add nothing in the >= case. (In that case we do add_nops() when
      patching).
      
      This way the alternatives user shouldn't have to care about instruction
      sizes and simply use the macros.
      
      Add asm ALTERNATIVE* flavor macros too, while at it.
      
      Also, we need to save the pad length in a separate struct alt_instr
      member for NOP optimization and the way to do that reliably is to carry
      the pad length instead of trying to detect whether we're looking at
      single-byte NOPs or at pathological instruction offsets like e9 90 90 90
      90, for example, which is a valid instruction.
      
      Thanks to Michael Matz for the great help with toolchain questions.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      4332195c
  3. 20 2月, 2015 1 次提交
    • R
      x86/asm: Add support for the pcommit instruction · 719d359d
      Ross Zwisler 提交于
      Add support for the new pcommit (persistent commit) instruction.
      This instruction was announced in the document "Intel
      Architecture Instruction Set Extensions Programming Reference"
      with reference number 319433-022:
      
        https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
      
      The pcommit instruction ensures that data that has been flushed
      from the processor's cache hierarchy with clwb, clflushopt or
      clflush is accepted to memory and is durable on the DIMM.  The
      primary use case for this is persistent memory.
      
      This function shows how to properly use clwb/clflushopt/clflush
      and pcommit with appropriate fencing:
      
      void flush_and_commit_buffer(void *vaddr, unsigned int size)
      {
      	void *vend = vaddr + size - 1;
      
      	for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
      		clwb(vaddr);
      
      	/* Flush any possible final partial cacheline */
      	clwb(vend);
      
      	/*
      	 * sfence to order clwb/clflushopt/clflush cache flushes
      	 * mfence via mb() also works
      	 */
      	wmb();
      
      	/* pcommit and the required sfence for ordering */
      	pcommit_sfence();
      }
      
      After this function completes the data pointed to by vaddr is
      has been accepted to memory and will be durable if the vaddr
      points to persistent memory.
      
      Pcommit must always be ordered by an mfence or sfence, so to
      help simplify things we include both the pcommit and the
      required sfence in the alternatives generated by
      pcommit_sfence().  The other option is to keep them separated,
      but on platforms that don't support pcommit this would then turn
      into:
      
      void flush_and_commit_buffer(void *vaddr, unsigned int size)
      {
              void *vend = vaddr + size - 1;
      
              for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
                      clwb(vaddr);
      
              /* Flush any possible final partial cacheline */
              clwb(vend);
      
              /*
               * sfence to order clwb/clflushopt/clflush cache flushes
               * mfence via mb() also works
               */
              wmb();
      
              nop(); /* from pcommit(), via alternatives */
      
              /*
               * sfence to order pcommit
               * mfence via mb() also works
               */
              wmb();
      }
      
      This is still correct, but now you've got two fences separated
      by only a nop.  With the commit and the fence together in
      pcommit_sfence() you avoid the final unneeded fence.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1424367448-24254-1-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      719d359d
  4. 03 12月, 2014 1 次提交
  5. 12 11月, 2014 1 次提交
  6. 24 9月, 2014 1 次提交
  7. 12 9月, 2014 3 次提交
    • D
      x86: Add more disabled features · 9298b815
      Dave Hansen 提交于
      The original motivation for these patches was for an Intel CPU
      feature called MPX.  The patch to add a disabled feature for it
      will go in with the other parts of the support.
      
      But, in the meantime, there are a few other features than MPX
      that we can make assumptions about at compile-time based on
      compile options.  Add them to disabled-features.h and check them
      with cpu_feature_enabled().
      
      Note that this gets rid of the last things that needed an #ifdef
      CONFIG_X86_64 in cpufeature.h.  Yay!
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140911211524.C0EC332A@viggo.jf.intel.comAcked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      9298b815
    • D
      x86: Introduce disabled-features · 381aa07a
      Dave Hansen 提交于
      I believe the REQUIRED_MASK aproach was taken so that it was
      easier to consult in assembly (arch/x86/kernel/verify_cpu.S).
      DISABLED_MASK does not have the same restriction, but I
      implemented it the same way for consistency.
      
      We have a REQUIRED_MASK... which does two things:
      1. Keeps a list of cpuid bits to check in very early boot and
         refuse to boot if those are not present.
      2. Consulted during cpu_has() checks, which allows us to
         optimize out things at compile-time.  In other words, if we
         *KNOW* we will not boot with the feature off, then we can
         safely assume that it will be present forever.
      
      But, we don't have a similar mechanism for CPU features which
      may be present but that we know we will not use.  We simply
      use our existing mechanisms to repeatedly check the status of
      the bit at runtime (well, the alternatives patching helps here
      but it does not provide compile-time optimization).
      
      Adding a feature to disabled-features.h allows the bit to be
      checked via a new macro: cpu_feature_enabled().  Note that
      for features in DISABLED_MASK, checks with this macro have
      all of the benefits of an #ifdef.  Before, we would have done
      this in a header:
      
      #ifdef CONFIG_X86_INTEL_MPX
      #define cpu_has_mpx cpu_has(X86_FEATURE_MPX)
      #else
      #define cpu_has_mpx 0
      #endif
      
      and this in the code:
      
      	if (cpu_has_mpx)
      		do_some_mpx_thing();
      
      Now, just add your feature to DISABLED_MASK and you can do this
      everywhere, and get the same benefits you would have from
      #ifdefs:
      
      	if (cpu_feature_enabled(X86_FEATURE_MPX))
      		do_some_mpx_thing();
      
      We need a new function and *not* a modification to cpu_has()
      because there are cases where we actually need to check the CPU
      itself, despite what features the kernel supports.  The best
      example of this is a hypervisor which has no control over what
      features its guests are using and where the guest does not depend
      on the host for support.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140911211513.9E35E931@viggo.jf.intel.comAcked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      381aa07a
    • D
      x86: Axe the lightly-used cpu_has_pae · c8128cce
      Dave Hansen 提交于
      cpu_has_pae is only referenced in one place: the X86_32 kexec
      code (in a file not even built on 64-bit).  It hardly warrants
      its own macro, or the trouble we go to ensuring that it can't
      be called in X86_64 code.
      
      Axe the macro and replace it with a direct cpu feature check.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20140911211511.AD76E774@viggo.jf.intel.comAcked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      c8128cce
  8. 18 8月, 2014 1 次提交
    • J
      x86: Support compiling out human-friendly processor feature names · 9def39be
      Josh Triplett 提交于
      The table mapping CPUID bits to human-readable strings takes up a
      non-trivial amount of space, and only exists to support /proc/cpuinfo
      and a couple of kernel messages.  Since programs depend on the format of
      /proc/cpuinfo, force inclusion of the table when building with /proc
      support; otherwise, support omitting that table to save space, in which
      case the kernel messages will print features numerically instead.
      
      In addition to saving 1408 bytes out of vmlinux, this also saves 1373
      bytes out of the uncompressed setup code, which contributes directly to
      the size of bzImage.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      9def39be
  9. 15 7月, 2014 2 次提交
  10. 19 6月, 2014 1 次提交
  11. 30 5月, 2014 2 次提交
  12. 28 2月, 2014 2 次提交
  13. 21 2月, 2014 1 次提交
  14. 19 2月, 2014 1 次提交
  15. 07 12月, 2013 1 次提交
  16. 11 10月, 2013 1 次提交
  17. 29 6月, 2013 1 次提交
  18. 21 6月, 2013 3 次提交
  19. 25 4月, 2013 1 次提交
  20. 21 4月, 2013 1 次提交
  21. 10 4月, 2013 1 次提交
  22. 03 4月, 2013 6 次提交
  23. 16 3月, 2013 1 次提交
  24. 16 2月, 2013 1 次提交
  25. 01 12月, 2012 1 次提交
    • W
      KVM: x86: Emulate IA32_TSC_ADJUST MSR · ba904635
      Will Auld 提交于
      CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported
      
      Basic design is to emulate the MSR by allowing reads and writes to a guest
      vcpu specific location to store the value of the emulated MSR while adding
      the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
      be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
      is of course as long as the "use TSC counter offsetting" VM-execution control
      is enabled as well as the IA32_TSC_ADJUST control.
      
      However, because hardware will only return the TSC + IA32_TSC_ADJUST +
      vmsc tsc_offset for a guest process when it does and rdtsc (with the correct
      settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one
      of these three locations. The argument against storing it in the actual MSR
      is performance. This is likely to be seldom used while the save/restore is
      required on every transition. IA32_TSC_ADJUST was created as a way to solve
      some issues with writing TSC itself so that is not an option either.
      
      The remaining option, defined above as our solution has the problem of
      returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
      done here) as mentioned above. However, more problematic is that storing the
      data in vmcs tsc_offset will have a different semantic effect on the system
      than does using the actual MSR. This is illustrated in the following example:
      
      The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
      process performs a rdtsc. In this case the guest process will get
      TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including
      IA32_TSC_ADJUST_guest. While the total system semantics changed the semantics
      as seen by the guest do not and hence this will not cause a problem.
      Signed-off-by: NWill Auld <will.auld@intel.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      ba904635
  26. 30 11月, 2012 1 次提交
  27. 14 11月, 2012 1 次提交