1. 26 8月, 2021 1 次提交
  2. 24 6月, 2021 1 次提交
    • D
      x86/fpu: Add PKRU storage outside of task XSAVE buffer · 9782a712
      Dave Hansen 提交于
      PKRU is currently partly XSAVE-managed and partly not. It has space
      in the task XSAVE buffer and is context-switched by XSAVE/XRSTOR.
      However, it is switched more eagerly than FPU because there may be a
      need for PKRU to be up-to-date for things like copy_to/from_user() since
      PKRU affects user-permission memory accesses, not just accesses from
      userspace itself.
      
      This leaves PKRU in a very odd position. XSAVE brings very little value
      to the table for how Linux uses PKRU except for signal related XSTATE
      handling.
      
      Prepare to move PKRU away from being XSAVE-managed. Allocate space in
      the thread_struct for it and save/restore it in the context-switch path
      separately from the XSAVE-managed features. task->thread_struct.pkru
      is only valid when the task is scheduled out. For the current task the
      authoritative source is the hardware, i.e. it has to be retrieved via
      rdpkru().
      
      Leave the XSAVE code in place for now to ensure bisectability.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20210623121456.399107624@linutronix.de
      9782a712
  3. 18 5月, 2021 1 次提交
  4. 13 5月, 2021 1 次提交
  5. 29 3月, 2021 1 次提交
    • L
      x86/process/64: Move cpu_current_top_of_stack out of TSS · 1591584e
      Lai Jiangshan 提交于
      cpu_current_top_of_stack is currently stored in TSS.sp1. TSS is exposed
      through the cpu_entry_area which is visible with user CR3 when PTI is
      enabled and active.
      
      This makes it a coveted fruit for attackers.  An attacker can fetch the
      kernel stack top from it and continue next steps of actions based on the
      kernel stack.
      
      But it is actualy not necessary to be stored in the TSS.  It is only
      accessed after the entry code switched to kernel CR3 and kernel GS_BASE
      which means it can be in any regular percpu variable.
      
      The reason why it is in TSS is historical (pre PTI) because TSS is also
      used as scratch space in SYSCALL_64 and therefore cache hot.
      
      A syscall also needs the per CPU variable current_task and eventually
      __preempt_count, so placing cpu_current_top_of_stack next to them makes it
      likely that they end up in the same cache line which should avoid
      performance regressions. This is not enforced as the compiler is free to
      place these variables, so these entry relevant variables should move into
      a data structure to make this enforceable.
      
      The seccomp_benchmark doesn't show any performance loss in the "getpid
      native" test result.  Actually, the result changes from 93ns before to 92ns
      with this change when KPTI is disabled. The test is very stable and
      although the test doesn't show a higher degree of precision it gives enough
      confidence that moving cpu_current_top_of_stack does not cause a
      regression.
      
      [ tglx: Removed unneeded export. Massaged changelog ]
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210125173444.22696-2-jiangshanlai@gmail.com
      1591584e
  6. 18 3月, 2021 1 次提交
    • I
      x86: Fix various typos in comments · d9f6e12f
      Ingo Molnar 提交于
      Fix ~144 single-word typos in arch/x86/ code comments.
      
      Doing this in a single commit should reduce the churn.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: linux-kernel@vger.kernel.org
      d9f6e12f
  7. 17 3月, 2021 1 次提交
  8. 08 3月, 2021 1 次提交
    • A
      x86/stackprotector/32: Make the canary into a regular percpu variable · 3fb0fdb3
      Andy Lutomirski 提交于
      On 32-bit kernels, the stackprotector canary is quite nasty -- it is
      stored at %gs:(20), which is nasty because 32-bit kernels use %fs for
      percpu storage.  It's even nastier because it means that whether %gs
      contains userspace state or kernel state while running kernel code
      depends on whether stackprotector is enabled (this is
      CONFIG_X86_32_LAZY_GS), and this setting radically changes the way
      that segment selectors work.  Supporting both variants is a
      maintenance and testing mess.
      
      Merely rearranging so that percpu and the stack canary
      share the same segment would be messy as the 32-bit percpu address
      layout isn't currently compatible with putting a variable at a fixed
      offset.
      
      Fortunately, GCC 8.1 added options that allow the stack canary to be
      accessed as %fs:__stack_chk_guard, effectively turning it into an ordinary
      percpu variable.  This lets us get rid of all of the code to manage the
      stack canary GDT descriptor and the CONFIG_X86_32_LAZY_GS mess.
      
      (That name is special.  We could use any symbol we want for the
       %fs-relative mode, but for CONFIG_SMP=n, gcc refuses to let us use any
       name other than __stack_chk_guard.)
      
      Forcibly disable stackprotector on older compilers that don't support
      the new options and turn the stack canary into a percpu variable. The
      "lazy GS" approach is now used for all 32-bit configurations.
      
      Also makes load_gs_index() work on 32-bit kernels. On 64-bit kernels,
      it loads the GS selector and updates the user GSBASE accordingly. (This
      is unchanged.) On 32-bit kernels, it loads the GS selector and updates
      GSBASE, which is now always the user base. This means that the overall
      effect is the same on 32-bit and 64-bit, which avoids some ifdeffery.
      
       [ bp: Massage commit message. ]
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/c0ff7dba14041c7e5d1cae5d4df052f03759bef3.1613243844.git.luto@kernel.org
      3fb0fdb3
  9. 11 2月, 2021 2 次提交
    • T
      x86/irq/64: Adjust the per CPU irq stack pointer by 8 · 951c2a51
      Thomas Gleixner 提交于
      The per CPU hardirq_stack_ptr contains the pointer to the irq stack in the
      form that it is ready to be assigned to [ER]SP so that the first push ends
      up on the top entry of the stack.
      
      But the stack switching on 64 bit has the following rules:
      
          1) Store the current stack pointer (RSP) in the top most stack entry
             to allow the unwinder to link back to the previous stack
      
          2) Set RSP to the top most stack entry
      
          3) Invoke functions on the irq stack
      
          4) Pop RSP from the top most stack entry (stored in #1) so it's back
             to the original stack.
      
      That requires all stack switching code to decrement the stored pointer by 8
      in order to be able to store the current RSP and then set RSP to that
      location. That's a pointless exercise.
      
      Do the -8 adjustment right when storing the pointer and make the data type
      a void pointer to avoid confusion vs. the struct irq_stack data type which
      is on 64bit only used to declare the backing store. Move the definition
      next to the inuse flag so they likely end up in the same cache
      line. Sticking them into a struct to enforce it is a seperate change.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210210002512.354260928@linutronix.de
      951c2a51
    • T
      x86/irq: Sanitize irq stack tracking · e7f89001
      Thomas Gleixner 提交于
      The recursion protection for hard interrupt stacks is an unsigned int per
      CPU variable initialized to -1 named __irq_count. 
      
      The irq stack switching is only done when the variable is -1, which creates
      worse code than just checking for 0. When the stack switching happens it
      uses this_cpu_add/sub(1), but there is no reason to do so. It simply can
      use straight writes. This is a historical leftover from the low level ASM
      code which used inc and jz to make a decision.
      
      Rename it to hardirq_stack_inuse, make it a bool and use plain stores.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210210002512.228830141@linutronix.de
      
      e7f89001
  10. 19 11月, 2020 1 次提交
  11. 09 9月, 2020 3 次提交
  12. 04 9月, 2020 1 次提交
  13. 27 7月, 2020 1 次提交
  14. 25 6月, 2020 1 次提交
  15. 18 6月, 2020 2 次提交
  16. 11 6月, 2020 1 次提交
  17. 07 5月, 2020 1 次提交
  18. 22 4月, 2020 1 次提交
  19. 27 3月, 2020 1 次提交
  20. 21 3月, 2020 1 次提交
  21. 24 1月, 2020 1 次提交
    • D
      x86/mpx: remove MPX from arch/x86 · 45fc24e8
      Dave Hansen 提交于
      From: Dave Hansen <dave.hansen@linux.intel.com>
      
      MPX is being removed from the kernel due to a lack of support
      in the toolchain going forward (gcc).
      
      This removes all the remaining (dead at this point) MPX handling
      code remaining in the tree.  The only remaining code is the XSAVE
      support for MPX state which is currently needd for KVM to handle
      VMs which might use MPX.
      
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: x86@kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      45fc24e8
  22. 14 1月, 2020 2 次提交
    • S
      x86/cpu: Detect VMX features on Intel, Centaur and Zhaoxin CPUs · b47ce1fe
      Sean Christopherson 提交于
      Add an entry in struct cpuinfo_x86 to track VMX capabilities and fill
      the capabilities during IA32_FEAT_CTL MSR initialization.
      
      Make the VMX capabilities dependent on IA32_FEAT_CTL and
      X86_FEATURE_NAMES so as to avoid unnecessary overhead on CPUs that can't
      possibly support VMX, or when /proc/cpuinfo is not available.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20191221044513.21680-11-sean.j.christopherson@intel.com
      b47ce1fe
    • S
      x86/vmx: Introduce VMX_FEATURES_* · 15934878
      Sean Christopherson 提交于
      Add a VMX-specific variant of X86_FEATURE_* flags, which will eventually
      supplant the synthetic VMX flags defined in cpufeatures word 8.  Use the
      Intel-defined layouts for the major VMX execution controls so that their
      word entries can be directly populated from their respective MSRs, and
      so that the VMX_FEATURE_* flags can be used to define the existing bit
      definitions in asm/vmx.h, i.e. force developers to define a VMX_FEATURE
      flag when adding support for a new hardware feature.
      
      The majority of Intel's (and compatible CPU's) VMX capabilities are
      enumerated via MSRs and not CPUID, i.e. querying /proc/cpuinfo doesn't
      naturally provide any insight into the virtualization capabilities of
      VMX enabled CPUs.  Commit
      
        e38e05a8 ("x86: extended "flags" to show virtualization HW feature
      		 in /proc/cpuinfo")
      
      attempted to address the issue by synthesizing select VMX features into
      a Linux-defined word in cpufeatures.
      
      Lack of reporting of VMX capabilities via /proc/cpuinfo is problematic
      because there is no sane way for a user to query the capabilities of
      their platform, e.g. when trying to find a platform to test a feature or
      debug an issue that has a hardware dependency.  Lack of reporting is
      especially problematic when the user isn't familiar with VMX, e.g. the
      format of the MSRs is non-standard, existence of some MSRs is reported
      by bits in other MSRs, several "features" from KVM's point of view are
      enumerated as 3+ distinct features by hardware, etc...
      
      The synthetic cpufeatures approach has several flaws:
      
        - The set of synthesized VMX flags has become extremely stale with
          respect to the full set of VMX features, e.g. only one new flag
          (EPT A/D) has been added in the the decade since the introduction of
          the synthetic VMX features.  Failure to keep the VMX flags up to
          date is likely due to the lack of a mechanism that forces developers
          to consider whether or not a new feature is worth reporting.
      
        - The synthetic flags may incorrectly be misinterpreted as affecting
          kernel behavior, i.e. KVM, the kernel's sole consumer of VMX,
          completely ignores the synthetic flags.
      
        - New CPU vendors that support VMX have duplicated the hideous code
          that propagates VMX features from MSRs to cpufeatures.  Bringing the
          synthetic VMX flags up to date would exacerbate the copy+paste
          trainwreck.
      
      Define separate VMX_FEATURE flags to set the stage for enumerating VMX
      capabilities outside of the cpu_has() framework, and for adding
      functional usage of VMX_FEATURE_* to help ensure the features reported
      via /proc/cpuinfo is up to date with respect to kernel recognition of
      VMX capabilities.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20191221044513.21680-10-sean.j.christopherson@intel.com
      15934878
  23. 14 12月, 2019 1 次提交
  24. 27 11月, 2019 3 次提交
    • A
      x86/doublefault/32: Move #DF stack and TSS to cpu_entry_area · dc4e0021
      Andy Lutomirski 提交于
      There are three problems with the current layout of the doublefault
      stack and TSS.  First, the TSS is only cacheline-aligned, which is
      not enough -- if the hardware portion of the TSS (struct x86_hw_tss)
      crosses a page boundary, horrible things happen [0].  Second, the
      stack and TSS are global, so simultaneous double faults on different
      CPUs will cause massive corruption.  Third, the whole mechanism
      won't work if user CR3 is loaded, resulting in a triple fault [1].
      
      Let the doublefault stack and TSS share a page (which prevents the
      TSS from spanning a page boundary), make it percpu, and move it into
      cpu_entry_area.  Teach the stack dump code about the doublefault
      stack.
      
      [0] Real hardware will read past the end of the page onto the next
          *physical* page if a task switch happens.  Virtual machines may
          have any number of bugs, and I would consider it reasonable for
          a VM to summarily kill the guest if it tries to task-switch to
          a page-spanning TSS.
      
      [1] Real hardware triple faults.  At least some VMs seem to hang.
          I'm not sure what's going on.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      dc4e0021
    • A
      x86/traps: Disentangle the 32-bit and 64-bit doublefault code · 93efbde2
      Andy Lutomirski 提交于
      The 64-bit doublefault handler is much nicer than the 32-bit one.
      As a first step toward unifying them, make the 64-bit handler
      self-contained.  This should have no effect no functional effect
      except in the odd case of x86_64 with CONFIG_DOUBLEFAULT=n in which
      case it will change the logging a bit.
      
      This also gets rid of CONFIG_DOUBLEFAULT configurability on 64-bit
      kernels.  It didn't do anything useful -- CONFIG_DOUBLEFAULT=n
      didn't actually disable doublefault handling on x86_64.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      93efbde2
    • I
      x86/iopl: Make 'struct tss_struct' constant size again · 0bcd7762
      Ingo Molnar 提交于
      After the following commit:
      
        05b042a1: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise")
      
      'struct cpu_entry_area' has to be Kconfig invariant, so that we always
      have a matching CPU_ENTRY_AREA_PAGES size.
      
      This commit added a CONFIG_X86_IOPL_IOPERM dependency to tss_struct:
      
        111e7b15: ("x86/ioperm: Extend IOPL config to control ioperm() as well")
      
      Which, if CONFIG_X86_IOPL_IOPERM is turned off, reduces the size of
      cpu_entry_area by two pages, triggering the assert:
      
        ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_202’ declared with attribute error: BUILD_BUG_ON failed: (CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE
      
      Simplify the Kconfig dependencies and make cpu_entry_area constant
      size on 32-bit kernels again.
      
      Fixes: 05b042a1: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0bcd7762
  25. 16 11月, 2019 8 次提交
  26. 05 11月, 2019 1 次提交
    • K
      x86/mm: Report which part of kernel image is freed · 5494c3a6
      Kees Cook 提交于
      The memory freeing report wasn't very useful for figuring out which
      parts of the kernel image were being freed. Add the details for clearer
      reporting in dmesg.
      
      Before:
      
        Freeing unused kernel image memory: 1348K
        Write protecting the kernel read-only data: 20480k
        Freeing unused kernel image memory: 2040K
        Freeing unused kernel image memory: 172K
      
      After:
      
        Freeing unused kernel image (initmem) memory: 1348K
        Write protecting the kernel read-only data: 20480k
        Freeing unused kernel image (text/rodata gap) memory: 2040K
        Freeing unused kernel image (rodata/data gap) memory: 172K
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: x86-ml <x86@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20191029211351.13243-28-keescook@chromium.org
      5494c3a6