1. 15 10月, 2021 1 次提交
  2. 26 8月, 2021 1 次提交
  3. 28 7月, 2021 1 次提交
  4. 24 6月, 2021 1 次提交
    • D
      x86/fpu: Add PKRU storage outside of task XSAVE buffer · 9782a712
      Dave Hansen 提交于
      PKRU is currently partly XSAVE-managed and partly not. It has space
      in the task XSAVE buffer and is context-switched by XSAVE/XRSTOR.
      However, it is switched more eagerly than FPU because there may be a
      need for PKRU to be up-to-date for things like copy_to/from_user() since
      PKRU affects user-permission memory accesses, not just accesses from
      userspace itself.
      
      This leaves PKRU in a very odd position. XSAVE brings very little value
      to the table for how Linux uses PKRU except for signal related XSTATE
      handling.
      
      Prepare to move PKRU away from being XSAVE-managed. Allocate space in
      the thread_struct for it and save/restore it in the context-switch path
      separately from the XSAVE-managed features. task->thread_struct.pkru
      is only valid when the task is scheduled out. For the current task the
      authoritative source is the hardware, i.e. it has to be retrieved via
      rdpkru().
      
      Leave the XSAVE code in place for now to ensure bisectability.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20210623121456.399107624@linutronix.de
      9782a712
  5. 18 5月, 2021 1 次提交
  6. 13 5月, 2021 1 次提交
  7. 29 3月, 2021 1 次提交
    • L
      x86/process/64: Move cpu_current_top_of_stack out of TSS · 1591584e
      Lai Jiangshan 提交于
      cpu_current_top_of_stack is currently stored in TSS.sp1. TSS is exposed
      through the cpu_entry_area which is visible with user CR3 when PTI is
      enabled and active.
      
      This makes it a coveted fruit for attackers.  An attacker can fetch the
      kernel stack top from it and continue next steps of actions based on the
      kernel stack.
      
      But it is actualy not necessary to be stored in the TSS.  It is only
      accessed after the entry code switched to kernel CR3 and kernel GS_BASE
      which means it can be in any regular percpu variable.
      
      The reason why it is in TSS is historical (pre PTI) because TSS is also
      used as scratch space in SYSCALL_64 and therefore cache hot.
      
      A syscall also needs the per CPU variable current_task and eventually
      __preempt_count, so placing cpu_current_top_of_stack next to them makes it
      likely that they end up in the same cache line which should avoid
      performance regressions. This is not enforced as the compiler is free to
      place these variables, so these entry relevant variables should move into
      a data structure to make this enforceable.
      
      The seccomp_benchmark doesn't show any performance loss in the "getpid
      native" test result.  Actually, the result changes from 93ns before to 92ns
      with this change when KPTI is disabled. The test is very stable and
      although the test doesn't show a higher degree of precision it gives enough
      confidence that moving cpu_current_top_of_stack does not cause a
      regression.
      
      [ tglx: Removed unneeded export. Massaged changelog ]
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210125173444.22696-2-jiangshanlai@gmail.com
      1591584e
  8. 18 3月, 2021 1 次提交
    • I
      x86: Fix various typos in comments · d9f6e12f
      Ingo Molnar 提交于
      Fix ~144 single-word typos in arch/x86/ code comments.
      
      Doing this in a single commit should reduce the churn.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: linux-kernel@vger.kernel.org
      d9f6e12f
  9. 17 3月, 2021 1 次提交
  10. 08 3月, 2021 1 次提交
    • A
      x86/stackprotector/32: Make the canary into a regular percpu variable · 3fb0fdb3
      Andy Lutomirski 提交于
      On 32-bit kernels, the stackprotector canary is quite nasty -- it is
      stored at %gs:(20), which is nasty because 32-bit kernels use %fs for
      percpu storage.  It's even nastier because it means that whether %gs
      contains userspace state or kernel state while running kernel code
      depends on whether stackprotector is enabled (this is
      CONFIG_X86_32_LAZY_GS), and this setting radically changes the way
      that segment selectors work.  Supporting both variants is a
      maintenance and testing mess.
      
      Merely rearranging so that percpu and the stack canary
      share the same segment would be messy as the 32-bit percpu address
      layout isn't currently compatible with putting a variable at a fixed
      offset.
      
      Fortunately, GCC 8.1 added options that allow the stack canary to be
      accessed as %fs:__stack_chk_guard, effectively turning it into an ordinary
      percpu variable.  This lets us get rid of all of the code to manage the
      stack canary GDT descriptor and the CONFIG_X86_32_LAZY_GS mess.
      
      (That name is special.  We could use any symbol we want for the
       %fs-relative mode, but for CONFIG_SMP=n, gcc refuses to let us use any
       name other than __stack_chk_guard.)
      
      Forcibly disable stackprotector on older compilers that don't support
      the new options and turn the stack canary into a percpu variable. The
      "lazy GS" approach is now used for all 32-bit configurations.
      
      Also makes load_gs_index() work on 32-bit kernels. On 64-bit kernels,
      it loads the GS selector and updates the user GSBASE accordingly. (This
      is unchanged.) On 32-bit kernels, it loads the GS selector and updates
      GSBASE, which is now always the user base. This means that the overall
      effect is the same on 32-bit and 64-bit, which avoids some ifdeffery.
      
       [ bp: Massage commit message. ]
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/c0ff7dba14041c7e5d1cae5d4df052f03759bef3.1613243844.git.luto@kernel.org
      3fb0fdb3
  11. 11 2月, 2021 2 次提交
    • T
      x86/irq/64: Adjust the per CPU irq stack pointer by 8 · 951c2a51
      Thomas Gleixner 提交于
      The per CPU hardirq_stack_ptr contains the pointer to the irq stack in the
      form that it is ready to be assigned to [ER]SP so that the first push ends
      up on the top entry of the stack.
      
      But the stack switching on 64 bit has the following rules:
      
          1) Store the current stack pointer (RSP) in the top most stack entry
             to allow the unwinder to link back to the previous stack
      
          2) Set RSP to the top most stack entry
      
          3) Invoke functions on the irq stack
      
          4) Pop RSP from the top most stack entry (stored in #1) so it's back
             to the original stack.
      
      That requires all stack switching code to decrement the stored pointer by 8
      in order to be able to store the current RSP and then set RSP to that
      location. That's a pointless exercise.
      
      Do the -8 adjustment right when storing the pointer and make the data type
      a void pointer to avoid confusion vs. the struct irq_stack data type which
      is on 64bit only used to declare the backing store. Move the definition
      next to the inuse flag so they likely end up in the same cache
      line. Sticking them into a struct to enforce it is a seperate change.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210210002512.354260928@linutronix.de
      951c2a51
    • T
      x86/irq: Sanitize irq stack tracking · e7f89001
      Thomas Gleixner 提交于
      The recursion protection for hard interrupt stacks is an unsigned int per
      CPU variable initialized to -1 named __irq_count. 
      
      The irq stack switching is only done when the variable is -1, which creates
      worse code than just checking for 0. When the stack switching happens it
      uses this_cpu_add/sub(1), but there is no reason to do so. It simply can
      use straight writes. This is a historical leftover from the low level ASM
      code which used inc and jz to make a decision.
      
      Rename it to hardirq_stack_inuse, make it a bool and use plain stores.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210210002512.228830141@linutronix.de
      
      e7f89001
  12. 19 11月, 2020 1 次提交
  13. 09 9月, 2020 3 次提交
  14. 04 9月, 2020 1 次提交
  15. 27 7月, 2020 1 次提交
  16. 25 6月, 2020 1 次提交
  17. 18 6月, 2020 2 次提交
  18. 11 6月, 2020 1 次提交
  19. 07 5月, 2020 1 次提交
  20. 22 4月, 2020 1 次提交
  21. 27 3月, 2020 1 次提交
  22. 21 3月, 2020 1 次提交
  23. 24 1月, 2020 1 次提交
    • D
      x86/mpx: remove MPX from arch/x86 · 45fc24e8
      Dave Hansen 提交于
      From: Dave Hansen <dave.hansen@linux.intel.com>
      
      MPX is being removed from the kernel due to a lack of support
      in the toolchain going forward (gcc).
      
      This removes all the remaining (dead at this point) MPX handling
      code remaining in the tree.  The only remaining code is the XSAVE
      support for MPX state which is currently needd for KVM to handle
      VMs which might use MPX.
      
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: x86@kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      45fc24e8
  24. 14 1月, 2020 2 次提交
    • S
      x86/cpu: Detect VMX features on Intel, Centaur and Zhaoxin CPUs · b47ce1fe
      Sean Christopherson 提交于
      Add an entry in struct cpuinfo_x86 to track VMX capabilities and fill
      the capabilities during IA32_FEAT_CTL MSR initialization.
      
      Make the VMX capabilities dependent on IA32_FEAT_CTL and
      X86_FEATURE_NAMES so as to avoid unnecessary overhead on CPUs that can't
      possibly support VMX, or when /proc/cpuinfo is not available.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20191221044513.21680-11-sean.j.christopherson@intel.com
      b47ce1fe
    • S
      x86/vmx: Introduce VMX_FEATURES_* · 15934878
      Sean Christopherson 提交于
      Add a VMX-specific variant of X86_FEATURE_* flags, which will eventually
      supplant the synthetic VMX flags defined in cpufeatures word 8.  Use the
      Intel-defined layouts for the major VMX execution controls so that their
      word entries can be directly populated from their respective MSRs, and
      so that the VMX_FEATURE_* flags can be used to define the existing bit
      definitions in asm/vmx.h, i.e. force developers to define a VMX_FEATURE
      flag when adding support for a new hardware feature.
      
      The majority of Intel's (and compatible CPU's) VMX capabilities are
      enumerated via MSRs and not CPUID, i.e. querying /proc/cpuinfo doesn't
      naturally provide any insight into the virtualization capabilities of
      VMX enabled CPUs.  Commit
      
        e38e05a8 ("x86: extended "flags" to show virtualization HW feature
      		 in /proc/cpuinfo")
      
      attempted to address the issue by synthesizing select VMX features into
      a Linux-defined word in cpufeatures.
      
      Lack of reporting of VMX capabilities via /proc/cpuinfo is problematic
      because there is no sane way for a user to query the capabilities of
      their platform, e.g. when trying to find a platform to test a feature or
      debug an issue that has a hardware dependency.  Lack of reporting is
      especially problematic when the user isn't familiar with VMX, e.g. the
      format of the MSRs is non-standard, existence of some MSRs is reported
      by bits in other MSRs, several "features" from KVM's point of view are
      enumerated as 3+ distinct features by hardware, etc...
      
      The synthetic cpufeatures approach has several flaws:
      
        - The set of synthesized VMX flags has become extremely stale with
          respect to the full set of VMX features, e.g. only one new flag
          (EPT A/D) has been added in the the decade since the introduction of
          the synthetic VMX features.  Failure to keep the VMX flags up to
          date is likely due to the lack of a mechanism that forces developers
          to consider whether or not a new feature is worth reporting.
      
        - The synthetic flags may incorrectly be misinterpreted as affecting
          kernel behavior, i.e. KVM, the kernel's sole consumer of VMX,
          completely ignores the synthetic flags.
      
        - New CPU vendors that support VMX have duplicated the hideous code
          that propagates VMX features from MSRs to cpufeatures.  Bringing the
          synthetic VMX flags up to date would exacerbate the copy+paste
          trainwreck.
      
      Define separate VMX_FEATURE flags to set the stage for enumerating VMX
      capabilities outside of the cpu_has() framework, and for adding
      functional usage of VMX_FEATURE_* to help ensure the features reported
      via /proc/cpuinfo is up to date with respect to kernel recognition of
      VMX capabilities.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20191221044513.21680-10-sean.j.christopherson@intel.com
      15934878
  25. 14 12月, 2019 1 次提交
  26. 27 11月, 2019 3 次提交
    • A
      x86/doublefault/32: Move #DF stack and TSS to cpu_entry_area · dc4e0021
      Andy Lutomirski 提交于
      There are three problems with the current layout of the doublefault
      stack and TSS.  First, the TSS is only cacheline-aligned, which is
      not enough -- if the hardware portion of the TSS (struct x86_hw_tss)
      crosses a page boundary, horrible things happen [0].  Second, the
      stack and TSS are global, so simultaneous double faults on different
      CPUs will cause massive corruption.  Third, the whole mechanism
      won't work if user CR3 is loaded, resulting in a triple fault [1].
      
      Let the doublefault stack and TSS share a page (which prevents the
      TSS from spanning a page boundary), make it percpu, and move it into
      cpu_entry_area.  Teach the stack dump code about the doublefault
      stack.
      
      [0] Real hardware will read past the end of the page onto the next
          *physical* page if a task switch happens.  Virtual machines may
          have any number of bugs, and I would consider it reasonable for
          a VM to summarily kill the guest if it tries to task-switch to
          a page-spanning TSS.
      
      [1] Real hardware triple faults.  At least some VMs seem to hang.
          I'm not sure what's going on.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      dc4e0021
    • A
      x86/traps: Disentangle the 32-bit and 64-bit doublefault code · 93efbde2
      Andy Lutomirski 提交于
      The 64-bit doublefault handler is much nicer than the 32-bit one.
      As a first step toward unifying them, make the 64-bit handler
      self-contained.  This should have no effect no functional effect
      except in the odd case of x86_64 with CONFIG_DOUBLEFAULT=n in which
      case it will change the logging a bit.
      
      This also gets rid of CONFIG_DOUBLEFAULT configurability on 64-bit
      kernels.  It didn't do anything useful -- CONFIG_DOUBLEFAULT=n
      didn't actually disable doublefault handling on x86_64.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      93efbde2
    • I
      x86/iopl: Make 'struct tss_struct' constant size again · 0bcd7762
      Ingo Molnar 提交于
      After the following commit:
      
        05b042a1: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise")
      
      'struct cpu_entry_area' has to be Kconfig invariant, so that we always
      have a matching CPU_ENTRY_AREA_PAGES size.
      
      This commit added a CONFIG_X86_IOPL_IOPERM dependency to tss_struct:
      
        111e7b15: ("x86/ioperm: Extend IOPL config to control ioperm() as well")
      
      Which, if CONFIG_X86_IOPL_IOPERM is turned off, reduces the size of
      cpu_entry_area by two pages, triggering the assert:
      
        ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_202’ declared with attribute error: BUILD_BUG_ON failed: (CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE
      
      Simplify the Kconfig dependencies and make cpu_entry_area constant
      size on 32-bit kernels again.
      
      Fixes: 05b042a1: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0bcd7762
  27. 16 11月, 2019 7 次提交
    • T
      x86/ioperm: Extend IOPL config to control ioperm() as well · 111e7b15
      Thomas Gleixner 提交于
      If iopl() is disabled, then providing ioperm() does not make much sense.
      
      Rename the config option and disable/enable both syscalls with it. Guard
      the code with #ifdefs where appropriate.
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      111e7b15
    • T
      x86/iopl: Remove legacy IOPL option · a24ca997
      Thomas Gleixner 提交于
      The IOPL emulation via the I/O bitmap is sufficient. Remove the legacy
      cruft dealing with the (e)flags based IOPL mechanism.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: Juergen Gross <jgross@suse.com> (Paravirt and Xen parts)
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      a24ca997
    • T
      x86/iopl: Restrict iopl() permission scope · c8137ace
      Thomas Gleixner 提交于
      The access to the full I/O port range can be also provided by the TSS I/O
      bitmap, but that would require to copy 8k of data on scheduling in the
      task. As shown with the sched out optimization TSS.io_bitmap_base can be
      used to switch the incoming task to a preallocated I/O bitmap which has all
      bits zero, i.e. allows access to all I/O ports.
      
      Implementing this allows to provide an iopl() emulation mode which restricts
      the IOPL level 3 permissions to I/O port access but removes the STI/CLI
      permission which is coming with the hardware IOPL mechansim.
      
      Provide a config option to switch IOPL to emulation mode, make it the
      default and while at it also provide an option to disable IOPL completely.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      c8137ace
    • T
      x86/ioperm: Add bitmap sequence number · 060aa16f
      Thomas Gleixner 提交于
      Add a globally unique sequence number which is incremented when ioperm() is
      changing the I/O bitmap of a task. Store the new sequence number in the
      io_bitmap structure and compare it with the sequence number of the I/O
      bitmap which was last loaded on a CPU. Only update the bitmap if the
      sequence is different.
      
      That should further reduce the overhead of I/O bitmap scheduling when there
      are only a few I/O bitmap users on the system.
      
      The 64bit sequence counter is sufficient. A wraparound of the sequence
      counter assuming an ioperm() call every nanosecond would require about 584
      years of uptime.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      060aa16f
    • T
      x86/ioperm: Move iobitmap data into a struct · 577d5cd7
      Thomas Gleixner 提交于
      No point in having all the data in thread_struct, especially as upcoming
      changes add more.
      
      Make the bitmap in the new struct accessible as array of longs and as array
      of characters via a union, so both the bitmap functions and the update
      logic can avoid type casts.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      577d5cd7
    • T
      x86/tss: Move I/O bitmap data into a seperate struct · f5848e5f
      Thomas Gleixner 提交于
      Move the non hardware portion of I/O bitmap data into a seperate struct for
      readability sake.
      Originally-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      f5848e5f
    • T
      x86/io: Speedup schedule out of I/O bitmap user · ecc7e37d
      Thomas Gleixner 提交于
      There is no requirement to update the TSS I/O bitmap when a thread using it is
      scheduled out and the incoming thread does not use it.
      
      For the permission check based on the TSS I/O bitmap the CPU calculates the memory
      location of the I/O bitmap by the address of the TSS and the io_bitmap_base member
      of the tss_struct. The easiest way to invalidate the I/O bitmap is to switch the
      offset to an address outside of the TSS limit.
      
      If an I/O instruction is issued from user space the TSS limit causes #GP to be
      raised in the same was as valid I/O bitmap with all bits set to 1 would do.
      
      This removes the extra work when an I/O bitmap using task is scheduled out
      and puts the burden on the rare I/O bitmap users when they are scheduled
      in.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ecc7e37d