1. 24 1月, 2017 2 次提交
  2. 12 1月, 2017 2 次提交
    • J
      x86/unwind: Include __schedule() in stack traces · 2c96b2fe
      Josh Poimboeuf 提交于
      In the following commit:
      
        0100301b ("sched/x86: Rewrite the switch_to() code")
      
      ... the layout of the 'inactive_task_frame' struct was designed to have
      a frame pointer header embedded in it, so that the unwinder could use
      the 'bp' and 'ret_addr' fields to report __schedule() on the stack (or
      ret_from_fork() for newly forked tasks which haven't actually run yet).
      
      Finish the job by changing get_frame_pointer() to return a pointer to
      inactive_task_frame's 'bp' field rather than 'bp' itself.  This allows
      the unwinder to start one frame higher on the stack, so that it properly
      reports __schedule().
      Reported-by: NMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/598e9f7505ed0aba86e8b9590aa528c6c7ae8dcd.1483978430.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2c96b2fe
    • J
      x86/unwind: Disable KASAN checks for non-current tasks · 84936118
      Josh Poimboeuf 提交于
      There are a handful of callers to save_stack_trace_tsk() and
      show_stack() which try to unwind the stack of a task other than current.
      In such cases, it's remotely possible that the task is running on one
      CPU while the unwinder is reading its stack from another CPU, causing
      the unwinder to see stack corruption.
      
      These cases seem to be mostly harmless.  The unwinder has checks which
      prevent it from following bad pointers beyond the bounds of the stack.
      So it's not really a bug as long as the caller understands that
      unwinding another task will not always succeed.
      
      In such cases, it's possible that the unwinder may read a KASAN-poisoned
      region of the stack.  Account for that by using READ_ONCE_NOCHECK() when
      reading the stack of another task.
      
      Use READ_ONCE() when reading the stack of the current task, since KASAN
      warnings can still be useful for finding bugs in that case.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/4c575eb288ba9f73d498dfe0acde2f58674598f1.1483978430.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      84936118
  3. 10 1月, 2017 2 次提交
  4. 05 1月, 2017 1 次提交
  5. 30 12月, 2016 1 次提交
    • L
      mm: optimize PageWaiters bit use for unlock_page() · b91e1302
      Linus Torvalds 提交于
      In commit 62906027 ("mm: add PageWaiters indicating tasks are
      waiting for a page bit") Nick Piggin made our page locking no longer
      unconditionally touch the hashed page waitqueue, which not only helps
      performance in general, but is particularly helpful on NUMA machines
      where the hashed wait queues can bounce around a lot.
      
      However, the "clear lock bit atomically and then test the waiters bit"
      sequence turns out to be much more expensive than it needs to be,
      because you get a nasty stall when trying to access the same word that
      just got updated atomically.
      
      On architectures where locking is done with LL/SC, this would be trivial
      to fix with a new primitive that clears one bit and tests another
      atomically, but that ends up not working on x86, where the only atomic
      operations that return the result end up being cmpxchg and xadd.  The
      atomic bit operations return the old value of the same bit we changed,
      not the value of an unrelated bit.
      
      On x86, we could put the lock bit in the high bit of the byte, and use
      "xadd" with that bit (where the overflow ends up not touching other
      bits), and look at the other bits of the result.  However, an even
      simpler model is to just use a regular atomic "and" to clear the lock
      bit, and then the sign bit in eflags will indicate the resulting state
      of the unrelated bit #7.
      
      So by moving the PageWaiters bit up to bit #7, we can atomically clear
      the lock bit and test the waiters bit on x86 too.  And architectures
      with LL/SC (which is all the usual RISC suspects), the particular bit
      doesn't matter, so they are fine with this approach too.
      
      This avoids the extra access to the same atomic word, and thus avoids
      the costly stall at page unlock time.
      
      The only downside is that the interface ends up being a bit odd and
      specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
      love the resulting name of the new primitive, but I'd rather make the
      name be descriptive and very clear about the limitation imposed by
      trying to work across all relevant architectures than make it be some
      generic thing that doesn't make the odd semantics explicit.
      
      So this introduces the new architecture primitive
      
          clear_bit_unlock_is_negative_byte();
      
      and adds the trivial implementation for x86.  We have a generic
      non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
      combination) which can be overridden by any architecture that can do
      better.  According to Nick, Power has the same hickup x86 has, for
      example, but some other architectures may not even care.
      
      All these optimizations mean that my page locking stress-test (which is
      just executing a lot of small short-lived shell scripts: "make test" in
      the git source tree) no longer makes our page locking look horribly bad.
      Before all these optimizations, just the unlock_page() costs were just
      over 3% of all CPU overhead on "make test".  After this, it's down to
      0.66%, so just a quarter of the cost it used to be.
      
      (The difference on NUMA is bigger, but there this micro-optimization is
      likely less noticeable, since the big issue on NUMA was not the accesses
      to 'struct page', but the waitqueue accesses that were already removed
      by Nick's earlier commit).
      Acked-by: NNick Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b91e1302
  6. 25 12月, 2016 2 次提交
  7. 19 12月, 2016 6 次提交
  8. 18 12月, 2016 1 次提交
  9. 17 12月, 2016 2 次提交
  10. 15 12月, 2016 3 次提交
    • K
      x86/mm: Drop unused argument 'removed' from sync_global_pgds() · 5372e155
      Kirill A. Shutemov 提交于
      Since commit af2cf278 ("x86/mm/hotplug: Don't remove PGD entries in
      remove_pagetable()") there are no callers of sync_global_pgds() which set
      the 'removed' argument to 1.
      
      Remove the argument and the related conditionals in the function.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Link: http://lkml.kernel.org/r/20161214234403.137556-1-kirill.shutemov@linux.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      5372e155
    • T
      x86/tsc: Force TSC_ADJUST register to value >= zero · 5bae1562
      Thomas Gleixner 提交于
      Roland reported that his DELL T5810 sports a value add BIOS which
      completely wreckages the TSC. The squirmware [(TM) Ingo Molnar] boots with
      random negative TSC_ADJUST values, different on all CPUs. That renders the
      TSC useless because the sycnchronization check fails.
      
      Roland tested the new TSC_ADJUST mechanism. While it manages to readjust
      the TSCs he needs to disable the TSC deadline timer, otherwise the machine
      just stops booting.
      
      Deeper investigation unearthed that the TSC deadline timer is sensitive to
      the TSC_ADJUST value. Writing TSC_ADJUST to a negative value results in an
      interrupt storm caused by the TSC deadline timer.
      
      This does not make any sense and it's hard to imagine what kind of hardware
      wreckage is behind that misfeature, but it's reliably reproducible on other
      systems which have TSC_ADJUST and TSC deadline timer.
      
      While it would be understandable that a big enough negative value which
      moves the resulting TSC readout into the negative space could have the
      described effect, this happens even with a adjust value of -1, which keeps
      the TSC readout definitely in the positive space. The compare register for
      the TSC deadline timer is set to a positive value larger than the TSC, but
      despite not having reached the deadline the interrupt is raised
      immediately. If this happens on the boot CPU, then the machine dies
      silently because this setup happens before the NMI watchdog is armed.
      
      Further experiments showed that any other adjustment of TSC_ADJUST works as
      expected as long as it stays in the positive range. The direction of the
      adjustment has no influence either. See the lkml link for further analysis.
      
      Yet another proof for the theory that timers are designed by janitors and
      the underlying (obviously undocumented) mechanisms which allow BIOSes to
      wreckage them are considered a feature. Well done Intel - NOT!
      
      To address this wreckage add the following sanity measures:
      
      - If the TSC_ADJUST value on the boot cpu is not 0, set it to 0
      
      - If the TSC_ADJUST value on any cpu is negative, set it to 0
      
      - Prevent the cross package synchronization mechanism from setting negative
        TSC_ADJUST values.
      Reported-and-tested-by: NRoland Scheidegger <rscheidegger_lists@hispeed.ch>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Bruce Schlobohm <bruce.schlobohm@intel.com>
      Cc: Kevin Stanton <kevin.b.stanton@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Allen Hung <allen_hung@dell.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Link: http://lkml.kernel.org/r/20161213131211.397588033@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      5bae1562
    • T
      x86/tsc: Validate TSC_ADJUST after resume · 6a369583
      Thomas Gleixner 提交于
      Some 'feature' BIOSes fiddle with the TSC_ADJUST register during
      suspend/resume which renders the TSC unusable.
      
      Add sanity checks into the resume path and restore the
      original value if it was adjusted.
      Reported-and-tested-by: NRoland Scheidegger <rscheidegger_lists@hispeed.ch>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Bruce Schlobohm <bruce.schlobohm@intel.com>
      Cc: Kevin Stanton <kevin.b.stanton@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Allen Hung <allen_hung@dell.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Link: http://lkml.kernel.org/r/20161213131211.317654500@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6a369583
  11. 14 12月, 2016 1 次提交
  12. 11 12月, 2016 1 次提交
    • P
      x86/paravirt: Fix bool return type for PVOP_CALL() · 11f254db
      Peter Zijlstra 提交于
      Commit:
      
        3cded417 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
      
      introduced a paravirt op with bool return type [*]
      
      It turns out that the PVOP_CALL*() macros miscompile when rettype is
      bool. Code that looked like:
      
         83 ef 01                sub    $0x1,%edi
         ff 15 32 a0 d8 00       callq  *0xd8a032(%rip)        # ffffffff81e28120 <pv_lock_ops+0x20>
         84 c0                   test   %al,%al
      
      ended up looking like so after PVOP_CALL1() was applied:
      
         83 ef 01                sub    $0x1,%edi
         48 63 ff                movslq %edi,%rdi
         ff 14 25 20 81 e2 81    callq  *0xffffffff81e28120
         48 85 c0                test   %rax,%rax
      
      Note how it tests the whole of %rax, even though a typical bool return
      function only sets %al, like:
      
        0f 95 c0                setne  %al
        c3                      retq
      
      This is because ____PVOP_CALL() does:
      
      		__ret = (rettype)__eax;
      
      and while regular integer type casts truncate the result, a cast to
      bool tests for any !0 value. Fix this by explicitly truncating to
      sizeof(rettype) before casting.
      
      [*] The actual bug should've been exposed in commit:
            446f3dc8 ("locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests")
          but that didn't properly implement the paravirt call.
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 3cded417 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
      Link: http://lkml.kernel.org/r/20161208154349.346057680@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      11f254db
  13. 10 12月, 2016 5 次提交
  14. 09 12月, 2016 2 次提交
    • S
      tracing: Have the reg function allow to fail · 8cf868af
      Steven Rostedt (Red Hat) 提交于
      Some tracepoints have a registration function that gets enabled when the
      tracepoint is enabled. There may be cases that the registraction function
      must fail (for example, can't allocate enough memory). In this case, the
      tracepoint should also fail to register, otherwise the user would not know
      why the tracepoint is not working.
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8cf868af
    • A
      x86: Make E820_X_MAX unconditionally larger than E820MAX · 9d2f86c6
      Alex Thorlton 提交于
      It's really not necessary to limit E820_X_MAX to 128 in the non-EFI
      case.  This commit drops E820_X_MAX's dependency on CONFIG_EFI, so that
      E820_X_MAX is always at least slightly larger than E820MAX.
      
      The real motivation behind this is actually to prevent some issues in
      the Xen kernel, where the XENMEM_machine_memory_map hypercall can
      produce an e820 map larger than 128 entries, even on systems where the
      original e820 table was quite a bit smaller than that, depending on how
      many IOAPICs are installed on the system.
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Suggested-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      9d2f86c6
  15. 08 12月, 2016 5 次提交
    • L
      KVM: nVMX: check host CR3 on vmentry and vmexit · 1dc35dac
      Ladi Prosek 提交于
      This commit adds missing host CR3 checks. Before entering guest mode, the value
      of CR3 is checked for reserved bits. After returning, nested_vmx_load_cr3 is
      called to set the new CR3 value and check and load PDPTRs.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      1dc35dac
    • L
      KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry · 9ed38ffa
      Ladi Prosek 提交于
      Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
      as implemented in kvm_set_cr3, in several ways.
      
      * different rules are followed to check CR3 and it is desirable for the caller
      to distinguish between the possible failures
      * PDPTRs are not loaded if PAE paging and nested EPT are both enabled
      * many MMU operations are not necessary
      
      This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
      nested vmentry and vmexit, and makes use of it on the nested vmentry path.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9ed38ffa
    • D
      KVM: nVMX: support restore of VMX capability MSRs · 62cc6b9d
      David Matlack 提交于
      The VMX capability MSRs advertise the set of features the KVM virtual
      CPU can support. This set of features varies across different host CPUs
      and KVM versions. This patch aims to addresses both sources of
      differences, allowing VMs to be migrated across CPUs and KVM versions
      without guest-visible changes to these MSRs. Note that cross-KVM-
      version migration is only supported from this point forward.
      
      When the VMX capability MSRs are restored, they are audited to check
      that the set of features advertised are a subset of what KVM and the
      CPU support.
      
      Since the VMX capability MSRs are read-only, they do not need to be on
      the default MSR save/restore lists. The userspace hypervisor can set
      the values of these MSRs or read them from KVM at VCPU creation time,
      and restore the same value after every save/restore.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      62cc6b9d
    • K
      KVM: x86: Add kvm_skip_emulated_instruction and use it. · 6affcbed
      Kyle Huey 提交于
      kvm_skip_emulated_instruction calls both
      kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
      skipping the emulated instruction and generating a trap if necessary.
      
      Replacing skip_emulated_instruction calls with
      kvm_skip_emulated_instruction is straightforward, except for:
      
      - ICEBP, which is already inside a trap, so avoid triggering another trap.
      - Instructions that can trigger exits to userspace, such as the IO insns,
        MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
        KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
        IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
        take precedence. The singlestep will be triggered again on the next
        instruction, which is the current behavior.
      - Task switch instructions which would require additional handling (e.g.
        the task switch bit) and are instead left alone.
      - Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
        which do not trigger singlestep traps as mentioned previously.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6affcbed
    • K
      KVM: x86: Add a return value to kvm_emulate_cpuid · 6a908b62
      Kyle Huey 提交于
      Once skipping the emulated instruction can potentially trigger an exit to
      userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
      propagate a return value.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6a908b62
  16. 06 12月, 2016 1 次提交
  17. 02 12月, 2016 1 次提交
  18. 30 11月, 2016 2 次提交