1. 27 11月, 2018 1 次提交
    • W
      KVM: X86: Fix scan ioapic use-before-initialization · e97f852f
      Wanpeng Li 提交于
      Reported by syzkaller:
      
       BUG: unable to handle kernel NULL pointer dereference at 00000000000001c8
       PGD 80000003ec4da067 P4D 80000003ec4da067 PUD 3f7bfa067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP PTI
       CPU: 7 PID: 5059 Comm: debug Tainted: G           OE     4.19.0-rc5 #16
       RIP: 0010:__lock_acquire+0x1a6/0x1990
       Call Trace:
        lock_acquire+0xdb/0x210
        _raw_spin_lock+0x38/0x70
        kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
        vcpu_enter_guest+0x167e/0x1910 [kvm]
        kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
        kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
        do_vfs_ioctl+0xa5/0x690
        ksys_ioctl+0x6d/0x80
        __x64_sys_ioctl+0x1a/0x20
        do_syscall_64+0x83/0x6e0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that the testcase writes hyperv synic HV_X64_MSR_SINT6 msr
      and triggers scan ioapic logic to load synic vectors into EOI exit bitmap.
      However, irqchip is not initialized by this simple testcase, ioapic/apic
      objects should not be accessed.
      This can be triggered by the following program:
      
          #define _GNU_SOURCE
      
          #include <endian.h>
          #include <stdint.h>
          #include <stdio.h>
          #include <stdlib.h>
          #include <string.h>
          #include <sys/syscall.h>
          #include <sys/types.h>
          #include <unistd.h>
      
          uint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff};
      
          int main(void)
          {
          	syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
          	long res = 0;
          	memcpy((void*)0x20000040, "/dev/kvm", 9);
          	res = syscall(__NR_openat, 0xffffffffffffff9c, 0x20000040, 0, 0);
          	if (res != -1)
          		r[0] = res;
          	res = syscall(__NR_ioctl, r[0], 0xae01, 0);
          	if (res != -1)
          		r[1] = res;
          	res = syscall(__NR_ioctl, r[1], 0xae41, 0);
          	if (res != -1)
          		r[2] = res;
          	memcpy(
          			(void*)0x20000080,
          			"\x01\x00\x00\x00\x00\x5b\x61\xbb\x96\x00\x00\x40\x00\x00\x00\x00\x01\x00"
          			"\x08\x00\x00\x00\x00\x00\x0b\x77\xd1\x78\x4d\xd8\x3a\xed\xb1\x5c\x2e\x43"
          			"\xaa\x43\x39\xd6\xff\xf5\xf0\xa8\x98\xf2\x3e\x37\x29\x89\xde\x88\xc6\x33"
          			"\xfc\x2a\xdb\xb7\xe1\x4c\xac\x28\x61\x7b\x9c\xa9\xbc\x0d\xa0\x63\xfe\xfe"
          			"\xe8\x75\xde\xdd\x19\x38\xdc\x34\xf5\xec\x05\xfd\xeb\x5d\xed\x2e\xaf\x22"
          			"\xfa\xab\xb7\xe4\x42\x67\xd0\xaf\x06\x1c\x6a\x35\x67\x10\x55\xcb",
          			106);
          	syscall(__NR_ioctl, r[2], 0x4008ae89, 0x20000080);
          	syscall(__NR_ioctl, r[2], 0xae80, 0);
          	return 0;
          }
      
      This patch fixes it by bailing out scan ioapic if ioapic is not initialized in
      kernel.
      Reported-by: NWei Wu <ww9210@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wei Wu <ww9210@gmail.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e97f852f
  2. 29 10月, 2018 1 次提交
  3. 23 10月, 2018 1 次提交
    • R
      Revert "kvm: x86: optimize dr6 restore" · f9dcf08e
      Radim Krčmář 提交于
      This reverts commit 0e0a53c5.
      
      As Christian Ehrhardt noted:
      
        The most common case is that vcpu->arch.dr6 and the host's %dr6 value
        are not related at all because ->switch_db_regs is zero. To do this
        all correctly, we must handle the case where the guest leaves an arbitrary
        unused value in vcpu->arch.dr6 before disabling breakpoints again.
      
        However, this means that vcpu->arch.dr6 is not suitable to detect the
        need for a %dr6 clear.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f9dcf08e
  4. 18 10月, 2018 5 次提交
    • J
      kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD · c4f55198
      Jim Mattson 提交于
      This is a per-VM capability which can be enabled by userspace so that
      the faulting linear address will be included with the information
      about a pending #PF in L2, and the "new DR6 bits" will be included
      with the information about a pending #DB in L2. With this capability
      enabled, the L1 hypervisor can now intercept #PF before CR2 is
      modified. Under VMX, the L1 hypervisor can now intercept #DB before
      DR6 and DR7 are modified.
      
      When userspace has enabled KVM_CAP_EXCEPTION_PAYLOAD, it should
      generally provide an appropriate payload when injecting a #PF or #DB
      exception via KVM_SET_VCPU_EVENTS. However, to support restoring old
      checkpoints, this payload is not required.
      
      Note that bit 16 of the "new DR6 bits" is set to indicate that a debug
      exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM
      region while advanced debugging of RTM transactional regions was
      enabled. This is the reverse of DR6.RTM, which is cleared in this
      scenario.
      
      This capability also enables exception.pending in struct
      kvm_vcpu_events, which allows userspace to distinguish between pending
      and injected exceptions.
      Reported-by: NJim Mattson <jmattson@google.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c4f55198
    • J
      kvm: vmx: Defer setting of DR6 until #DB delivery · f10c729f
      Jim Mattson 提交于
      When exception payloads are enabled by userspace (which is not yet
      possible) and a #DB is raised in L2, defer the setting of DR6 until
      later. Under VMX, this allows the L1 hypervisor to intercept the fault
      before DR6 is modified. Under SVM, DR6 is modified before L1 can
      intercept the fault (as has always been the case with DR7).
      
      Note that the payload associated with a #DB exception includes only
      the "new DR6 bits." When the payload is delievered, DR6.B0-B3 will be
      cleared and DR6.RTM will be set prior to merging in the new DR6 bits.
      
      Also note that bit 16 in the "new DR6 bits" is set to indicate that a
      debug exception (#DB) or a breakpoint exception (#BP) occurred inside
      an RTM region while advanced debugging of RTM transactional regions
      was enabled. Though the reverse of DR6.RTM, this makes the #DB payload
      field compatible with both the pending debug exceptions field under
      VMX and the exit qualification for #DB exceptions under VMX.
      Reported-by: NJim Mattson <jmattson@google.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f10c729f
    • J
      kvm: x86: Defer setting of CR2 until #PF delivery · da998b46
      Jim Mattson 提交于
      When exception payloads are enabled by userspace (which is not yet
      possible) and a #PF is raised in L2, defer the setting of CR2 until
      the #PF is delivered. This allows the L1 hypervisor to intercept the
      fault before CR2 is modified.
      
      For backwards compatibility, when exception payloads are not enabled
      by userspace, kvm_multiple_exception modifies CR2 when the #PF
      exception is raised.
      Reported-by: NJim Mattson <jmattson@google.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      da998b46
    • J
      kvm: x86: Add payload operands to kvm_multiple_exception · 91e86d22
      Jim Mattson 提交于
      kvm_multiple_exception now takes two additional operands: has_payload
      and payload, so that updates to CR2 (and DR6 under VMX) can be delayed
      until the exception is delivered. This is necessary to properly
      emulate VMX or SVM hardware behavior for nested virtualization.
      
      The new behavior is triggered by
      vcpu->kvm->arch.exception_payload_enabled, which will (later) be set
      by a new per-VM capability, KVM_CAP_EXCEPTION_PAYLOAD.
      Reported-by: NJim Mattson <jmattson@google.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      91e86d22
    • J
      kvm: x86: Add exception payload fields to kvm_vcpu_events · 59073aaf
      Jim Mattson 提交于
      The per-VM capability KVM_CAP_EXCEPTION_PAYLOAD (to be introduced in a
      later commit) adds the following fields to struct kvm_vcpu_events:
      exception_has_payload, exception_payload, and exception.pending.
      
      With this capability set, all of the details of vcpu->arch.exception,
      including the payload for a pending exception, are reported to
      userspace in response to KVM_GET_VCPU_EVENTS.
      
      With this capability clear, the original ABI is preserved, and the
      exception.injected field is set for either pending or injected
      exceptions.
      
      When userspace calls KVM_SET_VCPU_EVENTS with
      KVM_CAP_EXCEPTION_PAYLOAD clear, exception.injected is no longer
      translated to exception.pending. KVM_SET_VCPU_EVENTS can now only
      establish a pending exception when KVM_CAP_EXCEPTION_PAYLOAD is set.
      Reported-by: NJim Mattson <jmattson@google.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      59073aaf
  5. 17 10月, 2018 9 次提交
  6. 01 10月, 2018 1 次提交
  7. 21 9月, 2018 1 次提交
  8. 20 9月, 2018 6 次提交
    • D
      KVM: x86: Control guest reads of MSR_PLATFORM_INFO · 6fbbde9a
      Drew Schmitt 提交于
      Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
      to reads of MSR_PLATFORM_INFO.
      
      Disabling access to reads of this MSR gives userspace the control to "expose"
      this platform-dependent information to guests in a clear way. As it exists
      today, guests that read this MSR would get unpopulated information if userspace
      hadn't already set it (and prior to this patch series, only the CPUID faulting
      information could have been populated). This existing interface could be
      confusing if guests don't handle the potential for incorrect/incomplete
      information gracefully (e.g. zero reported for base frequency).
      Signed-off-by: NDrew Schmitt <dasch@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6fbbde9a
    • D
      KVM: x86: Turbo bits in MSR_PLATFORM_INFO · d84f1cff
      Drew Schmitt 提交于
      Allow userspace to set turbo bits in MSR_PLATFORM_INFO. Previously, only
      the CPUID faulting bit was settable. But now any bit in
      MSR_PLATFORM_INFO would be settable. This can be used, for example, to
      convey frequency information about the platform on which the guest is
      running.
      Signed-off-by: NDrew Schmitt <dasch@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d84f1cff
    • L
      KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv · e6c67d8c
      Liran Alon 提交于
      In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
      it is possible for a vCPU to be blocked while it is in guest-mode.
      
      According to Intel SDM 26.6.5 Interrupt-Window Exiting and
      Virtual-Interrupt Delivery: "These events wake the logical processor
      if it just entered the HLT state because of a VM entry".
      Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
      deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
      should be waken from the HLT state and injected with the interrupt.
      
      In addition, if while the vCPU is blocked (while it is in guest-mode),
      it receives a nested posted-interrupt, then the vCPU should also be
      waken and injected with the posted interrupt.
      
      To handle these cases, this patch enhances kvm_vcpu_has_events() to also
      check if there is a pending interrupt in L2 virtual APICv provided by
      L1. That is, it evaluates if there is a pending virtual interrupt for L2
      by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
      Evaluation of Pending Interrupts.
      
      Note that this also handles the case of nested posted-interrupt by the
      fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
      called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
      kvm_vcpu_running() -> vmx_check_nested_events() ->
      vmx_complete_nested_posted_interrupt().
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e6c67d8c
    • S
      kvm: x86: make kvm_{load|put}_guest_fpu() static · 822f312d
      Sebastian Andrzej Siewior 提交于
      The functions
      	kvm_load_guest_fpu()
      	kvm_put_guest_fpu()
      
      are only used locally, make them static. This requires also that both
      functions are moved because they are used before their implementation.
      Those functions were exported (via EXPORT_SYMBOL) before commit
      e5bb4025 ("KVM: Drop kvm_{load,put}_guest_fpu() exports").
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      822f312d
    • S
      KVM: VMX: use preemption timer to force immediate VMExit · d264ee0c
      Sean Christopherson 提交于
      A VMX preemption timer value of '0' is guaranteed to cause a VMExit
      prior to the CPU executing any instructions in the guest.  Use the
      preemption timer (if it's supported) to trigger immediate VMExit
      in place of the current method of sending a self-IPI.  This ensures
      that pending VMExit injection to L1 occurs prior to executing any
      instructions in the guest (regardless of nesting level).
      
      When deferring VMExit injection, KVM generates an immediate VMExit
      from the (possibly nested) guest by sending itself an IPI.  Because
      hardware interrupts are blocked prior to VMEnter and are unblocked
      (in hardware) after VMEnter, this results in taking a VMExit(INTR)
      before any guest instruction is executed.  But, as this approach
      relies on the IPI being received before VMEnter executes, it only
      works as intended when KVM is running as L0.  Because there are no
      architectural guarantees regarding when IPIs are delivered, when
      running nested the INTR may "arrive" long after L2 is running e.g.
      L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
      
      For the most part, this unintended delay is not an issue since the
      events being injected to L1 also do not have architectural guarantees
      regarding their timing.  The notable exception is the VMX preemption
      timer[1], which is architecturally guaranteed to cause a VMExit prior
      to executing any instructions in the guest if the timer value is '0'
      at VMEnter.  Specifically, the delay in injecting the VMExit causes
      the preemption timer KVM unit test to fail when run in a nested guest.
      
      Note: this approach is viable even on CPUs with a broken preemption
      timer, as broken in this context only means the timer counts at the
      wrong rate.  There are no known errata affecting timer value of '0'.
      
      [1] I/O SMIs also have guarantees on when they arrive, but I have
          no idea if/how those are emulated in KVM.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Use a hook for SVM instead of leaving the default in x86.c - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d264ee0c
    • J
      kvm: mmu: Don't read PDPTEs when paging is not enabled · d35b34a9
      Junaid Shahid 提交于
      kvm should not attempt to read guest PDPTEs when CR0.PG = 0 and
      CR4.PAE = 1.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d35b34a9
  9. 30 8月, 2018 5 次提交
    • S
      KVM: x86: Unexport x86_emulate_instruction() · c60658d1
      Sean Christopherson 提交于
      Allowing x86_emulate_instruction() to be called directly has led to
      subtle bugs being introduced, e.g. not setting EMULTYPE_NO_REEXECUTE
      in the emulation type.  While most of the blame lies on re-execute
      being opt-out, exporting x86_emulate_instruction() also exposes its
      cr2 parameter, which may have contributed to commit d391f120
      ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO
      when running nested") using x86_emulate_instruction() instead of
      emulate_instruction() because "hey, I have a cr2!", which in turn
      introduced its EMULTYPE_NO_REEXECUTE bug.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c60658d1
    • S
      KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction() · 0ce97a2b
      Sean Christopherson 提交于
      Lack of the kvm_ prefix gives the impression that it's a VMX or SVM
      specific function, and there's no conflict that prevents adding the
      kvm_ prefix.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      0ce97a2b
    • S
      KVM: x86: Do not re-{try,execute} after failed emulation in L2 · 6c3dfeb6
      Sean Christopherson 提交于
      Commit a6f177ef ("KVM: Reenter guest after emulation failure if
      due to access to non-mmio address") added reexecute_instruction() to
      handle the scenario where two (or more) vCPUS race to write a shadowed
      page, i.e. reexecute_instruction() is intended to return true if and
      only if the instruction being emulated was accessing a shadowed page.
      As L0 is only explicitly shadowing L1 tables, an emulation failure of
      a nested VM instruction cannot be due to a race to write a shadowed
      page and so should never be re-executed.
      
      This fixes an issue where an "MMIO" emulation failure[1] in L2 is all
      but guaranteed to result in an infinite loop when TDP is enabled.
      Because "cr2" is actually an L2 GPA when TDP is enabled, calling
      kvm_mmu_gva_to_gpa_write() to translate cr2 in the non-direct mapped
      case (L2 is never direct mapped) will almost always yield UNMAPPED_GVA
      and cause reexecute_instruction() to immediately return true.  The
      !mmio_info_in_cache() check in kvm_mmu_page_fault() doesn't catch this
      case because mmio_info_in_cache() returns false for a nested MMU (the
      MMIO caching currently handles L1 only, e.g. to cache nested guests'
      GPAs we'd have to manually flush the cache when switching between
      VMs and when L1 updated its page tables controlling the nested guest).
      
      Way back when, commit 68be0803 ("KVM: x86: never re-execute
      instruction with enabled tdp") changed reexecute_instruction() to
      always return false when using TDP under the assumption that KVM would
      only get into the emulator for MMIO.  Commit 95b3cf69 ("KVM: x86:
      let reexecute_instruction work for tdp") effectively reverted that
      behavior in order to handle the scenario where emulation failed due to
      an access from L1 to the shadow page tables for L2, but it didn't
      account for the case where emulation failed in L2 with TDP enabled.
      
      All of the above logic also applies to retry_instruction(), added by
      commit 1cb3f3ae ("KVM: x86: retry non-page-table writing
      instructions").  An indefinite loop in retry_instruction() should be
      impossible as it protects against retrying the same instruction over
      and over, but it's still correct to not retry an L2 instruction in
      the first place.
      
      Fix the immediate issue by adding a check for a nested guest when
      determining whether or not to allow retry in kvm_mmu_page_fault().
      In addition to fixing the immediate bug, add WARN_ON_ONCE in the
      retry functions since they are not designed to handle nested cases,
      i.e. they need to be modified even if there is some scenario in the
      future where we want to allow retrying a nested guest.
      
      [1] This issue was encountered after commit 3a2936de ("kvm: mmu:
          Don't expose private memslots to L2") changed the page fault path
          to return KVM_PFN_NOSLOT when translating an L2 access to a
          prive memslot.  Returning KVM_PFN_NOSLOT is semantically correct
          when we want to hide a memslot from L2, i.e. there effectively is
          no defined memory region for L2, but it has the unfortunate side
          effect of making KVM think the GFN is a MMIO page, thus triggering
          emulation.  The failure occurred with in-development code that
          deliberately exposed a private memslot to L2, which L2 accessed
          with an instruction that is not emulated by KVM.
      
      Fixes: 95b3cf69 ("KVM: x86: let reexecute_instruction work for tdp")
      Fixes: 1cb3f3ae ("KVM: x86: retry non-page-table writing instructions")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6c3dfeb6
    • S
      KVM: x86: Merge EMULTYPE_RETRY and EMULTYPE_ALLOW_REEXECUTE · 384bf221
      Sean Christopherson 提交于
      retry_instruction() and reexecute_instruction() are a package deal,
      i.e. there is no scenario where one is allowed and the other is not.
      Merge their controlling emulation type flags to enforce this in code.
      Name the combined flag EMULTYPE_ALLOW_RETRY to make it abundantly
      clear that we are allowing re{try,execute} to occur, as opposed to
      explicitly requesting retry of a previously failed instruction.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      384bf221
    • S
      KVM: x86: Invert emulation re-execute behavior to make it opt-in · 8065dbd1
      Sean Christopherson 提交于
      Re-execution of an instruction after emulation decode failure is
      intended to be used only when emulating shadow page accesses.  Invert
      the flag to make allowing re-execution opt-in since that behavior is
      by far in the minority.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8065dbd1
  10. 23 8月, 2018 1 次提交
    • M
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko 提交于
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7
  11. 22 8月, 2018 1 次提交
    • A
      x86: kvm: avoid unused variable warning · 7288bde1
      Arnd Bergmann 提交于
      Removing one of the two accesses of the maxphyaddr variable led to
      a harmless warning:
      
      arch/x86/kvm/x86.c: In function 'kvm_set_mmio_spte_mask':
      arch/x86/kvm/x86.c:6563:6: error: unused variable 'maxphyaddr' [-Werror=unused-variable]
      
      Removing the #ifdef seems to be the nicest workaround, as it
      makes the code look cleaner than adding another #ifdef.
      
      Fixes: 28a1f3ac ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: stable@vger.kernel.org # L1TF
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7288bde1
  12. 15 8月, 2018 1 次提交
  13. 06 8月, 2018 7 次提交