1. 13 11月, 2019 4 次提交
  2. 21 9月, 2019 1 次提交
  3. 16 9月, 2019 1 次提交
    • P
      kvm: Check irqchip mode before assign irqfd · d5f65393
      Peter Xu 提交于
      [ Upstream commit 654f1f13ea56b92bacade8ce2725aea0457f91c0 ]
      
      When assigning kvm irqfd we didn't check the irqchip mode but we allow
      KVM_IRQFD to succeed with all the irqchip modes.  However it does not
      make much sense to create irqfd even without the kernel chips.  Let's
      provide a arch-dependent helper to check whether a specific irqfd is
      allowed by the arch.  At least for x86, it should make sense to check:
      
      - when irqchip mode is NONE, all irqfds should be disallowed, and,
      
      - when irqchip mode is SPLIT, irqfds that are with resamplefd should
        be disallowed.
      
      For either of the case, previously we'll silently ignore the irq or
      the irq ack event if the irqchip mode is incorrect.  However that can
      cause misterious guest behaviors and it can be hard to triage.  Let's
      fail KVM_IRQFD even earlier to detect these incorrect configurations.
      
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Radim Krčmář <rkrcmar@redhat.com>
      CC: Alex Williamson <alex.williamson@redhat.com>
      CC: Eduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d5f65393
  4. 10 9月, 2019 2 次提交
    • A
      KVM: arm/arm64: VGIC: Properly initialise private IRQ affinity · b8727dff
      Andre Przywara 提交于
      [ Upstream commit 2e16f3e926ed48373c98edea85c6ad0ef69425d1 ]
      
      At the moment we initialise the target *mask* of a virtual IRQ to the
      VCPU it belongs to, even though this mask is only defined for GICv2 and
      quickly runs out of bits for many GICv3 guests.
      This behaviour triggers an UBSAN complaint for more than 32 VCPUs:
      ------
      [ 5659.462377] UBSAN: Undefined behaviour in virt/kvm/arm/vgic/vgic-init.c:223:21
      [ 5659.471689] shift exponent 32 is too large for 32-bit type 'unsigned int'
      ------
      Also for GICv3 guests the reporting of TARGET in the "vgic-state" debugfs
      dump is wrong, due to this very same problem.
      
      Because there is no requirement to create the VGIC device before the
      VCPUs (and QEMU actually does it the other way round), we can't safely
      initialise mpidr or targets in kvm_vgic_vcpu_init(). But since we touch
      every private IRQ for each VCPU anyway later (in vgic_init()), we can
      just move the initialisation of those fields into there, where we
      definitely know the VGIC type.
      
      On the way make sure we really have either a VGICv2 or a VGICv3 device,
      since the existing code is just checking for "VGICv3 or not", silently
      ignoring the uninitialised case.
      Signed-off-by: NAndre Przywara <andre.przywara@arm.com>
      Reported-by: NDave Martin <dave.martin@arm.com>
      Tested-by: NJulien Grall <julien.grall@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b8727dff
    • A
      KVM: arm/arm64: Only skip MMIO insn once · 111d36b6
      Andrew Jones 提交于
      [ Upstream commit 2113c5f62b7423e4a72b890bd479704aa85c81ba ]
      
      If after an MMIO exit to userspace a VCPU is immediately run with an
      immediate_exit request, such as when a signal is delivered or an MMIO
      emulation completion is needed, then the VCPU completes the MMIO
      emulation and immediately returns to userspace. As the exit_reason
      does not get changed from KVM_EXIT_MMIO in these cases we have to
      be careful not to complete the MMIO emulation again, when the VCPU is
      eventually run again, because the emulation does an instruction skip
      (and doing too many skips would be a waste of guest code :-) We need
      to use additional VCPU state to track if the emulation is complete.
      As luck would have it, we already have 'mmio_needed', which even
      appears to be used in this way by other architectures already.
      
      Fixes: 0d640732dbeb ("arm64: KVM: Skip MMIO insn after emulation")
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      111d36b6
  5. 06 9月, 2019 2 次提交
  6. 25 8月, 2019 1 次提交
    • M
      KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block · 8c7053d1
      Marc Zyngier 提交于
      commit 5eeaf10eec394b28fad2c58f1f5c3a5da0e87d1c upstream.
      
      Since commit commit 328e5664 ("KVM: arm/arm64: vgic: Defer
      touching GICH_VMCR to vcpu_load/put"), we leave ICH_VMCR_EL2 (or
      its GICv2 equivalent) loaded as long as we can, only syncing it
      back when we're scheduled out.
      
      There is a small snag with that though: kvm_vgic_vcpu_pending_irq(),
      which is indirectly called from kvm_vcpu_check_block(), needs to
      evaluate the guest's view of ICC_PMR_EL1. At the point were we
      call kvm_vcpu_check_block(), the vcpu is still loaded, and whatever
      changes to PMR is not visible in memory until we do a vcpu_put().
      
      Things go really south if the guest does the following:
      
      	mov x0, #0	// or any small value masking interrupts
      	msr ICC_PMR_EL1, x0
      
      	[vcpu preempted, then rescheduled, VMCR sampled]
      
      	mov x0, #ff	// allow all interrupts
      	msr ICC_PMR_EL1, x0
      	wfi		// traps to EL2, so samping of VMCR
      
      	[interrupt arrives just after WFI]
      
      Here, the hypervisor's view of PMR is zero, while the guest has enabled
      its interrupts. kvm_vgic_vcpu_pending_irq() will then say that no
      interrupts are pending (despite an interrupt being received) and we'll
      block for no reason. If the guest doesn't have a periodic interrupt
      firing once it has blocked, it will stay there forever.
      
      To avoid this unfortuante situation, let's resync VMCR from
      kvm_arch_vcpu_blocking(), ensuring that a following kvm_vcpu_check_block()
      will observe the latest value of PMR.
      
      This has been found by booting an arm64 Linux guest with the pseudo NMI
      feature, and thus using interrupt priorities to mask interrupts instead
      of the usual PSTATE masking.
      
      Cc: stable@vger.kernel.org # 4.12
      Fixes: 328e5664 ("KVM: arm/arm64: vgic: Defer touching GICH_VMCR to vcpu_load/put")
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      8c7053d1
  7. 16 8月, 2019 1 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 2bc73d91
      Wanpeng Li 提交于
      commit 17e433b54393a6269acbcb792da97791fe1592d8 upstream.
      
      After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2bc73d91
  8. 14 7月, 2019 1 次提交
    • D
      KVM: arm/arm64: vgic: Fix kvm_device leak in vgic_its_destroy · 512bbb11
      Dave Martin 提交于
      [ Upstream commit 4729ec8c1e1145234aeeebad5d96d77f4ccbb00a ]
      
      kvm_device->destroy() seems to be supposed to free its kvm_device
      struct, but vgic_its_destroy() is not currently doing this,
      resulting in a memory leak, resulting in kmemleak reports such as
      the following:
      
      unreferenced object 0xffff800aeddfe280 (size 128):
        comm "qemu-system-aar", pid 13799, jiffies 4299827317 (age 1569.844s)
        [...]
        backtrace:
          [<00000000a08b80e2>] kmem_cache_alloc+0x178/0x208
          [<00000000dcad2bd3>] kvm_vm_ioctl+0x350/0xbc0
      
      Fix it.
      
      Cc: Andre Przywara <andre.przywara@arm.com>
      Fixes: 1085fdc6 ("KVM: arm64: vgic-its: Introduce new KVM ITS device")
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      512bbb11
  9. 19 6月, 2019 1 次提交
  10. 09 6月, 2019 1 次提交
    • T
      KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID · 6a2fbec7
      Thomas Huth 提交于
      commit a86cb413f4bf273a9d341a3ab2c2ca44e12eb317 upstream.
      
      KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all
      architectures. However, on s390x, the amount of usable CPUs is determined
      during runtime - it is depending on the features of the machine the code
      is running on. Since we are using the vcpu_id as an index into the SCA
      structures that are defined by the hardware (see e.g. the sca_add_vcpu()
      function), it is not only the amount of CPUs that is limited by the hard-
      ware, but also the range of IDs that we can use.
      Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too.
      So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common
      code into the architecture specific code, and on s390x we have to return
      the same value here as for KVM_CAP_MAX_VCPUS.
      This problem has been discovered with the kvm_create_max_vcpus selftest.
      With this change applied, the selftest now passes on s390x, too.
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Message-Id: <20190523164309.13345-9-thuth@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      6a2fbec7
  11. 26 5月, 2019 1 次提交
  12. 17 5月, 2019 1 次提交
  13. 04 5月, 2019 2 次提交
    • M
      KVM: arm/arm64: vgic-its: Take the srcu lock when parsing the memslots · e4705ae7
      Marc Zyngier 提交于
      [ Upstream commit 7494cec6cb3ba7385a6a223b81906384f15aae34 ]
      
      Calling kvm_is_visible_gfn() implies that we're parsing the memslots,
      and doing this without the srcu lock is frown upon:
      
      [12704.164532] =============================
      [12704.164544] WARNING: suspicious RCU usage
      [12704.164560] 5.1.0-rc1-00008-g600025238f51-dirty #16 Tainted: G        W
      [12704.164573] -----------------------------
      [12704.164589] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
      [12704.164602] other info that might help us debug this:
      [12704.164616] rcu_scheduler_active = 2, debug_locks = 1
      [12704.164631] 6 locks held by qemu-system-aar/13968:
      [12704.164644]  #0: 000000007ebdae4f (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0
      [12704.164691]  #1: 000000007d751022 (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0
      [12704.164726]  #2: 00000000219d2706 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164761]  #3: 00000000a760aecd (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164794]  #4: 000000000ef8e31d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164827]  #5: 000000007a872093 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164861] stack backtrace:
      [12704.164878] CPU: 2 PID: 13968 Comm: qemu-system-aar Tainted: G        W         5.1.0-rc1-00008-g600025238f51-dirty #16
      [12704.164887] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019
      [12704.164896] Call trace:
      [12704.164910]  dump_backtrace+0x0/0x138
      [12704.164920]  show_stack+0x24/0x30
      [12704.164934]  dump_stack+0xbc/0x104
      [12704.164946]  lockdep_rcu_suspicious+0xcc/0x110
      [12704.164958]  gfn_to_memslot+0x174/0x190
      [12704.164969]  kvm_is_visible_gfn+0x28/0x70
      [12704.164980]  vgic_its_check_id.isra.0+0xec/0x1e8
      [12704.164991]  vgic_its_save_tables_v0+0x1ac/0x330
      [12704.165001]  vgic_its_set_attr+0x298/0x3a0
      [12704.165012]  kvm_device_ioctl_attr+0x9c/0xd8
      [12704.165022]  kvm_device_ioctl+0x8c/0xf8
      [12704.165035]  do_vfs_ioctl+0xc8/0x960
      [12704.165045]  ksys_ioctl+0x8c/0xa0
      [12704.165055]  __arm64_sys_ioctl+0x28/0x38
      [12704.165067]  el0_svc_common+0xd8/0x138
      [12704.165078]  el0_svc_handler+0x38/0x78
      [12704.165089]  el0_svc+0x8/0xc
      
      Make sure the lock is taken when doing this.
      
      Fixes: bf308242 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock")
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NSasha Levin (Microsoft) <sashal@kernel.org>
      e4705ae7
    • M
      KVM: arm/arm64: vgic-its: Take the srcu lock when writing to guest memory · 0371fa03
      Marc Zyngier 提交于
      [ Upstream commit a6ecfb11bf37743c1ac49b266595582b107b61d4 ]
      
      When halting a guest, QEMU flushes the virtual ITS caches, which
      amounts to writing to the various tables that the guest has allocated.
      
      When doing this, we fail to take the srcu lock, and the kernel
      shouts loudly if running a lockdep kernel:
      
      [   69.680416] =============================
      [   69.680819] WARNING: suspicious RCU usage
      [   69.681526] 5.1.0-rc1-00008-g600025238f51-dirty #18 Not tainted
      [   69.682096] -----------------------------
      [   69.682501] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
      [   69.683225]
      [   69.683225] other info that might help us debug this:
      [   69.683225]
      [   69.683975]
      [   69.683975] rcu_scheduler_active = 2, debug_locks = 1
      [   69.684598] 6 locks held by qemu-system-aar/4097:
      [   69.685059]  #0: 0000000034196013 (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0
      [   69.686087]  #1: 00000000f2ed935e (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0
      [   69.686919]  #2: 000000005e71ea54 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.687698]  #3: 00000000c17e548d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.688475]  #4: 00000000ba386017 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.689978]  #5: 00000000c2c3c335 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.690729]
      [   69.690729] stack backtrace:
      [   69.691151] CPU: 2 PID: 4097 Comm: qemu-system-aar Not tainted 5.1.0-rc1-00008-g600025238f51-dirty #18
      [   69.691984] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019
      [   69.692831] Call trace:
      [   69.694072]  lockdep_rcu_suspicious+0xcc/0x110
      [   69.694490]  gfn_to_memslot+0x174/0x190
      [   69.694853]  kvm_write_guest+0x50/0xb0
      [   69.695209]  vgic_its_save_tables_v0+0x248/0x330
      [   69.695639]  vgic_its_set_attr+0x298/0x3a0
      [   69.696024]  kvm_device_ioctl_attr+0x9c/0xd8
      [   69.696424]  kvm_device_ioctl+0x8c/0xf8
      [   69.696788]  do_vfs_ioctl+0xc8/0x960
      [   69.697128]  ksys_ioctl+0x8c/0xa0
      [   69.697445]  __arm64_sys_ioctl+0x28/0x38
      [   69.697817]  el0_svc_common+0xd8/0x138
      [   69.698173]  el0_svc_handler+0x38/0x78
      [   69.698528]  el0_svc+0x8/0xc
      
      The fix is to obviously take the srcu lock, just like we do on the
      read side of things since bf308242. One wonders why this wasn't
      fixed at the same time, but hey...
      
      Fixes: bf308242 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock")
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NSasha Levin (Microsoft) <sashal@kernel.org>
      0371fa03
  14. 03 4月, 2019 1 次提交
  15. 24 3月, 2019 4 次提交
  16. 13 2月, 2019 3 次提交
  17. 17 1月, 2019 1 次提交
    • C
      KVM: arm/arm64: Fix VMID alloc race by reverting to lock-less · 4f14f446
      Christoffer Dall 提交于
      commit fb544d1ca65a89f7a3895f7531221ceeed74ada7 upstream.
      
      We recently addressed a VMID generation race by introducing a read/write
      lock around accesses and updates to the vmid generation values.
      
      However, kvm_arch_vcpu_ioctl_run() also calls need_new_vmid_gen() but
      does so without taking the read lock.
      
      As far as I can tell, this can lead to the same kind of race:
      
        VM 0, VCPU 0			VM 0, VCPU 1
        ------------			------------
        update_vttbr (vmid 254)
        				update_vttbr (vmid 1) // roll over
      				read_lock(kvm_vmid_lock);
      				force_vm_exit()
        local_irq_disable
        need_new_vmid_gen == false //because vmid gen matches
      
        enter_guest (vmid 254)
        				kvm_arch.vttbr = <PGD>:<VMID 1>
      				read_unlock(kvm_vmid_lock);
      
        				enter_guest (vmid 1)
      
      Which results in running two VCPUs in the same VM with different VMIDs
      and (even worse) other VCPUs from other VMs could now allocate clashing
      VMID 254 from the new generation as long as VCPU 0 is not exiting.
      
      Attempt to solve this by making sure vttbr is updated before another CPU
      can observe the updated VMID generation.
      
      Cc: stable@vger.kernel.org
      Fixes: f0cf47d9 "KVM: arm/arm64: Close VMID generation race"
      Reviewed-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f14f446
  18. 10 1月, 2019 5 次提交
  19. 14 11月, 2018 2 次提交
    • M
      KVM: arm64: Fix caching of host MDCR_EL2 value · 59571785
      Mark Rutland 提交于
      commit da5a3ce66b8bb51b0ea8a89f42aac153903f90fb upstream.
      
      At boot time, KVM stashes the host MDCR_EL2 value, but only does this
      when the kernel is not running in hyp mode (i.e. is non-VHE). In these
      cases, the stashed value of MDCR_EL2.HPMN happens to be zero, which can
      lead to CONSTRAINED UNPREDICTABLE behaviour.
      
      Since we use this value to derive the MDCR_EL2 value when switching
      to/from a guest, after a guest have been run, the performance counters
      do not behave as expected. This has been observed to result in accesses
      via PMXEVTYPER_EL0 and PMXEVCNTR_EL0 not affecting the relevant
      counters, resulting in events not being counted. In these cases, only
      the fixed-purpose cycle counter appears to work as expected.
      
      Fix this by always stashing the host MDCR_EL2 value, regardless of VHE.
      
      Cc: Christopher Dall <christoffer.dall@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 1e947bad ("arm64: KVM: Skip HYP setup when already running in HYP")
      Tested-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59571785
    • P
      KVM: arm/arm64: Ensure only THP is candidate for adjustment · 3e286d39
      Punit Agrawal 提交于
      commit fd2ef358 upstream.
      
      PageTransCompoundMap() returns true for hugetlbfs and THP
      hugepages. This behaviour incorrectly leads to stage 2 faults for
      unsupported hugepage sizes (e.g., 64K hugepage with 4K pages) to be
      treated as THP faults.
      
      Tighten the check to filter out hugetlbfs pages. This also leads to
      consistently mapping all unsupported hugepage sizes as PTE level
      entries at stage 2.
      Signed-off-by: NPunit Agrawal <punit.agrawal@arm.com>
      Reviewed-by: NSuzuki Poulose <suzuki.poulose@arm.com>
      Cc: Christoffer Dall <christoffer.dall@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: stable@vger.kernel.org # v4.13+
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3e286d39
  20. 07 9月, 2018 2 次提交
  21. 23 8月, 2018 1 次提交
    • M
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko 提交于
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7
  22. 13 8月, 2018 2 次提交