1. 14 11月, 2019 2 次提交
    • S
      KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast() · ed69a6cb
      Sean Christopherson 提交于
      Acquire the per-VM slots_lock when zapping all shadow pages as part of
      toggling nx_huge_pages.  The fast zap algorithm relies on exclusivity
      (via slots_lock) to identify obsolete vs. valid shadow pages, because it
      uses a single bit for its generation number. Holding slots_lock also
      obviates the need to acquire a read lock on the VM's srcu.
      
      Failing to take slots_lock when toggling nx_huge_pages allows multiple
      instances of kvm_mmu_zap_all_fast() to run concurrently, as the other
      user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock.
      (kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be
      temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough
      to enforce exclusivity).
      
      Concurrent fast zap instances causes obsolete shadow pages to be
      incorrectly identified as valid due to the single bit generation number
      wrapping, which results in stale shadow pages being left in KVM's MMU
      and leads to all sorts of undesirable behavior.
      The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and
      toggling nx_huge_pages via its module param.
      
      Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when
      using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used
      an ulong-sized generation instead of relying on exclusivity for
      correctness, but all callers except the recently added set_nx_huge_pages()
      needed to hold slots_lock anyways.  Therefore, this patch does not have
      to be backported to stable kernels.
      
      Given that toggling nx_huge_pages is by no means a fast path, force it
      to conform to the current approach instead of reintroducing the previous
      generation count.
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE)
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed69a6cb
    • M
      KVM: Forbid /dev/kvm being opened by a compat task when CONFIG_KVM_COMPAT=n · b9876e6d
      Marc Zyngier 提交于
      On a system without KVM_COMPAT, we prevent IOCTLs from being issued
      by a compat task. Although this prevents most silly things from
      happening, it can still confuse a 32bit userspace that is able
      to open the kvm device (the qemu test suite seems to be pretty
      mad with this behaviour).
      
      Take a more radical approach and return a -ENODEV to the compat
      task.
      Reported-by: NPeter Maydell <peter.maydell@linaro.org>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b9876e6d
  2. 13 11月, 2019 5 次提交
    • X
      KVM: X86: Reset the three MSR list number variables to 0 in kvm_init_msr_list() · 6cbee2b9
      Xiaoyao Li 提交于
      When applying commit 7a5ee6ed ("KVM: X86: Fix initialization of MSR
      lists"), it forgot to reset the three MSR lists number varialbes to 0
      while removing the useless conditionals.
      
      Fixes: 7a5ee6ed (KVM: X86: Fix initialization of MSR lists)
      Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6cbee2b9
    • V
      selftests: kvm: fix build with glibc >= 2.30 · e37f9f13
      Vitaly Kuznetsov 提交于
      Glibc-2.30 gained gettid() wrapper, selftests fail to compile:
      
      lib/assert.c:58:14: error: static declaration of ‘gettid’ follows non-static declaration
         58 | static pid_t gettid(void)
            |              ^~~~~~
      In file included from /usr/include/unistd.h:1170,
                       from include/test_util.h:18,
                       from lib/assert.c:10:
      /usr/include/bits/unistd_ext.h:34:16: note: previous declaration of ‘gettid’ was here
         34 | extern __pid_t gettid (void) __THROW;
            |                ^~~~~~
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e37f9f13
    • P
      kvm: x86: disable shattered huge page recovery for PREEMPT_RT. · 13fb5927
      Paolo Bonzini 提交于
      If a huge page is recovered (and becomes no executable) while another
      thread is executing it, the resulting contention on mmu_lock can cause
      latency spikes.  Disabling recovery for PREEMPT_RT kernels fixes this
      issue.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      13fb5927
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 8c5bd25b
      Linus Torvalds 提交于
      Pull kvm fixes from Paolo Bonzini:
       "Fix unwinding of KVM_CREATE_VM failure, VT-d posted interrupts,
        DAX/ZONE_DEVICE, and module unload/reload"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
        KVM: VMX: Introduce pi_is_pir_empty() helper
        KVM: VMX: Do not change PID.NDST when loading a blocked vCPU
        KVM: VMX: Consider PID.PIR to determine if vCPU has pending interrupts
        KVM: VMX: Fix comment to specify PID.ON instead of PIR.ON
        KVM: X86: Fix initialization of MSR lists
        KVM: fix placement of refcount initialization
        KVM: Fix NULL-ptr deref after kvm_create_vm fails
      8c5bd25b
    • L
      Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · eb094f06
      Linus Torvalds 提交于
      Pull x86 TSX Async Abort and iTLB Multihit mitigations from Thomas Gleixner:
       "The performance deterioration departement is not proud at all of
        presenting the seventh installment of speculation mitigations and
        hardware misfeature workarounds:
      
         1) TSX Async Abort (TAA) - 'The Annoying Affair'
      
            TAA is a hardware vulnerability that allows unprivileged
            speculative access to data which is available in various CPU
            internal buffers by using asynchronous aborts within an Intel TSX
            transactional region.
      
            The mitigation depends on a microcode update providing a new MSR
            which allows to disable TSX in the CPU. CPUs which have no
            microcode update can be mitigated by disabling TSX in the BIOS if
            the BIOS provides a tunable.
      
            Newer CPUs will have a bit set which indicates that the CPU is not
            vulnerable, but the MSR to disable TSX will be available
            nevertheless as it is an architected MSR. That means the kernel
            provides the ability to disable TSX on the kernel command line,
            which is useful as TSX is a truly useful mechanism to accelerate
            side channel attacks of all sorts.
      
         2) iITLB Multihit (NX) - 'No eXcuses'
      
            iTLB Multihit is an erratum where some Intel processors may incur
            a machine check error, possibly resulting in an unrecoverable CPU
            lockup, when an instruction fetch hits multiple entries in the
            instruction TLB. This can occur when the page size is changed
            along with either the physical address or cache type. A malicious
            guest running on a virtualized system can exploit this erratum to
            perform a denial of service attack.
      
            The workaround is that KVM marks huge pages in the extended page
            tables as not executable (NX). If the guest attempts to execute in
            such a page, the page is broken down into 4k pages which are
            marked executable. The workaround comes with a mechanism to
            recover these shattered huge pages over time.
      
        Both issues come with full documentation in the hardware
        vulnerabilities section of the Linux kernel user's and administrator's
        guide.
      
        Thanks to all patch authors and reviewers who had the extraordinary
        priviledge to be exposed to this nuisance.
      
        Special thanks to Borislav Petkov for polishing the final TAA patch
        set and to Paolo Bonzini for shepherding the KVM iTLB workarounds and
        providing also the backports to stable kernels for those!"
      
      * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs
        Documentation: Add ITLB_MULTIHIT documentation
        kvm: x86: mmu: Recovery of shattered NX large pages
        kvm: Add helper function for creating VM worker threads
        kvm: mmu: ITLB_MULTIHIT mitigation
        cpu/speculation: Uninline and export CPU mitigations helpers
        x86/cpu: Add Tremont to the cpu vulnerability whitelist
        x86/bugs: Add ITLB_MULTIHIT bug infrastructure
        x86/tsx: Add config options to set tsx=on|off|auto
        x86/speculation/taa: Add documentation for TSX Async Abort
        x86/tsx: Add "auto" option to the tsx= cmdline parameter
        kvm/x86: Export MDS_NO=0 to guests when TSX is enabled
        x86/speculation/taa: Add sysfs reporting for TSX Async Abort
        x86/speculation/taa: Add mitigation for TSX Async Abort
        x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default
        x86/cpu: Add a helper function x86_read_arch_cap_msr()
        x86/msr: Add the IA32_TSX_CTRL MSR
      eb094f06
  3. 12 11月, 2019 10 次提交
    • S
      KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved · a78986aa
      Sean Christopherson 提交于
      Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
      instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
      like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
      pages, e.g. put pages grabbed via gup().  But for flows such as setting
      A/D bits or shifting refcounts for transparent huge pages, KVM needs to
      to avoid processing ZONE_DEVICE pages as the flows in question lack the
      underlying machinery for proper handling of ZONE_DEVICE pages.
      
      This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
      when running a KVM guest backed with /dev/dax memory, as KVM straight up
      doesn't put any references to ZONE_DEVICE pages acquired by gup().
      
      Note, Dan Williams proposed an alternative solution of doing put_page()
      on ZONE_DEVICE pages immediately after gup() in order to simplify the
      auditing needed to ensure is_zone_device_page() is called if and only if
      the backing device is pinned (via gup()).  But that approach would break
      kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
      unmap() when accessing guest memory, unlike KVM's secondary MMU, which
      coordinates with mmu_notifier invalidations to avoid creating stale
      page references, i.e. doesn't rely on pages being pinned.
      
      [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.plReported-by: NAdam Borowski <kilobyte@angband.pl>
      Analyzed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: stable@vger.kernel.org
      Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a78986aa
    • J
      KVM: VMX: Introduce pi_is_pir_empty() helper · 29881b6e
      Joao Martins 提交于
      Streamline the PID.PIR check and change its call sites to use
      the newly added helper.
      Suggested-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      29881b6e
    • J
      KVM: VMX: Do not change PID.NDST when loading a blocked vCPU · 132194ff
      Joao Martins 提交于
      When vCPU enters block phase, pi_pre_block() inserts vCPU to a per pCPU
      linked list of all vCPUs that are blocked on this pCPU. Afterwards, it
      changes PID.NV to POSTED_INTR_WAKEUP_VECTOR which its handler
      (wakeup_handler()) is responsible to kick (unblock) any vCPU on that
      linked list that now has pending posted interrupts.
      
      While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which
      will cause vmx_vcpu_pi_put() to set PID.SN.  If later the vCPU will be
      scheduled to run on a different pCPU, vmx_vcpu_pi_load() will clear
      PID.SN but will also *overwrite PID.NDST to this different pCPU*.
      Instead of keeping it with original pCPU which vCPU had entered block
      phase on.
      
      This results in an issue because when a posted interrupt is delivered, as
      the wakeup_handler() will be executed and fail to find blocked vCPU on
      its per pCPU linked list of all vCPUs that are blocked on this pCPU.
      Which is due to the vCPU being placed on a *different* per pCPU
      linked list i.e. the original pCPU in which it entered block phase.
      
      The regression is introduced by commit c112b5f5 ("KVM: x86:
      Recompute PID.ON when clearing PID.SN"). Therefore, partially revert
      it and reintroduce the condition in vmx_vcpu_pi_load() responsible for
      avoiding changing PID.NDST when loading a blocked vCPU.
      
      Fixes: c112b5f5 ("KVM: x86: Recompute PID.ON when clearing PID.SN")
      Tested-by: NNathan Ni <nathan.ni@oracle.com>
      Co-developed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      132194ff
    • J
      KVM: VMX: Consider PID.PIR to determine if vCPU has pending interrupts · 9482ae45
      Joao Martins 提交于
      Commit 17e433b5 ("KVM: Fix leak vCPU's VMCS value into other pCPU")
      introduced vmx_dy_apicv_has_pending_interrupt() in order to determine
      if a vCPU have a pending posted interrupt. This routine is used by
      kvm_vcpu_on_spin() when searching for a a new runnable vCPU to schedule
      on pCPU instead of a vCPU doing busy loop.
      
      vmx_dy_apicv_has_pending_interrupt() determines if a
      vCPU has a pending posted interrupt solely based on PID.ON. However,
      when a vCPU is preempted, vmx_vcpu_pi_put() sets PID.SN which cause
      raised posted interrupts to only set bit in PID.PIR without setting
      PID.ON (and without sending notification vector), as depicted in VT-d
      manual section 5.2.3 "Interrupt-Posting Hardware Operation".
      
      Therefore, checking PID.ON is insufficient to determine if a vCPU has
      pending posted interrupts and instead we should also check if there is
      some bit set on PID.PIR if PID.SN=1.
      
      Fixes: 17e433b5 ("KVM: Fix leak vCPU's VMCS value into other pCPU")
      Reviewed-by: NJagannathan Raman <jag.raman@oracle.com>
      Co-developed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9482ae45
    • L
      KVM: VMX: Fix comment to specify PID.ON instead of PIR.ON · d9ff2744
      Liran Alon 提交于
      The Outstanding Notification (ON) bit is part of the Posted Interrupt
      Descriptor (PID) as opposed to the Posted Interrupts Register (PIR).
      The latter is a bitmap for pending vectors.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9ff2744
    • C
      KVM: X86: Fix initialization of MSR lists · 7a5ee6ed
      Chenyi Qiang 提交于
      The three MSR lists(msrs_to_save[], emulated_msrs[] and
      msr_based_features[]) are global arrays of kvm.ko, which are
      adjusted (copy supported MSRs forward to override the unsupported MSRs)
      when insmod kvm-{intel,amd}.ko, but it doesn't reset these three arrays
      to their initial value when rmmod kvm-{intel,amd}.ko. Thus, at the next
      installation, kvm-{intel,amd}.ko will do operations on the modified
      arrays with some MSRs lost and some MSRs duplicated.
      
      So define three constant arrays to hold the initial MSR lists and
      initialize msrs_to_save[], emulated_msrs[] and msr_based_features[]
      based on the constant arrays.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      [Remove now useless conditionals. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a5ee6ed
    • L
      Merge Intel Gen8/Gen9 graphics fixes from Jon Bloomfield. · 100d46bd
      Linus Torvalds 提交于
      This fixes two different classes of bugs in the Intel graphics hardware:
      
      MMIO register read hang:
       "On Intels Gen8 and Gen9 Graphics hardware, a read of specific graphics
        MMIO registers when the product is in certain low power states causes
        a system hang.
      
        There are two potential triggers for DoS:
          a) H/W corruption of the RC6 save/restore vector
          b) Hard hang within the MIPI hardware
      
        This prevents the DoS in two areas of the hardware:
          1) Detect corruption of RC6 address on exit from low-power state,
             and if we find it corrupted, disable RC6 and RPM
          2) Permanently lower the MIPI MMIO timeout"
      
      Blitter command streamer unrestricted memory accesses:
       "On Intels Gen9 Graphics hardware the Blitter Command Streamer (BCS)
        allows writing to Memory Mapped Input Output (MMIO) that should be
        blocked. With modifications of page tables, this can lead to privilege
        escalation. This exposure is limited to the Guest Physical Address
        space and does not allow for access outside of the graphics virtual
        machine.
      
        This series establishes a software parser into the Blitter command
        stream to scan for, and prevent, reads or writes to MMIO's that should
        not be accessible to non-privileged contexts.
      
        Much of the command parser infrastructure has existed for some time,
        and is used on Ivybridge/Haswell/Valleyview derived products to allow
        the use of features normally blocked by hardware. In this legacy
        context, the command parser is employed to allow normally unprivileged
        submissions to be run with elevated privileges in order to grant
        access to a limited set of extra capabilities. In this mode the parser
        is optional; In the event that the parser finds any construct that it
        cannot properly validate (e.g. nested command buffers), it simply
        aborts the scan and submits the buffer in non-privileged mode.
      
        For Gen9 Graphics, this series makes the parser mandatory for all
        Blitter submissions. The incoming user buffer is first copied to a
        kernel owned buffer, and parsed. If all checks are successful the
        kernel owned buffer is mapped READ-ONLY and submitted on behalf of the
        user. If any checks fail, or the parser is unable to complete the scan
        (nested buffers), it is forcibly rejected. The successfully scanned
        buffer is executed with NORMAL user privileges (key difference from
        legacy usage).
      
        Modern usermode does not use the Blitter on later hardware, having
        switched over to using the 3D engine instead for performance reasons.
        There are however some legacy usermode apps that rely on Blitter,
        notably the SNA X-Server. There are no known usermode applications
        that require nested command buffers on the Blitter, so the forcible
        rejection of such buffers in this patch series is considered an
        acceptable limitation"
      
      * Intel graphics fixes in emailed bundle from Jon Bloomfield <jon.bloomfield@intel.com>:
        drm/i915/cmdparser: Fix jump whitelist clearing
        drm/i915/gen8+: Add RC6 CTX corruption WA
        drm/i915: Lower RM timeout to avoid DSI hard hangs
        drm/i915/cmdparser: Ignore Length operands during command matching
        drm/i915/cmdparser: Add support for backward jumps
        drm/i915/cmdparser: Use explicit goto for error paths
        drm/i915: Add gen9 BCS cmdparsing
        drm/i915: Allow parsing of unsized batches
        drm/i915: Support ro ppgtt mapped cmdparser shadow buffers
        drm/i915: Add support for mandatory cmdparsing
        drm/i915: Remove Master tables from cmdparser
        drm/i915: Disable Secure Batches for gen6+
        drm/i915: Rename gen7 cmdparser tables
      100d46bd
    • L
      Merge branch 'for-5.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · de620fb9
      Linus Torvalds 提交于
      Pull cgroup fix from Tejun Heo:
       "There's an inadvertent preemption point in ptrace_stop() which was
        reliably triggering for a test scenario significantly slowing it down.
      
        This contains Oleg's fix to remove the unwanted preemption point"
      
      * 'for-5.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup: freezer: call cgroup_enter_frozen() with preemption disabled in ptrace_stop()
      de620fb9
    • L
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 72d5ac67
      Linus Torvalds 提交于
      Pull SCSI fixes from James Bottomley:
       "Three small changes: two in the core and one in the qla2xxx driver.
      
        The sg_tablesize fix affects a thinko in the migration to blk-mq of
        certain legacy drivers which could cause an oops and the sd core
        change should only affect zoned block devices which were wrongly
        suppressing error messages for reset all zones"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: core: Handle drivers which set sg_tablesize to zero
        scsi: qla2xxx: fix NPIV tear down process
        scsi: sd_zbc: Fix sd_zbc_complete()
      72d5ac67
    • B
      drm/i915/cmdparser: Fix jump whitelist clearing · ea0b163b
      Ben Hutchings 提交于
      When a jump_whitelist bitmap is reused, it needs to be cleared.
      Currently this is done with memset() and the size calculation assumes
      bitmaps are made of 32-bit words, not longs.  So on 64-bit
      architectures, only the first half of the bitmap is cleared.
      
      If some whitelist bits are carried over between successive batches
      submitted on the same context, this will presumably allow embedding
      the rogue instructions that we're trying to reject.
      
      Use bitmap_zero() instead, which gets the calculation right.
      
      Fixes: f8c08d8f ("drm/i915/cmdparser: Add support for backward jumps")
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NJon Bloomfield <jon.bloomfield@intel.com>
      ea0b163b
  4. 11 11月, 2019 15 次提交
    • P
      KVM: fix placement of refcount initialization · e2d3fcaf
      Paolo Bonzini 提交于
      Reported by syzkaller:
      
         =============================
         WARNING: suspicious RCU usage
         -----------------------------
         ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage!
      
         other info that might help us debug this:
      
         rcu_scheduler_active = 2, debug_locks = 1
         no locks held by repro_11/12688.
      
         stack backtrace:
         Call Trace:
          dump_stack+0x7d/0xc5
          lockdep_rcu_suspicious+0x123/0x170
          kvm_dev_ioctl+0x9a9/0x1260 [kvm]
          do_vfs_ioctl+0x1a1/0xfb0
          ksys_ioctl+0x6d/0x80
          __x64_sys_ioctl+0x73/0xb0
          do_syscall_64+0x108/0xaa0
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Commit a97b0e77 (kvm: call kvm_arch_destroy_vm if vm creation fails)
      sets users_count to 1 before kvm_arch_init_vm(), however, if kvm_arch_init_vm()
      fails, we need to decrease this count.  By moving it earlier, we can push
      the decrease to out_err_no_arch_destroy_vm without introducing yet another
      error label.
      
      syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=15209b84e00000
      
      Reported-by: syzbot+75475908cd0910f141ee@syzkaller.appspotmail.com
      Fixes: a97b0e77 ("kvm: call kvm_arch_destroy_vm if vm creation fails")
      Cc: Jim Mattson <jmattson@google.com>
      Analyzed-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e2d3fcaf
    • P
      KVM: Fix NULL-ptr deref after kvm_create_vm fails · 8a44119a
      Paolo Bonzini 提交于
      Reported by syzkaller:
      
          kasan: CONFIG_KASAN_INLINE enabled
          kasan: GPF could be caused by NULL-ptr deref or user memory access
          general protection fault: 0000 [#1] PREEMPT SMP KASAN
          CPU: 0 PID: 14727 Comm: syz-executor.3 Not tainted 5.4.0-rc4+ #0
          RIP: 0010:kvm_coalesced_mmio_init+0x5d/0x110 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:121
          Call Trace:
           kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:3446 [inline]
           kvm_dev_ioctl+0x781/0x1490 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3494
           vfs_ioctl fs/ioctl.c:46 [inline]
           file_ioctl fs/ioctl.c:509 [inline]
           do_vfs_ioctl+0x196/0x1150 fs/ioctl.c:696
           ksys_ioctl+0x62/0x90 fs/ioctl.c:713
           __do_sys_ioctl fs/ioctl.c:720 [inline]
           __se_sys_ioctl fs/ioctl.c:718 [inline]
           __x64_sys_ioctl+0x6e/0xb0 fs/ioctl.c:718
           do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Commit 9121923c ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
      moves memslots and buses allocations around, however, if kvm->srcu/irq_srcu fails
      initialization, NULL will be returned instead of error code, NULL will not be intercepted
      in kvm_dev_ioctl_create_vm() and be dereferenced by kvm_coalesced_mmio_init(), this patch
      fixes it.
      
      Moving the initialization is required anyway to avoid an incorrect synchronize_srcu that
      was also reported by syzkaller:
      
       wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136
       __synchronize_srcu+0x197/0x250 kernel/rcu/srcutree.c:921
       synchronize_srcu_expedited kernel/rcu/srcutree.c:946 [inline]
       synchronize_srcu+0x239/0x3e8 kernel/rcu/srcutree.c:997
       kvm_page_track_unregister_notifier+0xe7/0x130 arch/x86/kvm/page_track.c:212
       kvm_mmu_uninit_vm+0x1e/0x30 arch/x86/kvm/mmu.c:5828
       kvm_arch_destroy_vm+0x4a2/0x5f0 arch/x86/kvm/x86.c:9579
       kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:702 [inline]
      
      so do it.
      
      Reported-by: syzbot+89a8060879fa0bd2db4f@syzkaller.appspotmail.com
      Reported-by: syzbot+e27e7027eb2b80e44225@syzkaller.appspotmail.com
      Fixes: 9121923c ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8a44119a
    • L
      Linux 5.4-rc7 · 31f4f5b4
      Linus Torvalds 提交于
      31f4f5b4
    • L
      Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 44866956
      Linus Torvalds 提交于
      Pull ARM SoC fixes from Olof Johansson:
       "A set of fixes that have trickled in over the last couple of weeks:
      
         - MAINTAINER update for Cavium/Marvell ThunderX2
      
         - stm32 tweaks to pinmux for Joystick/Camera, and RAM allocation for
           CAN interfaces
      
         - i.MX fixes for voltage regulator GPIO mappings, fixes voltage
           scaling issues
      
         - More i.MX fixes for various issues on i.MX eval boards: interrupt
           storm due to u-boot leaving pins in new states, fixing power button
           config, a couple of compatible-string corrections.
      
         - Powerdown and Suspend/Resume fixes for Allwinner A83-based tablets
      
         - A few documentation tweaks and a fix of a memory leak in the reset
           subsystem"
      
      * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
        MAINTAINERS: update Cavium ThunderX2 maintainers
        ARM: dts: stm32: change joystick pinctrl definition on stm32mp157c-ev1
        ARM: dts: stm32: remove OV5640 pinctrl definition on stm32mp157c-ev1
        ARM: dts: stm32: Fix CAN RAM mapping on stm32mp157c
        ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157
        arm64: dts: zii-ultra: fix ARM regulator GPIO handle
        ARM: sunxi: Fix CPU powerdown on A83T
        ARM: dts: sun8i-a83t-tbs-a711: Fix WiFi resume from suspend
        arm64: dts: imx8mn: fix compatible string for sdma
        arm64: dts: imx8mm: fix compatible string for sdma
        reset: fix reset_control_ops kerneldoc comment
        ARM: dts: imx6-logicpd: Re-enable SNVS power key
        soc: imx: gpc: fix initialiser format
        ARM: dts: imx6qdl-sabreauto: Fix storm of accelerometer interrupts
        arm64: dts: ls1028a: fix a compatible issue
        reset: fix reset_control_get_exclusive kerneldoc comment
        reset: fix reset_control_lookup kerneldoc comment
        reset: fix of_reset_control_get_count kerneldoc comment
        reset: fix of_reset_simple_xlate kerneldoc comment
        reset: Fix memory leak in reset_control_array_put()
      44866956
    • L
      Merge tag 'staging-5.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · dd892625
      Linus Torvalds 提交于
      Pull IIO fixes and staging driver from Greg KH:
       "Here is a mix of a number of IIO driver fixes for 5.4-rc7, and a whole
        new staging driver.
      
        The IIO fixes resolve some reported issues, all are tiny.
      
        The staging driver addition is the vboxsf filesystem, which is the
        VirtualBox guest shared folder code. Hans has been trying to get
        filesystem reviewers to review the code for many months now, and
        Christoph finally said to just merge it in staging now as it is
        stand-alone and the filesystem people can review it easier over time
        that way.
      
        I know it's late for this big of an addition, but it is stand-alone.
      
        The code has been in linux-next for a while, long enough to pick up a
        few tiny fixes for it already so people are looking at it.
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'staging-5.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: Fix error return code in vboxsf_fill_super()
        staging: vboxsf: fix dereference of pointer dentry before it is null checked
        staging: vboxsf: Remove unused including <linux/version.h>
        staging: Add VirtualBox guest shared folder (vboxsf) support
        iio: adc: stm32-adc: fix stopping dma
        iio: imu: inv_mpu6050: fix no data on MPU6050
        iio: srf04: fix wrong limitation in distance measuring
        iio: imu: adis16480: make sure provided frequency is positive
      dd892625
    • L
      Merge tag 'char-misc-5.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 3de2a3e9
      Linus Torvalds 提交于
      Pull char/misc driver fixes from Greg KH:
       "Here are a number of late-arrival driver fixes for issues reported for
        some char/misc drivers for 5.4-rc7
      
        These all come from the different subsystem/driver maintainers as
        things that they had reports for and wanted to see fixed.
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'char-misc-5.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        intel_th: pci: Add Jasper Lake PCH support
        intel_th: pci: Add Comet Lake PCH support
        intel_th: msu: Fix possible memory leak in mode_store()
        intel_th: msu: Fix overflow in shift of an unsigned int
        intel_th: msu: Fix missing allocation failure check on a kstrndup
        intel_th: msu: Fix an uninitialized mutex
        intel_th: gth: Fix the window switching sequence
        soundwire: slave: fix scanf format
        soundwire: intel: fix intel_register_dai PDI offsets and numbers
        interconnect: Add locking in icc_set_tag()
        interconnect: qcom: Fix icc_onecell_data allocation
        soundwire: depend on ACPI || OF
        soundwire: depend on ACPI
        thunderbolt: Drop unnecessary read when writing LC command in Ice Lake
        thunderbolt: Fix lockdep circular locking depedency warning
        thunderbolt: Read DP IN adapter first two dwords in one go
      3de2a3e9
    • L
      Merge tag 'configfs-for-5.4-2' of git://git.infradead.org/users/hch/configfs · a5871fcb
      Linus Torvalds 提交于
      Pull configfs regression fix from Christoph Hellwig:
       "Fix a regression from this merge window in the configfs symlink
        handling (Honggang Li)"
      
      * tag 'configfs-for-5.4-2' of git://git.infradead.org/users/hch/configfs:
        configfs: calculate the depth of parent item
      a5871fcb
    • L
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9805a683
      Linus Torvalds 提交于
      Pull x86 fixes from Thomas Gleixner:
       "A small set of fixes for x86:
      
         - Make the tsc=reliable/nowatchdog command line parameter work again.
           It was broken with the introduction of the early TSC clocksource.
      
         - Prevent the evaluation of exception stacks before they are set up.
           This causes a crash in dumpstack because the stack walk termination
           gets screwed up.
      
         - Prevent a NULL pointer dereference in the rescource control file
           system.
      
         - Avoid bogus warnings about APIC id mismatch related to the LDR
           which can happen when the LDR is not in use and therefore not
           initialized. Only evaluate that when the APIC is in logical
           destination mode"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/tsc: Respect tsc command line paraemeter for clocksource_tsc_early
        x86/dumpstack/64: Don't evaluate exception stacks before setup
        x86/apic/32: Avoid bogus LDR warnings
        x86/resctrl: Prevent NULL pointer dereference when reading mondata
      9805a683
    • L
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 621084cd
      Linus Torvalds 提交于
      Pull timer fixes from Thomas Gleixner:
       "A small set of fixes for timekeepoing and clocksource drivers:
      
         - VDSO data was updated conditional on the availability of a VDSO
           capable clocksource. This causes the VDSO functions which do not
           depend on a VDSO capable clocksource to operate on stale data.
           Always update unconditionally.
      
         - Prevent a double free in the mediatek driver
      
         - Use the proper helper in the sh_mtu2 driver so it won't attempt to
           initialize non-existing interrupts"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        timekeeping/vsyscall: Update VDSO data unconditionally
        clocksource/drivers/sh_mtu2: Do not loop using platform_get_irq_by_name()
        clocksource/drivers/mediatek: Fix error handling
      621084cd
    • L
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 81388c2b
      Linus Torvalds 提交于
      Pull scheduler fixes from Thomas Gleixner:
       "Two fixes for scheduler regressions:
      
         - Plug a subtle race condition which was introduced with the rework
           of the next task selection functionality. The change of task
           properties became unprotected which can be observed inconsistently
           causing state corruption.
      
         - A trivial compile fix for CONFIG_CGROUPS=n"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched: Fix pick_next_task() vs 'change' pattern race
        sched/core: Fix compilation error when cgroup not selected
      81388c2b
    • L
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · b584a176
      Linus Torvalds 提交于
      Pull perf tooling fixes from Thomas Gleixner:
      
       - Fix the time sorting algorithm which was broken due to truncation of
         big numbers
      
       - Fix the python script generator fail caused by a broken tracepoint
         array iterator
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf tools: Fix time sorting
        perf tools: Remove unused trace_find_next_event()
        perf scripting engines: Iterate on tep event arrays directly
      b584a176
    • L
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ffba65ea
      Linus Torvalds 提交于
      Pull irq fixlet from Thomas Gleixner:
       "A trivial fix for a kernel doc regression where an argument change was
        not reflected in the documentation"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irq/irqdomain: Update __irq_domain_alloc_fwnode() function documentation
      ffba65ea
    • L
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 20c7e296
      Linus Torvalds 提交于
      Pull stacktrace fix from Thomas Gleixner:
       "A small fix for a stacktrace regression.
      
        Saving a stacktrace for a foreign task skipped an extra entry which
        makes e.g. the output of /proc/$PID/stack incomplete"
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        stacktrace: Don't skip first entry on noncurrent tasks
      20c7e296
    • L
      Merge tag '5.4-rc7-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6 · 79a64063
      Linus Torvalds 提交于
      Pull cifs fix from Steve French:
       "Small fix for an smb3 reconnect bug (also marked for stable)"
      
      * tag '5.4-rc7-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
        SMB3: Fix persistent handles reconnect
      79a64063
    • C
      lib: Remove select of inexistant GENERIC_IO · 820b7c71
      Corentin Labbe 提交于
      config option GENERIC_IO was removed but still selected by lib/kconfig
      This patch finish the cleaning.
      
      Fixes: 9de8da47 ("kconfig: kill off GENERIC_IO option")
      Acked-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NCorentin Labbe <clabbe@baylibre.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      820b7c71
  5. 10 11月, 2019 3 次提交
    • L
      Merge tag 'pinctrl-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 4763c089
      Linus Torvalds 提交于
      Pull pin control fixes from Linus Walleij:
      
       - Fix glitch risks in the Intel GPIO
      
       - Fix the Intel Cherryview valid irq mask calculation.
      
       - Allocate the Intel Cherryview irqchip dynamically.
      
       - Fix the valid mask init sequency on the ST STMFX driver.
      
      * tag 'pinctrl-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: stmfx: fix valid_mask init sequence
        pinctrl: cherryview: Allocate IRQ chip dynamic
        pinctrl: cherryview: Fix irq_valid_mask calculation
        pinctrl: intel: Avoid potential glitches if pin is in GPIO mode
      4763c089
    • L
      Merge tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 00aff683
      Linus Torvalds 提交于
      Pull btrfs fixes from David Sterba:
       "A few regressions and fixes for stable.
      
        Regressions:
      
         - fix a race leading to metadata space leak after task received a
           signal
      
         - un-deprecate 2 ioctls, marked as deprecated by mistake
      
        Fixes:
      
         - fix limit check for number of devices during chunk allocation
      
         - fix a race due to double evaluation of i_size_read inside max()
           macro, can cause a crash
      
         - remove wrong device id check in tree-checker"
      
      * tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC
        btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range
        Btrfs: fix race leading to metadata space leak after task received signal
        btrfs: tree-checker: Fix wrong check on max devid
        btrfs: Consider system chunk array size for new SYSTEM chunks
      00aff683
    • L
      Merge tag 'linux-watchdog-5.4-rc7' of git://www.linux-watchdog.org/linux-watchdog · 4aba1a7e
      Linus Torvalds 提交于
      Pull watchdog fixes from Wim Van Sebroeck:
      
       - cpwd: fix build regression
      
       - pm8916_wdt: fix pretimeout registration flow
      
       - meson: Fix the wrong value of left time
      
       - imx_sc_wdt: Pretimeout should follow SCU firmware format
      
       - bd70528: Add MODULE_ALIAS to allow module auto loading
      
      * tag 'linux-watchdog-5.4-rc7' of git://www.linux-watchdog.org/linux-watchdog:
        watchdog: bd70528: Add MODULE_ALIAS to allow module auto loading
        watchdog: imx_sc_wdt: Pretimeout should follow SCU firmware format
        watchdog: meson: Fix the wrong value of left time
        watchdog: pm8916_wdt: fix pretimeout registration flow
        watchdog: cpwd: fix build regression
      4aba1a7e
  6. 09 11月, 2019 5 次提交
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 0058b0a5
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) BPF sample build fixes from Björn Töpel
      
       2) Fix powerpc bpf tail call implementation, from Eric Dumazet.
      
       3) DCCP leaks jiffies on the wire, fix also from Eric Dumazet.
      
       4) Fix crash in ebtables when using dnat target, from Florian Westphal.
      
       5) Fix port disable handling whne removing bcm_sf2 driver, from Florian
          Fainelli.
      
       6) Fix kTLS sk_msg trim on fallback to copy mode, from Jakub Kicinski.
      
       7) Various KCSAN fixes all over the networking, from Eric Dumazet.
      
       8) Memory leaks in mlx5 driver, from Alex Vesker.
      
       9) SMC interface refcounting fix, from Ursula Braun.
      
      10) TSO descriptor handling fixes in stmmac driver, from Jose Abreu.
      
      11) Add a TX lock to synchonize the kTLS TX path properly with crypto
          operations. From Jakub Kicinski.
      
      12) Sock refcount during shutdown fix in vsock/virtio code, from Stefano
          Garzarella.
      
      13) Infinite loop in Intel ice driver, from Colin Ian King.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (108 commits)
        ixgbe: need_wakeup flag might not be set for Tx
        i40e: need_wakeup flag might not be set for Tx
        igb/igc: use ktime accessors for skb->tstamp
        i40e: Fix for ethtool -m issue on X722 NIC
        iavf: initialize ITRN registers with correct values
        ice: fix potential infinite loop because loop counter being too small
        qede: fix NULL pointer deref in __qede_remove()
        net: fix data-race in neigh_event_send()
        vsock/virtio: fix sock refcnt holding during the shutdown
        net: ethernet: octeon_mgmt: Account for second possible VLAN header
        mac80211: fix station inactive_time shortly after boot
        net/fq_impl: Switch to kvmalloc() for memory allocation
        mac80211: fix ieee80211_txq_setup_flows() failure path
        ipv4: Fix table id reference in fib_sync_down_addr
        ipv6: fixes rt6_probe() and fib6_nh->last_probe init
        net: hns: Fix the stray netpoll locks causing deadlock in NAPI path
        net: usb: qmi_wwan: add support for DW5821e with eSIM support
        CDC-NCM: handle incomplete transfer of MTU
        nfc: netlink: fix double device reference drop
        NFC: st21nfca: fix double free
        ...
      0058b0a5
    • L
      Merge tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block · 5cb8418c
      Linus Torvalds 提交于
      Pull block fixes from Jens Axboe:
      
       - Two NVMe device removal crash fixes, and a compat fixup for for an
         ioctl that was introduced in this release (Anton, Charles, Max - via
         Keith)
      
       - Missing error path mutex unlock for drbd (Dan)
      
       - cgroup writeback fixup on dead memcg (Tejun)
      
       - blkcg online stats print fix (Tejun)
      
      * tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block:
        cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
        block: drbd: remove a stray unlock in __drbd_send_protocol()
        blkcg: make blkcg_print_stat() print stats only for online blkgs
        nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
        nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
        nvme-rdma: fix a segmentation fault during module unload
      5cb8418c
    • D
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue · a2582cdc
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Fixes 2019-11-08
      
      This series contains fixes to igb, igc, ixgbe, i40e, iavf and ice
      drivers.
      
      Colin Ian King fixes a potentially wrap-around counter in a for-loop.
      
      Nick fixes the default ITR values for the iavf driver to 50 usecs
      interval.
      
      Arkadiusz fixes 'ethtool -m' for X722 devices where the correct value
      cannot be obtained from the firmware, so add X722 to the check to ensure
      the wrong value is not returned.
      
      Jake fixes igb and igc drivers in their implementation of launch time
      support by declaring skb->tstamp value as ktime_t instead of s64.
      
      Magnus fixes ixgbe and i40e where the need_wakeup flag for transmit may
      not be set for AF_XDP sockets that are only used to send packets.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2582cdc
    • M
      ixgbe: need_wakeup flag might not be set for Tx · 0843aa8f
      Magnus Karlsson 提交于
      The need_wakeup flag for Tx might not be set for AF_XDP sockets that
      are only used to send packets. This happens if there is at least one
      outstanding packet that has not been completed by the hardware and we
      get that corresponding completion (which will not generate an
      interrupt since interrupts are disabled in the napi poll loop) between
      the time we stopped processing the Tx completions and interrupts are
      enabled again. In this case, the need_wakeup flag will have been
      cleared at the end of the Tx completion processing as we believe we
      will get an interrupt from the outstanding completion at a later point
      in time. But if this completion interrupt occurs before interrupts
      are enable, we lose it and should at that point really have set the
      need_wakeup flag since there are no more outstanding completions that
      can generate an interrupt to continue the processing. When this
      happens, user space will see a Tx queue need_wakeup of 0 and skip
      issuing a syscall, which means will never get into the Tx processing
      again and we have a deadlock.
      
      This patch introduces a quick fix for this issue by just setting the
      need_wakeup flag for Tx to 1 all the time. I am working on a proper
      fix for this that will toggle the flag appropriately, but it is more
      challenging than I anticipated and I am afraid that this patch will
      not be completed before the merge window closes, therefore this easier
      fix for now. This fix has a negative performance impact in the range
      of 0% to 4%. Towards the higher end of the scale if you have driver
      and application on the same core and issue a lot of packets, and
      towards no negative impact if you use two cores, lower transmission
      speeds and/or a workload that also receives packets.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      0843aa8f
    • M
      i40e: need_wakeup flag might not be set for Tx · 70563957
      Magnus Karlsson 提交于
      The need_wakeup flag for Tx might not be set for AF_XDP sockets that
      are only used to send packets. This happens if there is at least one
      outstanding packet that has not been completed by the hardware and we
      get that corresponding completion (which will not generate an
      interrupt since interrupts are disabled in the napi poll loop) between
      the time we stopped processing the Tx completions and interrupts are
      enabled again. In this case, the need_wakeup flag will have been
      cleared at the end of the Tx completion processing as we believe we
      will get an interrupt from the outstanding completion at a later point
      in time. But if this completion interrupt occurs before interrupts
      are enable, we lose it and should at that point really have set the
      need_wakeup flag since there are no more outstanding completions that
      can generate an interrupt to continue the processing. When this
      happens, user space will see a Tx queue need_wakeup of 0 and skip
      issuing a syscall, which means will never get into the Tx processing
      again and we have a deadlock.
      
      This patch introduces a quick fix for this issue by just setting the
      need_wakeup flag for Tx to 1 all the time. I am working on a proper
      fix for this that will toggle the flag appropriately, but it is more
      challenging than I anticipated and I am afraid that this patch will
      not be completed before the merge window closes, therefore this easier
      fix for now. This fix has a negative performance impact in the range
      of 0% to 4%. Towards the higher end of the scale if you have driver
      and application on the same core and issue a lot of packets, and
      towards no negative impact if you use two cores, lower transmission
      speeds and/or a workload that also receives packets.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      70563957