1. 12 9月, 2020 11 次提交
    • W
      KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exit · 99b82a14
      Wanpeng Li 提交于
      According to SDM 27.2.4, Event delivery causes an APIC-access VM exit.
      Don't report internal error and freeze guest when event delivery causes
      an APIC-access exit, it is handleable and the event will be re-injected
      during the next vmentry.
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1597827327-25055-2-git-send-email-wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      99b82a14
    • W
      KVM: SVM: avoid emulation with stale next_rip · e42c6828
      Wanpeng Li 提交于
      svm->next_rip is reset in svm_vcpu_run() only after calling
      svm_exit_handlers_fastpath(), which will cause SVM's
      skip_emulated_instruction() to write a stale RIP.
      
      We can move svm_exit_handlers_fastpath towards the end of
      svm_vcpu_run().  To align VMX with SVM, keep svm_complete_interrupts()
      close as well.
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Paul K. <kronenpj@kronenpj.dyndns.org>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Also move vmcb_mark_all_clean before any possible write to the VMCB.
       - Paolo]
      e42c6828
    • V
      KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_EN · d831de17
      Vitaly Kuznetsov 提交于
      Even without in-kernel LAPIC we should allow writing '0' to
      MSR_KVM_ASYNC_PF_EN as we're not enabling the mechanism. In
      particular, QEMU with 'kernel-irqchip=off' fails to start
      a guest with
      
      qemu-system-x86_64: error: failed to set MSR 0x4b564d02 to 0x0
      
      Fixes: 9d3c447c ("KVM: X86: Fix async pf caused null-ptr-deref")
      Reported-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200911093147.484565-1-vkuznets@redhat.com>
      [Actually commit the version proposed by Sean Christopherson. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d831de17
    • D
      KVM: SVM: Periodically schedule when unregistering regions on destroy · 7be74942
      David Rientjes 提交于
      There may be many encrypted regions that need to be unregistered when a
      SEV VM is destroyed.  This can lead to soft lockups.  For example, on a
      host running 4.15:
      
      watchdog: BUG: soft lockup - CPU#206 stuck for 11s! [t_virtual_machi:194348]
      CPU: 206 PID: 194348 Comm: t_virtual_machi
      RIP: 0010:free_unref_page_list+0x105/0x170
      ...
      Call Trace:
       [<0>] release_pages+0x159/0x3d0
       [<0>] sev_unpin_memory+0x2c/0x50 [kvm_amd]
       [<0>] __unregister_enc_region_locked+0x2f/0x70 [kvm_amd]
       [<0>] svm_vm_destroy+0xa9/0x200 [kvm_amd]
       [<0>] kvm_arch_destroy_vm+0x47/0x200
       [<0>] kvm_put_kvm+0x1a8/0x2f0
       [<0>] kvm_vm_release+0x25/0x30
       [<0>] do_exit+0x335/0xc10
       [<0>] do_group_exit+0x3f/0xa0
       [<0>] get_signal+0x1bc/0x670
       [<0>] do_signal+0x31/0x130
      
      Although the CLFLUSH is no longer issued on every encrypted region to be
      unregistered, there are no other changes that can prevent soft lockups for
      very large SEV VMs in the latest kernel.
      
      Periodically schedule if necessary.  This still holds kvm->lock across the
      resched, but since this only happens when the VM is destroyed this is
      assumed to be acceptable.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Message-Id: <alpine.DEB.2.23.453.2008251255240.2987727@chino.kir.corp.google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7be74942
    • H
      KVM: MIPS: Change the definition of kvm type · 15e9e35c
      Huacai Chen 提交于
      MIPS defines two kvm types:
      
       #define KVM_VM_MIPS_TE          0
       #define KVM_VM_MIPS_VZ          1
      
      In Documentation/virt/kvm/api.rst it is said that "You probably want to
      use 0 as machine type", which implies that type 0 be the "automatic" or
      "default" type. And, in user-space libvirt use the null-machine (with
      type 0) to detect the kvm capability, which returns "KVM not supported"
      on a VZ platform.
      
      I try to fix it in QEMU but it is ugly:
      https://lists.nongnu.org/archive/html/qemu-devel/2020-08/msg05629.html
      
      And Thomas Huth suggests me to change the definition of kvm type:
      https://lists.nongnu.org/archive/html/qemu-devel/2020-09/msg03281.html
      
      So I define like this:
      
       #define KVM_VM_MIPS_AUTO        0
       #define KVM_VM_MIPS_VZ          1
       #define KVM_VM_MIPS_TE          2
      
      Since VZ and TE cannot co-exists, using type 0 on a TE platform will
      still return success (so old user-space tools have no problems on new
      kernels); the advantage is that using type 0 on a VZ platform will not
      return failure. So, the only problem is "new user-space tools use type
      2 on old kernels", but if we treat this as a kernel bug, we can backport
      this patch to old stable kernels.
      Signed-off-by: NHuacai Chen <chenhc@lemote.com>
      Message-Id: <1599734031-28746-1-git-send-email-chenhc@lemote.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      15e9e35c
    • L
      kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when needed · f6f6195b
      Lai Jiangshan 提交于
      When kvm_mmu_get_page() gets a page with unsynced children, the spt
      pagetable is unsynchronized with the guest pagetable. But the
      guest might not issue a "flush" operation on it when the pagetable
      entry is changed from zero or other cases. The hypervisor has the
      responsibility to synchronize the pagetables.
      
      KVM behaved as above for many years, But commit 8c8560b8
      ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
      inadvertently included a line of code to change it without giving any
      reason in the changelog. It is clear that the commit's intention was to
      change KVM_REQ_TLB_FLUSH -> KVM_REQ_TLB_FLUSH_CURRENT, so we don't
      needlessly flush other contexts; however, one of the hunks changed
      a nearby KVM_REQ_MMU_SYNC instead.  This patch changes it back.
      
      Link: https://lore.kernel.org/lkml/20200320212833.3507-26-sean.j.christopherson@intel.com/
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20200902135421.31158-1-jiangshanlai@gmail.com>
      fixes: 8c8560b8 ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f6f6195b
    • C
      KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control · c6b177a3
      Chenyi Qiang 提交于
      A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field
      in exit_ctls_high.
      
      Fixes: 03a8871a ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
      VM-{Entry,Exit} control")
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      Reviewed-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Message-Id: <20200828085622.8365-5-chenyi.qiang@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c6b177a3
    • R
      KVM: fix memory leak in kvm_io_bus_unregister_dev() · f6588660
      Rustam Kovhaev 提交于
      when kmalloc() fails in kvm_io_bus_unregister_dev(), before removing
      the bus, we should iterate over all other devices linked to it and call
      kvm_iodevice_destructor() for them
      
      Fixes: 90db1043 ("KVM: kvm_io_bus_unregister_dev() should never fail")
      Cc: stable@vger.kernel.org
      Reported-and-tested-by: syzbot+f196caa45793d6374707@syzkaller.appspotmail.com
      Link: https://syzkaller.appspot.com/bug?extid=f196caa45793d6374707Signed-off-by: NRustam Kovhaev <rkovhaev@gmail.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200907185535.233114-1-rkovhaev@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f6588660
    • H
      KVM: Check the allocation of pv cpu mask · 0f990222
      Haiwei Li 提交于
      check the allocation of per-cpu __pv_cpu_mask. Initialize ops only when
      successful.
      Signed-off-by: NHaiwei Li <lihaiwei@tencent.com>
      Message-Id: <d59f05df-e6d3-3d31-a036-cc25a2b2f33f@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0f990222
    • P
      KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected · 43fea4e4
      Peter Shier 提交于
      When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call
      load_pdptrs to read the possibly updated PDPTEs from the guest
      physical address referenced by CR3.  It loads them into
      vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in
      vcpu->arch.regs_dirty.
      
      At the subsequent assumed reentry into L2, the mmu will call
      vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees
      VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads
      VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works
      if the L2 CRn write intercept always resumes L2.
      
      The resume path calls vmx_check_nested_events which checks for
      exceptions, MTF, and expired VMX preemption timers. If
      vmx_check_nested_events finds any of these conditions pending it will
      reflect the corresponding exit into L1. Live migration at this point
      would also cause a missed immediate reentry into L2.
      
      After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which
      clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty.  When L2 next
      resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in
      vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from
      vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load
      VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale
      values stored at last L2 exit. A repro of this bug showed L2 entering
      triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values.
      
      When L2 is in PAE paging mode add a call to ept_load_pdptrs before
      leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in
      vcpu->arch.walk_mmu->pdptrs[].
      
      Tested:
      kvm-unit-tests with new directed test: vmx_mtf_pdpte_test.
      Verified that test fails without the fix.
      
      Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a
      custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix
      would repro readily. Ran 14 simultaneous L2s for 140 iterations with no
      failures.
      Signed-off-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Message-Id: <20200820230545.2411347-1-pshier@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      43fea4e4
    • P
      Merge tag 'kvmarm-fixes-5.9-1' of... · 1b67fd08
      Paolo Bonzini 提交于
      Merge tag 'kvmarm-fixes-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for Linux 5.9, take #1
      
      - Multiple stolen time fixes, with a new capability to match x86
      - Fix for hugetlbfs mappings when PUD and PMD are the same level
      - Fix for hugetlbfs mappings when PTE mappings are enforced
        (dirty logging, for example)
      - Fix tracing output of 64bit values
      1b67fd08
  2. 04 9月, 2020 3 次提交
  3. 22 8月, 2020 2 次提交
    • W
      KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set · b5331379
      Will Deacon 提交于
      When an MMU notifier call results in unmapping a range that spans multiple
      PGDs, we end up calling into cond_resched_lock() when crossing a PGD boundary,
      since this avoids running into RCU stalls during VM teardown. Unfortunately,
      if the VM is destroyed as a result of OOM, then blocking is not permitted
      and the call to the scheduler triggers the following BUG():
      
       | BUG: sleeping function called from invalid context at arch/arm64/kvm/mmu.c:394
       | in_atomic(): 1, irqs_disabled(): 0, non_block: 1, pid: 36, name: oom_reaper
       | INFO: lockdep is turned off.
       | CPU: 3 PID: 36 Comm: oom_reaper Not tainted 5.8.0 #1
       | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
       | Call trace:
       |  dump_backtrace+0x0/0x284
       |  show_stack+0x1c/0x28
       |  dump_stack+0xf0/0x1a4
       |  ___might_sleep+0x2bc/0x2cc
       |  unmap_stage2_range+0x160/0x1ac
       |  kvm_unmap_hva_range+0x1a0/0x1c8
       |  kvm_mmu_notifier_invalidate_range_start+0x8c/0xf8
       |  __mmu_notifier_invalidate_range_start+0x218/0x31c
       |  mmu_notifier_invalidate_range_start_nonblock+0x78/0xb0
       |  __oom_reap_task_mm+0x128/0x268
       |  oom_reap_task+0xac/0x298
       |  oom_reaper+0x178/0x17c
       |  kthread+0x1e4/0x1fc
       |  ret_from_fork+0x10/0x30
      
      Use the new 'flags' argument to kvm_unmap_hva_range() to ensure that we
      only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is set in the notifier
      flags.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 8b3405e3 ("kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd")
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Message-Id: <20200811102725.7121-3-will@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5331379
    • W
      KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() · fdfe7cbd
      Will Deacon 提交于
      The 'flags' field of 'struct mmu_notifier_range' is used to indicate
      whether invalidate_range_{start,end}() are permitted to block. In the
      case of kvm_mmu_notifier_invalidate_range_start(), this field is not
      forwarded on to the architecture-specific implementation of
      kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
      whether or not to block.
      
      Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
      architectures are aware as to whether or not they are permitted to block.
      
      Cc: <stable@vger.kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Message-Id: <20200811102725.7121-2-will@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fdfe7cbd
  4. 21 8月, 2020 6 次提交
  5. 18 8月, 2020 4 次提交
    • J
      kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode · cb957adb
      Jim Mattson 提交于
      See the SDM, volume 3, section 4.4.1:
      
      If PAE paging would be in use following an execution of MOV to CR0 or
      MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
      CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
      the PDPTEs are loaded from the address in CR3.
      
      Fixes: b9baba86 ("KVM, pkeys: expose CPUID/CR4 to guest")
      Cc: Huaitong Han <huaitong.han@intel.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NOliver Upton <oupton@google.com>
      Message-Id: <20200817181655.3716509-1-jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb957adb
    • J
      kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode · 427890af
      Jim Mattson 提交于
      See the SDM, volume 3, section 4.4.1:
      
      If PAE paging would be in use following an execution of MOV to CR0 or
      MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
      CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
      the PDPTEs are loaded from the address in CR3.
      
      Fixes: 0be0226f ("KVM: MMU: fix SMAP virtualization")
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NOliver Upton <oupton@google.com>
      Message-Id: <20200817181655.3716509-2-jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      427890af
    • P
      KVM: x86: fix access code passed to gva_to_gpa · 19cf4b7e
      Paolo Bonzini 提交于
      The PK bit of the error code is computed dynamically in permission_fault
      and therefore need not be passed to gva_to_gpa: only the access bits
      (fetch, user, write) need to be passed down.
      
      Not doing so causes a splat in the pku test:
      
         WARNING: CPU: 25 PID: 5465 at arch/x86/kvm/mmu.h:197 paging64_walk_addr_generic+0x594/0x750 [kvm]
         Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0014.D62.2001092233 01/09/2020
         RIP: 0010:paging64_walk_addr_generic+0x594/0x750 [kvm]
         Code: <0f> 0b e9 db fe ff ff 44 8b 43 04 4c 89 6c 24 30 8b 13 41 39 d0 89
         RSP: 0018:ff53778fc623fb60 EFLAGS: 00010202
         RAX: 0000000000000001 RBX: ff53778fc623fbf0 RCX: 0000000000000007
         RDX: 0000000000000001 RSI: 0000000000000002 RDI: ff4501efba818000
         RBP: 0000000000000020 R08: 0000000000000005 R09: 00000000004000e7
         R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007
         R13: ff4501efba818388 R14: 10000000004000e7 R15: 0000000000000000
         FS:  00007f2dcf31a700(0000) GS:ff4501f1c8040000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 0000000000000000 CR3: 0000001dea475005 CR4: 0000000000763ee0
         DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         PKRU: 55555554
         Call Trace:
          paging64_gva_to_gpa+0x3f/0xb0 [kvm]
          kvm_fixup_and_inject_pf_error+0x48/0xa0 [kvm]
          handle_exception_nmi+0x4fc/0x5b0 [kvm_intel]
          kvm_arch_vcpu_ioctl_run+0x911/0x1c10 [kvm]
          kvm_vcpu_ioctl+0x23e/0x5d0 [kvm]
          ksys_ioctl+0x92/0xb0
          __x64_sys_ioctl+0x16/0x20
          do_syscall_64+0x3e/0xb0
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
         ---[ end trace d17eb998aee991da ]---
      Reported-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Fixes: 89786147 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      19cf4b7e
    • Y
      selftests: kvm: Use a shorter encoding to clear RAX · 98b0bf02
      Yang Weijiang 提交于
      If debug_regs.c is built with newer binutils, the resulting binary is "optimized"
      by the assembler:
      
      asm volatile("ss_start: "
                   "xor %%rax,%%rax\n\t"
                   "cpuid\n\t"
                   "movl $0x1a0,%%ecx\n\t"
                   "rdmsr\n\t"
                   : : : "rax", "ecx");
      
      is translated to :
      
        000000000040194e <ss_start>:
        40194e:       31 c0                   xor    %eax,%eax     <----- rax->eax?
        401950:       0f a2                   cpuid
        401952:       b9 a0 01 00 00          mov    $0x1a0,%ecx
        401957:       0f 32                   rdmsr
      
      As you can see rax is replaced with eax in target binary code.
      This causes a difference is the length of xor instruction (2 Byte vs 3 Byte),
      and makes the hard-coded instruction length check fail:
      
              /* Instruction lengths starting at ss_start */
              int ss_size[4] = {
                      3,              /* xor */   <-------- 2 or 3?
                      2,              /* cpuid */
                      5,              /* mov */
                      2,              /* rdmsr */
              };
      
      Encode the shorter version directly and, while at it, fix the "clobbers"
      of the asm.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NYang Weijiang <weijiang.yang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      98b0bf02
  6. 17 8月, 2020 3 次提交
    • L
      Linux 5.9-rc1 · 9123e3a7
      Linus Torvalds 提交于
      9123e3a7
    • L
      Merge tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block · 2cc3c4b3
      Linus Torvalds 提交于
      Pull io_uring fixes from Jens Axboe:
       "A few differerent things in here.
      
        Seems like syzbot got some more io_uring bits wired up, and we got a
        handful of reports and the associated fixes are in here.
      
        General fixes too, and a lot of them marked for stable.
      
        Lastly, a bit of fallout from the async buffered reads, where we now
        more easily trigger short reads. Some applications don't really like
        that, so the io_read() code now handles short reads internally, and
        got a cleanup along the way so that it's now easier to read (and
        documented). We're now passing tests that failed before"
      
      * tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block:
        io_uring: short circuit -EAGAIN for blocking read attempt
        io_uring: sanitize double poll handling
        io_uring: internally retry short reads
        io_uring: retain iov_iter state over io_read/io_write calls
        task_work: only grab task signal lock when needed
        io_uring: enable lookup of links holding inflight files
        io_uring: fail poll arm on queue proc failure
        io_uring: hold 'ctx' reference around task_work queue + execute
        fs: RWF_NOWAIT should imply IOCB_NOIO
        io_uring: defer file table grabbing request cleanup for locked requests
        io_uring: add missing REQ_F_COMP_LOCKED for nested requests
        io_uring: fix recursive completion locking on oveflow flush
        io_uring: use TWA_SIGNAL for task_work uncondtionally
        io_uring: account locked memory before potential error case
        io_uring: set ctx sq/cq entry count earlier
        io_uring: Fix NULL pointer dereference in loop_rw_iter()
        io_uring: add comments on how the async buffered read retry works
        io_uring: io_async_buf_func() need not test page bit
      2cc3c4b3
    • M
      parisc: fix PMD pages allocation by restoring pmd_alloc_one() · 6f6aea7e
      Mike Rapoport 提交于
      Commit 1355c31e ("asm-generic: pgalloc: provide generic pmd_alloc_one()
      and pmd_free_one()") converted parisc to use generic version of
      pmd_alloc_one() but it missed the fact that parisc uses order-1 pages for
      PMD.
      
      Restore the original version of pmd_alloc_one() for parisc, just use
      GFP_PGTABLE_KERNEL that implies __GFP_ZERO instead of GFP_KERNEL and
      memset.
      
      Fixes: 1355c31e ("asm-generic: pgalloc: provide generic pmd_alloc_one() and pmd_free_one()")
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Tested-by: NMeelis Roos <mroos@linux.ee>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Link: https://lkml.kernel.org/r/9f2b5ebd-e4a4-0fa1-6cd3-4b9f6892d1ad@linux.eeSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f6aea7e
  7. 16 8月, 2020 10 次提交
    • L
      Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block · 4b6c093e
      Linus Torvalds 提交于
      Pull block fixes from Jens Axboe:
       "A few fixes on the block side of things:
      
         - Discard granularity fix (Coly)
      
         - rnbd cleanups (Guoqing)
      
         - md error handling fix (Dan)
      
         - md sysfs fix (Junxiao)
      
         - Fix flush request accounting, which caused an IO slowdown for some
           configurations (Ming)
      
         - Properly propagate loop flag for partition scanning (Lennart)"
      
      * tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block:
        block: fix double account of flush request's driver tag
        loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
        rnbd: no need to set bi_end_io in rnbd_bio_map_kern
        rnbd: remove rnbd_dev_submit_io
        md-cluster: Fix potential error pointer dereference in resize_bitmaps()
        block: check queue's limits.discard_granularity in __blkdev_issue_discard()
        md: get sysfs entry after redundancy attr group create
      4b6c093e
    • L
      Merge tag 'riscv-for-linus-5.9-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · d84835b1
      Linus Torvalds 提交于
      Pull RISC-V fix from Palmer Dabbelt:
       "I collected a single fix during the merge window: we managed to break
        the early trap setup on !MMU, this fixes it"
      
      * tag 'riscv-for-linus-5.9-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: Setup exception vector for nommu platform
      d84835b1
    • L
      Merge tag 'sh-for-5.9' of git://git.libc.org/linux-sh · 5bbec3cf
      Linus Torvalds 提交于
      Pull arch/sh updates from Rich Felker:
       "Cleanup, SECCOMP_FILTER support, message printing fixes, and other
        changes to arch/sh"
      
      * tag 'sh-for-5.9' of git://git.libc.org/linux-sh: (34 commits)
        sh: landisk: Add missing initialization of sh_io_port_base
        sh: bring syscall_set_return_value in line with other architectures
        sh: Add SECCOMP_FILTER
        sh: Rearrange blocks in entry-common.S
        sh: switch to copy_thread_tls()
        sh: use the generic dma coherent remap allocator
        sh: don't allow non-coherent DMA for NOMMU
        dma-mapping: consolidate the NO_DMA definition in kernel/dma/Kconfig
        sh: unexport register_trapped_io and match_trapped_io_handler
        sh: don't include <asm/io_trapped.h> in <asm/io.h>
        sh: move the ioremap implementation out of line
        sh: move ioremap_fixed details out of <asm/io.h>
        sh: remove __KERNEL__ ifdefs from non-UAPI headers
        sh: sort the selects for SUPERH alphabetically
        sh: remove -Werror from Makefiles
        sh: Replace HTTP links with HTTPS ones
        arch/sh/configs: remove obsolete CONFIG_SOC_CAMERA*
        sh: stacktrace: Remove stacktrace_ops.stack()
        sh: machvec: Modernize printing of kernel messages
        sh: pci: Modernize printing of kernel messages
        ...
      5bbec3cf
    • J
      io_uring: short circuit -EAGAIN for blocking read attempt · f91daf56
      Jens Axboe 提交于
      One case was missed in the short IO retry handling, and that's hitting
      -EAGAIN on a blocking attempt read (eg from io-wq context). This is a
      problem on sockets that are marked as non-blocking when created, they
      don't carry any REQ_F_NOWAIT information to help us terminate them
      instead of perpetually retrying.
      
      Fixes: 227c0c96 ("io_uring: internally retry short reads")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f91daf56
    • J
      io_uring: sanitize double poll handling · d4e7cd36
      Jens Axboe 提交于
      There's a bit of confusion on the matching pairs of poll vs double poll,
      depending on if the request is a pure poll (IORING_OP_POLL_ADD) or
      poll driven retry.
      
      Add io_poll_get_double() that returns the double poll waitqueue, if any,
      and io_poll_get_single() that returns the original poll waitqueue. With
      that, remove the argument to io_poll_remove_double().
      
      Finally ensure that wait->private is cleared once the double poll handler
      has run, so that remove knows it's already been seen.
      
      Cc: stable@vger.kernel.org # v5.8
      Reported-by: syzbot+7f617d4a9369028b8a2c@syzkaller.appspotmail.com
      Fixes: 18bceab1 ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d4e7cd36
    • L
      Merge tag 'perf-tools-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux · 713eee84
      Linus Torvalds 提交于
      Pull more perf tools updates from Arnaldo Carvalho de Melo:
       "Fixes:
         - Fixes for 'perf bench numa'.
      
         - Always memset source before memcpy in 'perf bench mem'.
      
         - Quote CC and CXX for their arguments to fix build in environments
           using those variables to pass more than just the compiler names.
      
         - Fix module symbol processing, addressing regression detected via
           "perf test".
      
         - Allow multiple probes in record+script_probe_vfs_getname.sh 'perf
           test' entry.
      
        Improvements:
         - Add script to autogenerate socket family name id->string table from
           copy of kernel header, used so far in 'perf trace'.
      
         - 'perf ftrace' improvements to provide similar options for this
           utility so that one can go from 'perf record', 'perf trace', etc to
           'perf ftrace' just by changing the name of the subcommand.
      
         - Prefer new "sched:sched_waking" trace event when it exists in 'perf
           sched' post processing.
      
         - Update POWER9 metrics to utilize other metrics.
      
         - Fall back to querying debuginfod if debuginfo not found locally.
      
        Miscellaneous:
         - Sync various kvm headers with kernel sources"
      
      * tag 'perf-tools-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (40 commits)
        perf ftrace: Make option description initials all capital letters
        perf build-ids: Fall back to debuginfod query if debuginfo not found
        perf bench numa: Remove dead code in parse_nodes_opt()
        perf stat: Update POWER9 metrics to utilize other metrics
        perf ftrace: Add change log
        perf: ftrace: Add set_tracing_options() to set all trace options
        perf ftrace: Add option --tid to filter by thread id
        perf ftrace: Add option -D/--delay to delay tracing
        perf: ftrace: Allow set graph depth by '--graph-opts'
        perf ftrace: Add support for trace option tracing_thresh
        perf ftrace: Add option 'verbose' to show more info for graph tracer
        perf ftrace: Add support for tracing option 'irq-info'
        perf ftrace: Add support for trace option funcgraph-irqs
        perf ftrace: Add support for trace option sleep-time
        perf ftrace: Add support for tracing option 'func_stack_trace'
        perf tools: Add general function to parse sublevel options
        perf ftrace: Add option '--inherit' to trace children processes
        perf ftrace: Show trace column header
        perf ftrace: Add option '-m/--buffer-size' to set per-cpu buffer size
        perf ftrace: Factor out function write_tracing_file_int()
        ...
      713eee84
    • L
      Merge tag 'x86-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 50f6c7db
      Linus Torvalds 提交于
      Pull x86 fixes from Ingo Molnar:
       "Misc fixes and small updates all around the place:
      
         - Fix mitigation state sysfs output
      
         - Fix an FPU xstate/sxave code assumption bug triggered by
           Architectural LBR support
      
         - Fix Lightning Mountain SoC TSC frequency enumeration bug
      
         - Fix kexec debug output
      
         - Fix kexec memory range assumption bug
      
         - Fix a boundary condition in the crash kernel code
      
         - Optimize porgatory.ro generation a bit
      
         - Enable ACRN guests to use X2APIC mode
      
         - Reduce a __text_poke() IRQs-off critical section for the benefit of
           PREEMPT_RT"
      
      * tag 'x86-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/alternatives: Acquire pte lock with interrupts enabled
        x86/bugs/multihit: Fix mitigation reporting when VMX is not in use
        x86/fpu/xstate: Fix an xstate size check warning with architectural LBRs
        x86/purgatory: Don't generate debug info for purgatory.ro
        x86/tsr: Fix tsc frequency enumeration bug on Lightning Mountain SoC
        kexec_file: Correctly output debugging information for the PT_LOAD ELF header
        kexec: Improve & fix crash_exclude_mem_range() to handle overlapping ranges
        x86/crash: Correct the address boundary of function parameters
        x86/acrn: Remove redundant chars from ACRN signature
        x86/acrn: Allow ACRN guest to use X2APIC mode
      50f6c7db
    • L
      Merge tag 'sched-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1195d58f
      Linus Torvalds 提交于
      Pull scheduler fixes from Ingo Molnar:
       "Two fixes: fix a new tracepoint's output value, and fix the formatting
        of show-state syslog printouts"
      
      * tag 'sched-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/debug: Fix the alignment of the show-state debug output
        sched: Fix use of count for nr_running tracepoint
      1195d58f
    • L
      Merge tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7f5faaaa
      Linus Torvalds 提交于
      Pull perf fixes from Ingo Molnar:
       "Misc fixes, an expansion of perf syscall access to CAP_PERFMON
        privileged tools, plus a RAPL HW-enablement for Intel SPR platforms"
      
      * tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/rapl: Add support for Intel SPR platform
        perf/x86/rapl: Support multiple RAPL unit quirks
        perf/x86/rapl: Fix missing psys sysfs attributes
        hw_breakpoint: Remove unused __register_perf_hw_breakpoint() declaration
        kprobes: Remove show_registers() function prototype
        perf/core: Take over CAP_SYS_PTRACE creds to CAP_PERFMON capability
      7f5faaaa
    • L
      Merge tag 'locking-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · eb1319af
      Linus Torvalds 提交于
      Pull locking fixlets from Ingo Molnar:
       "A documentation fix and a 'fallthrough' macro update"
      
      * tag 'locking-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        futex: Convert to use the preferred 'fallthrough' macro
        Documentation/locking/locktypes: Fix a typo
      eb1319af
  8. 15 8月, 2020 1 次提交
    • L
      Merge tag '9p-for-5.9-rc1' of git://github.com/martinetd/linux · 410520d0
      Linus Torvalds 提交于
      Pull 9p updates from Dominique Martinet:
      
       - some code cleanup
      
       - a couple of static analysis fixes
      
       - setattr: try to pick a fid associated with the file rather than the
         dentry, which might sometimes matter
      
      * tag '9p-for-5.9-rc1' of git://github.com/martinetd/linux:
        9p: Remove unneeded cast from memory allocation
        9p: remove unused code in 9p
        net/9p: Fix sparse endian warning in trans_fd.c
        9p: Fix memory leak in v9fs_mount
        9p: retrieve fid from file when file instance exist.
      410520d0