1. 29 3月, 2019 19 次提交
    • S
      KVM: doc: Document the life cycle of a VM and its resources · 919f6cd8
      Sean Christopherson 提交于
      The series to add memcg accounting to KVM allocations[1] states:
      
        There are many KVM kernel memory allocations which are tied to the
        life of the VM process and should be charged to the VM process's
        cgroup.
      
      While it is correct to account KVM kernel allocations to the cgroup of
      the process that created the VM, it's technically incorrect to state
      that the KVM kernel memory allocations are tied to the life of the VM
      process.  This is because the VM itself, i.e. struct kvm, is not tied to
      the life of the process which created it, rather it is tied to the life
      of its associated file descriptor.  In other words, kvm_destroy_vm() is
      not invoked until fput() decrements its associated file's refcount to
      zero.  A simple example is to fork() in Qemu and have the child sleep
      indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file
      descriptor *and* the rogue child is killed.
      
      The allocations are guaranteed to be *accounted* to the process which
      created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the
      ioctl() with -EIO if kvm->mm != current->mm.  I.e. the child can keep
      the VM "alive" but can't do anything useful with its reference.
      
      Note that because 'struct kvm' also holds a reference to the mm_struct
      of its owner, the above behavior also applies to userspace allocations.
      
      Given that mucking with a VM's file descriptor can lead to subtle and
      undesirable behavior, e.g. memcg charges persisting after a VM is shut
      down, explicitly document a VM's lifecycle and its impact on the VM's
      resources.
      
      Alternatively, KVM could aggressively free resources when the creating
      process exits, e.g. via mmu_notifier->release().  However, mmu_notifier
      isn't guaranteed to be available, and freeing resources when the creator
      exits is likely to be error prone and fragile as KVM would need to
      ensure that it only freed resources that are truly out of reach. In
      practice, the existing behavior shouldn't be problematic as a properly
      configured system will prevent a child process from being moved out of
      the appropriate cgroup hierarchy, i.e. prevent hiding the process from
      the OOM killer, and will prevent an unprivileged user from being able to
      to hold a reference to struct kvm via another method, e.g. debugfs.
      
      [1]https://patchwork.kernel.org/patch/10806707/Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      919f6cd8
    • S
      KVM: selftests: complete IO before migrating guest state · 0f73bbc8
      Sean Christopherson 提交于
      Documentation/virtual/kvm/api.txt states:
      
        NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
              KVM_EXIT_EPR the corresponding operations are complete (and guest
              state is consistent) only after userspace has re-entered the
              kernel with KVM_RUN.  The kernel side will first finish incomplete
              operations and then check for pending signals.  Userspace can
              re-enter the guest with an unmasked signal pending to complete
              pending operations.
      
      Because guest state may be inconsistent, starting state migration after
      an IO exit without first completing IO may result in test failures, e.g.
      a proposed change to KVM's handling of %rip in its fast PIO handling[1]
      will cause the new VM, i.e. the post-migration VM, to have its %rip set
      to the IN instruction that triggered KVM_EXIT_IO, leading to a test
      assertion due to a stage mismatch.
      
      For simplicitly, require KVM_CAP_IMMEDIATE_EXIT to complete IO and skip
      the test if it's not available.  The addition of KVM_CAP_IMMEDIATE_EXIT
      predates the state selftest by more than a year.
      
      [1] https://patchwork.kernel.org/patch/10848545/
      
      Fixes: fa3899ad ("kvm: selftests: add basic test for state save and restore")
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0f73bbc8
    • S
      KVM: selftests: disable stack protector for all KVM tests · ffac839d
      Sean Christopherson 提交于
      Since 4.8.3, gcc has enabled -fstack-protector by default.  This is
      problematic for the KVM selftests as they do not configure fs or gs
      segments (the stack canary is pulled from fs:0x28).  With the default
      behavior, gcc will insert a stack canary on any function that creates
      buffers of 8 bytes or more.  As a result, ucall() will hit a triple
      fault shutdown due to reading a bad fs segment when inserting its
      stack canary, i.e. every test fails with an unexpected SHUTDOWN.
      
      Fixes: 14c47b75 ("kvm: selftests: introduce ucall")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ffac839d
    • S
      KVM: selftests: explicitly disable PIE for tests · 0a3f29b5
      Sean Christopherson 提交于
      KVM selftests embed the guest "image" as a function in the test itself
      and extract the guest code at runtime by manually parsing the elf
      headers.  The parsing is very simple and doesn't supporting fancy things
      like position independent executables.  Recent versions of gcc enable
      pie by default, which results in triple fault shutdowns in the guest due
      to the virtual address in the headers not matching up with the virtual
      address retrieved from the function pointer.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0a3f29b5
    • S
      KVM: selftests: assert on exit reason in CR4/cpuid sync test · 8df98ae0
      Sean Christopherson 提交于
      ...so that the test doesn't end up in an infinite loop if it fails for
      whatever reason, e.g. SHUTDOWN due to gcc inserting stack canary code
      into ucall() and attempting to derefence a null segment.
      
      Fixes: ca359066 ("kvm: selftests: add cr4_cpuid_sync_test")
      Cc: Wei Huang <wei@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8df98ae0
    • S
      KVM: x86: update %rip after emulating IO · 45def77e
      Sean Christopherson 提交于
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      
      Updating %rip prior to executing to userspace has several drawbacks:
      
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45def77e
    • V
      x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init · 013cc6eb
      Vitaly Kuznetsov 提交于
      When userspace initializes guest vCPUs it may want to zero all supported
      MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/
      HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5 ("kvm/x86: Update SynIC
      timers on guest entry only") we began doing stimer_mark_pending()
      unconditionally on every config change.
      
      The issue I'm observing manifests itself as following:
      - Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as
        pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER;
      - kvm_hv_has_stimer_pending() starts returning true;
      - kvm_vcpu_has_events() starts returning true;
      - kvm_arch_vcpu_runnable() starts returning true;
      - when kvm_arch_vcpu_ioctl_run() gets into
        (vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case:
        - kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns
          immediately, avoiding normal wait path;
        - -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing
          userspace to retry.
      
      So instead of normal wait path we get a busy loop on all secondary vCPUs
      before they get INIT signal. This seems to be undesirable, especially given
      that this happens even when Hyper-V extensions are not used.
      
      Generally, it seems to be pointless to mark an stimer as pending in
      stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing
      kvm_hv_process_stimers() will do is clear the corresponding bit. We may
      just not mark disabled timers as pending instead.
      
      Fixes: f3b138c5 ("kvm/x86: Update SynIC timers on guest entry only")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      013cc6eb
    • X
      kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs · 2bdb76c0
      Xiaoyao Li 提交于
      Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if
      host doesn't suppot it. We should move it to array emulated_msrs from
      arry msrs_to_save, to report to userspace that guest support this msr.
      Signed-off-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2bdb76c0
    • S
      KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 0cf9135b
      Sean Christopherson 提交于
      The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
      userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
      regardless of hardware support under the pretense that KVM fully
      emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
      handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
      also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
      
      Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
      that it's emulated on AMD hosts.
      
      Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
      Cc: stable@vger.kernel.org
      Reported-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0cf9135b
    • S
      kvm: don't redefine flags as something else · ca0488aa
      Sebastian Andrzej Siewior 提交于
      The function irqfd_wakeup() has flags defined as __poll_t and then it
      has additional flags which is used for irqflags.
      
      Redefine the inner flags variable as iflags so it does not shadow the
      outer flags.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ca0488aa
    • B
      kvm: mmu: Used range based flushing in slot_handle_level_range · f285c633
      Ben Gardon 提交于
      Replace kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address
      in slot_handle_level_range. When range based flushes are not enabled
      kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs.
      
      This changes the behavior of many functions that indirectly use
      slot_handle_level_range, iff the range based flushes are enabled. The
      only potential problem I see with this is that kvm->tlbs_dirty will be
      cleared less often, however the only caller of slot_handle_level_range that
      checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which
      checks it and does a kvm_flush_remote_tlbs after calling
      kvm_unmap_hva_range anyway.
      
      Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
      	without this patch. The patch introduced no new failures.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f285c633
    • M
      KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supported · 3d9683cf
      Masahiro Yamada 提交于
      I do not see any consistency about headers_install of <linux/kvm_para.h>
      and <asm/kvm_para.h>.
      
      According to my analysis of Linux 5.1-rc1, there are 3 groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          alpha, arm, hexagon, mips, powerpc, s390, sparc, x86
      
       [2] <asm/kvm_para.h> is exported, but <linux/kvm_para.h> is not
      
          arc, arm64, c6x, h8300, ia64, m68k, microblaze, nios2, openrisc,
          parisc, sh, unicore32, xtensa
      
       [3] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          csky, nds32, riscv
      
      This does not match to the actual KVM support. At least, [2] is
      half-baked.
      
      Nor do arch maintainers look like they care about this. For example,
      commit 0add5371 ("microblaze: Add missing kvm_para.h to Kbuild")
      exported <asm/kvm_para.h> to user-space in order to fix an in-kernel
      build error.
      
      We have two ways to make this consistent:
      
       [A] export both <linux/kvm_para.h> and <asm/kvm_para.h> for all
           architectures, irrespective of the KVM support
      
       [B] Match the header export of <linux/kvm_para.h> and <asm/kvm_para.h>
           to the KVM support
      
      My first attempt was [A] because the code looks cleaner, but Paolo
      suggested [B].
      
      So, this commit goes with [B].
      
      For most architectures, <asm/kvm_para.h> was moved to the kernel-space.
      I changed include/uapi/linux/Kbuild so that it checks generated
      asm/kvm_para.h as well as check-in ones.
      
      After this commit, there will be two groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          arm, arm64, mips, powerpc, s390, x86
      
       [2] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          alpha, arc, c6x, csky, h8300, hexagon, ia64, m68k, microblaze,
          nds32, nios2, openrisc, parisc, riscv, sh, sparc, unicore32, xtensa
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NCornelia Huck <cohuck@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3d9683cf
    • W
      KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region() · 4d66623c
      Wei Yang 提交于
      * nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is
        non-zero.
      
      * nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages()
        never return zero.
      
      Based on these two reasons, we can merge the two *if* clause and use the
      return value from kvm_mmu_calculate_mmu_pages() directly. This simplify
      the code and also eliminate the possibility for reader to believe
      nr_mmu_pages would be zero.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d66623c
    • K
      kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields · 711eff3a
      Krish Sadhukhan 提交于
      According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
      following check is performed on vmentry of L2 guests:
      
          On processors that support Intel 64 architecture, the IA32_SYSENTER_ESP
          field and the IA32_SYSENTER_EIP field must each contain a canonical
          address.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      711eff3a
    • S
      KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) · 05d5a486
      Singh, Brijesh 提交于
      Errata#1096:
      
      On a nested data page fault when CR.SMAP=1 and the guest data read
      generates a SMAP violation, GuestInstrBytes field of the VMCB on a
      VMEXIT will incorrectly return 0h instead the correct guest
      instruction bytes .
      
      Recommend Workaround:
      
      To determine what instruction the guest was executing the hypervisor
      will have to decode the instruction at the instruction pointer.
      
      The recommended workaround can not be implemented for the SEV
      guest because guest memory is encrypted with the guest specific key,
      and instruction decoder will not be able to decode the instruction
      bytes. If we hit this errata in the SEV guest then log the message
      and request a guest shutdown.
      Reported-by: NVenkatesh Srinivas <venkateshs@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05d5a486
    • S
      KVM: Reject device ioctls from processes other than the VM's creator · ddba9180
      Sean Christopherson 提交于
      KVM's API requires thats ioctls must be issued from the same process
      that created the VM.  In other words, userspace can play games with a
      VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
      creator can do anything useful.  Explicitly reject device ioctls that
      are issued by a process other than the VM's creator, and update KVM's
      API documentation to extend its requirements to device ioctls.
      
      Fixes: 852b6d57 ("kvm: add device control API")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ddba9180
    • S
      KVM: doc: Fix incorrect word ordering regarding supported use of APIs · 5e124900
      Sean Christopherson 提交于
      Per Paolo[1], instantiating multiple VMs in a single process is legal;
      but this conflicts with KVM's API documentation, which states:
      
        The only supported use is one virtual machine per process, and one
        vcpu per thread.
      
      However, an earlier section in the documentation states:
      
         Only run VM ioctls from the same process (address space) that was used
         to create the VM.
      
      and:
      
         Only run vcpu ioctls from the same thread that was used to create the
         vcpu.
      
      This suggests that the conflicting documentation is simply an incorrect
      ordering of of words, i.e. what's really meant is that a virtual machine
      can't be shared across multiple processes and a vCPU can't be shared
      across multiple threads.
      
      Tweak the blurb on issuing ioctls to use a more assertive tone, and
      rewrite the "supported use" sentence to reference said blurb instead of
      poorly restating it in different terms.
      
      Opportunistically add missing punctuation.
      
      [1] https://lkml.kernel.org/r/f23265d4-528e-3bd4-011f-4d7b8f3281db@redhat.com
      
      Fixes: 9c1b96e3 ("KVM: Document basic API")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Improve notes on asynchronous ioctl]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5e124900
    • S
      KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size' · 47c42e6b
      Sean Christopherson 提交于
      The cr4_pae flag is a bit of a misnomer, its purpose is really to track
      whether the guest PTE that is being shadowed is a 4-byte entry or an
      8-byte entry.  Prior to supporting nested EPT, the size of the gpte was
      reflected purely by CR4.PAE.  KVM fudged things a bit for direct sptes,
      but it was mostly harmless since the size of the gpte never mattered.
      Now that a spte may be tracking an indirect EPT entry, relying on
      CR4.PAE is wrong and ill-named.
      
      For direct shadow pages, force the gpte_size to '1' as they are always
      8-byte entries; EPT entries can only be 8-bytes and KVM always uses
      8-byte entries for NPT and its identity map (when running with EPT but
      not unrestricted guest).
      
      Likewise, nested EPT entries are always 8-bytes.  Nested EPT presents a
      unique scenario as the size of the entries are not dictated by CR4.PAE,
      but neither is the shadow page a direct map.  To handle this scenario,
      set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination,
      to denote a nested EPT shadow page.  Use the information to avoid
      incorrectly zapping an unsync'd indirect page in __kvm_sync_page().
      
      Providing a consistent and accurate gpte_size fixes a bug reported by
      Vitaly where fast_cr3_switch() always fails when switching from L2 to
      L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages,
      whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE.
      
      Fixes: 7dcd5755 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
      Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      47c42e6b
    • S
      KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPT · 552c69b1
      Sean Christopherson 提交于
      Explicitly zero out quadrant and invalid instead of inheriting them from
      the root_mmu.  Functionally, this patch is a nop as we (should) never
      set quadrant for a direct mapped (EPT) root_mmu and nested EPT is only
      allowed if EPT is used for L1, and the root_mmu will never be invalid at
      this point.
      
      Explicitly setting flags sets the stage for repurposing the legacy
      paging bits in role, e.g. nxe, cr0_wp, and sm{a,e}p_andnot_wp, at which
      point 'smm' would be the only flag to be inherited from root_mmu.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      552c69b1
  2. 25 3月, 2019 13 次提交
    • L
      Linux 5.1-rc2 · 8c2ffd91
      Linus Torvalds 提交于
      8c2ffd91
    • L
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 17403fa2
      Linus Torvalds 提交于
      Pull ext4 fixes from Ted Ts'o:
       "Miscellaneous ext4 bug fixes for 5.1"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: prohibit fstrim in norecovery mode
        ext4: cleanup bh release code in ext4_ind_remove_space()
        ext4: brelse all indirect buffer in ext4_ind_remove_space()
        ext4: report real fs size after failed resize
        ext4: add missing brelse() in add_new_gdb_meta_bg()
        ext4: remove useless ext4_pin_inode()
        ext4: avoid panic during forced reboot
        ext4: fix data corruption caused by unaligned direct AIO
        ext4: fix NULL pointer dereference while journal is aborted
      17403fa2
    • L
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 231c807a
      Linus Torvalds 提交于
      Pull scheduler updates from Thomas Gleixner:
       "Third more careful attempt for this set of fixes:
      
         - Prevent a 32bit math overflow in the cpufreq code
      
         - Fix a buffer overflow when scanning the cgroup2 cpu.max property
      
         - A set of fixes for the NOHZ scheduler logic to prevent waking up
           CPUs even if the capacity of the busy CPUs is sufficient along with
           other tweaks optimizing the behaviour for asymmetric systems
           (big/little)"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/fair: Skip LLC NOHZ logic for asymmetric systems
        sched/fair: Tune down misfit NOHZ kicks
        sched/fair: Comment some nohz_balancer_kick() kick conditions
        sched/core: Fix buffer overflow in cgroup2 property cpu.max
        sched/cpufreq: Fix 32-bit math overflow
      231c807a
    • L
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 49ef0156
      Linus Torvalds 提交于
      Pull perf updates from Thomas Gleixner:
       "A larger set of perf updates.
      
        Not all of them are strictly fixes, but that's solely the tip
        maintainers fault as they let the timely -rc1 pull request fall
        through the cracks for various reasons including travel. So I'm
        sending this nevertheless because rebasing and distangling fixes and
        updates would be a mess and risky as well. As of tomorrow, a strict
        fixes separation is happening again. Sorry for the slip-up.
      
        Kernel:
      
         - Handle RECORD_MMAP vs. RECORD_MMAP2 correctly so different
           consumers of the mmap event get what they requested.
      
        Tools:
      
         - A larger set of updates to perf record/report/scripts vs. time
           stamp handling
      
         - More Python3 fixups
      
         - A pile of memory leak plumbing
      
         - perf BPF improvements and fixes
      
         - Finalize the perf.data directory storage"
      
      [ Note: the kernel part is strictly a fix, the updates are purely to
        tooling       - Linus ]
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
        perf bpf: Show more BPF program info in print_bpf_prog_info()
        perf bpf: Extract logic to create program names from perf_event__synthesize_one_bpf_prog()
        perf tools: Save bpf_prog_info and BTF of new BPF programs
        perf evlist: Introduce side band thread
        perf annotate: Enable annotation of BPF programs
        perf build: Check what binutils's 'disassembler()' signature to use
        perf bpf: Process PERF_BPF_EVENT_PROG_LOAD for annotation
        perf symbols: Introduce DSO_BINARY_TYPE__BPF_PROG_INFO
        perf feature detection: Add -lopcodes to feature-libbfd
        perf top: Add option --no-bpf-event
        perf bpf: Save BTF information as headers to perf.data
        perf bpf: Save BTF in a rbtree in perf_env
        perf bpf: Save bpf_prog_info information as headers to perf.data
        perf bpf: Save bpf_prog_info in a rbtree in perf_env
        perf bpf: Make synthesize_bpf_events() receive perf_session pointer instead of perf_tool
        perf bpf: Synthesize bpf events with bpf_program__get_prog_info_linear()
        bpftool: use bpf_program__get_prog_info_linear() in prog.c:do_dump()
        tools lib bpf: Introduce bpf_program__get_prog_info_linear()
        perf record: Replace option --bpf-event with --no-bpf-event
        perf tests: Fix a memory leak in test__perf_evsel__tp_sched_test()
        ...
      49ef0156
    • L
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 19caf581
      Linus Torvalds 提交于
      Pull x86 fixes from Thomas Gleixner:
       "A set of x86 fixes:
      
         - Prevent potential NULL pointer dereferences in the HPET and HyperV
           code
      
         - Exclude the GART aperture from /proc/kcore to prevent kernel
           crashes on access
      
         - Use the correct macros for Cyrix I/O on Geode processors
      
         - Remove yet another kernel address printk leak
      
         - Announce microcode reload completion as requested by quite some
           people. Microcode loading has become popular recently.
      
         - Some 'Make Clang' happy fixlets
      
         - A few cleanups for recently added code"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/gart: Exclude GART aperture from kcore
        x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error
        x86/mm/pti: Make local symbols static
        x86/cpu/cyrix: Remove {get,set}Cx86_old macros used for Cyrix processors
        x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors
        x86/microcode: Announce reload operation's completion
        x86/hyperv: Prevent potential NULL pointer dereference
        x86/hpet: Prevent potential NULL pointer dereference
        x86/lib: Fix indentation issue, remove extra tab
        x86/boot: Restrict header scope to make Clang happy
        x86/mm: Don't leak kernel addresses
        x86/cpufeature: Fix various quality problems in the <asm/cpu_device_hd.h> header
      19caf581
    • L
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a75eda7b
      Linus Torvalds 提交于
      Pull timer fixes from Thomas Gleixner:
       "A set of small fixes plus the removal of stale board support code:
      
         - Remove the board support code from the clpx711x clocksource driver.
           This change had fallen through the cracks and I'm sending it now
           rather than dealing with people who want to improve that stale code
           for 3 month.
      
         - Use the proper clocksource mask on RICSV
      
         - Make local scope functions and variables static"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        clocksource/drivers/clps711x: Remove board support
        clocksource/drivers/riscv: Fix clocksource mask
        clocksource/drivers/mips-gic-timer: Make gic_compare_irqaction static
        clocksource/drivers/timer-ti-dm: Make omap_dm_timer_set_load_start() static
        clocksource/drivers/tcb_clksrc: Make tc_clksrc_suspend/resume() static
        clocksource/drivers/clps711x: Make clps711x_clksrc_init() static
        time/jiffies: Make refined_jiffies static
      a75eda7b
    • L
      Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · f6cc519b
      Linus Torvalds 提交于
      Pull locking fixes from Thomas Gleixner:
       "Two small fixes:
      
         - Cure a recently introduces error path hickup which tries to
           unregister a not registered lockdep key in te workqueue code
      
         - Prevent unaligned cmpxchg() crashes in the robust list handling
           code by sanity checking the user space supplied futex pointer"
      
      * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        futex: Ensure that futex address is aligned in handle_futex_death()
        workqueue: Only unregister a registered lockdep key
      f6cc519b
    • L
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e08fef88
      Linus Torvalds 提交于
      Pull irq fixes from Thomas Gleixner:
       "A set of fixes for the interrupt subsystem:
      
         - Remove secondary GIC support on systems w/o device-tree support
      
         - A set of small fixlets in various irqchip drivers
      
         - static and fall-through annotations
      
         - Kernel doc and typo fixes"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq: Mark expected switch case fall-through
        genirq/devres: Remove excess parameter from kernel doc
        irqchip/irq-mvebu-sei: Make mvebu_sei_ap806_caps static
        irqchip/mbigen: Don't clear eventid when freeing an MSI
        irqchip/stm32: Don't set rising configuration registers at init
        irqchip/stm32: Don't clear rising/falling config registers at init
        dt-bindings: irqchip: renesas-irqc: Document r8a774c0 support
        irqchip/mmp: Make mmp_irq_domain_ops static
        irqchip/brcmstb-l2: Make two init functions static
        genirq: Fix typo in comment of IRQD_MOVE_PCNTXT
        irqchip/gic-v3-its: Fix comparison logic in lpi_range_cmp
        irqchip/gic: Drop support for secondary GIC in non-DT systems
        irqchip/imx-irqsteer: Fix of_property_read_u32() error handling
      e08fef88
    • L
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1ebf5afb
      Linus Torvalds 提交于
      Pull core fixes from Thomas Gleixner:
       "Two small fixes:
      
         - Move the large objtool_file struct off the stack so objtool works
           in setups with a tight stack limit.
      
         - Make a few variables static in the watchdog core code"
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        watchdog/core: Make variables static
        objtool: Move objtool_file struct off the stack
      1ebf5afb
    • L
      Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux · 9fc13bbd
      Linus Torvalds 提交于
      Pull thermal management fixes from Zhang Rui:
      
       - Fix a wrong __percpu structure declaration in intel_powerclamp driver
         (Luc Van Oostenryck)
      
       - Fix truncated name of the idle injection kthreads created by
         intel_powerclamp driver (Zhang Rui)
      
       - Fix the missing UUID supports in int3400 thermal driver (Matthew
         Garrett)
      
       - Fix a crash when accessing the debugfs of bcm2835 SoC thermal driver
         (Phil Elwell)
      
       - A couple of trivial fixes/cleanups in some SoC thermal drivers
      
      * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
        thermal/intel_powerclamp: fix truncated kthread name
        thermal: mtk: Allocate enough space for mtk_thermal.
        thermal/int340x_thermal: fix mode setting
        thermal/int340x_thermal: Add additional UUIDs
        thermal: cpu_cooling: Remove unused cur_freq variable
        thermal: bcm2835: Fix crash in bcm2835_thermal_debugfs
        thermal: samsung: Fix incorrect check after code merge
        thermal/intel_powerclamp: fix __percpu declaration of worker_data
      9fc13bbd
    • L
      Merge tag '5.1-rc1-cifs-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 · 38104c00
      Linus Torvalds 提交于
      Pull smb3 fixes from Steve French:
      
       - two fixes for stable for guest mount problems with smb3.1.1
      
       - two fixes for crediting (SMB3 flow control) on resent requests
      
       - a byte range lock leak fix
      
       - two fixes for incorrect rc mappings
      
      * tag '5.1-rc1-cifs-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        cifs: update internal module version number
        SMB3: Fix SMB3.1.1 guest mounts to Samba
        cifs: Fix slab-out-of-bounds when tracing SMB tcon
        cifs: allow guest mounts to work for smb3.11
        fix incorrect error code mapping for OBJECTID_NOT_FOUND
        cifs: fix that return -EINVAL when do dedupe operation
        CIFS: Fix an issue with re-sending rdata when transport returning -EAGAIN
        CIFS: Fix an issue with re-sending wdata when transport returning -EAGAIN
      38104c00
    • L
      Merge tag 'auxdisplay-for-linus-v5.1-rc2' of git://github.com/ojeda/linux · e0046bb3
      Linus Torvalds 提交于
      Pull auxdisplay updates from Miguel Ojeda:
       "A few fixes and improvements for auxdisplay:
      
         - Series to fix a memory leak in hd44780 while introducing
           charlcd_free(). From Andy Shevchenko
      
         - Series to clean up the Kconfig menus and a couple of improvements
           for charlcd. From Mans Rullgard"
      
      * tag 'auxdisplay-for-linus-v5.1-rc2' of git://github.com/ojeda/linux:
        auxdisplay: charlcd: make backlight initial state configurable
        auxdisplay: charlcd: simplify init message display
        auxdisplay: deconfuse configuration
        auxdisplay: hd44780: Convert to use charlcd_free()
        auxdisplay: panel: Convert to use charlcd_free()
        auxdisplay: charlcd: Introduce charlcd_free() helper
        auxdisplay: charlcd: Move to_priv() to charlcd namespace
        auxdisplay: hd44780: Fix memory leak on ->remove()
      e0046bb3
    • L
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 1fa8109f
      Linus Torvalds 提交于
      Pull SCSI fixes from James Bottomley:
       "Six fixes to four drivers and two core fixes.
      
        One core fix simply corrects a missed destroy_rcu_head() but the other
        is hopefully the end of an ongoing effort to make suspend/resume play
        nicely with scsi quiesce"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: ibmvscsi: Fix empty event pool access during host removal
        scsi: ibmvscsi: Protect ibmvscsi_head from concurrent modificaiton
        scsi: hisi_sas: Add softreset in hisi_sas_I_T_nexus_reset()
        scsi: qla2xxx: Fix NULL pointer crash due to stale CPUID
        scsi: qla2xxx: Fix FC-AL connection target discovery
        scsi: core: Avoid that a kernel warning appears during system resume
        scsi: core: Also call destroy_rcu_head() for passthrough requests
        scsi: iscsi: flush running unbind operations when removing a session
      1fa8109f
  3. 24 3月, 2019 5 次提交
    • A
      clocksource/drivers/clps711x: Remove board support · 2a6a8e2d
      Alexander Shiyan 提交于
      Since board support for the CLPS711X platform was removed,
      remove the board support from the clps711x-timer driver.
      Signed-off-by: NAlexander Shiyan <shc_work@mail.ru>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Link: https://lkml.kernel.org/r/20181220111626.17140-1-shc_work@mail.ru
      2a6a8e2d
    • L
      Merge tag 'io_uring-20190323' of git://git.kernel.dk/linux-block · 1bdd3dbf
      Linus Torvalds 提交于
      Pull io_uring fixes and improvements from Jens Axboe:
       "The first five in this series are heavily inspired by the work Al did
        on the aio side to fix the races there.
      
        The last two re-introduce a feature that was in io_uring before it got
        merged, but which I pulled since we didn't have a good way to have
        BVEC iters that already have a stable reference. These aren't
        necessarily related to block, it's just how io_uring pins fixed
        buffers"
      
      * tag 'io_uring-20190323' of git://git.kernel.dk/linux-block:
        block: add BIO_NO_PAGE_REF flag
        iov_iter: add ITER_BVEC_FLAG_NO_REF flag
        io_uring: mark me as the maintainer
        io_uring: retry bulk slab allocs as single allocs
        io_uring: fix poll races
        io_uring: fix fget/fput handling
        io_uring: add prepped flag
        io_uring: make io_read/write return an integer
        io_uring: use regular request ref counts
      1bdd3dbf
    • L
      Merge tag 'for-linus-20190323' of git://git.kernel.dk/linux-block · 2335cbe6
      Linus Torvalds 提交于
      Pull block fixes from Jens Axboe:
       "A set of fixes/changes that should go into this series. This contains:
      
         - Kernel doc / comment updates (Bart, Shenghui)
      
         - Un-export of core-only used function (Bart)
      
         - Fix race on loop file access (Dongli)
      
         - pf/pcd queue cleanup fixes (me)
      
         - Use appropriate helper for RESTART bit set (Yufen)
      
         - Use named identifier for classic poll (Yufen)"
      
      * tag 'for-linus-20190323' of git://git.kernel.dk/linux-block:
        sbitmap: trivial - update comment for sbitmap_deferred_clear_bit
        blkcg: Fix kernel-doc warnings
        blk-iolatency: #include "blk.h"
        block: Unexport blk_mq_add_to_requeue_list()
        block: add BLK_MQ_POLL_CLASSIC for hybrid poll and return EINVAL for unexpected value
        blk-mq: remove unused 'nr_expired' from blk_mq_hw_ctx
        loop: access lo_backing_file only when the loop device is Lo_bound
        blk-mq: use blk_mq_sched_mark_restart_hctx to set RESTART
        paride/pcd: cleanup queues when detection fails
        paride/pf: cleanup queues when detection fails
      2335cbe6
    • L
      Merge tag 'ceph-for-5.1-rc2' of git://github.com/ceph/ceph-client · 9a1050ad
      Linus Torvalds 提交于
      Pull ceph fixes from Ilya Dryomov:
       "A follow up for the new alloc_size logic and a blacklisting fix,
        marked for stable"
      
      * tag 'ceph-for-5.1-rc2' of git://github.com/ceph/ceph-client:
        rbd: drop wait_for_latest_osdmap()
        libceph: wait for latest osdmap in ceph_monc_blacklist_add()
        rbd: set io_min, io_opt and discard_granularity to alloc_size
      9a1050ad
    • D
      ext4: prohibit fstrim in norecovery mode · 18915b58
      Darrick J. Wong 提交于
      The ext4 fstrim implementation uses the block bitmaps to find free space
      that can be discarded.  If we haven't replayed the journal, the bitmaps
      will be stale and we absolutely *cannot* use stale metadata to zap the
      underlying storage.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      18915b58
  4. 23 3月, 2019 3 次提交
    • Z
      ext4: cleanup bh release code in ext4_ind_remove_space() · 5e86bdda
      zhangyi (F) 提交于
      Currently, we are releasing the indirect buffer where we are done with
      it in ext4_ind_remove_space(), so we can see the brelse() and
      BUFFER_TRACE() everywhere.  It seems fragile and hard to read, and we
      may probably forget to release the buffer some day.  This patch cleans
      up the code by putting of the code which releases the buffers to the
      end of the function.
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      5e86bdda
    • Z
      ext4: brelse all indirect buffer in ext4_ind_remove_space() · 674a2b27
      zhangyi (F) 提交于
      All indirect buffers get by ext4_find_shared() should be released no
      mater the branch should be freed or not. But now, we forget to release
      the lower depth indirect buffers when removing space from the same
      higher depth indirect block. It will lead to buffer leak and futher
      more, it may lead to quota information corruption when using old quota,
      consider the following case.
      
       - Create and mount an empty ext4 filesystem without extent and quota
         features,
       - quotacheck and enable the user & group quota,
       - Create some files and write some data to them, and then punch hole
         to some files of them, it may trigger the buffer leak problem
         mentioned above.
       - Disable quota and run quotacheck again, it will create two new
         aquota files and write the checked quota information to them, which
         probably may reuse the freed indirect block(the buffer and page
         cache was not freed) as data block.
       - Enable quota again, it will invoke
         vfs_load_quota_inode()->invalidate_bdev() to try to clean unused
         buffers and pagecache. Unfortunately, because of the buffer of quota
         data block is still referenced, quota code cannot read the up to date
         quota info from the device and lead to quota information corruption.
      
      This problem can be reproduced by xfstests generic/231 on ext3 file
      system or ext4 file system without extent and quota features.
      
      This patch fix this problem by releasing the missing indirect buffers,
      in ext4_ind_remove_space().
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      674a2b27
    • G
      genirq: Mark expected switch case fall-through · 93417a3f
      Gustavo A. R. Silva 提交于
      In preparation to enabling -Wimplicit-fallthrough, mark switch
      cases where we are expecting to fall through.
      
      With -Wimplicit-fallthrough added to CFLAGS:
      
       kernel/irq/manage.c: In function ‘irq_do_set_affinity’:
       kernel/irq/manage.c:198:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
         cpumask_copy(desc->irq_common_data.affinity, mask);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       kernel/irq/manage.c:199:2: note: here
         case IRQ_SET_MASK_OK_NOCOPY:
         ^~~~
      
      Annotate it.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Link: https://lkml.kernel.org/r/20190228213714.GA9246@embeddedor
      93417a3f