1. 21 2月, 2019 4 次提交
    • S
      KVM: Explicitly define the "memslot update in-progress" bit · 361209e0
      Sean Christopherson 提交于
      KVM uses bit 0 of the memslots generation as an "update in-progress"
      flag, which is used by x86 to prevent caching MMIO access while the
      memslots are changing.  Although the intended behavior is flag-like,
      e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
      caching data from in-flux memslots, the implementation oftentimes treats
      the bit as part of the generation number itself, e.g. incrementing the
      generation increments twice, once to set the flag and once to clear it.
      
      Prior to commit 4bd518f1 ("KVM: use separate generations for
      each address space"), incorporating the "update in-progress" bit into
      the generation number largely made sense, e.g. "real" generations are
      even, "bogus" generations are odd, most code doesn't need to be aware of
      the bit, etc...
      
      Now that unique memslots generation numbers are assigned to each address
      space, stealthing the in-progress status into the generation number
      results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
      over bit 0 when initializing the memslots generation without any hint as
      to why.
      
      Explicitly define the flag and convert as much code as possible (which
      isn't much) to actually treat it like a flag.  This paves the way for
      eventually using a different bit for "update in-progress" so that it can
      be a flag in truth instead of a awkward extension to the generation
      number.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      361209e0
    • S
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 15248258
      Sean Christopherson 提交于
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      15248258
    • B
      kvm: Add memcg accounting to KVM allocations · b12ce36a
      Ben Gardon 提交于
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      
      There remain a few allocations which should be charged to the VM's
      cgroup but are not. In they include:
              vcpu->run
              kvm->coalesced_mmio_ring
      There allocations are unaccounted in this patch because they are mapped
      to userspace, and accounting them to a cgroup causes problems. This
      should be addressed in a future patch.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b12ce36a
    • G
      kvm: Use struct_size() in kmalloc() · 90952cd3
      Gustavo A. R. Silva 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct foo {
          int stuff;
          void *entry[];
      };
      
      instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      90952cd3
  2. 08 2月, 2019 1 次提交
    • J
      kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) · cfa39381
      Jann Horn 提交于
      kvm_ioctl_create_device() does the following:
      
      1. creates a device that holds a reference to the VM object (with a borrowed
         reference, the VM's refcount has not been bumped yet)
      2. initializes the device
      3. transfers the reference to the device to the caller's file descriptor table
      4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
         reference
      
      The ownership transfer in step 3 must not happen before the reference to the VM
      becomes a proper, non-borrowed reference, which only happens in step 4.
      After step 3, an attacker can close the file descriptor and drop the borrowed
      reference, which can cause the refcount of the kvm object to drop to zero.
      
      This means that we need to grab a reference for the device before
      anon_inode_getfd(), otherwise the VM can disappear from under us.
      
      Fixes: 852b6d57 ("kvm: add device control API")
      Cc: stable@kernel.org
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cfa39381
  3. 12 1月, 2019 1 次提交
  4. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  5. 29 12月, 2018 1 次提交
    • J
      mm/mmu_notifier: use structure for invalidate_range_start/end callback · 5d6527a7
      Jérôme Glisse 提交于
      Patch series "mmu notifier contextual informations", v2.
      
      This patchset adds contextual information, why an invalidation is
      happening, to mmu notifier callback.  This is necessary for user of mmu
      notifier that wish to maintains their own data structure without having to
      add new fields to struct vm_area_struct (vma).
      
      For instance device can have they own page table that mirror the process
      address space.  When a vma is unmap (munmap() syscall) the device driver
      can free the device page table for the range.
      
      Today we do not have any information on why a mmu notifier call back is
      happening and thus device driver have to assume that it is always an
      munmap().  This is inefficient at it means that it needs to re-allocate
      device page table on next page fault and rebuild the whole device driver
      data structure for the range.
      
      Other use case beside munmap() also exist, for instance it is pointless
      for device driver to invalidate the device page table when the
      invalidation is for the soft dirtyness tracking.  Or device driver can
      optimize away mprotect() that change the page table permission access for
      the range.
      
      This patchset enables all this optimizations for device drivers.  I do not
      include any of those in this series but another patchset I am posting will
      leverage this.
      
      The patchset is pretty simple from a code point of view.  The first two
      patches consolidate all mmu notifier arguments into a struct so that it is
      easier to add/change arguments.  The last patch adds the contextual
      information (munmap, protection, soft dirty, clear, ...).
      
      This patch (of 3):
      
      To avoid having to change many callback definition everytime we want to
      add a parameter use a structure to group all parameters for the
      mmu_notifier invalidate_range_start/end callback.  No functional changes
      with this patch.
      
      [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
      Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: Jason Gunthorpe <jgg@mellanox.com>	[infiniband]
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d6527a7
  6. 21 12月, 2018 3 次提交
  7. 14 12月, 2018 3 次提交
    • P
      kvm: introduce manual dirty log reprotect · 2a31b9db
      Paolo Bonzini 提交于
      There are two problems with KVM_GET_DIRTY_LOG.  First, and less important,
      it can take kvm->mmu_lock for an extended period of time.  Second, its user
      can actually see many false positives in some cases.  The latter is due
      to a benign race like this:
      
        1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
           them.
        2. The guest modifies the pages, causing them to be marked ditry.
        3. Userspace actually copies the pages.
        4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
           they were not written to since (3).
      
      This is especially a problem for large guests, where the time between
      (1) and (3) can be substantial.  This patch introduces a new
      capability which, when enabled, makes KVM_GET_DIRTY_LOG not
      write-protect the pages it returns.  Instead, userspace has to
      explicitly clear the dirty log bits just before using the content
      of the page.  The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
      64-page granularity rather than requiring to sync a full memslot;
      this way, the mmu_lock is taken for small amounts of time, and
      only a small amount of time will pass between write protection
      of pages and the sending of their content.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2a31b9db
    • P
      kvm: rename last argument to kvm_get_dirty_log_protect · 8fe65a82
      Paolo Bonzini 提交于
      When manual dirty log reprotect will be enabled, kvm_get_dirty_log_protect's
      pointer argument will always be false on exit, because no TLB flush is needed
      until the manual re-protection operation.  Rename it from "is_dirty" to "flush",
      which more accurately tells the caller what they have to do with it.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8fe65a82
    • P
      kvm: make KVM_CAP_ENABLE_CAP_VM architecture agnostic · e5d83c74
      Paolo Bonzini 提交于
      The first such capability to be handled in virt/kvm/ will be manual
      dirty page reprotection.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e5d83c74
  8. 27 10月, 2018 1 次提交
  9. 17 10月, 2018 4 次提交
  10. 23 8月, 2018 1 次提交
    • M
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko 提交于
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7
  11. 06 8月, 2018 3 次提交
  12. 21 7月, 2018 1 次提交
  13. 13 7月, 2018 1 次提交
  14. 22 6月, 2018 1 次提交
  15. 20 6月, 2018 1 次提交
  16. 13 6月, 2018 1 次提交
    • K
      treewide: Use array_size() in vmalloc() · 42bc47b3
      Kees Cook 提交于
      The vmalloc() function has no 2-factor argument form, so multiplication
      factors need to be wrapped in array_size(). This patch replaces cases of:
      
              vmalloc(a * b)
      
      with:
              vmalloc(array_size(a, b))
      
      as well as handling cases of:
      
              vmalloc(a * b * c)
      
      with:
      
              vmalloc(array3_size(a, b, c))
      
      This does, however, attempt to ignore constant size factors like:
      
              vmalloc(4 * 1024)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        vmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        vmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        vmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
        vmalloc(
      -	SIZE * COUNT
      +	array_size(COUNT, SIZE)
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        vmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        vmalloc(C1 * C2 * C3, ...)
      |
        vmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants.
      @@
      expression E1, E2;
      constant C1, C2;
      @@
      
      (
        vmalloc(C1 * C2, ...)
      |
        vmalloc(
      -	E1 * E2
      +	array_size(E1, E2)
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      42bc47b3
  17. 02 6月, 2018 2 次提交
    • G
      kvm: no need to check return value of debugfs_create functions · 929f45e3
      Greg Kroah-Hartman 提交于
      When calling debugfs functions, there is no need to ever check the
      return value.  The function can work or not, but the code logic should
      never do something different based on this.
      
      This cleans up the error handling a lot, as this code will never get
      hit.
      
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Christoffer Dall <christoffer.dall@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim KrÄmář" <rkrcmar@redhat.com>
      Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
      Cc: Eric Auger <eric.auger@redhat.com>
      Cc: Andre Przywara <andre.przywara@arm.com>
      Cc: kvm-ppc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: kvmarm@lists.cs.columbia.edu
      Cc: kvm@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      929f45e3
    • S
      kvm: Change return type to vm_fault_t · 1499fa80
      Souptick Joarder 提交于
      Use new return type vm_fault_t for fault handler. For
      now, this is just documenting that the function returns
      a VM_FAULT value rather than an errno. Once all instances
      are converted, vm_fault_t will become a distinct type.
      
      commit 1c8f4220 ("mm: change return type to vm_fault_t")
      Signed-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1499fa80
  18. 26 5月, 2018 1 次提交
  19. 25 5月, 2018 1 次提交
    • C
      KVM: arm/arm64: Introduce kvm_arch_vcpu_run_pid_change · bd2a6394
      Christoffer Dall 提交于
      KVM/ARM differs from other architectures in having to maintain an
      additional virtual address space from that of the host and the
      guest, because we split the execution of KVM across both EL1 and
      EL2.
      
      This results in a need to explicitly map data structures into EL2
      (hyp) which are accessed from the hyp code.  As we are about to be
      more clever with our FPSIMD handling on arm64, which stores data in
      the task struct and uses thread_info flags, we will have to map
      parts of the currently executing task struct into the EL2 virtual
      address space.
      
      However, we don't want to do this on every KVM_RUN, because it is a
      fairly expensive operation to walk the page tables, and the common
      execution mode is to map a single thread to a VCPU.  By introducing
      a hook that architectures can select with
      HAVE_KVM_VCPU_RUN_PID_CHANGE, we do not introduce overhead for
      other architectures, but have a simple way to only map the data we
      need when required for arm64.
      
      This patch introduces the framework only, and wires it up in the
      arm/arm64 KVM common code.
      
      No functional change.
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      bd2a6394
  20. 07 3月, 2018 1 次提交
  21. 24 2月, 2018 1 次提交
    • W
      KVM: mmu: Fix overlap between public and private memslots · b28676bb
      Wanpeng Li 提交于
      Reported by syzkaller:
      
          pte_list_remove: ffff9714eb1f8078 0->BUG
          ------------[ cut here ]------------
          kernel BUG at arch/x86/kvm/mmu.c:1157!
          invalid opcode: 0000 [#1] SMP
          RIP: 0010:pte_list_remove+0x11b/0x120 [kvm]
          Call Trace:
           drop_spte+0x83/0xb0 [kvm]
           mmu_page_zap_pte+0xcc/0xe0 [kvm]
           kvm_mmu_prepare_zap_page+0x81/0x4a0 [kvm]
           kvm_mmu_invalidate_zap_all_pages+0x159/0x220 [kvm]
           kvm_arch_flush_shadow_all+0xe/0x10 [kvm]
           kvm_mmu_notifier_release+0x6c/0xa0 [kvm]
           ? kvm_mmu_notifier_release+0x5/0xa0 [kvm]
           __mmu_notifier_release+0x79/0x110
           ? __mmu_notifier_release+0x5/0x110
           exit_mmap+0x15a/0x170
           ? do_exit+0x281/0xcb0
           mmput+0x66/0x160
           do_exit+0x2c9/0xcb0
           ? __context_tracking_exit.part.5+0x4a/0x150
           do_group_exit+0x50/0xd0
           SyS_exit_group+0x14/0x20
           do_syscall_64+0x73/0x1f0
           entry_SYSCALL64_slow_path+0x25/0x25
      
      The reason is that when creates new memslot, there is no guarantee for new
      memslot not overlap with private memslots. This can be triggered by the
      following program:
      
         #include <fcntl.h>
         #include <pthread.h>
         #include <setjmp.h>
         #include <signal.h>
         #include <stddef.h>
         #include <stdint.h>
         #include <stdio.h>
         #include <stdlib.h>
         #include <string.h>
         #include <sys/ioctl.h>
         #include <sys/stat.h>
         #include <sys/syscall.h>
         #include <sys/types.h>
         #include <unistd.h>
         #include <linux/kvm.h>
      
         long r[16];
      
         int main()
         {
      	void *p = valloc(0x4000);
      
      	r[2] = open("/dev/kvm", 0);
      	r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul);
      
      	uint64_t addr = 0xf000;
      	ioctl(r[3], KVM_SET_IDENTITY_MAP_ADDR, &addr);
      	r[6] = ioctl(r[3], KVM_CREATE_VCPU, 0x0ul);
      	ioctl(r[3], KVM_SET_TSS_ADDR, 0x0ul);
      	ioctl(r[6], KVM_RUN, 0);
      	ioctl(r[6], KVM_RUN, 0);
      
      	struct kvm_userspace_memory_region mr = {
      		.slot = 0,
      		.flags = KVM_MEM_LOG_DIRTY_PAGES,
      		.guest_phys_addr = 0xf000,
      		.memory_size = 0x4000,
      		.userspace_addr = (uintptr_t) p
      	};
      	ioctl(r[3], KVM_SET_USER_MEMORY_REGION, &mr);
      	return 0;
         }
      
      This patch fixes the bug by not adding a new memslot even if it
      overlaps with private memslots.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      ---
       virt/kvm/kvm_main.c | 3 +--
       1 file changed, 1 insertion(+), 2 deletions(-)
      b28676bb
  22. 01 2月, 2018 3 次提交
    • D
      mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks · 5ff7091f
      David Rientjes 提交于
      Commit 4d4bbd85 ("mm, oom_reaper: skip mm structs with mmu
      notifiers") prevented the oom reaper from unmapping private anonymous
      memory with the oom reaper when the oom victim mm had mmu notifiers
      registered.
      
      The rationale is that doing mmu_notifier_invalidate_range_{start,end}()
      around the unmap_page_range(), which is needed, can block and the oom
      killer will stall forever waiting for the victim to exit, which may not
      be possible without reaping.
      
      That concern is real, but only true for mmu notifiers that have
      blockable invalidate_range_{start,end}() callbacks.  This patch adds a
      "flags" field to mmu notifier ops that can set a bit to indicate that
      these callbacks do not block.
      
      The implementation is steered toward an expensive slowpath, such as
      after the oom reaper has grabbed mm->mmap_sem of a still alive oom
      victim.
      
      [rientjes@google.com: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801091339570.240101@chino.kir.corp.google.com
      [akpm@linux-foundation.org: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()]
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141329500.74052@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NDimitri Sivanich <sivanich@hpe.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ff7091f
    • M
      kvm: embed vcpu id to dentry of vcpu anon inode · e46b4692
      Masatake YAMATO 提交于
      All d-entries for vcpu have the same, "anon_inode:kvm-vcpu". That means
      it is impossible to know the mapping between fds for vcpu and vcpu
      from userland.
      
          # LC_ALL=C ls -l /proc/617/fd | grep vcpu
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 18 -> anon_inode:kvm-vcpu
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 19 -> anon_inode:kvm-vcpu
      
      It is also impossible to know the mapping between vma for kvm_run
      structure and vcpu from userland.
      
          # LC_ALL=C grep vcpu /proc/617/maps
          7f9d842d0000-7f9d842d3000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu
          7f9d842d3000-7f9d842d6000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu
      
      This change adds vcpu id to d-entries for vcpu. With this change
      you can get the following output:
      
          # LC_ALL=C ls -l /proc/617/fd | grep vcpu
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 18 -> anon_inode:kvm-vcpu:0
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 19 -> anon_inode:kvm-vcpu:1
      
          # LC_ALL=C grep vcpu /proc/617/maps
          7f9d842d0000-7f9d842d3000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu:0
          7f9d842d3000-7f9d842d6000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu:1
      
      With the mappings known from the output, a tool like strace can report more details
      of qemu-kvm process activities. Here is the strace output of my local prototype:
      
          # ./strace -KK -f -p 617 2>&1 | grep 'KVM_RUN\| K'
          ...
          [pid   664] ioctl(18, KVM_RUN, 0)       = 0 (KVM_EXIT_MMIO)
           K ready_for_interrupt_injection=1, if_flag=0, flags=0, cr8=0000000000000000, apic_base=0x000000fee00d00
           K phys_addr=0, len=1634035803, [33, 0, 0, 0, 0, 0, 0, 0], is_write=112
          [pid   664] ioctl(18, KVM_RUN, 0)       = 0 (KVM_EXIT_MMIO)
           K ready_for_interrupt_injection=1, if_flag=1, flags=0, cr8=0000000000000000, apic_base=0x000000fee00d00
           K phys_addr=0, len=1634035803, [33, 0, 0, 0, 0, 0, 0, 0], is_write=112
          ...
      Signed-off-by: NMasatake YAMATO <yamato@redhat.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      e46b4692
    • K
      kvm: Map PFN-type memory regions as writable (if possible) · a340b3e2
      KarimAllah Ahmed 提交于
      For EPT-violations that are triggered by a read, the pages are also mapped with
      write permissions (if their memory region is also writable). That would avoid
      getting yet another fault on the same page when a write occurs.
      
      This optimization only happens when you have a "struct page" backing the memory
      region. So also enable it for memory regions that do not have a "struct page".
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      a340b3e2
  23. 16 1月, 2018 1 次提交
  24. 14 12月, 2017 2 次提交
    • P
      KVM: introduce kvm_arch_vcpu_async_ioctl · 5cb0944c
      Paolo Bonzini 提交于
      After the vcpu_load/vcpu_put pushdown, the handling of asynchronous VCPU
      ioctl is already much clearer in that it is obvious that they bypass
      vcpu_load and vcpu_put.
      
      However, it is still not perfect in that the different state of the VCPU
      mutex is still hidden in the caller.  Separate those ioctls into a new
      function kvm_arch_vcpu_async_ioctl that returns -ENOIOCTLCMD for more
      "traditional" synchronous ioctls.
      
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Suggested-by: NCornelia Huck <cohuck@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5cb0944c
    • C
      KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl · 9b062471
      Christoffer Dall 提交于
      Move the calls to vcpu_load() and vcpu_put() in to the architecture
      specific implementations of kvm_arch_vcpu_ioctl() which dispatches
      further architecture-specific ioctls on to other functions.
      
      Some architectures support asynchronous vcpu ioctls which cannot call
      vcpu_load() or take the vcpu->mutex, because that would prevent
      concurrent execution with a running VCPU, which is the intended purpose
      of these ioctls, for example because they inject interrupts.
      
      We repeat the separate checks for these specifics in the architecture
      code for MIPS, S390 and PPC, and avoid taking the vcpu->mutex and
      calling vcpu_load for these ioctls.
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9b062471