1. 27 8月, 2013 1 次提交
    • A
      kvm: optimize away THP checks in kvm_is_mmio_pfn() · 11feeb49
      Andrea Arcangeli 提交于
      The checks on PG_reserved in the page structure on head and tail pages
      aren't necessary because split_huge_page wouldn't transfer the
      PG_reserved bit from head to tail anyway.
      
      This was a forward-thinking check done in the case PageReserved was
      set by a driver-owned page mapped in userland with something like
      remap_pfn_range in a VM_PFNMAP region, but using hugepmds (not
      possible right now). It was meant to be very safe, but it's overkill
      as it's unlikely split_huge_page could ever run without the driver
      noticing and tearing down the hugepage itself.
      
      And if a driver in the future will really want to map a reserved
      hugepage in userland using an huge pmd it should simply take care of
      marking all subpages reserved too to keep KVM safe. This of course
      would require such a hypothetical driver to tear down the huge pmd
      itself and splitting the hugepage itself, instead of relaying on
      split_huge_page, but that sounds very reasonable, especially
      considering split_huge_page wouldn't currently transfer the reserved
      bit anyway.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      11feeb49
  2. 26 8月, 2013 1 次提交
  3. 29 7月, 2013 1 次提交
  4. 18 7月, 2013 2 次提交
  5. 04 6月, 2013 1 次提交
    • A
      kvm: exclude ioeventfd from counting kvm_io_range limit · 6ea34c9b
      Amos Kong 提交于
      We can easily reach the 1000 limit by start VM with a couple
      hundred I/O devices (multifunction=on). The hardcode limit
      already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000).
      
      In userspace, we already have maximum file descriptor to
      limit ioeventfd count. But kvm_io_bus devices also are used
      for pit, pic, ioapic, coalesced_mmio. They couldn't be limited
      by maximum file descriptor.
      
      Currently only ioeventfds take too much kvm_io_bus devices,
      so just exclude it from counting kvm_io_range limit.
      
      Also fixed one indent issue in kvm_host.h
      Signed-off-by: NAmos Kong <akong@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      6ea34c9b
  6. 12 5月, 2013 1 次提交
  7. 09 5月, 2013 1 次提交
  8. 08 5月, 2013 1 次提交
    • A
      KVM: Fix kvm_irqfd_init initialization · 7dac16c3
      Asias He 提交于
      In commit a0f155e9 'KVM: Initialize irqfd from kvm_init()', when
      kvm_init() is called the second time (e.g kvm-amd.ko and kvm-intel.ko),
      kvm_arch_init() will fail with -EEXIST, then kvm_irqfd_exit() will be
      called on the error handling path. This way, the kvm_irqfd system will
      not be ready.
      
      This patch fix the following:
      
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffff81c0721e>] _raw_spin_lock+0xe/0x30
      PGD 0
      Oops: 0002 [#1] SMP
      Modules linked in: vhost_net
      CPU 6
      Pid: 4257, comm: qemu-system-x86 Not tainted 3.9.0-rc3+ #757 Dell Inc. OptiPlex 790/0V5HMK
      RIP: 0010:[<ffffffff81c0721e>]  [<ffffffff81c0721e>] _raw_spin_lock+0xe/0x30
      RSP: 0018:ffff880221721cc8  EFLAGS: 00010046
      RAX: 0000000000000100 RBX: ffff88022dcc003f RCX: ffff880221734950
      RDX: ffff8802208f6ca8 RSI: 000000007fffffff RDI: 0000000000000000
      RBP: ffff880221721cc8 R08: 0000000000000002 R09: 0000000000000002
      R10: 00007f7fd01087e0 R11: 0000000000000246 R12: ffff8802208f6ca8
      R13: 0000000000000080 R14: ffff880223e2a900 R15: 0000000000000000
      FS:  00007f7fd38488e0(0000) GS:ffff88022dcc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 000000022309f000 CR4: 00000000000427e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process qemu-system-x86 (pid: 4257, threadinfo ffff880221720000, task ffff880222bd5640)
      Stack:
       ffff880221721d08 ffffffff810ac5c5 ffff88022431dc00 0000000000000086
       0000000000000080 ffff880223e2a900 ffff8802208f6ca8 0000000000000000
       ffff880221721d48 ffffffff810ac8fe 0000000000000000 ffff880221734000
      Call Trace:
       [<ffffffff810ac5c5>] __queue_work+0x45/0x2d0
       [<ffffffff810ac8fe>] queue_work_on+0x8e/0xa0
       [<ffffffff810ac949>] queue_work+0x19/0x20
       [<ffffffff81009b6b>] irqfd_deactivate+0x4b/0x60
       [<ffffffff8100a69d>] kvm_irqfd+0x39d/0x580
       [<ffffffff81007a27>] kvm_vm_ioctl+0x207/0x5b0
       [<ffffffff810c9545>] ? update_curr+0xf5/0x180
       [<ffffffff811b66e8>] do_vfs_ioctl+0x98/0x550
       [<ffffffff810c1f5e>] ? finish_task_switch+0x4e/0xe0
       [<ffffffff81c054aa>] ? __schedule+0x2ea/0x710
       [<ffffffff811b6bf7>] sys_ioctl+0x57/0x90
       [<ffffffff8140ae9e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
       [<ffffffff81c0f602>] system_call_fastpath+0x16/0x1b
      Code: c1 ea 08 38 c2 74 0f 66 0f 1f 44 00 00 f3 90 0f b6 03 38 c2 75 f7 48 83 c4 08 5b c9 c3 55 48 89 e5 66 66 66 66 90 b8 00 01 00 00 <f0> 66 0f c1 07 89 c2 66 c1 ea 08 38 c2 74 0c 0f 1f 00 f3 90 0f
      RIP  [<ffffffff81c0721e>] _raw_spin_lock+0xe/0x30
      RSP <ffff880221721cc8>
      CR2: 0000000000000000
      ---[ end trace 13fb1e4b6e5ab21f ]---
      Signed-off-by: NAsias He <asias@redhat.com>
      Acked-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      7dac16c3
  9. 05 5月, 2013 1 次提交
  10. 02 5月, 2013 1 次提交
    • P
      KVM: PPC: Book3S: Add API for in-kernel XICS emulation · 5975a2e0
      Paul Mackerras 提交于
      This adds the API for userspace to instantiate an XICS device in a VM
      and connect VCPUs to it.  The API consists of a new device type for
      the KVM_CREATE_DEVICE ioctl, a new capability KVM_CAP_IRQ_XICS, which
      functions similarly to KVM_CAP_IRQ_MPIC, and the KVM_IRQ_LINE ioctl,
      which is used to assert and deassert interrupt inputs of the XICS.
      
      The XICS device has one attribute group, KVM_DEV_XICS_GRP_SOURCES.
      Each attribute within this group corresponds to the state of one
      interrupt source.  The attribute number is the same as the interrupt
      source number.
      
      This does not support irq routing or irqfd yet.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5975a2e0
  11. 27 4月, 2013 6 次提交
  12. 17 4月, 2013 2 次提交
  13. 16 4月, 2013 1 次提交
  14. 08 4月, 2013 2 次提交
  15. 07 4月, 2013 1 次提交
  16. 11 3月, 2013 2 次提交
  17. 06 3月, 2013 1 次提交
  18. 05 3月, 2013 5 次提交
  19. 11 2月, 2013 1 次提交
  20. 05 2月, 2013 2 次提交
    • T
      KVM: set_memory_region: Disallow changing read-only attribute later · 75d61fbc
      Takuya Yoshikawa 提交于
      As Xiao pointed out, there are a few problems with it:
       - kvm_arch_commit_memory_region() write protects the memory slot only
         for GET_DIRTY_LOG when modifying the flags.
       - FNAME(sync_page) uses the old spte value to set a new one without
         checking KVM_MEM_READONLY flag.
      
      Since we flush all shadow pages when creating a new slot, the simplest
      fix is to disallow such problematic flag changes: this is safe because
      no one is doing such things.
      Reviewed-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NTakuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      75d61fbc
    • T
      KVM: set_memory_region: Identify the requested change explicitly · f64c0398
      Takuya Yoshikawa 提交于
      KVM_SET_USER_MEMORY_REGION forces __kvm_set_memory_region() to identify
      what kind of change is being requested by checking the arguments.  The
      current code does this checking at various points in code and each
      condition being used there is not easy to understand at first glance.
      
      This patch consolidates these checks and introduces an enum to name the
      possible changes to clean up the code.
      
      Although this does not introduce any functional changes, there is one
      change which optimizes the code a bit: if we have nothing to change, the
      new code returns 0 immediately.
      
      Note that the return value for this case cannot be changed since QEMU
      relies on it: we noticed this when we changed it to -EINVAL and got a
      section mismatch error at the final stage of live migration.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      f64c0398
  21. 29 1月, 2013 2 次提交
    • R
      kvm: Handle yield_to failure return code for potential undercommit case · c45c528e
      Raghavendra K T 提交于
      yield_to returns -ESRCH, When source and target of yield_to
      run queue length is one. When we see three successive failures of
      yield_to we assume we are in potential undercommit case and abort
      from PLE handler.
      The assumption is backed by low probability of wrong decision
      for even worst case scenarios such as average runqueue length
      between 1 and 2.
      
      More detail on rationale behind using three tries:
      if p is the probability of finding rq length one on a particular cpu,
      and if we do n tries, then probability of exiting ple handler is:
      
       p^(n+1) [ because we would have come across one source with rq length
      1 and n target cpu rqs  with length 1 ]
      
      so
      num tries:         probability of aborting ple handler (1.5x overcommit)
       1                 1/4
       2                 1/8
       3                 1/16
      
      We can increase this probability with more tries, but the problem is
      the overhead.
      Also, If we have tried three times that means we would have iterated
      over 3 good eligible vcpus along with many non-eligible candidates. In
      worst case if we iterate all the vcpus, we reduce 1x performance and
      overcommit performance get hit.
      
      note that we do not update last boosted vcpu in failure cases.
      Thank Avi for raising question on aborting after first fail from yield_to.
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Tested-by: NChegu Vinod <chegu_vinod@hp.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      c45c528e
    • Y
      x86, apicv: add virtual interrupt delivery support · c7c9c56c
      Yang Zhang 提交于
      Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
      manually, which is fully taken care of by the hardware. This needs
      some special awareness into existing interrupr injection path:
      
      - for pending interrupt, instead of direct injection, we may need
        update architecture specific indicators before resuming to guest.
      
      - A pending interrupt, which is masked by ISR, should be also
        considered in above update action, since hardware will decide
        when to inject it at right time. Current has_interrupt and
        get_interrupt only returns a valid vector from injection p.o.v.
      Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NKevin Tian <kevin.tian@intel.com>
      Signed-off-by: NYang Zhang <yang.z.zhang@Intel.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      c7c9c56c
  22. 27 1月, 2013 1 次提交
  23. 17 1月, 2013 3 次提交