1. 14 7月, 2016 1 次提交
    • P
      x86/kvm: Audit and remove any unnecessary uses of module.h · 1767e931
      Paul Gortmaker 提交于
      Historically a lot of these existed because we did not have
      a distinction between what was modular code and what was providing
      support to modules via EXPORT_SYMBOL and friends.  That changed
      when we forked out support for the latter into the export.h file.
      
      This means we should be able to reduce the usage of module.h
      in code that is obj-y Makefile or bool Kconfig.  In the case of
      kvm where it is modular, we can extend that to also include files
      that are building basic support functionality but not related
      to loading or registering the final module; such files also have
      no need whatsoever for module.h
      
      The advantage in removing such instances is that module.h itself
      sources about 15 other headers; adding significantly to what we feed
      cpp, and it can obscure what headers we are effectively using.
      
      Since module.h was the source for init.h (for __init) and for
      export.h (for EXPORT_SYMBOL) we consider each instance for the
      presence of either and replace as needed.
      
      Several instances got replaced with moduleparam.h since that was
      really all that was required for those particular files.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160714001901.31603-8-paul.gortmaker@windriver.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1767e931
  2. 27 6月, 2016 3 次提交
    • Q
      KVM: nVMX: VMX instructions: fix segment checks when L1 is in long mode. · ff30ef40
      Quentin Casasnovas 提交于
      I couldn't get Xen to boot a L2 HVM when it was nested under KVM - it was
      getting a GP(0) on a rather unspecial vmread from Xen:
      
           (XEN) ----[ Xen-4.7.0-rc  x86_64  debug=n  Not tainted ]----
           (XEN) CPU:    1
           (XEN) RIP:    e008:[<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
           (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d1v0)
           (XEN) rax: ffff82d0801e6288   rbx: ffff83003ffbfb7c   rcx: fffffffffffab928
           (XEN) rdx: 0000000000000000   rsi: 0000000000000000   rdi: ffff83000bdd0000
           (XEN) rbp: ffff83000bdd0000   rsp: ffff83003ffbfab0   r8:  ffff830038813910
           (XEN) r9:  ffff83003faf3958   r10: 0000000a3b9f7640   r11: ffff83003f82d418
           (XEN) r12: 0000000000000000   r13: ffff83003ffbffff   r14: 0000000000004802
           (XEN) r15: 0000000000000008   cr0: 0000000080050033   cr4: 00000000001526e0
           (XEN) cr3: 000000003fc79000   cr2: 0000000000000000
           (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
           (XEN) Xen code around <ffff82d0801e629e> (vmx_get_segment_register+0x14e/0x450):
           (XEN)  00 00 41 be 02 48 00 00 <44> 0f 78 74 24 08 0f 86 38 56 00 00 b8 08 68 00
           (XEN) Xen stack trace from rsp=ffff83003ffbfab0:
      
           ...
      
           (XEN) Xen call trace:
           (XEN)    [<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
           (XEN)    [<ffff82d0801f3695>] get_page_from_gfn_p2m+0x165/0x300
           (XEN)    [<ffff82d0801bfe32>] hvmemul_get_seg_reg+0x52/0x60
           (XEN)    [<ffff82d0801bfe93>] hvm_emulate_prepare+0x53/0x70
           (XEN)    [<ffff82d0801ccacb>] handle_mmio+0x2b/0xd0
           (XEN)    [<ffff82d0801be591>] emulate.c#_hvm_emulate_one+0x111/0x2c0
           (XEN)    [<ffff82d0801cd6a4>] handle_hvm_io_completion+0x274/0x2a0
           (XEN)    [<ffff82d0801f334a>] __get_gfn_type_access+0xfa/0x270
           (XEN)    [<ffff82d08012f3bb>] timer.c#add_entry+0x4b/0xb0
           (XEN)    [<ffff82d08012f80c>] timer.c#remove_entry+0x7c/0x90
           (XEN)    [<ffff82d0801c8433>] hvm_do_resume+0x23/0x140
           (XEN)    [<ffff82d0801e4fe7>] vmx_do_resume+0xa7/0x140
           (XEN)    [<ffff82d080164aeb>] context_switch+0x13b/0xe40
           (XEN)    [<ffff82d080128e6e>] schedule.c#schedule+0x22e/0x570
           (XEN)    [<ffff82d08012c0cc>] softirq.c#__do_softirq+0x5c/0x90
           (XEN)    [<ffff82d0801602c5>] domain.c#idle_loop+0x25/0x50
           (XEN)
           (XEN)
           (XEN) ****************************************
           (XEN) Panic on CPU 1:
           (XEN) GENERAL PROTECTION FAULT
           (XEN) [error_code=0000]
           (XEN) ****************************************
      
      Tracing my host KVM showed it was the one injecting the GP(0) when
      emulating the VMREAD and checking the destination segment permissions in
      get_vmx_mem_address():
      
           3)               |    vmx_handle_exit() {
           3)               |      handle_vmread() {
           3)               |        nested_vmx_check_permission() {
           3)               |          vmx_get_segment() {
           3)   0.074 us    |            vmx_read_guest_seg_base();
           3)   0.065 us    |            vmx_read_guest_seg_selector();
           3)   0.066 us    |            vmx_read_guest_seg_ar();
           3)   1.636 us    |          }
           3)   0.058 us    |          vmx_get_rflags();
           3)   0.062 us    |          vmx_read_guest_seg_ar();
           3)   3.469 us    |        }
           3)               |        vmx_get_cs_db_l_bits() {
           3)   0.058 us    |          vmx_read_guest_seg_ar();
           3)   0.662 us    |        }
           3)               |        get_vmx_mem_address() {
           3)   0.068 us    |          vmx_cache_reg();
           3)               |          vmx_get_segment() {
           3)   0.074 us    |            vmx_read_guest_seg_base();
           3)   0.068 us    |            vmx_read_guest_seg_selector();
           3)   0.071 us    |            vmx_read_guest_seg_ar();
           3)   1.756 us    |          }
           3)               |          kvm_queue_exception_e() {
           3)   0.066 us    |            kvm_multiple_exception();
           3)   0.684 us    |          }
           3)   4.085 us    |        }
           3)   9.833 us    |      }
           3) + 10.366 us   |    }
      
      Cross-checking the KVM/VMX VMREAD emulation code with the Intel Software
      Developper Manual Volume 3C - "VMREAD - Read Field from Virtual-Machine
      Control Structure", I found that we're enforcing that the destination
      operand is NOT located in a read-only data segment or any code segment when
      the L1 is in long mode - BUT that check should only happen when it is in
      protected mode.
      
      Shuffling the code a bit to make our emulation follow the specification
      allows me to boot a Xen dom0 in a nested KVM and start HVM L2 guests
      without problems.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Eugene Korenevsky <ekorenevsky@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ff30ef40
    • M
      KVM: LAPIC: cap __delay at lapic_timer_advance_ns · b606f189
      Marcelo Tosatti 提交于
      The host timer which emulates the guest LAPIC TSC deadline
      timer has its expiration diminished by lapic_timer_advance_ns
      nanoseconds. Therefore if, at wait_lapic_expire, a difference
      larger than lapic_timer_advance_ns is encountered, delay at most
      lapic_timer_advance_ns.
      
      This fixes a problem where the guest can cause the host
      to delay for large amounts of time.
      Reported-by: NAlan Jenkins <alan.christopher.jenkins@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b606f189
    • M
      KVM: x86: move nsec_to_cycles from x86.c to x86.h · 8d93c874
      Marcelo Tosatti 提交于
      Move the inline function nsec_to_cycles from x86.c to x86.h, as
      the next patch uses it from lapic.c.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d93c874
  3. 16 6月, 2016 3 次提交
  4. 02 6月, 2016 6 次提交
    • P
      KVM: x86: fix OOPS after invalid KVM_SET_DEBUGREGS · d14bdb55
      Paolo Bonzini 提交于
      MOV to DR6 or DR7 causes a #GP if an attempt is made to write a 1 to
      any of bits 63:32.  However, this is not detected at KVM_SET_DEBUGREGS
      time, and the next KVM_RUN oopses:
      
         general protection fault: 0000 [#1] SMP
         CPU: 2 PID: 14987 Comm: a.out Not tainted 4.4.9-300.fc23.x86_64 #1
         Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
         [...]
         Call Trace:
          [<ffffffffa072c93d>] kvm_arch_vcpu_ioctl_run+0x141d/0x14e0 [kvm]
          [<ffffffffa071405d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm]
          [<ffffffff81241648>] do_vfs_ioctl+0x298/0x480
          [<ffffffff812418a9>] SyS_ioctl+0x79/0x90
          [<ffffffff817a0f2e>] entry_SYSCALL_64_fastpath+0x12/0x71
         Code: 55 83 ff 07 48 89 e5 77 27 89 ff ff 24 fd 90 87 80 81 0f 23 fe 5d c3 0f 23 c6 5d c3 0f 23 ce 5d c3 0f 23 d6 5d c3 0f 23 de 5d c3 <0f> 23 f6 5d c3 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
         RIP  [<ffffffff810639eb>] native_set_debugreg+0x2b/0x40
          RSP <ffff88005836bd50>
      
      Testcase (beautified/reduced from syzkaller output):
      
          #include <unistd.h>
          #include <sys/syscall.h>
          #include <string.h>
          #include <stdint.h>
          #include <linux/kvm.h>
          #include <fcntl.h>
          #include <sys/ioctl.h>
      
          long r[8];
      
          int main()
          {
              struct kvm_debugregs dr = { 0 };
      
              r[2] = open("/dev/kvm", O_RDONLY);
              r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
              r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
      
              memcpy(&dr,
                     "\x5d\x6a\x6b\xe8\x57\x3b\x4b\x7e\xcf\x0d\xa1\x72"
                     "\xa3\x4a\x29\x0c\xfc\x6d\x44\x00\xa7\x52\xc7\xd8"
                     "\x00\xdb\x89\x9d\x78\xb5\x54\x6b\x6b\x13\x1c\xe9"
                     "\x5e\xd3\x0e\x40\x6f\xb4\x66\xf7\x5b\xe3\x36\xcb",
                     48);
              r[7] = ioctl(r[4], KVM_SET_DEBUGREGS, &dr);
              r[6] = ioctl(r[4], KVM_RUN, 0);
          }
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d14bdb55
    • P
      KVM: fail KVM_SET_VCPU_EVENTS with invalid exception number · 78e546c8
      Paolo Bonzini 提交于
      This cannot be returned by KVM_GET_VCPU_EVENTS, so it is okay to return
      EINVAL.  It causes a WARN from exception_type:
      
          WARNING: CPU: 3 PID: 16732 at arch/x86/kvm/x86.c:345 exception_type+0x49/0x50 [kvm]()
          CPU: 3 PID: 16732 Comm: a.out Tainted: G        W       4.4.6-300.fc23.x86_64 #1
          Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
           0000000000000286 000000006308a48b ffff8800bec7fcf8 ffffffff813b542e
           0000000000000000 ffffffffa0966496 ffff8800bec7fd30 ffffffff810a40f2
           ffff8800552a8000 0000000000000000 00000000002c267c 0000000000000001
          Call Trace:
           [<ffffffff813b542e>] dump_stack+0x63/0x85
           [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0
           [<ffffffff810a423a>] warn_slowpath_null+0x1a/0x20
           [<ffffffffa0924809>] exception_type+0x49/0x50 [kvm]
           [<ffffffffa0934622>] kvm_arch_vcpu_ioctl_run+0x10a2/0x14e0 [kvm]
           [<ffffffffa091c04d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm]
           [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
           [<ffffffff812414a9>] SyS_ioctl+0x79/0x90
           [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71
          ---[ end trace b1a0391266848f50 ]---
      
      Testcase (beautified/reduced from syzkaller output):
      
          #include <unistd.h>
          #include <sys/syscall.h>
          #include <string.h>
          #include <stdint.h>
          #include <fcntl.h>
          #include <sys/ioctl.h>
          #include <linux/kvm.h>
      
          long r[31];
      
          int main()
          {
              memset(r, -1, sizeof(r));
              r[2] = open("/dev/kvm", O_RDONLY);
              r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
              r[7] = ioctl(r[3], KVM_CREATE_VCPU, 0);
      
              struct kvm_vcpu_events ve = {
                      .exception.injected = 1,
                      .exception.nr = 0xd4
              };
              r[27] = ioctl(r[7], KVM_SET_VCPU_EVENTS, &ve);
              r[30] = ioctl(r[7], KVM_RUN, 0);
              return 0;
          }
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      78e546c8
    • P
      KVM: x86: avoid vmalloc(0) in the KVM_SET_CPUID · 83676e92
      Paolo Bonzini 提交于
      This causes an ugly dmesg splat.  Beautified syzkaller testcase:
      
          #include <unistd.h>
          #include <sys/syscall.h>
          #include <sys/ioctl.h>
          #include <fcntl.h>
          #include <linux/kvm.h>
      
          long r[8];
      
          int main()
          {
              struct kvm_cpuid2 c = { 0 };
              r[2] = open("/dev/kvm", O_RDWR);
              r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
              r[4] = ioctl(r[3], KVM_CREATE_VCPU, 0x8);
              r[7] = ioctl(r[4], KVM_SET_CPUID, &c);
              return 0;
          }
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      83676e92
    • P
      kvm: x86: avoid warning on repeated KVM_SET_TSS_ADDR · b21629da
      Paolo Bonzini 提交于
      Found by syzkaller:
      
          WARNING: CPU: 3 PID: 15175 at arch/x86/kvm/x86.c:7705 __x86_set_memory_region+0x1dc/0x1f0 [kvm]()
          CPU: 3 PID: 15175 Comm: a.out Tainted: G        W       4.4.6-300.fc23.x86_64 #1
          Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
           0000000000000286 00000000950899a7 ffff88011ab3fbf0 ffffffff813b542e
           0000000000000000 ffffffffa0966496 ffff88011ab3fc28 ffffffff810a40f2
           00000000000001fd 0000000000003000 ffff88014fc50000 0000000000000000
          Call Trace:
           [<ffffffff813b542e>] dump_stack+0x63/0x85
           [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0
           [<ffffffff810a423a>] warn_slowpath_null+0x1a/0x20
           [<ffffffffa09251cc>] __x86_set_memory_region+0x1dc/0x1f0 [kvm]
           [<ffffffffa092521b>] x86_set_memory_region+0x3b/0x60 [kvm]
           [<ffffffffa09bb61c>] vmx_set_tss_addr+0x3c/0x150 [kvm_intel]
           [<ffffffffa092f4d4>] kvm_arch_vm_ioctl+0x654/0xbc0 [kvm]
           [<ffffffffa091d31a>] kvm_vm_ioctl+0x9a/0x6f0 [kvm]
           [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
           [<ffffffff812414a9>] SyS_ioctl+0x79/0x90
           [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      Testcase:
      
          #include <unistd.h>
          #include <sys/ioctl.h>
          #include <fcntl.h>
          #include <string.h>
          #include <linux/kvm.h>
      
          long r[8];
      
          int main()
          {
              memset(r, -1, sizeof(r));
      	r[2] = open("/dev/kvm", O_RDONLY|O_TRUNC);
              r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul);
              r[5] = ioctl(r[3], KVM_SET_TSS_ADDR, 0x20000000ul);
              r[7] = ioctl(r[3], KVM_SET_TSS_ADDR, 0x20000000ul);
              return 0;
          }
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b21629da
    • D
      KVM: Handle MSR_IA32_PERF_CTL · 0c2df2a1
      Dmitry Bilunov 提交于
      Intel CPUs having Turbo Boost feature implement an MSR to provide a
      control interface via rdmsr/wrmsr instructions. One could detect the
      presence of this feature by issuing one of these instructions and
      handling the #GP exception which is generated in case the referenced MSR
      is not implemented by the CPU.
      
      KVM's vCPU model behaves exactly as a real CPU in this case by injecting
      a fault when MSR_IA32_PERF_CTL is called (which KVM does not support).
      However, some operating systems use this register during an early boot
      stage in which their kernel is not capable of handling #GP correctly,
      causing #DP and finally a triple fault effectively resetting the vCPU.
      
      This patch implements a dummy handler for MSR_IA32_PERF_CTL to avoid the
      crashes.
      Signed-off-by: NDmitry Bilunov <kmeaw@yandex-team.ru>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      0c2df2a1
    • N
      KVM: x86: avoid write-tearing of TDP · b19ee2ff
      Nadav Amit 提交于
      In theory, nothing prevents the compiler from write-tearing PTEs, or
      split PTE writes. These partially-modified PTEs can be fetched by other
      cores and cause mayhem. I have not really encountered such case in
      real-life, but it does seem possible.
      
      For example, the compiler may try to do something creative for
      kvm_set_pte_rmapp() and perform multiple writes to the PTE.
      Signed-off-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b19ee2ff
  5. 25 5月, 2016 1 次提交
    • R
      kvm:vmx: more complete state update on APICv on/off · 3ce424e4
      Roman Kagan 提交于
      The function to update APICv on/off state (in particular, to deactivate
      it when enabling Hyper-V SynIC) is incomplete: it doesn't adjust
      APICv-related fields among secondary processor-based VM-execution
      controls.  As a result, Windows 2012 guests get stuck when SynIC-based
      auto-EOI interrupt intersected with e.g. an IPI in the guest.
      
      In addition, the MSR intercept bitmap isn't updated every time "virtualize
      x2APIC mode" is toggled.  This path can only be triggered by a malicious
      guest, because Windows didn't use x2APIC but rather their own synthetic
      APIC access MSRs; however a guest running in a SynIC-enabled VM could
      switch to x2APIC and thus obtain direct access to host APIC MSRs
      (CVE-2016-4440).
      
      The patch fixes those omissions.
      Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
      Reported-by: NSteve Rutherford <srutherford@google.com>
      Reported-by: NYang Zhang <yang.zhang.wz@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3ce424e4
  6. 24 5月, 2016 1 次提交
  7. 19 5月, 2016 12 次提交
  8. 13 5月, 2016 1 次提交
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
  9. 12 5月, 2016 1 次提交
  10. 11 5月, 2016 1 次提交
  11. 06 5月, 2016 1 次提交
    • A
      mm: thp: kvm: fix memory corruption in KVM with THP enabled · 127393fb
      Andrea Arcangeli 提交于
      After the THP refcounting change, obtaining a compound pages from
      get_user_pages() no longer allows us to assume the entire compound page
      is immediately mappable from a secondary MMU.
      
      A secondary MMU doesn't want to call get_user_pages() more than once for
      each compound page, in order to know if it can map the whole compound
      page.  So a secondary MMU needs to know from a single get_user_pages()
      invocation when it can map immediately the entire compound page to avoid
      a flood of unnecessary secondary MMU faults and spurious
      atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
      users).
      
      Ideally instead of the page->_mapcount < 1 check, get_user_pages()
      should return the granularity of the "page" mapping in the "mm" passed
      to get_user_pages().  However it's non trivial change to pass the "pmd"
      status belonging to the "mm" walked by get_user_pages up the stack (up
      to the caller of get_user_pages).  So the fix just checks if there is
      not a single pte mapping on the page returned by get_user_pages, and in
      turn if the caller can assume that the whole compound page is mapped in
      the current "mm" (in a pmd_trans_huge()).  In such case the entire
      compound page is safe to map into the secondary MMU without additional
      get_user_pages() calls on the surrounding tail/head pages.  In addition
      of being faster, not having to run other get_user_pages() calls also
      reduces the memory footprint of the secondary MMU fault in case the pmd
      split happened as result of memory pressure.
      
      Without this fix after a MADV_DONTNEED (like invoked by QEMU during
      postcopy live migration or balloning) or after generic swapping (with a
      failure in split_huge_page() that would only result in pmd splitting and
      not a physical page split), KVM would map the whole compound page into
      the shadow pagetables, despite regular faults or userfaults (like
      UFFDIO_COPY) may map regular pages into the primary MMU as result of the
      pte faults, leading to the guest mode and userland mode going out of
      sync and not working on the same memory at all times.
      
      Any other secondary MMU notifier manager (KVM is just one of the many
      MMU notifier users) will need the same information if it doesn't want to
      run a flood of get_user_pages_fast and it can support multiple
      granularity in the secondary MMU mappings, so I think it is justified to
      be exposed not just to KVM.
      
      The other option would be to move transparent_hugepage_adjust to
      mm/huge_memory.c but that currently has all kind of KVM data structures
      in it, so it's definitely not a cut-and-paste work, so I couldn't do a
      fix as cleaner as this one for 4.6.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Li, Liang Z" <liang.z.li@intel.com>
      Cc: Amit Shah <amit.shah@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      127393fb
  12. 03 5月, 2016 1 次提交
  13. 29 4月, 2016 2 次提交
  14. 28 4月, 2016 1 次提交
  15. 20 4月, 2016 3 次提交
  16. 18 4月, 2016 1 次提交
  17. 13 4月, 2016 1 次提交