1. 20 9月, 2018 9 次提交
    • S
      KVM: VMX: use preemption timer to force immediate VMExit · d264ee0c
      Sean Christopherson 提交于
      A VMX preemption timer value of '0' is guaranteed to cause a VMExit
      prior to the CPU executing any instructions in the guest.  Use the
      preemption timer (if it's supported) to trigger immediate VMExit
      in place of the current method of sending a self-IPI.  This ensures
      that pending VMExit injection to L1 occurs prior to executing any
      instructions in the guest (regardless of nesting level).
      
      When deferring VMExit injection, KVM generates an immediate VMExit
      from the (possibly nested) guest by sending itself an IPI.  Because
      hardware interrupts are blocked prior to VMEnter and are unblocked
      (in hardware) after VMEnter, this results in taking a VMExit(INTR)
      before any guest instruction is executed.  But, as this approach
      relies on the IPI being received before VMEnter executes, it only
      works as intended when KVM is running as L0.  Because there are no
      architectural guarantees regarding when IPIs are delivered, when
      running nested the INTR may "arrive" long after L2 is running e.g.
      L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
      
      For the most part, this unintended delay is not an issue since the
      events being injected to L1 also do not have architectural guarantees
      regarding their timing.  The notable exception is the VMX preemption
      timer[1], which is architecturally guaranteed to cause a VMExit prior
      to executing any instructions in the guest if the timer value is '0'
      at VMEnter.  Specifically, the delay in injecting the VMExit causes
      the preemption timer KVM unit test to fail when run in a nested guest.
      
      Note: this approach is viable even on CPUs with a broken preemption
      timer, as broken in this context only means the timer counts at the
      wrong rate.  There are no known errata affecting timer value of '0'.
      
      [1] I/O SMIs also have guarantees on when they arrive, but I have
          no idea if/how those are emulated in KVM.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Use a hook for SVM instead of leaving the default in x86.c - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d264ee0c
    • S
      KVM: VMX: modify preemption timer bit only when arming timer · f459a707
      Sean Christopherson 提交于
      Provide a singular location where the VMX preemption timer bit is
      set/cleared so that future usages of the preemption timer can ensure
      the VMCS bit is up-to-date without having to modify unrelated code
      paths.  For example, the preemption timer can be used to force an
      immediate VMExit.  Cache the status of the timer to avoid redundant
      VMREAD and VMWRITE, e.g. if the timer stays armed across multiple
      VMEnters/VMExits.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f459a707
    • S
      KVM: VMX: immediately mark preemption timer expired only for zero value · 4c008127
      Sean Christopherson 提交于
      A VMX preemption timer value of '0' at the time of VMEnter is
      architecturally guaranteed to cause a VMExit prior to the CPU
      executing any instructions in the guest.  This architectural
      definition is in place to ensure that a previously expired timer
      is correctly recognized by the CPU as it is possible for the timer
      to reach zero and not trigger a VMexit due to a higher priority
      VMExit being signalled instead, e.g. a pending #DB that morphs into
      a VMExit.
      
      Whether by design or coincidence, commit f4124500 ("KVM: nVMX:
      Fully emulate preemption timer") special cased timer values of '0'
      and '1' to ensure prompt delivery of the VMExit.  Unlike '0', a
      timer value of '1' has no has no architectural guarantees regarding
      when it is delivered.
      
      Modify the timer emulation to trigger immediate VMExit if and only
      if the timer value is '0', and document precisely why '0' is special.
      Do this even if calibration of the virtual TSC failed, i.e. VMExit
      will occur immediately regardless of the frequency of the timer.
      Making only '0' a special case gives KVM leeway to be more aggressive
      in ensuring the VMExit is injected prior to executing instructions in
      the nested guest, and also eliminates any ambiguity as to why '1' is
      a special case, e.g. why wasn't the threshold for a "short timeout"
      set to 10, 100, 1000, etc...
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4c008127
    • A
      KVM: SVM: Switch to bitmap_zalloc() · a101c9d6
      Andy Shevchenko 提交于
      Switch to bitmap_zalloc() to show clearly what we are allocating.
      Besides that it returns pointer of bitmap type instead of opaque void *.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a101c9d6
    • T
      KVM/MMU: Fix comment in walk_shadow_page_lockless_end() · 9a984586
      Tianyu Lan 提交于
      kvm_commit_zap_page() has been renamed to kvm_mmu_commit_zap_page()
      This patch is to fix the commit.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9a984586
    • L
      kvm: selftests: use -pthread instead of -lpthread · 6bd317d3
      Lei Yang 提交于
      I run into the following error
      
      testing/selftests/kvm/dirty_log_test.c:285: undefined reference to `pthread_create'
      testing/selftests/kvm/dirty_log_test.c:297: undefined reference to `pthread_join'
      collect2: error: ld returned 1 exit status
      
      my gcc version is gcc version 4.8.4
      "-pthread" would work everywhere
      Signed-off-by: NLei Yang <Lei.Yang@windriver.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6bd317d3
    • W
      KVM: x86: don't reset root in kvm_mmu_setup() · 83b20b28
      Wei Yang 提交于
      Here is the code path which shows kvm_mmu_setup() is invoked after
      kvm_mmu_create(). Since kvm_mmu_setup() is only invoked in this code path,
      this means the root_hpa and prev_roots are guaranteed to be invalid. And
      it is not necessary to reset it again.
      
          kvm_vm_ioctl_create_vcpu()
              kvm_arch_vcpu_create()
                  vmx_create_vcpu()
                      kvm_vcpu_init()
                          kvm_arch_vcpu_init()
                              kvm_mmu_create()
              kvm_arch_vcpu_setup()
                  kvm_mmu_setup()
                      kvm_init_mmu()
      
      This patch set reset_roots to false in kmv_mmu_setup().
      
      Fixes: 50c28f21Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      83b20b28
    • J
      kvm: mmu: Don't read PDPTEs when paging is not enabled · d35b34a9
      Junaid Shahid 提交于
      kvm should not attempt to read guest PDPTEs when CR0.PG = 0 and
      CR4.PAE = 1.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d35b34a9
    • V
      x86/kvm/lapic: always disable MMIO interface in x2APIC mode · d1766202
      Vitaly Kuznetsov 提交于
      When VMX is used with flexpriority disabled (because of no support or
      if disabled with module parameter) MMIO interface to lAPIC is still
      available in x2APIC mode while it shouldn't be (kvm-unit-tests):
      
      PASS: apic_disable: Local apic enabled in x2APIC mode
      PASS: apic_disable: CPUID.1H:EDX.APIC[bit 9] is set
      FAIL: apic_disable: *0xfee00030: 50014
      
      The issue appears because we basically do nothing while switching to
      x2APIC mode when APIC access page is not used. apic_mmio_{read,write}
      only check if lAPIC is disabled before proceeding to actual write.
      
      When APIC access is virtualized we correctly manipulate with VMX controls
      in vmx_set_virtual_apic_mode() and we don't get vmexits from memory writes
      in x2APIC mode so there's no issue.
      
      Disabling MMIO interface seems to be easy. The question is: what do we
      do with these reads and writes? If we add apic_x2apic_mode() check to
      apic_mmio_in_range() and return -EOPNOTSUPP these reads and writes will
      go to userspace. When lAPIC is in kernel, Qemu uses this interface to
      inject MSIs only (see kvm_apic_mem_write() in hw/i386/kvm/apic.c). This
      somehow works with disabled lAPIC but when we're in xAPIC mode we will
      get a real injected MSI from every write to lAPIC. Not good.
      
      The simplest solution seems to be to just ignore writes to the region
      and return ~0 for all reads when we're in x2APIC mode. This is what this
      patch does. However, this approach is inconsistent with what currently
      happens when flexpriority is enabled: we allocate APIC access page and
      create KVM memory region so in x2APIC modes all reads and writes go to
      this pre-allocated page which is, btw, the same for all vCPUs.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1766202
  2. 18 9月, 2018 2 次提交
  3. 17 9月, 2018 2 次提交
  4. 16 9月, 2018 4 次提交
    • L
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 27c5a778
      Linus Torvalds 提交于
      Pull x86 fixes from Ingol Molnar:
       "Misc fixes:
      
         - EFI crash fix
      
         - Xen PV fixes
      
         - do not allow PTI on 2-level 32-bit kernels for now
      
         - documentation fix"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/APM: Fix build warning when PROC_FS is not enabled
        Revert "x86/mm/legacy: Populate the user page-table with user pgd's"
        x86/efi: Load fixmap GDT in efi_call_phys_epilog() before setting %cr3
        x86/xen: Disable CPU0 hotplug for Xen PV
        x86/EISA: Don't probe EISA bus for Xen PV guests
        x86/doc: Fix Documentation/x86/earlyprintk.txt
      27c5a778
    • L
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4314daa5
      Linus Torvalds 提交于
      Pull scheduler fixes from Ingo Molnar:
       "Misc fixes: various scheduler metrics corner case fixes, a
        sched_features deadlock fix, and a topology fix for certain NUMA
        systems"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/fair: Fix kernel-doc notation warning
        sched/fair: Fix load_balance redo for !imbalance
        sched/fair: Fix scale_rt_capacity() for SMT
        sched/fair: Fix vruntime_normalized() for remote non-migration wakeup
        sched/pelt: Fix update_blocked_averages() for RT and DL classes
        sched/topology: Set correct NUMA topology type
        sched/debug: Fix potential deadlock when writing to sched_features
      4314daa5
    • L
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c0be92b5
      Linus Torvalds 提交于
      Pull perf fixes from Ingo Molnar:
       "Mostly tooling fixes, but also breakpoint and x86 PMU driver fixes"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
        perf tools: Fix maps__find_symbol_by_name()
        tools headers uapi: Update tools's copy of linux/if_link.h
        tools headers uapi: Update tools's copy of linux/vhost.h
        tools headers uapi: Update tools's copies of kvm headers
        tools headers uapi: Update tools's copy of drm/drm.h
        tools headers uapi: Update tools's copy of asm-generic/unistd.h
        tools headers uapi: Update tools's copy of linux/perf_event.h
        perf/core: Force USER_DS when recording user stack data
        perf/UAPI: Clearly mark __PERF_SAMPLE_CALLCHAIN_EARLY as internal use
        perf/x86/intel: Add support/quirk for the MISPREDICT bit on Knights Landing CPUs
        perf annotate: Fix parsing aarch64 branch instructions after objdump update
        perf probe powerpc: Ignore SyS symbols irrespective of endianness
        perf event-parse: Use fixed size string for comms
        perf util: Fix bad memory access in trace info.
        perf tools: Streamline bpf examples and headers installation
        perf evsel: Fix potential null pointer dereference in perf_evsel__new_idx()
        perf arm64: Fix include path for asm-generic/unistd.h
        perf/hw_breakpoint: Simplify breakpoint enable in perf_event_modify_breakpoint
        perf/hw_breakpoint: Enable breakpoint in modify_user_hw_breakpoint
        perf/hw_breakpoint: Remove superfluous bp->attr.disabled = 0
        ...
      c0be92b5
    • L
      Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ca062f8d
      Linus Torvalds 提交于
      Pull locking fixes from Ingo Molnar:
       "Misc fixes: liblockdep fixes and ww_mutex fixes"
      
      * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/ww_mutex: Fix spelling mistake "cylic" -> "cyclic"
        locking/lockdep: Delete unnecessary #include
        tools/lib/lockdep: Add dummy task_struct state member
        tools/lib/lockdep: Add empty nmi.h
        tools/lib/lockdep: Update Sasha Levin email to MSFT
        jump_label: Fix typo in warning message
        locking/mutex: Fix mutex debug call and ww_mutex documentation
      ca062f8d
  5. 15 9月, 2018 14 次提交
  6. 14 9月, 2018 9 次提交
    • L
      Merge tag 'usb-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 1abc088a
      Linus Torvalds 提交于
      Pull USB fixes from Greg KH:
       "Here are a number of small USB driver fixes for -rc4.
      
        The usual suspects of gadget, xhci, and dwc2/3 are in here, along with
        some reverts of reported problem changes, and a number of build
        documentation warning fixes. Full details are in the shortlog.
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'usb-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (28 commits)
        Revert "cdc-acm: implement put_char() and flush_chars()"
        usb: Change usb_of_get_companion_dev() place to usb/common
        usb: xhci: fix interrupt transfer error happened on MTK platforms
        usb: cdc-wdm: Fix a sleep-in-atomic-context bug in service_outstanding_interrupt()
        usb: misc: uss720: Fix two sleep-in-atomic-context bugs
        usb: host: u132-hcd: Fix a sleep-in-atomic-context bug in u132_get_frame()
        usb: Avoid use-after-free by flushing endpoints early in usb_set_interface()
        linux/mod_devicetable.h: fix kernel-doc missing notation for typec_device_id
        usb/typec: fix kernel-doc notation warning for typec_match_altmode
        usb: Don't die twice if PCI xhci host is not responding in resume
        usb: mtu3: fix error of xhci port id when enable U3 dual role
        usb: uas: add support for more quirk flags
        USB: Add quirk to support DJI CineSSD
        usb: typec: fix kernel-doc parameter warning
        usb/dwc3/gadget: fix kernel-doc parameter warning
        USB: yurex: Check for truncation in yurex_read()
        USB: yurex: Fix buffer over-read in yurex_write()
        usb: host: xhci-plat: Iterate over parent nodes for finding quirks
        xhci: Fix use after free for URB cancellation on a reallocated endpoint
        USB: add quirk for WORLDE Controller KS49 or Prodipe MIDI 49C USB controller
        ...
      1abc088a
    • L
      Merge tag 'tty-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · c284cf06
      Linus Torvalds 提交于
      Pull tty fixes from Greg KH:
       "Here are three small HVC tty driver fixes to resolve a reported
        regression from 4.19-rc1.
      
        All of these have been in linux-next for a while with no reported
        issues"
      
      * tag 'tty-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        tty: hvc: hvc_write() fix break condition
        tty: hvc: hvc_poll() fix read loop batching
        tty: hvc: hvc_poll() fix read loop hang
      c284cf06
    • L
      Merge tag 'staging-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 45d9ab8a
      Linus Torvalds 提交于
      Pull staging/IIO driver fixes from Greg KH:
       "Here are a few small staging and iio driver fixes for -rc4.
      
        Nothing major, just a few small bugfixes for some reported issues, and
        a MAINTAINERS file update for the fbtft drivers.
      
        We also re-enable the building of the erofs filesystem as the XArray
        patches that were causing it to break never got merged in the -rc1
        cycle, so there's no reason it can't be turned back on for now. The
        problem that was previously there is now being handled in the Xarray
        tree at the moment, so it will not hit us again in the future.
      
        All of these patches have been in linux-next with no reported issues"
      
      * tag 'staging-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: vboxvideo: Change address of scanout buffer on page-flip
        staging: vboxvideo: Fix IRQs no longer working
        staging: gasket: TODO: re-implement using UIO
        staging/fbtft: Update TODO and mailing lists
        staging: erofs: rename superblock flags (MS_xyz -> SB_xyz)
        iio: imu: st_lsm6dsx: take into account ts samples in wm configuration
        Revert "iio: temperature: maxim_thermocouple: add MAX31856 part"
        Revert "staging: erofs: disable compiling temporarile"
        MAINTAINERS: Switch a maintainer for drivers/staging/gasket
        staging: wilc1000: revert "fix TODO to compile spi and sdio components in single module"
      45d9ab8a
    • L
      Merge tag 'char-misc-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · 319cbacf
      Linus Torvalds 提交于
      Pull char/misc driver fixes from Greg KH:
       "Here are a small handful of char/misc driver fixes for 4.19-rc4.
      
        All of them are simple, resolving reported problems in a few drivers.
        Full details are in the shortlog.
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'char-misc-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        firmware: Fix security issue with request_firmware_into_buf()
        vmbus: don't return values for uninitalized channels
        fpga: dfl: fme: fix return value check in in pr_mgmt_init()
        misc: hmc6352: fix potential Spectre v1
        Tools: hv: Fix a bug in the key delete code
        misc: ibmvsm: Fix wrong assignment of return code
        android: binder: fix the race mmap and alloc_new_buf_locked
        mei: bus: need to unlink client before freeing
        mei: bus: fix hw module get/put balance
        mei: fix use-after-free in mei_cl_write
        mei: ignore not found client in the enumeration
      319cbacf
    • J
      Revert "x86/mm/legacy: Populate the user page-table with user pgd's" · 61a6bd83
      Joerg Roedel 提交于
      This reverts commit 1f40a46c.
      
      It turned out that this patch is not sufficient to enable PTI on 32 bit
      systems with legacy 2-level page-tables. In this paging mode the huge-page
      PTEs are in the top-level page-table directory, where also the mirroring to
      the user-space page-table happens. So every huge PTE exits twice, in the
      kernel and in the user page-table.
      
      That means that accessed/dirty bits need to be fetched from two PTEs in
      this mode to be safe, but this is not trivial to implement because it needs
      changes to generic code just for the sake of enabling PTI with 32-bit
      legacy paging. As all systems that need PTI should support PAE anyway,
      remove support for PTI when 32-bit legacy paging is used.
      
      Fixes: 7757d607 ('x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32')
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: https://lkml.kernel.org/r/1536922754-31379-1-git-send-email-joro@8bytes.org
      61a6bd83
    • M
      xen/gntdev: fix up blockable calls to mn_invl_range_start · 58a57569
      Michal Hocko 提交于
      Patch series "mmu_notifiers follow ups".
      
      Tetsuo has noticed some fallouts from 93065ac7 ("mm, oom: distinguish
      blockable mode for mmu notifiers").  One of them has been fixed and picked
      up by AMD/DRM maintainer [1].  XEN issue is fixed by patch 1.  I have also
      clarified expectations about blockable semantic of invalidate_range_end.
      Finally the last patch removes MMU_INVALIDATE_DOES_NOT_BLOCK which is no
      longer used nor needed.
      
      [1] http://lkml.kernel.org/r/20180824135257.GU29735@dhcp22.suse.cz
      
      This patch (of 3):
      
      93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers") has
      introduced blockable parameter to all mmu_notifiers and the notifier has
      to back off when called in !blockable case and it could block down the
      road.
      
      The above commit implemented that for mn_invl_range_start but both
      in_range checks are done unconditionally regardless of the blockable mode
      and as such they would fail all the time for regular calls.  Fix this by
      checking blockable parameter as well.
      
      Once we are there we can remove the stale TODO.  The lock has to be
      sleepable because we wait for completion down in gnttab_unmap_refs_sync.
      
      Link: http://lkml.kernel.org/r/20180827112623.8992-2-mhocko@kernel.org
      Fixes: 93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      58a57569
    • J
      xen: fix GCC warning and remove duplicate EVTCHN_ROW/EVTCHN_COL usage · 4dca864b
      Josh Abraham 提交于
      This patch removes duplicate macro useage in events_base.c.
      
      It also fixes gcc warning:
      variable ‘col’ set but not used [-Wunused-but-set-variable]
      Signed-off-by: NJoshua Abraham <j.abraham1776@gmail.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      4dca864b
    • O
      xen: avoid crash in disable_hotplug_cpu · 3366cdb6
      Olaf Hering 提交于
      The command 'xl vcpu-set 0 0', issued in dom0, will crash dom0:
      
      BUG: unable to handle kernel NULL pointer dereference at 00000000000002d8
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 7 PID: 65 Comm: xenwatch Not tainted 4.19.0-rc2-1.ga9462db-default #1 openSUSE Tumbleweed (unreleased)
      Hardware name: Intel Corporation S5520UR/S5520UR, BIOS S5500.86B.01.00.0050.050620101605 05/06/2010
      RIP: e030:device_offline+0x9/0xb0
      Code: 77 24 00 e9 ce fe ff ff 48 8b 13 e9 68 ff ff ff 48 8b 13 e9 29 ff ff ff 48 8b 13 e9 ea fe ff ff 90 66 66 66 66 90 41 54 55 53 <f6> 87 d8 02 00 00 01 0f 85 88 00 00 00 48 c7 c2 20 09 60 81 31 f6
      RSP: e02b:ffffc90040f27e80 EFLAGS: 00010203
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff8801f3800000 RSI: ffffc90040f27e70 RDI: 0000000000000000
      RBP: 0000000000000000 R08: ffffffff820e47b3 R09: 0000000000000000
      R10: 0000000000007ff0 R11: 0000000000000000 R12: ffffffff822e6d30
      R13: dead000000000200 R14: dead000000000100 R15: ffffffff8158b4e0
      FS:  00007ffa595158c0(0000) GS:ffff8801f39c0000(0000) knlGS:0000000000000000
      CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000002d8 CR3: 00000001d9602000 CR4: 0000000000002660
      Call Trace:
       handle_vcpu_hotplug_event+0xb5/0xc0
       xenwatch_thread+0x80/0x140
       ? wait_woken+0x80/0x80
       kthread+0x112/0x130
       ? kthread_create_worker_on_cpu+0x40/0x40
       ret_from_fork+0x3a/0x50
      
      This happens because handle_vcpu_hotplug_event is called twice. In the
      first iteration cpu_present is still true, in the second iteration
      cpu_present is false which causes get_cpu_device to return NULL.
      In case of cpu#0, cpu_online is apparently always true.
      
      Fix this crash by checking if the cpu can be hotplugged, which is false
      for a cpu that was just removed.
      
      Also check if the cpu was actually offlined by device_remove, otherwise
      leave the cpu_present state as it is.
      
      Rearrange to code to do all work with device_hotplug_lock held.
      Signed-off-by: NOlaf Hering <olaf@aepfle.de>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      3366cdb6
    • M
      xen/balloon: add runtime control for scrubbing ballooned out pages · 197ecb38
      Marek Marczykowski-Górecki 提交于
      Scrubbing pages on initial balloon down can take some time, especially
      in nested virtualization case (nested EPT is slow). When HVM/PVH guest is
      started with memory= significantly lower than maxmem=, all the extra
      pages will be scrubbed before returning to Xen. But since most of them
      weren't used at all at that point, Xen needs to populate them first
      (from populate-on-demand pool). In nested virt case (Xen inside KVM)
      this slows down the guest boot by 15-30s with just 1.5GB needed to be
      returned to Xen.
      
      Add runtime parameter to enable/disable it, to allow initially disabling
      scrubbing, then enable it back during boot (for example in initramfs).
      Such usage relies on assumption that a) most pages ballooned out during
      initial boot weren't used at all, and b) even if they were, very few
      secrets are in the guest at that time (before any serious userspace
      kicks in).
      Convert CONFIG_XEN_SCRUB_PAGES to CONFIG_XEN_SCRUB_PAGES_DEFAULT (also
      enabled by default), controlling default value for the new runtime
      switch.
      Signed-off-by: NMarek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      197ecb38