1. 22 6月, 2013 1 次提交
    • A
      aout32 coredump compat fix · 945fb136
      Al Viro 提交于
      dump_seek() does SEEK_CUR, not SEEK_SET; native binfmt_aout
      handles it correctly (seeks by PAGE_SIZE - sizeof(struct user),
      getting the current position to PAGE_SIZE), compat one seeks
      by PAGE_SIZE and ends up at PAGE_SIZE + already written...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      945fb136
  2. 13 6月, 2013 2 次提交
  3. 11 6月, 2013 1 次提交
    • M
      Modify UEFI anti-bricking code · f8b84043
      Matthew Garrett 提交于
      This patch reworks the UEFI anti-bricking code, including an effective
      reversion of cc5a080c and 31ff2f20. It turns out that calling
      QueryVariableInfo() from boot services results in some firmware
      implementations jumping to physical addresses even after entering virtual
      mode, so until we have 1:1 mappings for UEFI runtime space this isn't
      going to work so well.
      
      Reverting these gets us back to the situation where we'd refuse to create
      variables on some systems because they classify deleted variables as "used"
      until the firmware triggers a garbage collection run, which they won't do
      until they reach a lower threshold. This results in it being impossible to
      install a bootloader, which is unhelpful.
      
      Feedback from Samsung indicates that the firmware doesn't need more than
      5KB of storage space for its own purposes, so that seems like a reasonable
      threshold. However, there's still no guarantee that a platform will attempt
      garbage collection merely because it drops below this threshold. It seems
      that this is often only triggered if an attempt to write generates a
      genuine EFI_OUT_OF_RESOURCES error. We can force that by attempting to
      create a variable larger than the remaining space. This should fail, but if
      it somehow succeeds we can then immediately delete it.
      
      I've tested this on the UEFI machines I have available, but I don't have
      a Samsung and so can't verify that it avoids the bricking problem.
      Signed-off-by: NMatthew Garrett <matthew.garrett@nebula.com>
      Signed-off-by: Lee, Chun-Y <jlee@suse.com> [ dummy variable cleanup ]
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      f8b84043
  4. 06 6月, 2013 1 次提交
    • M
      x86/PCI: Map PCI setup data with ioremap() so it can be in highmem · 65694c5a
      Matt Fleming 提交于
      f9a37be0 ("x86: Use PCI setup data") added support for using PCI ROM
      images from setup_data.  This used phys_to_virt(), which is not valid for
      highmem addresses, and can cause a crash when booting a 32-bit kernel via
      the EFI boot stub.
      
      pcibios_add_device() assumes that the physical addresses stored in
      setup_data are accessible via the direct kernel mapping, and that calling
      phys_to_virt() is valid.  This isn't guaranteed to be true on x86 where the
      direct mapping range is much smaller than on x86-64.
      
      Calling phys_to_virt() on a highmem address results in the following:
      
       BUG: unable to handle kernel paging request at 39a3c198
       IP: [<c262be0f>] pcibios_add_device+0x2f/0x90
       ...
       Call Trace:
        [<c2370c73>] pci_device_add+0xe3/0x130
        [<c274640b>] pci_scan_single_device+0x8b/0xb0
        [<c2370d08>] pci_scan_slot+0x48/0x100
        [<c2371904>] pci_scan_child_bus+0x24/0xc0
        [<c262a7b0>] pci_acpi_scan_root+0x2c0/0x490
        [<c23b7203>] acpi_pci_root_add+0x312/0x42f
        ...
      
      The solution is to use ioremap() instead of phys_to_virt() to map the
      setup data into the kernel address space.
      
      [bhelgaas: changelog]
      Tested-by: NJani Nikula <jani.nikula@intel.com>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Seth Forshee <seth.forshee@canonical.com>
      Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
      Cc: stable@vger.kernel.org	# v3.8+
      65694c5a
  5. 04 6月, 2013 1 次提交
    • K
      xen/smp: Fixup NOHZ per cpu data when onlining an offline CPU. · 466318a8
      Konrad Rzeszutek Wilk 提交于
      The xen_play_dead is an undead function. When the vCPU is told to
      offline it ends up calling xen_play_dead wherin it calls the
      VCPUOP_down hypercall which offlines the vCPU. However, when the
      vCPU is onlined back, it resumes execution right after
      VCPUOP_down hypercall.
      
      That was OK (albeit the API for play_dead assumes that the CPU
      stays dead and never returns) but with commit 4b0c0f29
      (tick: Cleanup NOHZ per cpu data on cpu down) that is no longer safe
      as said commit resets the ts->inidle which at the start of the
      cpu_idle loop was set.
      
      The net effect is that we get this warn:
      
      Broke affinity for irq 16
      installing Xen timer for CPU 1
      cpu 1 spinlock event irq 48
      ------------[ cut here ]------------
      WARNING: at /home/konrad/linux-linus/kernel/time/tick-sched.c:935 tick_nohz_idle_exit+0x195/0x1b0()
      Modules linked in: dm_multipath dm_mod xen_evtchn iscsi_boot_sysfs
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.0-rc3upstream-00068-gdcdbe33a #1
      Hardware name: BIOSTAR Group N61PB-M2S/N61PB-M2S, BIOS 6.00 PG 09/03/2009
       ffffffff8193b448 ffff880039da5e60 ffffffff816707c8 ffff880039da5ea0
       ffffffff8108ce8b ffff880039da4010 ffff88003fa8e500 ffff880039da4010
       0000000000000001 ffff880039da4000 ffff880039da4010 ffff880039da5eb0
      Call Trace:
       [<ffffffff816707c8>] dump_stack+0x19/0x1b
       [<ffffffff8108ce8b>] warn_slowpath_common+0x6b/0xa0
       [<ffffffff8108ced5>] warn_slowpath_null+0x15/0x20
       [<ffffffff810e4745>] tick_nohz_idle_exit+0x195/0x1b0
       [<ffffffff810da755>] cpu_startup_entry+0x205/0x250
       [<ffffffff81661070>] cpu_bringup_and_idle+0x13/0x15
      ---[ end trace 915c8c486004dda1 ]---
      
      b/c ts_inidle is set to zero. Thomas suggested that we just add a workaround
      to call tick_nohz_idle_enter before returning from xen_play_dead() - and
      that is what this patch does and fixes the issue.
      
      We also add the stable part b/c git commit 4b0c0f29 is on the stable
      tree.
      
      CC: stable@vger.kernel.org
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      466318a8
  6. 03 6月, 2013 3 次提交
  7. 01 6月, 2013 1 次提交
    • Y
      x86: Fix adjust_range_size_mask calling position · 7de3d66b
      Yinghai Lu 提交于
      Commit
      
          8d57470d x86, mm: setup page table in top-down
      
      causes a kernel panic while setting mem=2G.
      
           [mem 0x00000000-0x000fffff] page 4k
           [mem 0x7fe00000-0x7fffffff] page 1G
           [mem 0x7c000000-0x7fdfffff] page 1G
           [mem 0x00100000-0x001fffff] page 4k
           [mem 0x00200000-0x7bffffff] page 2M
      
      for last entry is not what we want, we should have
           [mem 0x00200000-0x3fffffff] page 2M
           [mem 0x40000000-0x7bffffff] page 1G
      
      Actually we merge the continuous ranges with same page size too early.
      in this case, before merging we have
           [mem 0x00200000-0x3fffffff] page 2M
           [mem 0x40000000-0x7bffffff] page 2M
      after merging them, will get
           [mem 0x00200000-0x7bffffff] page 2M
      even we can use 1G page to map
           [mem 0x40000000-0x7bffffff]
      
      that will cause problem, because we already map
           [mem 0x7fe00000-0x7fffffff] page 1G
           [mem 0x7c000000-0x7fdfffff] page 1G
      with 1G page, aka [0x40000000-0x7fffffff] is mapped with 1G page already.
      During phys_pud_init() for [0x40000000-0x7bffffff], it will not
      reuse existing that pud page, and allocate new one then try to use
      2M page to map it instead, as page_size_mask does not include
      PG_LEVEL_1G. At end will have [7c000000-0x7fffffff] not mapped, loop
      in phys_pmd_init stop mapping at 0x7bffffff.
      
      That is right behavoir, it maps exact range with exact page size that
      we ask, and we should explicitly call it to map [7c000000-0x7fffffff]
      before or after mapping 0x40000000-0x7bffffff.
      Anyway we need to make sure ranges' page_size_mask correct and consistent
      after split_mem_range for each range.
      
      Fix that by calling adjust_range_size_mask before merging range
      with same page size.
      
      -v2: update change log.
      -v3: add more explanation why [7c000000-0x7fffffff] is not mapped, and
          it causes panic.
      Bisected-by: N"Xie, ChanglongX" <changlongx.xie@intel.com>
      Bisected-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Reported-and-tested-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1370015587-20835-1-git-send-email-yinghai@kernel.org
      Cc: <stable@vger.kernel.org> v3.9
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      7de3d66b
  8. 31 5月, 2013 2 次提交
    • P
      x86: Allow FPU to be used at interrupt time even with eagerfpu · 5187b28f
      Pekka Riikonen 提交于
      With the addition of eagerfpu the irq_fpu_usable() now returns false
      negatives especially in the case of ksoftirqd and interrupted idle task,
      two common cases for FPU use for example in networking/crypto.  With
      eagerfpu=off FPU use is possible in those contexts.  This is because of
      the eagerfpu check in interrupted_kernel_fpu_idle():
      
      ...
        * For now, with eagerfpu we will return interrupted kernel FPU
        * state as not-idle. TBD: Ideally we can change the return value
        * to something like __thread_has_fpu(current). But we need to
        * be careful of doing __thread_clear_has_fpu() before saving
        * the FPU etc for supporting nested uses etc. For now, take
        * the simple route!
      ...
       	if (use_eager_fpu())
       		return 0;
      
      As eagerfpu is automatically "on" on those CPUs that also have the
      features like AES-NI this patch changes the eagerfpu check to return 1 in
      case the kernel_fpu_begin() has not been said yet.  Once it has been the
      __thread_has_fpu() will start returning 0.
      
      Notice that with eagerfpu the __thread_has_fpu is always true initially.
      FPU use is thus always possible no matter what task is under us, unless
      the state has already been saved with kernel_fpu_begin().
      
      [ hpa: this is a performance regression, not a correctness regression,
        but since it can be quite serious on CPUs which need encryption at
        interrupt time I am marking this for urgent/stable. ]
      Signed-off-by: NPekka Riikonen <priikone@iki.fi>
      Link: http://lkml.kernel.org/r/alpine.GSO.2.00.1305131356320.18@git.silcnet.org
      Cc: <stable@vger.kernel.org> v3.7+
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      5187b28f
    • J
      x86, crc32-pclmul: Fix build with older binutils · 2baad612
      Jan Beulich 提交于
      binutils prior to 2.18 (e.g. the ones found on SLE10) don't support
      assembling PEXTRD, so a macro based approach like the one for PCLMULQDQ
      in the same file should be used.
      
      This requires making the helper macros capable of recognizing 32-bit
      general purpose register operands.
      
      [ hpa: tagging for stable as it is a low risk build fix ]
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Link: http://lkml.kernel.org/r/51A6142A02000078000D99D8@nat28.tlf.novell.com
      Cc: Alexander Boyko <alexander_boyko@xyratex.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: <stable@vger.kernel.org> v3.9
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      2baad612
  9. 29 5月, 2013 2 次提交
    • S
      xen: Clean up apic ipi interface · 1db01b49
      Stefan Bader 提交于
      Commit f447d56d introduced the
      implementation of the PV apic ipi interface. But there were some
      odd things (it seems none of which cause really any issue but
      maybe they should be cleaned up anyway):
       - xen_send_IPI_mask_allbutself (and by that xen_send_IPI_allbutself)
         ignore the passed in vector and only use the CALL_FUNCTION_SINGLE
         vector. While xen_send_IPI_all and xen_send_IPI_mask use the vector.
       - physflat_send_IPI_allbutself is declared unnecessarily. It is never
         used.
      
      This patch tries to clean up those things.
      Signed-off-by: NStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      1db01b49
    • Z
      x86-64, init: Fix a possible wraparound bug in switchover in head_64.S · e9d0626e
      Zhang Yanfei 提交于
      In head_64.S, a switchover has been used to handle kernel crossing
      1G, 512G boundaries.
      
      And commit 8170e6be
          x86, 64bit: Use a #PF handler to materialize early mappings on demand
      said:
          During the switchover in head_64.S, before #PF handler is available,
          we use three pages to handle kernel crossing 1G, 512G boundaries with
          sharing page by playing games with page aliasing: the same page is
          mapped twice in the higher-level tables with appropriate wraparound.
      
      But from the switchover code, when we set up the PUD table:
      114         addq    $4096, %rdx
      115         movq    %rdi, %rax
      116         shrq    $PUD_SHIFT, %rax
      117         andl    $(PTRS_PER_PUD-1), %eax
      118         movq    %rdx, (4096+0)(%rbx,%rax,8)
      119         movq    %rdx, (4096+8)(%rbx,%rax,8)
      
      It seems line 119 has a potential bug there. For example,
      if the kernel is loaded at physical address 511G+1008M, that is
          000000000 111111111 111111000 000000000000000000000
      and the kernel _end is 512G+2M, that is
          000000001 000000000 000000001 000000000000000000000
      So in this example, when using the 2nd page to setup PUD (line 114~119),
      rax is 511.
      In line 118, we put rdx which is the address of the PMD page (the 3rd page)
      into entry 511 of the PUD table. But in line 119, the entry we calculate from
      (4096+8)(%rbx,%rax,8) has exceeded the PUD page. IMO, the entry in line
      119 should be wraparound into entry 0 of the PUD table.
      
      The patch fixes the bug.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Link: http://lkml.kernel.org/r/5191DE5A.3020302@cn.fujitsu.comSigned-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: <stable@vger.kernel.org> v3.9
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      e9d0626e
  10. 28 5月, 2013 1 次提交
  11. 21 5月, 2013 2 次提交
  12. 15 5月, 2013 1 次提交
    • J
      time: Revert ALWAYS_USE_PERSISTENT_CLOCK compile time optimizaitons · b4f711ee
      John Stultz 提交于
      Kay Sievers noted that the ALWAYS_USE_PERSISTENT_CLOCK config,
      which enables some minor compile time optimization to avoid
      uncessary code in mostly the suspend/resume path could cause
      problems for userland.
      
      In particular, the dependency for RTC_HCTOSYS on
      !ALWAYS_USE_PERSISTENT_CLOCK, which avoids setting the time
      twice and simplifies suspend/resume, has the side effect
      of causing the /sys/class/rtc/rtcN/hctosys flag to always be
      zero, and this flag is commonly used by udev to setup the
      /dev/rtc symlink to /dev/rtcN, which can cause pain for
      older applications.
      
      While the udev rules could use some work to be less fragile,
      breaking userland should strongly be avoided. Additionally
      the compile time optimizations are fairly minor, and the code
      being optimized is likely to be reworked in the future, so
      lets revert this change.
      Reported-by: NKay Sievers <kay@vrfy.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: stable <stable@vger.kernel.org> #3.9
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Link: http://lkml.kernel.org/r/1366828376-18124-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      b4f711ee
  13. 14 5月, 2013 1 次提交
    • L
      x86, efi: initial the local variable of DataSize to zero · eccaf52f
      Lee, Chun-Yi 提交于
      That will be better initial the value of DataSize to zero for the input of
      GetVariable(), otherwise we will feed a random value. The debug log of input
      DataSize like this:
      
      ...
      [  195.915612] EFI Variables Facility v0.08 2004-May-17
      [  195.915819] efi: size: 18446744071581821342
      [  195.915969] efi:  size': 18446744071581821342
      [  195.916324] efi: size: 18446612150714306560
      [  195.916632] efi:  size': 18446612150714306560
      [  195.917159] efi: size: 18446612150714306560
      [  195.917453] efi:  size': 18446612150714306560
      ...
      
      The size' is value that was returned by BIOS.
      
      After applied this patch:
      [   82.442042] EFI Variables Facility v0.08 2004-May-17
      [   82.442202] efi: size: 0
      [   82.442360] efi:  size': 1039
      [   82.443828] efi: size: 0
      [   82.444127] efi:  size': 2616
      [   82.447057] efi: size: 0
      [   82.447356] efi:  size': 5832
      ...
      
      Found on Acer Aspire V3 BIOS, it will not return the size of data if we input a
      non-zero DataSize.
      
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NLee, Chun-Yi <jlee@suse.com>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      eccaf52f
  14. 10 5月, 2013 4 次提交
  15. 09 5月, 2013 5 次提交
    • P
      KVM: emulator: emulate SALC · 326f578f
      Paolo Bonzini 提交于
      This is an almost-undocumented instruction available in 32-bit mode.
      I say "almost" undocumented because AMD documents it in their opcode
      maps just to say that it is unavailable in 64-bit mode (sections
      "A.2.1 One-Byte Opcodes" and "B.3 Invalid and Reassigned Instructions
      in 64-Bit Mode").
      
      It is roughly equivalent to "sbb %al, %al" except it does not
      set the flags.  Use fastop to emulate it, but do not use the opcode
      directly because it would fail if the host is 64-bit!
      Reported-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: stable@vger.kernel.org # 3.9
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      326f578f
    • P
      KVM: emulator: emulate XLAT · 7fa57952
      Paolo Bonzini 提交于
      This is used by SGABIOS, KVM breaks with emulate_invalid_guest_state=1.
      It is just a MOV in disguise, with a funny source address.
      Reported-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: stable@vger.kernel.org # 3.9
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      7fa57952
    • P
      KVM: emulator: emulate AAM · a035d5c6
      Paolo Bonzini 提交于
      This is used by SGABIOS, KVM breaks with emulate_invalid_guest_state=1.
      
      AAM needs the source operand to be unsigned; do the same in AAD as well
      for consistency, even though it does not affect the result.
      Reported-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: stable@vger.kernel.org # 3.9
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      a035d5c6
    • K
      x86/microcode: Add local mutex to fix physical CPU hot-add deadlock · 074d72ff
      Konrad Rzeszutek Wilk 提交于
      This can easily be triggered if a new CPU is added (via
      ACPI hotplug mechanism) and from user-space you do:
      
         echo 1 > /sys/devices/system/cpu/cpu3/online
      
      (or wait for UDEV to do it) on a newly appeared physical CPU.
      
      The deadlock is that the "store_online" in drivers/base/cpu.c
      takes the cpu_hotplug_driver_lock() lock, then calls "cpu_up".
      "cpu_up" eventually ends up calling "save_mc_for_early"
      which also takes the cpu_hotplug_driver_lock() lock.
      
      And here is that lockdep thinks of it:
      
       smpboot: Stack at about ffff880075c39f44
       smpboot: CPU3: has booted.
       microcode: CPU3 sig=0x206a7, pf=0x2, revision=0x25
      
       =============================================
       [ INFO: possible recursive locking detected ]
       3.9.0upstream-10129-g167af0e #1 Not tainted
       ---------------------------------------------
       sh/2487 is trying to acquire lock:
        (x86_cpu_hotplug_driver_mutex){+.+.+.}, at: [<ffffffff81075512>] cpu_hotplug_driver_lock+0x12/0x20
      
       but task is already holding lock:
        (x86_cpu_hotplug_driver_mutex){+.+.+.}, at: [<ffffffff81075512>] cpu_hotplug_driver_lock+0x12/0x20
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(x86_cpu_hotplug_driver_mutex);
         lock(x86_cpu_hotplug_driver_mutex);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       6 locks held by sh/2487:
        #0:  (sb_writers#5){.+.+.+}, at: [<ffffffff811ca48d>] vfs_write+0x17d/0x190
        #1:  (&buffer->mutex){+.+.+.}, at: [<ffffffff812464ef>] sysfs_write_file+0x3f/0x160
        #2:  (s_active#20){.+.+.+}, at: [<ffffffff81246578>] sysfs_write_file+0xc8/0x160
        #3:  (x86_cpu_hotplug_driver_mutex){+.+.+.}, at: [<ffffffff81075512>] cpu_hotplug_driver_lock+0x12/0x20
        #4:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810961c2>] cpu_maps_update_begin+0x12/0x20
        #5:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff810962a7>] cpu_hotplug_begin+0x27/0x60
      Suggested-and-Acked-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: fenghua.yu@intel.com
      Cc: xen-devel@lists.xensource.com
      Cc: stable@vger.kernel.org # for v3.9
      Link: http://lkml.kernel.org/r/1368029583-23337-1-git-send-email-konrad.wilk@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      074d72ff
    • G
      KVM: VMX: fix halt emulation while emulating invalid guest sate · 8d76c49e
      Gleb Natapov 提交于
      The invalid guest state emulation loop does not check halt_request
      which causes 100% cpu loop while guest is in halt and in invalid
      state, but more serious issue is that this leaves halt_request set, so
      random instruction emulated by vm86 #GP exit can be interpreted
      as halt which causes guest hang. Fix both problems by handling
      halt_request in emulation loop.
      Reported-by: NTomas Papan <tomas.papan@gmail.com>
      Tested-by: NTomas Papan <tomas.papan@gmail.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      8d76c49e
  16. 08 5月, 2013 4 次提交
  17. 07 5月, 2013 3 次提交
  18. 06 5月, 2013 1 次提交
    • K
      xen/vcpu/pvhvm: Fix vcpu hotplugging hanging. · 7f1fc268
      Konrad Rzeszutek Wilk 提交于
      If a user did:
      
      	echo 0 > /sys/devices/system/cpu/cpu1/online
      	echo 1 > /sys/devices/system/cpu/cpu1/online
      
      we would (this a build with DEBUG enabled) get to:
      smpboot: ++++++++++++++++++++=_---CPU UP  1
      .. snip..
      smpboot: Stack at about ffff880074c0ff44
      smpboot: CPU1: has booted.
      
      and hang. The RCU mechanism would kick in an try to IPI the CPU1
      but the IPIs (and all other interrupts) would never arrive at the
      CPU1. At first glance at least. A bit digging in the hypervisor
      trace shows that (using xenanalyze):
      
      [vla] d4v1 vec 243 injecting
         0.043163027 --|x d4v1 intr_window vec 243 src 5(vector) intr f3
      ]  0.043163639 --|x d4v1 vmentry cycles 1468
      ]  0.043164913 --|x d4v1 vmexit exit_reason PENDING_INTERRUPT eip ffffffff81673254
         0.043164913 --|x d4v1 inj_virq vec 243  real
        [vla] d4v1 vec 243 injecting
         0.043164913 --|x d4v1 intr_window vec 243 src 5(vector) intr f3
      ]  0.043165526 --|x d4v1 vmentry cycles 1472
      ]  0.043166800 --|x d4v1 vmexit exit_reason PENDING_INTERRUPT eip ffffffff81673254
         0.043166800 --|x d4v1 inj_virq vec 243  real
        [vla] d4v1 vec 243 injecting
      
      there is a pending event (subsequent debugging shows it is the IPI
      from the VCPU0 when smpboot.c on VCPU1 has done
      "set_cpu_online(smp_processor_id(), true)") and the guest VCPU1 is
      interrupted with the callback IPI (0xf3 aka 243) which ends up calling
      __xen_evtchn_do_upcall.
      
      The __xen_evtchn_do_upcall seems to do *something* but not acknowledge
      the pending events. And the moment the guest does a 'cli' (that is the
      ffffffff81673254 in the log above) the hypervisor is invoked again to
      inject the IPI (0xf3) to tell the guest it has pending interrupts.
      This repeats itself forever.
      
      The culprit was the per_cpu(xen_vcpu, cpu) pointer. At the bootup
      we set each per_cpu(xen_vcpu, cpu) to point to the
      shared_info->vcpu_info[vcpu] but later on use the VCPUOP_register_vcpu_info
      to register per-CPU  structures (xen_vcpu_setup).
      This is used to allow events for more than 32 VCPUs and for performance
      optimizations reasons.
      
      When the user performs the VCPU hotplug we end up calling the
      the xen_vcpu_setup once more. We make the hypercall which returns
      -EINVAL as it does not allow multiple registration calls (and
      already has re-assigned where the events are being set). We pick
      the fallback case and set per_cpu(xen_vcpu, cpu) to point to the
      shared_info->vcpu_info[vcpu] (which is a good fallback during bootup).
      However the hypervisor is still setting events in the register
      per-cpu structure (per_cpu(xen_vcpu_info, cpu)).
      
      As such when the events are set by the hypervisor (such as timer one),
      and when we iterate in __xen_evtchn_do_upcall we end up reading stale
      events from the shared_info->vcpu_info[vcpu] instead of the
      per_cpu(xen_vcpu_info, cpu) structures. Hence we never acknowledge the
      events that the hypervisor has set and the hypervisor keeps on reminding
      us to ack the events which we never do.
      
      The fix is simple. Don't on the second time when xen_vcpu_setup is
      called over-write the per_cpu(xen_vcpu, cpu) if it points to
      per_cpu(xen_vcpu_info).
      Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      7f1fc268
  19. 05 5月, 2013 1 次提交
  20. 04 5月, 2013 2 次提交
  21. 03 5月, 2013 1 次提交