1. 23 7月, 2017 1 次提交
  2. 13 6月, 2017 1 次提交
  3. 02 5月, 2017 8 次提交
  4. 16 3月, 2017 1 次提交
    • T
      x86: Remap GDT tables in the fixmap section · 69218e47
      Thomas Garnier 提交于
      Each processor holds a GDT in its per-cpu structure. The sgdt
      instruction gives the base address of the current GDT. This address can
      be used to bypass KASLR memory randomization. With another bug, an
      attacker could target other per-cpu structures or deduce the base of
      the main memory section (PAGE_OFFSET).
      
      This patch relocates the GDT table for each processor inside the
      fixmap section. The space is reserved based on number of supported
      processors.
      
      For consistency, the remapping is done by default on 32 and 64-bit.
      
      Each processor switches to its remapped GDT at the end of
      initialization. For hibernation, the main processor returns with the
      original GDT and switches back to the remapping at completion.
      
      This patch was tested on both architectures. Hibernation and KVM were
      both tested specially for their usage of the GDT.
      
      Thanks to Boris Ostrovsky <boris.ostrovsky@oracle.com> for testing and
      recommending changes for Xen support.
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Luis R . Rodriguez <mcgrof@kernel.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: kasan-dev@googlegroups.com
      Cc: kernel-hardening@lists.openwall.com
      Cc: kvm@vger.kernel.org
      Cc: lguest@lists.ozlabs.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-pm@vger.kernel.org
      Cc: xen-devel@lists.xenproject.org
      Cc: zijun_hu <zijun_hu@htc.com>
      Link: http://lkml.kernel.org/r/20170314170508.100882-2-thgarnie@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      69218e47
  5. 02 3月, 2017 1 次提交
  6. 07 2月, 2017 1 次提交
  7. 13 12月, 2016 1 次提交
    • T
      x86/smpboot: Make logical package management more robust · 9d85eb91
      Thomas Gleixner 提交于
      The logical package management has several issues:
      
       - The APIC ids provided by ACPI are not required to be the same as the
         initial APIC id which can be retrieved by CPUID. The APIC ids provided
         by ACPI are those which are written by the BIOS into the APIC. The
         initial id is set by hardware and can not be changed. The hardware
         provided ids contain the real hardware package information.
      
         Especially AMD sets the effective APIC id different from the hardware id
         as they need to reserve space for the IOAPIC ids starting at id 0.
      
         As a consequence those machines trigger the currently active firmware
         bug printouts in dmesg, These are obviously wrong.
      
       - Virtual machines have their own interesting of enumerating APICs and
         packages which are not reliably covered by the current implementation.
      
      The sizing of the mapping array has been tweaked to be generously large to
      handle systems which provide a wrong core count when HT is disabled so the
      whole magic which checks for space in the physical hotplug case is not
      needed anymore.
      
      Simplify the whole machinery and do the mapping when the CPU starts and the
      CPUID derived physical package information is available. This solves the
      observed problems on AMD machines and works for the virtualization issues
      as well.
      
      Remove the extra call from XEN cpu bringup code as it is not longer
      required.
      
      Fixes: d49597fd ("x86/cpu: Deal with broken firmware (VMWare/XEN)")
      Reported-and-tested-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: M. Vefa Bicakci <m.v.b@runbox.com>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Cc: Charles (Chas) Williams <ciwillia@brocade.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1612121102260.3429@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      9d85eb91
  8. 06 10月, 2016 1 次提交
    • B
      xen/x86: Update topology map for PV VCPUs · a6a198bc
      Boris Ostrovsky 提交于
      Early during boot topology_update_package_map() computes
      logical_pkg_ids for all present processors.
      
      Later, when processors are brought up, identify_cpu() updates
      these values based on phys_pkg_id which is a function of
      initial_apicid. On PV guests the latter may point to a
      non-existing node, causing logical_pkg_ids to be set to -1.
      
      Intel's RAPL uses logical_pkg_id (as topology_logical_package_id())
      to index its arrays and therefore in this case will point to index
      65535 (since logical_pkg_id is a u16). This could lead to either a
      crash or may actually access random memory location.
      
      As a workaround, we recompute topology during CPU bringup to reset
      logical_pkg_id to a valid value.
      
      (The reason for initial_apicid being bogus is because it is
      initial_apicid of the processor from which the guest is launched.
      This value is CPUID(1).EBX[31:24])
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      a6a198bc
  9. 30 9月, 2016 1 次提交
    • K
      xen: Remove event channel notification through Xen PCI platform device · 72a9b186
      KarimAllah Ahmed 提交于
      Ever since commit 254d1a3f ("xen/pv-on-hvm kexec: shutdown watches
      from old kernel") using the INTx interrupt from Xen PCI platform
      device for event channel notification would just lockup the guest
      during bootup.  postcore_initcall now calls xs_reset_watches which
      will eventually try to read a value from XenStore and will get stuck
      on read_reply at XenBus forever since the platform driver is not
      probed yet and its INTx interrupt handler is not registered yet. That
      means that the guest can not be notified at this moment of any pending
      event channels and none of the per-event handlers will ever be invoked
      (including the XenStore one) and the reply will never be picked up by
      the kernel.
      
      The exact stack where things get stuck during xenbus_init:
      
      -xenbus_init
       -xs_init
        -xs_reset_watches
         -xenbus_scanf
          -xenbus_read
           -xs_single
            -xs_single
             -xs_talkv
      
      Vector callbacks have always been the favourite event notification
      mechanism since their introduction in commit 38e20b07 ("x86/xen:
      event channels delivery on HVM.") and the vector callback feature has
      always been advertised for quite some time by Xen that's why INTx was
      broken for several years now without impacting anyone.
      
      Luckily this also means that event channel notification through INTx
      is basically dead-code which can be safely removed without impacting
      anybody since it has been effectively disabled for more than 4 years
      with nobody complaining about it (at least as far as I'm aware of).
      
      This commit removes event channel notification through Xen PCI
      platform device.
      
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Julien Grall <julien.grall@citrix.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Ross Lagerwall <ross.lagerwall@citrix.com>
      Cc: xen-devel@lists.xenproject.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-pci@vger.kernel.org
      Cc: Anthony Liguori <aliguori@amazon.com>
      Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      72a9b186
  10. 25 8月, 2016 1 次提交
  11. 25 7月, 2016 2 次提交
    • V
      xen/pvhvm: run xen_vcpu_setup() for the boot CPU · ee42d665
      Vitaly Kuznetsov 提交于
      Historically we didn't call VCPUOP_register_vcpu_info for CPU0 for
      PVHVM guests (while we had it for PV and ARM guests). This is usually
      fine as we can use vcpu info in the shared_info page but when we try
      booting on a vCPU with Xen's vCPU id > 31 (e.g. when we try to kdump
      after crashing on this CPU) we're not able to boot.
      
      Switch to always doing VCPUOP_register_vcpu_info for the boot CPU.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      ee42d665
    • V
      x86/xen: use xen_vcpu_id mapping for HYPERVISOR_vcpu_op · ad5475f9
      Vitaly Kuznetsov 提交于
      HYPERVISOR_vcpu_op() passes Linux's idea of vCPU id as a parameter
      while Xen's idea is expected. In some cases these ideas diverge so we
      need to do remapping.
      
      Convert all callers of HYPERVISOR_vcpu_op() to use xen_vcpu_nr().
      
      Leave xen_fill_possible_map() and xen_filter_cpu_maps() intact as
      they're only being called by PV guests before perpu areas are
      initialized. While the issue could be solved by switching to
      early_percpu for xen_vcpu_id I think it's not worth it: PV guests will
      probably never get to the point where their idea of vCPU id diverges
      from Xen's.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      ad5475f9
  12. 29 3月, 2016 1 次提交
  13. 02 3月, 2016 1 次提交
    • T
      arch/hotplug: Call into idle with a proper state · fc6d73d6
      Thomas Gleixner 提交于
      Let the non boot cpus call into idle with the corresponding hotplug state, so
      the hotplug core can handle the further bringup. That's a first step to
      convert the boot side of the hotplugged cpus to do all the synchronization
      with the other side through the state machine. For now it'll only start the
      hotplug thread and kick the full bringup of the cpu.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-arch@vger.kernel.org
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Link: http://lkml.kernel.org/r/20160226182341.614102639@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fc6d73d6
  14. 09 9月, 2015 1 次提交
  15. 20 8月, 2015 1 次提交
    • B
      xen/PMU: Initialization code for Xen PMU · 65d0cf0b
      Boris Ostrovsky 提交于
      Map shared data structure that will hold CPU registers, VPMU context,
      V/PCPU IDs of the CPU interrupted by PMU interrupt. Hypervisor fills
      this information in its handler and passes it to the guest for further
      processing.
      
      Set up PMU VIRQ.
      
      Now that perf infrastructure will assume that PMU is available on a PV
      guest we need to be careful and make sure that accesses via RDPMC
      instruction don't cause fatal traps by the hypervisor. Provide a nop
      RDPMC handler.
      
      For the same reason avoid issuing a warning on a write to APIC's LVTPC.
      
      Both of these will be made functional in later patches.
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      65d0cf0b
  16. 02 4月, 2015 1 次提交
  17. 25 3月, 2015 1 次提交
    • D
      x86/asm/entry: Get rid of KERNEL_STACK_OFFSET · ef593260
      Denys Vlasenko 提交于
      PER_CPU_VAR(kernel_stack) was set up in a way where it points
      five stack slots below the top of stack.
      
      Presumably, it was done to avoid one "sub $5*8,%rsp"
      in syscall/sysenter code paths, where iret frame needs to be
      created by hand.
      
      Ironically, none of them benefits from this optimization,
      since all of them need to allocate additional data on stack
      (struct pt_regs), so they still have to perform subtraction.
      
      This patch eliminates KERNEL_STACK_OFFSET.
      
      PER_CPU_VAR(kernel_stack) now points directly to top of stack.
      pt_regs allocations are adjusted to allocate iret frame as well.
      Hopefully we can merge it later with 32-bit specific
      PER_CPU_VAR(cpu_current_top_of_stack) variable...
      
      Net result in generated code is that constants in several insns
      are changed.
      
      This change is necessary for changing struct pt_regs creation
      in SYSCALL64 code path from MOV to PUSH instructions.
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1426785469-15125-2-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ef593260
  18. 12 3月, 2015 1 次提交
    • P
      x86: Use common outgoing-CPU-notification code · 2a442c9c
      Paul E. McKenney 提交于
      This commit removes the open-coded CPU-offline notification with new
      common code.  Among other things, this change avoids calling scheduler
      code using RCU from an offline CPU that RCU is ignoring.  It also allows
      Xen to notice at online time that the CPU did not go offline correctly.
      Note that Xen has the surviving CPU carry out some cleanup operations,
      so if the surviving CPU times out, these cleanup operations might have
      been carried out while the outgoing CPU was still running.  It might
      therefore be unwise to bring this CPU back online, and this commit
      avoids doing so.
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: <x86@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: <xen-devel@lists.xenproject.org>
      2a442c9c
  19. 26 1月, 2015 1 次提交
  20. 10 11月, 2014 1 次提交
  21. 06 10月, 2014 1 次提交
  22. 16 9月, 2014 1 次提交
  23. 15 4月, 2014 1 次提交
    • B
      x86/xen: Fix 32-bit PV guests's usage of kernel_stack · 4461bbc0
      Boris Ostrovsky 提交于
      Commit 198d208d ("x86: Keep
      thread_info on thread stack in x86_32") made 32-bit kernels use
      kernel_stack to point to thread_info. That change missed a couple of
      updates needed by Xen's 32-bit PV guests:
      
      1. kernel_stack needs to be initialized for secondary CPUs
      
      2. GET_THREAD_INFO() now uses %fs register which may not be the
         kernel's version when executing xen_iret().
      
      With respect to the second issue, we don't need GET_THREAD_INFO()
      anymore: we used it as an intermediate step to get to per_cpu xen_vcpu
      and avoid referencing %fs. Now that we are going to use %fs anyway we
      may as well go directly to xen_vcpu.
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      4461bbc0
  24. 22 1月, 2014 1 次提交
    • R
      xen/pvh: Set X86_CR0_WP and others in CR0 (v2) · c9f6e997
      Roger Pau Monne 提交于
      otherwise we will get for some user-space applications
      that use 'clone' with CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
      end up hitting an assert in glibc manifested by:
      
      general protection ip:7f80720d364c sp:7fff98fd8a80 error:0 in
      libc-2.13.so[7f807209e000+180000]
      
      This is due to the nature of said operations which sets and clears
      the PID.  "In the successful one I can see that the page table of
      the parent process has been updated successfully to use a
      different physical page, so the write of the tid on
      that page only affects the child...
      
      On the other hand, in the failed case, the write seems to happen before
      the copy of the original page is done, so both the parent and the child
      end up with the same value (because the parent copies the page after
      the write of the child tid has already happened)."
      (Roger's analysis). The nature of this is due to the Xen's commit
      of 51e2cac257ec8b4080d89f0855c498cbbd76a5e5
      "x86/pvh: set only minimal cr0 and cr4 flags in order to use paging"
      the CR0_WP was removed so COW features of the Linux kernel were not
      operating properly.
      
      While doing that also update the rest of the CR0 flags to be inline
      with what a baremetal Linux kernel would set them to.
      
      In 'secondary_startup_64' (baremetal Linux) sets:
      
      X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP |
      X86_CR0_AM | X86_CR0_PG
      
      The hypervisor for HVM type guests (which PVH is a bit) sets:
      X86_CR0_PE | X86_CR0_ET | X86_CR0_TS
      For PVH it specifically sets:
      X86_CR0_PG
      
      Which means we need to set the rest: X86_CR0_MP | X86_CR0_NE  |
      X86_CR0_WP | X86_CR0_AM to have full parity.
      Signed-off-by: NRoger Pau Monne <roger.pau@citrix.com>
      Signed-off-by: NMukesh Rathor <mukesh.rathor@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      [v1: Took out the cr4 writes to be a seperate patch]
      [v2: 0-DAY kernel found xen_setup_gdt to be missing a static]
      c9f6e997
  25. 06 1月, 2014 1 次提交
    • M
      xen/pvh: Secondary VCPU bringup (non-bootup CPUs) · 5840c84b
      Mukesh Rathor 提交于
      The VCPU bringup protocol follows the PV with certain twists.
      From xen/include/public/arch-x86/xen.h:
      
      Also note that when calling DOMCTL_setvcpucontext and VCPU_initialise
      for HVM and PVH guests, not all information in this structure is updated:
      
       - For HVM guests, the structures read include: fpu_ctxt (if
       VGCT_I387_VALID is set), flags, user_regs, debugreg[*]
      
       - PVH guests are the same as HVM guests, but additionally use ctrlreg[3] to
       set cr3. All other fields not used should be set to 0.
      
      This is what we do. We piggyback on the 'xen_setup_gdt' - but modify
      a bit - we need to call 'load_percpu_segment' so that 'switch_to_new_gdt'
      can load per-cpu data-structures. It has no effect on the VCPU0.
      
      We also piggyback on the %rdi register to pass in the CPU number - so
      that when we bootup a new CPU, the cpu_bringup_and_idle will have
      passed as the first parameter the CPU number (via %rdi for 64-bit).
      Signed-off-by: NMukesh Rathor <mukesh.rathor@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      5840c84b
  26. 07 11月, 2013 1 次提交
  27. 10 10月, 2013 1 次提交
    • F
      xen: Fix possible user space selector corruption · 7cde9b27
      Frediano Ziglio 提交于
      Due to the way kernel is initialized under Xen is possible that the
      ring1 selector used by the kernel for the boot cpu end up to be copied
      to userspace leading to segmentation fault in the userspace.
      
      Xen code in the kernel initialize no-boot cpus with correct selectors (ds
      and es set to __USER_DS) but the boot one keep the ring1 (passed by Xen).
      On task context switch (switch_to) we assume that ds, es and cs already
      point to __USER_DS and __KERNEL_CSso these selector are not changed.
      
      If processor is an Intel that support sysenter instruction sysenter/sysexit
      is used so ds and es are not restored switching back from kernel to
      userspace. In the case the selectors point to a ring1 instead of __USER_DS
      the userspace code will crash on first memory access attempt (to be
      precise Xen on the emulated iret used to do sysexit will detect and set ds
      and es to zero which lead to GPF anyway).
      
      Now if an userspace process call kernel using sysenter and get rescheduled
      (for me it happen on a specific init calling wait4) could happen that the
      ring1 selector is set to ds and es.
      
      This is quite hard to detect cause after a while these selectors are fixed
      (__USER_DS seems sticky).
      
      Bisecting the code commit 7076aada appears
      to be the first one that have this issue.
      Signed-off-by: NFrediano Ziglio <frediano.ziglio@citrix.com>
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Reviewed-by: NAndrew Cooper <andrew.cooper3@citrix.com>
      7cde9b27
  28. 10 9月, 2013 2 次提交
    • K
      xen/smp: Update pv_lock_ops functions before alternative code starts under PVHVM · 26a79995
      Konrad Rzeszutek Wilk 提交于
      Before this patch we would patch all of the pv_lock_ops sites
      using alternative assembler. Then later in the bootup cycle
      change the unlock_kick and lock_spinning to the Xen specific -
      without re patching.
      
      That meant that for the core of the kernel we would be running
      with the baremetal version of unlock_kick and lock_spinning while
      for modules we would have the proper Xen specific slowpaths.
      
      As most of the module uses some API from the core kernel that ended
      up with slowpath lockers waiting forever to be kicked (b/c they
      would be using the Xen specific slowpath logic). And the
      kick never came b/c the unlock path that was taken was the
      baremetal one.
      
      On PV we do not have the problem as we initialise before the
      alternative code kicks in.
      
      The fix is to make the updating of the pv_lock_ops function
      be done before the alternative code starts patching.
      
      Note that this patch fixes issues discovered by commit
      f10cd522.
      ("xen: disable PV spinlocks on HVM") wherein it mentioned
      
         PV spinlocks cannot possibly work with the current code because they are
         enabled after pvops patching has already been done, and because PV
         spinlocks use a different data structure than native spinlocks so we
         cannot switch between them dynamically.
      
      The first problem is solved by this patch.
      
      The second problem has been solved by commit
      816434ec
      (Merge branch 'x86-spinlocks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip)
      
      P.S.
      There is still the commit 70dd4998
      (xen/spinlock: Disable IRQ spinlock (PV) allocation on PVHVM) to
      revert but that can be done later after all other bugs have been
      fixed.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: NDavid Vrabel <david.vrabel@citrix.com>
      26a79995
    • K
      xen/spinlock: Fix locking path engaging too soon under PVHVM. · 1fb3a8b2
      Konrad Rzeszutek Wilk 提交于
      The xen_lock_spinning has a check for the kicker interrupts
      and if it is not initialized it will spin normally (not enter
      the slowpath).
      
      But for PVHVM case we would initialize the kicker interrupt
      before the CPU came online. This meant that if the booting
      CPU used a spinlock and went in the slowpath - it would
      enter the slowpath and block forever. The forever part because
      during bootup: the spinlock would be taken _before_ the CPU
      sets itself to be online (more on this further), and we enter
      to poll on the event channel forever.
      
      The bootup CPU (see commit fc78d343
      "xen/smp: initialize IPI vectors before marking CPU online"
      for details) and the CPU that started the bootup consult
      the cpu_online_mask to determine whether the booting CPU should
      get an IPI. The booting CPU has to set itself in this mask via:
      
        set_cpu_online(smp_processor_id(), true);
      
      However, if the spinlock is taken before this (and it is) and
      it polls on an event channel - it will never be woken up as
      the kernel will never send an IPI to an offline CPU.
      
      Note that the PVHVM logic in sending IPIs is using the HVM
      path which has numerous checks using the cpu_online_mask
      and cpu_active_mask. See above mention git commit for details.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: NDavid Vrabel <david.vrabel@citrix.com>
      1fb3a8b2
  29. 20 8月, 2013 1 次提交
    • C
      xen/smp: initialize IPI vectors before marking CPU online · fc78d343
      Chuck Anderson 提交于
      An older PVHVM guest (v3.0 based) crashed during vCPU hot-plug with:
      
      	kernel BUG at drivers/xen/events.c:1328!
      
      RCU has detected that a CPU has not entered a quiescent state within the
      grace period.  It needs to send the CPU a reschedule IPI if it is not
      offline.  rcu_implicit_offline_qs() does this check:
      
      	/*
      	 * If the CPU is offline, it is in a quiescent state.  We can
      	 * trust its state not to change because interrupts are disabled.
      	 */
      	if (cpu_is_offline(rdp->cpu)) {
      		rdp->offline_fqs++;
      		return 1;
      	}
      
      	Else the CPU is online.  Send it a reschedule IPI.
      
      The CPU is in the middle of being hot-plugged and has been marked online
      (!cpu_is_offline()).  See start_secondary():
      
      	set_cpu_online(smp_processor_id(), true);
      	...
      	per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
      
      start_secondary() then waits for the CPU bringing up the hot-plugged CPU to
      mark it as active:
      
      	/*
      	 * Wait until the cpu which brought this one up marked it
      	 * online before enabling interrupts. If we don't do that then
      	 * we can end up waking up the softirq thread before this cpu
      	 * reached the active state, which makes the scheduler unhappy
      	 * and schedule the softirq thread on the wrong cpu. This is
      	 * only observable with forced threaded interrupts, but in
      	 * theory it could also happen w/o them. It's just way harder
      	 * to achieve.
      	 */
      	while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
      		cpu_relax();
      
      	/* enable local interrupts */
      	local_irq_enable();
      
      The CPU being hot-plugged will be marked active after it has been fully
      initialized by the CPU managing the hot-plug.  In the Xen PVHVM case
      xen_smp_intr_init() is called to set up the hot-plugged vCPU's
      XEN_RESCHEDULE_VECTOR.
      
      The hot-plugging CPU is marked online, not marked active and does not have
      its IPI vectors set up.  rcu_implicit_offline_qs() sees the hot-plugging
      cpu is !cpu_is_offline() and tries to send it a reschedule IPI:
      This will lead to:
      
      	kernel BUG at drivers/xen/events.c:1328!
      
      	xen_send_IPI_one()
      	xen_smp_send_reschedule()
      	rcu_implicit_offline_qs()
      	rcu_implicit_dynticks_qs()
      	force_qs_rnp()
      	force_quiescent_state()
      	__rcu_process_callbacks()
      	rcu_process_callbacks()
      	__do_softirq()
      	call_softirq()
      	do_softirq()
      	irq_exit()
      	xen_evtchn_do_upcall()
      
      because xen_send_IPI_one() will attempt to use an uninitialized IRQ for
      the XEN_RESCHEDULE_VECTOR.
      
      There is at least one other place that has caused the same crash:
      
      	xen_smp_send_reschedule()
      	wake_up_idle_cpu()
      	add_timer_on()
      	clocksource_watchdog()
      	call_timer_fn()
      	run_timer_softirq()
      	__do_softirq()
      	call_softirq()
      	do_softirq()
      	irq_exit()
      	xen_evtchn_do_upcall()
      	xen_hvm_callback_vector()
      
      clocksource_watchdog() uses cpu_online_mask to pick the next CPU to handle
      a watchdog timer:
      
      	/*
      	 * Cycle through CPUs to check if the CPUs stay synchronized
      	 * to each other.
      	 */
      	next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
      	if (next_cpu >= nr_cpu_ids)
      		next_cpu = cpumask_first(cpu_online_mask);
      	watchdog_timer.expires += WATCHDOG_INTERVAL;
      	add_timer_on(&watchdog_timer, next_cpu);
      
      This resulted in an attempt to send an IPI to a hot-plugging CPU that
      had not initialized its reschedule vector. One option would be to make
      the RCU code check to not check for CPU offline but for CPU active.
      As becoming active is done after a CPU is online (in older kernels).
      
      But Srivatsa pointed out that "the cpu_active vs cpu_online ordering has been
      completely reworked - in the online path, cpu_active is set *before* cpu_online,
      and also, in the cpu offline path, the cpu_active bit is reset in the CPU_DYING
      notification instead of CPU_DOWN_PREPARE." Drilling in this the bring-up
      path: "[brought up CPU].. send out a CPU_STARTING notification, and in response
      to that, the scheduler sets the CPU in the cpu_active_mask. Again, this mask
      is better left to the scheduler alone, since it has the intelligence to use it
      judiciously."
      
      The conclusion was that:
      "
      1. At the IPI sender side:
      
         It is incorrect to send an IPI to an offline CPU (cpu not present in
         the cpu_online_mask). There are numerous places where we check this
         and warn/complain.
      
      2. At the IPI receiver side:
      
         It is incorrect to let the world know of our presence (by setting
         ourselves in global bitmasks) until our initialization steps are complete
         to such an extent that we can handle the consequences (such as
         receiving interrupts without crashing the sender etc.)
      " (from Srivatsa)
      
      As the native code enables the interrupts at some point we need to be
      able to service them. In other words a CPU must have valid IPI vectors
      if it has been marked online.
      
      It doesn't need to handle the IPI (interrupts may be disabled) but needs
      to have valid IPI vectors because another CPU may find it in cpu_online_mask
      and attempt to send it an IPI.
      
      This patch will change the order of the Xen vCPU bring-up functions so that
      Xen vectors have been set up before start_secondary() is called.
      It also will not continue to bring up a Xen vCPU if xen_smp_intr_init() fails
      to initialize it.
      
      Orabug 13823853
      Signed-off-by Chuck Anderson <chuck.anderson@oracle.com>
      Acked-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      fc78d343
  30. 09 8月, 2013 2 次提交