1. 23 8月, 2012 2 次提交
    • K
      xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything. · 3699aad0
      Konrad Rzeszutek Wilk 提交于
      We don't need to return the new PGD - as we do not use it.
      Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      3699aad0
    • K
      Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain." and... · 51faaf2b
      Konrad Rzeszutek Wilk 提交于
      Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain." and "xen/x86: Use memblock_reserve for sensitive areas."
      
      This reverts commit 806c312e and
      commit 59b29440.
      
      And also documents setup.c and why we want to do it that way, which
      is that we tried to make the the memblock_reserve more selective so
      that it would be clear what region is reserved. Sadly we ran
      in the problem wherein on a 64-bit hypervisor with a 32-bit
      initial domain, the pt_base has the cr3 value which is not
      neccessarily where the pagetable starts! As Jan put it: "
      Actually, the adjustment turns out to be correct: The page
      tables for a 32-on-64 dom0 get allocated in the order "first L1",
      "first L2", "first L3", so the offset to the page table base is
      indeed 2. When reading xen/include/public/xen.h's comment
      very strictly, this is not a violation (since there nothing is said
      that the first thing in the page table space is pointed to by
      pt_base; I admit that this seems to be implied though, namely
      do I think that it is implied that the page table space is the
      range [pt_base, pt_base + nt_pt_frames), whereas that
      range here indeed is [pt_base - 2, pt_base - 2 + nt_pt_frames),
      which - without a priori knowledge - the kernel would have
      difficulty to figure out)." - so lets just fall back to the
      easy way and reserve the whole region.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      51faaf2b
  2. 22 8月, 2012 3 次提交
    • K
      xen/apic/xenbus/swiotlb/pcifront/grant/tmem: Make functions or variables static. · b8b0f559
      Konrad Rzeszutek Wilk 提交于
      There is no need for those functions/variables to be visible. Make them
      static and also fix the compile warnings of this sort:
      
      drivers/xen/<some file>.c: warning: symbol '<blah>' was not declared. Should it be static?
      
      Some of them just require including the header file that
      declares the functions.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      b8b0f559
    • K
      xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain. · 806c312e
      Konrad Rzeszutek Wilk 提交于
      If a 64-bit hypervisor is booted with a 32-bit initial domain,
      the hypervisor deals with the initial domain as "compat" and
      does some extra adjustments (like pagetables are 4 bytes instead
      of 8). It also adjusts the xen_start_info->pt_base incorrectly.
      
      When booted with a 32-bit hypervisor (32-bit initial domain):
      ..
      (XEN)  Start info:    cf831000->cf83147c
      (XEN)  Page tables:   cf832000->cf8b5000
      ..
      [    0.000000] PT: cf832000 (f832000)
      [    0.000000] Reserving PT: f832000->f8b5000
      
      And with a 64-bit hypervisor:
      (XEN)  Start info:    00000000cf831000->00000000cf8314b4
      (XEN)  Page tables:   00000000cf832000->00000000cf8b6000
      
      [    0.000000] PT: cf834000 (f834000)
      [    0.000000] Reserving PT: f834000->f8b8000
      
      To deal with this, we keep keep track of the highest physical
      address we have reserved via memblock_reserve. If that address
      does not overlap with pt_base, we have a gap which we reserve.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      806c312e
    • K
      xen/x86: Use memblock_reserve for sensitive areas. · 59b29440
      Konrad Rzeszutek Wilk 提交于
      instead of a big memblock_reserve. This way we can be more
      selective in freeing regions (and it also makes it easier
      to understand where is what).
      
      [v1: Move the auto_translate_physmap to proper line]
      [v2: Per Stefano suggestion add more comments]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      59b29440
  3. 17 8月, 2012 1 次提交
  4. 20 7月, 2012 7 次提交
  5. 08 6月, 2012 1 次提交
  6. 01 6月, 2012 1 次提交
    • A
      xen/setup: filter APERFMPERF cpuid feature out · 5e626254
      Andre Przywara 提交于
      Xen PV kernels allow access to the APERF/MPERF registers to read the
      effective frequency. Access to the MSRs is however redirected to the
      currently scheduled physical CPU, making consecutive read and
      compares unreliable. In addition each rdmsr traps into the hypervisor.
      So to avoid bogus readouts and expensive traps, disable the kernel
      internal feature flag for APERF/MPERF if running under Xen.
      This will
      a) remove the aperfmperf flag from /proc/cpuinfo
      b) not mislead the power scheduler (arch/x86/kernel/cpu/sched.c) to
         use the feature to improve scheduling (by default disabled)
      c) not mislead the cpufreq driver to use the MSRs
      
      This does not cover userland programs which access the MSRs via the
      device file interface, but this will be addressed separately.
      Signed-off-by: NAndre Przywara <andre.przywara@amd.com>
      Cc: stable@vger.kernel.org # v3.0+
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      5e626254
  7. 31 5月, 2012 1 次提交
  8. 08 5月, 2012 4 次提交
  9. 07 5月, 2012 1 次提交
    • K
      xen/apic: Return the APIC ID (and version) for CPU 0. · 558daa28
      Konrad Rzeszutek Wilk 提交于
      On x86_64 on AMD machines where the first APIC_ID is not zero, we get:
      
      ACPI: LAPIC (acpi_id[0x01] lapic_id[0x10] enabled)
      BIOS bug: APIC version is 0 for CPU 1/0x10, fixing up to 0x10
      BIOS bug: APIC version mismatch, boot CPU: 0, CPU 1: version 10
      
      which means that when the ACPI processor driver loads and
      tries to parse the _Pxx states it fails to do as, as it
      ends up calling acpi_get_cpuid which does this:
      
      for_each_possible_cpu(i) {
              if (cpu_physical_id(i) == apic_id)
                      return i;
      }
      
      And the bootup CPU, has not been found so it fails and returns -1
      for the first CPU - which then subsequently in the loop that
      "acpi_processor_get_info" does results in returning an error, which
      means that "acpi_processor_add" failing and per_cpu(processor)
      is never set (and is NULL).
      
      That means that when xen-acpi-processor tries to load (much much
      later on) and parse the P-states it gets -ENODEV from
      acpi_processor_register_performance() (which tries to read
      the per_cpu(processor)) and fails to parse the data.
      Reported-by-and-Tested-by: NStefan Bader <stefan.bader@canonical.com>
      Suggested-by: NBoris Ostrovsky <boris.ostrovsky@amd.com>
      [v2: Bit-shift APIC ID by 24 bits]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      558daa28
  10. 02 5月, 2012 1 次提交
  11. 27 4月, 2012 1 次提交
    • K
      xen/enlighten: Disable MWAIT_LEAF so that acpi-pad won't be loaded. · df88b2d9
      Konrad Rzeszutek Wilk 提交于
      There are exactly four users of __monitor and __mwait:
      
       - cstate.c (which allows acpi_processor_ffh_cstate_enter to be called
         when the cpuidle API drivers are used. However patch
         "cpuidle: replace xen access to x86 pm_idle and default_idle"
         provides a mechanism to disable the cpuidle and use safe_halt.
       - smpboot (which allows mwait_play_dead to be called). However
         safe_halt is always used so we skip that.
       - intel_idle (same deal as above).
       - acpi_pad.c. This the one that we do not want to run as we
         will hit the below crash.
      
      Why do we want to expose MWAIT_LEAF in the first place?
      We want it for the xen-acpi-processor driver - which uploads
      C-states to the hypervisor. If MWAIT_LEAF is set, the cstate.c
      sets the proper address in the C-states so that the hypervisor
      can benefit from using the MWAIT functionality. And that is
      the sole reason for using it.
      
      Without this patch, if a module performs mwait or monitor we
      get this:
      
      invalid opcode: 0000 [#1] SMP
      CPU 2
      .. snip..
      Pid: 5036, comm: insmod Tainted: G           O 3.4.0-rc2upstream-dirty #2 Intel Corporation S2600CP/S2600CP
      RIP: e030:[<ffffffffa000a017>]  [<ffffffffa000a017>] mwait_check_init+0x17/0x1000 [mwait_check]
      RSP: e02b:ffff8801c298bf18  EFLAGS: 00010282
      RAX: ffff8801c298a010 RBX: ffffffffa03b2000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff8801c29800d8 RDI: ffff8801ff097200
      RBP: ffff8801c298bf18 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
      R13: ffffffffa000a000 R14: 0000005148db7294 R15: 0000000000000003
      FS:  00007fbb364f2700(0000) GS:ffff8801ff08c000(0000) knlGS:0000000000000000
      CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 000000000179f038 CR3: 00000001c9469000 CR4: 0000000000002660
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process insmod (pid: 5036, threadinfo ffff8801c298a000, task ffff8801c29cd7e0)
      Stack:
       ffff8801c298bf48 ffffffff81002124 ffffffffa03b2000 00000000000081fd
       000000000178f010 000000000178f030 ffff8801c298bf78 ffffffff810c41e6
       00007fff3fb30db9 00007fff3fb30db9 00000000000081fd 0000000000010000
      Call Trace:
       [<ffffffff81002124>] do_one_initcall+0x124/0x170
       [<ffffffff810c41e6>] sys_init_module+0xc6/0x220
       [<ffffffff815b15b9>] system_call_fastpath+0x16/0x1b
      Code: <0f> 01 c8 31 c0 0f 01 c9 c9 c3 00 00 00 00 00 00 00 00 00 00 00 00
      RIP  [<ffffffffa000a017>] mwait_check_init+0x17/0x1000 [mwait_check]
       RSP <ffff8801c298bf18>
      ---[ end trace 16582fc8a3d1e29a ]---
      Kernel panic - not syncing: Fatal exception
      
      With this module (which is what acpi_pad.c would hit):
      
      MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
      MODULE_DESCRIPTION("mwait_check_and_back");
      MODULE_LICENSE("GPL");
      MODULE_VERSION();
      
      static int __init mwait_check_init(void)
      {
      	__monitor((void *)&current_thread_info()->flags, 0, 0);
      	__mwait(0, 0);
      	return 0;
      }
      static void __exit mwait_check_exit(void)
      {
      }
      module_init(mwait_check_init);
      module_exit(mwait_check_exit);
      Reported-by: NLiu, Jinsong <jinsong.liu@intel.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      df88b2d9
  12. 29 3月, 2012 1 次提交
  13. 11 3月, 2012 1 次提交
    • K
      xen/enlighten: Expose MWAIT and MWAIT_LEAF if hypervisor OKs it. · 73c154c6
      Konrad Rzeszutek Wilk 提交于
      For the hypervisor to take advantage of the MWAIT support it needs
      to extract from the ACPI _CST the register address. But the
      hypervisor does not have the support to parse DSDT so it relies on
      the initial domain (dom0) to parse the ACPI Power Management information
      and push it up to the hypervisor. The pushing of the data is done
      by the processor_harveset_xen module which parses the information that
      the ACPI parser has graciously exposed in 'struct acpi_processor'.
      
      For the ACPI parser to also expose the Cx states for MWAIT, we need
      to expose the MWAIT capability (leaf 1). Furthermore we also need to
      expose the MWAIT_LEAF capability (leaf 5) for cstate.c to properly
      function.
      
      The hypervisor could expose these flags when it traps the XEN_EMULATE_PREFIX
      operations, but it can't do it since it needs to be backwards compatible.
      Instead we choose to use the native CPUID to figure out if the MWAIT
      capability exists and use the XEN_SET_PDC query hypercall to figure out
      if the hypervisor wants us to expose the MWAIT_LEAF capability or not.
      
      Note: The XEN_SET_PDC query was implemented in c/s 23783:
      "ACPI: add _PDC input override mechanism".
      
      With this in place, instead of
       C3 ACPI IOPORT 415
      we get now
       C3:ACPI FFH INTEL MWAIT 0x20
      
      Note: The cpu_idle which would be calling the mwait variants for idling
      never gets set b/c we set the default pm_idle to be the hypercall variant.
      Acked-by: NJan Beulich <JBeulich@suse.com>
      [v2: Fix missing header file include and #ifdef]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      73c154c6
  14. 20 2月, 2012 2 次提交
    • K
      xen/pat: Disable PAT support for now. · 8eaffa67
      Konrad Rzeszutek Wilk 提交于
      [Pls also look at https://lkml.org/lkml/2012/2/10/228]
      
      Using of PAT to change pages from WB to WC works quite nicely.
      Changing it back to WB - not so much. The crux of the matter is
      that the code that does this (__page_change_att_set_clr) has only
      limited information so when it tries to the change it gets
      the "raw" unfiltered information instead of the properly filtered one -
      and the "raw" one tell it that PSE bit is on (while infact it
      is not).  As a result when the PTE is set to be WB from WC, we get
      tons of:
      
      :WARNING: at arch/x86/xen/mmu.c:475 xen_make_pte+0x67/0xa0()
      :Hardware name: HP xw4400 Workstation
      .. snip..
      :Pid: 27, comm: kswapd0 Tainted: G        W    3.2.2-1.fc16.x86_64 #1
      :Call Trace:
      : [<ffffffff8106dd1f>] warn_slowpath_common+0x7f/0xc0
      : [<ffffffff8106dd7a>] warn_slowpath_null+0x1a/0x20
      : [<ffffffff81005a17>] xen_make_pte+0x67/0xa0
      : [<ffffffff810051bd>] __raw_callee_save_xen_make_pte+0x11/0x1e
      : [<ffffffff81040e15>] ? __change_page_attr_set_clr+0x9d5/0xc00
      : [<ffffffff8114c2e8>] ? __purge_vmap_area_lazy+0x158/0x1d0
      : [<ffffffff8114cca5>] ? vm_unmap_aliases+0x175/0x190
      : [<ffffffff81041168>] change_page_attr_set_clr+0x128/0x4c0
      : [<ffffffff81041542>] set_pages_array_wb+0x42/0xa0
      : [<ffffffff8100a9b2>] ? check_events+0x12/0x20
      : [<ffffffffa0074d4c>] ttm_pages_put+0x1c/0x70 [ttm]
      : [<ffffffffa0074e98>] ttm_page_pool_free+0xf8/0x180 [ttm]
      : [<ffffffffa0074f78>] ttm_pool_mm_shrink+0x58/0x90 [ttm]
      : [<ffffffff8112ba04>] shrink_slab+0x154/0x310
      : [<ffffffff8112f17a>] balance_pgdat+0x4fa/0x6c0
      : [<ffffffff8112f4b8>] kswapd+0x178/0x3d0
      : [<ffffffff815df134>] ? __schedule+0x3d4/0x8c0
      : [<ffffffff81090410>] ? remove_wait_queue+0x50/0x50
      : [<ffffffff8112f340>] ? balance_pgdat+0x6c0/0x6c0
      : [<ffffffff8108fb6c>] kthread+0x8c/0xa0
      
      for every page. The proper fix for this is has been posted
      and is https://lkml.org/lkml/2012/2/10/228
      "x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations."
      along with a detailed description of the problem and solution.
      
      But since that posting has gone nowhere I am proposing
      this band-aid solution so that at least users don't get
      the page corruption (the pages that are WC don't get changed to WB
      and end up being recycled for filesystem or other things causing
      mysterious crashes).
      
      The negative impact of this patch is that users of WC flag
      (which are InfiniBand, radeon, nouveau drivers) won't be able
      to set that flag - so they are going to see performance degradation.
      But stability is more important here.
      
      Fixes RH BZ# 742032, 787403, and 745574
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      8eaffa67
    • K
      xen/setup: Remove redundant filtering of PTE masks. · 416d7214
      Konrad Rzeszutek Wilk 提交于
      commit 7347b408 "xen: Allow
      unprivileged Xen domains to create iomap pages" added a redundant
      line in the early bootup code to filter out the PTE. That
      filtering is already done a bit earlier so this extra processing
      is not required.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      416d7214
  15. 25 1月, 2012 1 次提交
  16. 09 12月, 2011 1 次提交
    • T
      memblock: Kill memblock_init() · fe091c20
      Tejun Heo 提交于
      memblock_init() initializes arrays for regions and memblock itself;
      however, all these can be done with struct initializers and
      memblock_init() can be removed.  This patch kills memblock_init() and
      initializes memblock with struct initializer.
      
      The only difference is that the first dummy entries don't have .nid
      set to MAX_NUMNODES initially.  This doesn't cause any behavior
      difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      fe091c20
  17. 17 11月, 2011 1 次提交
  18. 20 10月, 2011 1 次提交
  19. 17 8月, 2011 1 次提交
    • J
      xen/x86: replace order-based range checking of M2P table by linear one · ccbcdf7c
      Jan Beulich 提交于
      The order-based approach is not only less efficient (requiring a shift
      and a compare, typical generated code looking like this
      
      	mov	eax, [machine_to_phys_order]
      	mov	ecx, eax
      	shr	ebx, cl
      	test	ebx, ebx
      	jnz	...
      
      whereas a direct check requires just a compare, like in
      
      	cmp	ebx, [machine_to_phys_nr]
      	jae	...
      
      ), but also slightly dangerous in the 32-on-64 case - the element
      address calculation can wrap if the next power of two boundary is
      sufficiently far away from the actual upper limit of the table, and
      hence can result in user space addresses being accessed (with it being
      unknown what may actually be mapped there).
      
      Additionally, the elimination of the mistaken use of fls() here (should
      have been __fls()) fixes a latent issue on x86-64 that would trigger
      if the code was run on a system with memory extending beyond the 44-bit
      boundary.
      
      CC: stable@kernel.org
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      [v1: Based on Jeremy's feedback]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      ccbcdf7c
  20. 05 8月, 2011 1 次提交
  21. 19 7月, 2011 1 次提交
  22. 16 6月, 2011 1 次提交
  23. 06 6月, 2011 1 次提交
  24. 13 5月, 2011 1 次提交
  25. 06 4月, 2011 2 次提交
  26. 26 2月, 2011 1 次提交