1. 14 6月, 2012 1 次提交
    • V
      x86: Add read_mostly declaration/definition to variables from smp.h · 0816b0f0
      Vlad Zolotarov 提交于
      Add "read-mostly" qualifier to the following variables in
      smp.h:
      
       - cpu_sibling_map
       - cpu_core_map
       - cpu_llc_shared_map
       - cpu_llc_id
       - cpu_number
       - x86_cpu_to_apicid
       - x86_bios_cpu_apicid
       - x86_cpu_to_logical_apicid
      
      As long as all the variables above are only written during the
      initialization, this change is meant to prevent the false
      sharing. More specifically, on vSMP Foundation platform
      x86_cpu_to_apicid shared the same internode_cache_line with
      frequently written lapic_events.
      
      From the analysis of the first 33 per_cpu variables out of 219
      (memories they describe, to be more specific) the 8 have read_mostly
      nature (tlb_vector_offset, cpu_loops_per_jiffy, xen_debug_irq, etc.)
      and 25 are frequently written (irq_stack_union, gdt_page,
      exception_stacks, idt_desc, etc.).
      
      Assuming that the spread of the rest of the per_cpu variables is
      similar, identifying the read mostly memories will make more sense
      in terms of long-term code maintenance comparing to identifying
      frequently written memories.
      Signed-off-by: NVlad Zolotarov <vlad@scalemp.com>
      Acked-by: NShai Fultheim <shai@scalemp.com>
      Cc: Shai Fultheim (Shai@ScaleMP.com) <Shai@scalemp.com>
      Cc: ido@wizery.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1719258.EYKzE4Zbq5@vladSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0816b0f0
  2. 09 5月, 2012 1 次提交
    • T
      percpu, x86: don't use PMD_SIZE as embedded atom_size on 32bit · d5e28005
      Tejun Heo 提交于
      With the embed percpu first chunk allocator, x86 uses either PAGE_SIZE
      or PMD_SIZE for atom_size.  PMD_SIZE is used when CPU supports PSE so
      that percpu areas are aligned to PMD mappings and possibly allow using
      PMD mappings in vmalloc areas in the future.  Using larger atom_size
      doesn't waste actual memory; however, it does require larger vmalloc
      space allocation later on for !first chunks.
      
      With reasonably sized vmalloc area, PMD_SIZE shouldn't be a problem
      but x86_32 at this point is anything but reasonable in terms of
      address space and using larger atom_size reportedly leads to frequent
      percpu allocation failures on certain setups.
      
      As there is no reason to not use PMD_SIZE on x86_64 as vmalloc space
      is aplenty and most x86_64 configurations support PSE, fix the issue
      by always using PMD_SIZE on x86_64 and PAGE_SIZE on x86_32.
      
      v2: drop cpu_has_pse test and make x86_64 always use PMD_SIZE and
          x86_32 PAGE_SIZE as suggested by hpa.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NYanmin Zhang <yanmin.zhang@intel.com>
      Reported-by: NShuoX Liu <shuox.liu@intel.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      LKML-Reference: <4F97BA98.6010001@intel.com>
      Cc: stable@vger.kernel.org
      d5e28005
  3. 28 1月, 2011 2 次提交
    • T
      x86: Unify CPU -> NUMA node mapping between 32 and 64bit · 645a7919
      Tejun Heo 提交于
      Unlike 64bit, 32bit has been using its own cpu_to_node_map[] for
      CPU -> NUMA node mapping.  Replace it with early_percpu variable
      x86_cpu_to_node_map and share the mapping code with 64bit.
      
      * USE_PERCPU_NUMA_NODE_ID is now enabled for 32bit too.
      
      * x86_cpu_to_node_map and numa_set/clear_node() are moved from
        numa_64 to numa.  For now, on 32bit, x86_cpu_to_node_map is initialized
        with 0 instead of NUMA_NO_NODE.  This is to avoid introducing unexpected
        behavior change and will be updated once init path is unified.
      
      * srat_detect_node() is now enabled for x86_32 too.  It calls
        numa_set_node() and initializes the mapping making explicit
        cpu_to_node_map[] updates from map/unmap_cpu_to_node() unnecessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: eric.dumazet@gmail.com
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: penberg@kernel.org
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-15-git-send-email-tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: David Rientjes <rientjes@google.com>
      645a7919
    • T
      x86: Replace cpu_2_logical_apicid[] with early percpu variable · 4c321ff8
      Tejun Heo 提交于
      Unlike x86_64, on x86_32, the mapping from cpu to logical apicid
      may vary depending on apic in use.  cpu_2_logical_apicid[] array
      is used for this mapping.  Replace it with early percpu variable
      x86_cpu_to_logical_apicid to make it better aligned with other
      mappings.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: eric.dumazet@gmail.com
      Cc: yinghai@kernel.org
      Cc: brgerst@gmail.com
      Cc: gorcunov@gmail.com
      Cc: penberg@kernel.org
      Cc: shaohui.zheng@intel.com
      Cc: rientjes@google.com
      LKML-Reference: <1295789862-25482-5-git-send-email-tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4c321ff8
  4. 28 8月, 2010 1 次提交
    • Y
      x86: Use memblock to replace early_res · 72d7c3b3
      Yinghai Lu 提交于
      1. replace find_e820_area with memblock_find_in_range
      2. replace reserve_early with memblock_x86_reserve_range
      3. replace free_early with memblock_x86_free_range.
      4. NO_BOOTMEM will switch to use memblock too.
      5. use _e820, _early wrap in the patch, in following patch, will
         replace them all
      6. because memblock_x86_free_range support partial free, we can remove some special care
      7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
         so adjust some calling later in setup.c::setup_arch()
         -- corruption_check and mptable_update
      
      -v2: Move reserve_brk() early
          Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
          that could happen We have more then 128 RAM entry in E820 tables, and
          memblock_x86_fill() could use memblock_find_in_range() to find a new place for
          memblock.memory.region array.
          and We don't need to use extend_brk() after fill_memblock_area()
          So move reserve_brk() early before fill_memblock_area().
      -v3: Move find_smp_config early
          To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
          in right place.
      -v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
          memblock.reserved already..
          use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
      -v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
          active_region for 32bit does include high pages
          need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
      -v6: Use current_limit instead
      -v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
      -v8: Set memblock_can_resize early to handle EFI with more RAM entries
      -v9: update after kmemleak changes in mainline
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Suggested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      72d7c3b3
  5. 13 8月, 2010 1 次提交
  6. 22 7月, 2010 1 次提交
  7. 21 7月, 2010 1 次提交
    • Y
      x86, numa: fix boot without RAM on node0 again · 9aebbdb6
      Yinghai Lu 提交于
      Commit e534c7c5 ("numa: x86_64: use generic percpu var
      numa_node_id() implementation") broke numa systems that don't have ram
      on node0 when MEMORY_HOTPLUG is enabled, because cpu_up() will call
      cpu_to_node() before per_cpu(numa_node) is setup for APs.
      
      When Node0 doesn't have RAM, on x86, cpus already round it to nearest
      node with RAM in x86_cpu_to_node_map.  and per_cpu(numa_node) is not set
      up until in c_init for APs.
      
      When later cpu_up() calling cpu_to_node() will get 0 again, and make it
      online even there is no RAM on node0.  so later all APs can not booted up,
      and later will have panic.
      
      [    1.611101] On node 0 totalpages: 0
      .........
      [    2.608558] On node 0 totalpages: 0
      [    2.612065] Brought up 1 CPUs
      [    2.615199] Total of 1 processors activated (3990.31 BogoMIPS).
      ...
         93.225341] calling  loop_init+0x0/0x1a4 @ 1
      [   93.229314] PERCPU: allocation failed, size=80 align=8, failed to populate
      [   93.246539] Pid: 1, comm: swapper Tainted: G        W   2.6.35-rc4-tip-yh-04371-gd64e6c4-dirty #354
      [   93.264621] Call Trace:
      [   93.266533]  [<ffffffff81125e43>] pcpu_alloc+0x83a/0x8e7
      [   93.270710]  [<ffffffff81125f15>] __alloc_percpu+0x10/0x12
      [   93.285849]  [<ffffffff8140786c>] alloc_disk_node+0x94/0x16d
      [   93.291811]  [<ffffffff81407956>] alloc_disk+0x11/0x13
      [   93.306157]  [<ffffffff81503e51>] loop_alloc+0xa7/0x180
      [   93.310538]  [<ffffffff8277ef48>] loop_init+0x9b/0x1a4
      [   93.324909]  [<ffffffff8277eead>] ? loop_init+0x0/0x1a4
      [   93.329650]  [<ffffffff810001f2>] do_one_initcall+0x57/0x136
      [   93.345197]  [<ffffffff827486d0>] kernel_init+0x184/0x20e
      [   93.348146]  [<ffffffff81034954>] kernel_thread_helper+0x4/0x10
      [   93.365194]  [<ffffffff81c7cc3c>] ? restore_args+0x0/0x30
      [   93.369305]  [<ffffffff8274854c>] ? kernel_init+0x0/0x20e
      [   93.386011]  [<ffffffff81034950>] ? kernel_thread_helper+0x0/0x10
      [   93.392047] loop: out of memory
      ...
      
      Try to assign per_cpu(numa_node) early
      
      [akpm@linux-foundation.org: tidy up code comment]
      Signed-off-by: NYinghai <yinghai@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9aebbdb6
  8. 31 5月, 2010 1 次提交
  9. 28 5月, 2010 1 次提交
  10. 03 3月, 2010 1 次提交
  11. 26 2月, 2010 1 次提交
  12. 10 12月, 2009 1 次提交
  13. 14 8月, 2009 9 次提交
    • T
      x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA · 4518e6a0
      Tejun Heo 提交于
      Embedding percpu first chunk allocator can now handle very sparse unit
      mapping.  Use embedding allocator instead of lpage for 64bit NUMA.
      This removes extra TLB pressure and the need to do complex and fragile
      dancing when changing page attributes.
      
      For 32bit, using very sparse unit mapping isn't a good idea because
      the vmalloc space is very constrained.  32bit NUMA machines aren't
      exactly the focus of optimization and it isn't very clear whether
      lpage performs better than page.  Use page first chunk allocator for
      32bit NUMAs.
      
      As this leaves setup_pcpu_*() functions pretty much empty, fold them
      into setup_per_cpu_areas().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <andi@firstfloor.org>
      4518e6a0
    • T
      percpu: update embedding first chunk allocator to handle sparse units · c8826dd5
      Tejun Heo 提交于
      Now that percpu core can handle very sparse units, given that vmalloc
      space is large enough, embedding first chunk allocator can use any
      memory to build the first chunk.  This patch teaches
      pcpu_embed_first_chunk() about distances between cpus and to use
      alloc/free callbacks to allocate node specific areas for each group
      and use them for the first chunk.
      
      This brings the benefits of embedding allocator to NUMA configurations
      - no extra TLB pressure with the flexibility of unified dynamic
      allocator and no need to restructure arch code to build memory layout
      suitable for percpu.  With units put into atom_size aligned groups
      according to cpu distances, using large page for dynamic chunks is
      also easily possible with falling back to reuglar pages if large
      allocation fails.
      
      Embedding allocator users are converted to specify NULL
      cpu_distance_fn, so this patch doesn't cause any visible behavior
      difference.  Following patches will convert them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c8826dd5
    • T
      percpu: add pcpu_unit_offsets[] · fb435d52
      Tejun Heo 提交于
      Currently units are mapped sequentially into address space.  This
      patch adds pcpu_unit_offsets[] which allows units to be mapped to
      arbitrary offsets from the chunk base address.  This is necessary to
      allow sparse embedding which might would need to allocate address
      ranges and memory areas which aren't aligned to unit size but
      allocation atom size (page or large page size).  This also simplifies
      things a bit by removing the need to calculate offset from unit
      number.
      
      With this change, there's no need for the arch code to know
      pcpu_unit_size.  Update pcpu_setup_first_chunk() and first chunk
      allocators to return regular 0 or -errno return code instead of unit
      size or -errno.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      fb435d52
    • T
      percpu: introduce pcpu_alloc_info and pcpu_group_info · fd1e8a1f
      Tejun Heo 提交于
      Till now, non-linear cpu->unit map was expressed using an integer
      array which maps each cpu to a unit and used only by lpage allocator.
      Although how many units have been placed in a single contiguos area
      (group) is known while building unit_map, the information is lost when
      the result is recorded into the unit_map array.  For lpage allocator,
      as all allocations are done by lpages and whether two adjacent lpages
      are in the same group or not is irrelevant, this didn't cause any
      problem.  Non-linear cpu->unit mapping will be used for sparse
      embedding and this grouping information is necessary for that.
      
      This patch introduces pcpu_alloc_info which contains all the
      information necessary for initializing percpu allocator.
      pcpu_alloc_info contains array of pcpu_group_info which describes how
      units are grouped and mapped to cpus.  pcpu_group_info also has
      base_offset field to specify its offset from the chunk's base address.
      pcpu_build_alloc_info() initializes this field as if all groups are
      allocated back-to-back as is currently done but this will be used to
      sparsely place groups.
      
      pcpu_alloc_info is a rather complex data structure which contains a
      flexible array which in turn points to nested cpu_map arrays.
      
      * pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
        help dealing with pcpu_alloc_info.
      
      * pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
        generalized and renamed to pcpu_build_alloc_info().
        @cpu_distance_fn may be NULL indicating that all cpus are of
        LOCAL_DISTANCE.
      
      * pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
        generalized and renamed to pcpu_dump_alloc_info().  It now also
        prints which group each alloc unit belongs to.
      
      * pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
        separate parameters.  All first chunk allocators are updated to use
        pcpu_build_alloc_info() to build alloc_info and call
        pcpu_setup_first_chunk() with it.  This has the side effect of
        packing units for sparse possible cpus.  ie. if cpus 0, 2 and 4 are
        possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
      
      * x86 setup_pcpu_lpage() is updated to deal with alloc_info.
      
      * sparc64 setup_per_cpu_areas() is updated to build alloc_info.
      
      Although the changes made by this patch are pretty pervasive, it
      doesn't cause any behavior difference other than packing of sparse
      cpus.  It mostly changes how information is passed among
      initialization functions and makes room for more flexibility.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Miller <davem@davemloft.net>
      fd1e8a1f
    • T
      percpu: add @align to pcpu_fc_alloc_fn_t · 3cbc8565
      Tejun Heo 提交于
      pcpu_fc_alloc_fn_t is about to see more interesting usage, add @align
      parameter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3cbc8565
    • T
      percpu: drop @static_size from first chunk allocators · 9a773769
      Tejun Heo 提交于
      First chunk allocators assume percpu areas have been linked using one
      of PERCPU_*() macros and depend on __per_cpu_load symbol defined by
      those macros, so there isn't much point in passing in static area size
      explicitly when it can be easily calculated from __per_cpu_start and
      __per_cpu_end.  Drop @static_size from all percpu first chunk
      allocators and helpers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9a773769
    • T
      percpu: generalize first chunk allocator selection · f58dc01b
      Tejun Heo 提交于
      Now that all first chunk allocators are in mm/percpu.c, it makes sense
      to make generalize percpu_alloc kernel parameter.  Define PCPU_FC_*
      and set pcpu_chosen_fc using early_param() in mm/percpu.c.  Arch code
      can use the set value to determine which first chunk allocator to use.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f58dc01b
    • T
      percpu: rename 4k first chunk allocator to page · 00ae4064
      Tejun Heo 提交于
      Page size isn't always 4k depending on arch and configuration.  Rename
      4k first chunk allocator to page.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David Howells <dhowells@redhat.com>
      00ae4064
    • T
      percpu, sparc64: fix sparse possible cpu map handling · 74d46d6b
      Tejun Heo 提交于
      percpu code has been assuming num_possible_cpus() == nr_cpu_ids which
      is incorrect if cpu_possible_map contains holes.  This causes percpu
      code to access beyond allocated memories and vmalloc areas.  On a
      sparc64 machine with cpus 0 and 2 (u60), this triggers the following
      warning or fails boot.
      
       WARNING: at /devel/tj/os/work/mm/vmalloc.c:106 vmap_page_range_noflush+0x1f0/0x240()
       Modules linked in:
       Call Trace:
        [00000000004b17d0] vmap_page_range_noflush+0x1f0/0x240
        [00000000004b1840] map_vm_area+0x20/0x60
        [00000000004b1950] __vmalloc_area_node+0xd0/0x160
        [0000000000593434] deflate_init+0x14/0xe0
        [0000000000583b94] __crypto_alloc_tfm+0xd4/0x1e0
        [00000000005844f0] crypto_alloc_base+0x50/0xa0
        [000000000058b898] alg_test_comp+0x18/0x80
        [000000000058dad4] alg_test+0x54/0x180
        [000000000058af00] cryptomgr_test+0x40/0x60
        [0000000000473098] kthread+0x58/0x80
        [000000000042b590] kernel_thread+0x30/0x60
        [0000000000472fd0] kthreadd+0xf0/0x160
       ---[ end trace 429b268a213317ba ]---
      
      This patch fixes generic percpu functions and sparc64
      setup_per_cpu_areas() so that they handle sparse cpu_possible_map
      properly.
      
      Please note that on x86, cpu_possible_map() doesn't contain holes and
      thus num_possible_cpus() == nr_cpu_ids and this patch doesn't cause
      any behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      74d46d6b
  14. 04 7月, 2009 4 次提交
    • T
      percpu: teach large page allocator about NUMA · a530b795
      Tejun Heo 提交于
      Large page first chunk allocator is primarily used for NUMA machines;
      however, its NUMA handling is extremely simplistic.  Regardless of
      their proximity, each cpu is put into separate large page just to
      return most of the allocated space back wasting large amount of
      vmalloc space and increasing cache footprint.
      
      This patch teachs NUMA details to large page allocator.  Given
      processor proximity information, pcpu_lpage_build_unit_map() will find
      fitting cpu -> unit mapping in which cpus in LOCAL_DISTANCE share the
      same large page and not too much virtual address space is wasted.
      
      This greatly reduces the unit and thus chunk size and wastes much less
      address space for the first chunk.  For example, on 4/4 NUMA machine,
      the original code occupied 16MB of virtual space for the first chunk
      while the new code only uses 4MB - one 2MB page for each node.
      
      [ Impact: much better space efficiency on NUMA machines ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Miller <davem@davemloft.net>
      a530b795
    • T
      x86,percpu: generalize lpage first chunk allocator · 8c4bfc6e
      Tejun Heo 提交于
      Generalize and move x86 setup_pcpu_lpage() into
      pcpu_lpage_first_chunk().  setup_pcpu_lpage() now is a simple wrapper
      around the generalized version.  Other than taking size parameters and
      using arch supplied callbacks to allocate/free/map memory,
      pcpu_lpage_first_chunk() is identical to the original implementation.
      
      This simplifies arch code and will help converting more archs to
      dynamic percpu allocator.
      
      While at it, factor out pcpu_calc_fc_sizes() which is common to
      pcpu_embed_first_chunk() and pcpu_lpage_first_chunk().
      
      [ Impact: code reorganization and generalization ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      8c4bfc6e
    • T
      x86,percpu: generalize 4k first chunk allocator · d4b95f80
      Tejun Heo 提交于
      Generalize and move x86 setup_pcpu_4k() into pcpu_4k_first_chunk().
      setup_pcpu_4k() now is a simple wrapper around the generalized
      version.  Other than taking size parameters and using arch supplied
      callbacks to allocate/free memory, pcpu_4k_first_chunk() is identical
      to the original implementation.
      
      This simplifies arch code and will help converting more archs to
      dynamic percpu allocator.
      
      While at it, s/pcpu_populate_pte_fn_t/pcpu_fc_populate_pte_fn_t/ for
      consistency.
      
      [ Impact: code reorganization and generalization ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      d4b95f80
    • T
      percpu: drop @unit_size from embed first chunk allocator · 788e5abc
      Tejun Heo 提交于
      The only extra feature @unit_size provides is making dead space at the
      end of the first chunk which doesn't have any valid usecase.  Drop the
      parameter.  This will increase consistency with generalized 4k
      allocator.
      
      James Bottomley spotted missing conversion for the default
      setup_per_cpu_areas() which caused build breakage on all arcsh which
      use it.
      
      [ Impact: drop unused code path ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      788e5abc
  15. 22 6月, 2009 6 次提交
    • T
      x86: ensure percpu lpage doesn't consume too much vmalloc space · 0017c869
      Tejun Heo 提交于
      On extreme configuration (e.g. 32bit 32-way NUMA machine), lpage
      percpu first chunk allocator can consume too much of vmalloc space.
      Make it fall back to 4k allocator if the consumption goes over 20%.
      
      [ Impact: add sanity check for lpage percpu first chunk allocator ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      0017c869
    • T
      x86: implement percpu_alloc kernel parameter · fa8a7094
      Tejun Heo 提交于
      According to Andi, it isn't clear whether lpage allocator is worth the
      trouble as there are many processors where PMD TLB is far scarcer than
      PTE TLB.  The advantage or disadvantage probably depends on the actual
      size of percpu area and specific processor.  As performance
      degradation due to TLB pressure tends to be highly workload specific
      and subtle, it is difficult to decide which way to go without more
      data.
      
      This patch implements percpu_alloc kernel parameter to allow selecting
      which first chunk allocator to use to ease debugging and testing.
      
      While at it, make sure all the failure paths report why something
      failed to help determining why certain allocator isn't working.  Also,
      kill the "Great future plan" comment which had already been realized
      quite some time ago.
      
      [ Impact: allow explicit percpu first chunk allocator selection ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      fa8a7094
    • T
      x86: fix pageattr handling for lpage percpu allocator and re-enable it · e59a1bb2
      Tejun Heo 提交于
      lpage allocator aliases a PMD page for each cpu and returns whatever
      is unused to the page allocator.  When the pageattr of the recycled
      pages are changed, this makes the two aliases point to the overlapping
      regions with different attributes which isn't allowed and known to
      cause subtle data corruption in certain cases.
      
      This can be handled in simliar manner to the x86_64 highmap alias.
      pageattr code should detect if the target pages have PMD alias and
      split the PMD alias and synchronize the attributes.
      
      pcpur allocator is updated to keep the allocated PMD pages map sorted
      in ascending address order and provide pcpu_lpage_remapped() function
      which binary searches the array to determine whether the given address
      is aliased and if so to which address.  pageattr is updated to use
      pcpu_lpage_remapped() to detect the PMD alias and split it up as
      necessary from cpa_process_alias().
      
      Jan Beulich spotted the original problem and incorrect usage of vaddr
      instead of laddr for lookup.
      
      With this, lpage percpu allocator should work correctly.  Re-enable
      it.
      
      [ Impact: fix subtle lpage pageattr bug and re-enable lpage ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      e59a1bb2
    • T
      x86: prepare setup_pcpu_lpage() for pageattr fix · 0ff2587f
      Tejun Heo 提交于
      Make the following changes in preparation of coming pageattr updates.
      
      * Define and use array of struct pcpul_ent instead of array of
        pointers.  The only difference is ->cpu field which is set but
        unused yet.
      
      * Rename variables according to the above change.
      
      * Rename local variable vm to pcpul_vm and move it out of the
        function.
      
      [ Impact: no functional difference ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jan Beulich <JBeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      0ff2587f
    • T
      x86: rename remap percpu first chunk allocator to lpage · 97c9bf06
      Tejun Heo 提交于
      The "remap" allocator remaps large pages to build the first chunk;
      however, the name isn't very good because 4k allocator remaps too and
      the whole point of the remap allocator is using large page mapping.
      The allocator will be generalized and exported outside of x86, rename
      it to lpage before that happens.
      
      percpu_alloc kernel parameter is updated to accept both "remap" and
      "lpage" for lpage allocator.
      
      [ Impact: code cleanup, kernel parameter argument updated ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      97c9bf06
    • T
      x86: fix duplicate free in setup_pcpu_remap() failure path · c5806df9
      Tejun Heo 提交于
      In the failure path, setup_pcpu_remap() tries to free the area which
      has already been freed to make holes in the large page.  Fix it.
      
      [ Impact: fix duplicate free in failure path ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      c5806df9
  16. 25 5月, 2009 1 次提交
  17. 18 5月, 2009 1 次提交
    • Y
      x86: fix system without memory on node0 · 35d5a9a6
      Yinghai Lu 提交于
      Jack found a boot crash on a system which doesn't have memory on node0.
      
      It turns out with recent per_cpu changes, node_number for BSP will always
      be 0, and it is not consistent to cpu_to_node() that might set it to a
      different (nearer) node already.
      
      aka when numa_set_node() for node0 is called early before per_cpu area is
      setup:
      
      two places touched that per_cpu(node_number,):
      
      1. in cpu/common.c::cpu_init() and it is not for BP
      | #ifdef CONFIG_NUMA
      |        if (cpu != 0 && percpu_read(node_number) == 0 &&
      |            cpu_to_node(cpu) != NUMA_NO_NODE)
      |                percpu_write(node_number, cpu_to_node(cpu));
      | #endif
      for BP: traps_init ==> cpu_init
      for AP: start_secondary ==> cpu_init
      
      2. cpu/intel.c or amd.c::srat_detect_node via numa_set_node()
      for BP: check_bugs ==> identify_boot_cpu ==> identify_cpu()
      	 that is rather later before numa_node_id() is used for BP...
      for AP: start_secondary => smp_callin => smp_store_cpu_info() =>
      	=> identify_secondary_cpu => identify_cpu()
      
      so try to set that for BP earlier in setup_per_cpu_areas(), and
      don't bother to set that for APs there (it will be updated later
      and will be used later)
      
      (and don't mess the 0 before the copying BP per_cpu data to APs)
      
      [ Impact: fix boot crash on memoryless node-0 ]
      Reported-and-tested-by: NJack Steiner <steiner@sgi.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4A0C4A02.7050401@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      35d5a9a6
  18. 02 4月, 2009 2 次提交
  19. 10 3月, 2009 2 次提交
    • T
      percpu: generalize embedding first chunk setup helper · 66c3a757
      Tejun Heo 提交于
      Impact: code reorganization
      
      Separate out embedding first chunk setup helper from x86 embedding
      first chunk allocator and put it in mm/percpu.c.  This will be used by
      the default percpu first chunk allocator and possibly by other archs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      66c3a757
    • T
      percpu: more flexibility for @dyn_size of pcpu_setup_first_chunk() · 6074d5b0
      Tejun Heo 提交于
      Impact: cleanup, more flexibility for first chunk init
      
      Non-negative @dyn_size used to be allowed iff @unit_size wasn't auto.
      This restriction stemmed from implementation detail and made things a
      bit less intuitive.  This patch allows @dyn_size to be specified
      regardless of @unit_size and swaps the positions of @dyn_size and
      @unit_size so that the parameter order makes more sense (static,
      reserved and dyn sizes followed by enclosing unit_size).
      
      While at it, add @unit_size >= PCPU_MIN_UNIT_SIZE sanity check.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6074d5b0
  20. 06 3月, 2009 2 次提交
    • T
      x86, percpu: setup reserved percpu area for x86_64 · 6b19b0c2
      Tejun Heo 提交于
      Impact: fix relocation overflow during module load
      
      x86_64 uses 32bit relocations for symbol access and static percpu
      symbols whether in core or modules must be inside 2GB of the percpu
      segement base which the dynamic percpu allocator doesn't guarantee.
      This patch makes x86_64 reserve PERCPU_MODULE_RESERVE bytes in the
      first chunk so that module percpu areas are always allocated from the
      first chunk which is always inside the relocatable range.
      
      This problem exists for any percpu allocator but is easily triggered
      when using the embedding allocator because the second chunk is located
      beyond 2GB on it.
      
      This patch also changes the meaning of PERCPU_DYNAMIC_RESERVE such
      that it only indicates the size of the area to reserve for dynamic
      allocation as static and dynamic areas can be separate.  New
      PERCPU_DYNAMIC_RESERVED is increased by 4k for both 32 and 64bits as
      the reserved area separation eats away some allocatable space and
      having slightly more headroom (currently between 4 and 8k after
      minimal boot sans module area) makes sense for common case
      performance.
      
      x86_32 can address anywhere from anywhere and doesn't need reserving.
      
      Mike Galbraith first reported the problem first and bisected it to the
      embedding percpu allocator commit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMike Galbraith <efault@gmx.de>
      Reported-by: NJaswinder Singh Rajput <jaswinder@kernel.org>
      6b19b0c2
    • T
      percpu, module: implement reserved allocation and use it for module percpu variables · edcb4639
      Tejun Heo 提交于
      Impact: add reserved allocation functionality and use it for module
      	percpu variables
      
      This patch implements reserved allocation from the first chunk.  When
      setting up the first chunk, arch can ask to set aside certain number
      of bytes right after the core static area which is available only
      through a separate reserved allocator.  This will be used primarily
      for module static percpu variables on architectures with limited
      relocation range to ensure that the module perpcu symbols are inside
      the relocatable range.
      
      If reserved area is requested, the first chunk becomes reserved and
      isn't available for regular allocation.  If the first chunk also
      includes piggy-back dynamic allocation area, a separate chunk mapping
      the same region is created to serve dynamic allocation.  The first one
      is called static first chunk and the second dynamic first chunk.
      Although they share the page map, their different area map
      initializations guarantee they serve disjoint areas according to their
      purposes.
      
      If arch doesn't setup reserved area, reserved allocation is handled
      like any other allocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      edcb4639