1. 23 9月, 2010 1 次提交
  2. 13 8月, 2010 1 次提交
  3. 10 8月, 2010 1 次提交
    • C
      kmap_atomic: make kunmap_atomic() harder to misuse · 597781f3
      Cesar Eduardo Barros 提交于
      kunmap_atomic() is currently at level -4 on Rusty's "Hard To Misuse"
      list[1] ("Follow common convention and you'll get it wrong"), except in
      some architectures when CONFIG_DEBUG_HIGHMEM is set[2][3].
      
      kunmap() takes a pointer to a struct page; kunmap_atomic(), however, takes
      takes a pointer to within the page itself.  This seems to once in a while
      trip people up (the convention they are following is the one from
      kunmap()).
      
      Make it much harder to misuse, by moving it to level 9 on Rusty's list[4]
      ("The compiler/linker won't let you get it wrong").  This is done by
      refusing to build if the type of its first argument is a pointer to a
      struct page.
      
      The real kunmap_atomic() is renamed to kunmap_atomic_notypecheck()
      (which is what you would call in case for some strange reason calling it
      with a pointer to a struct page is not incorrect in your code).
      
      The previous version of this patch was compile tested on x86-64.
      
      [1] http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html
      [2] In these cases, it is at level 5, "Do it right or it will always
          break at runtime."
      [3] At least mips and powerpc look very similar, and sparc also seems to
          share a common ancestor with both; there seems to be quite some
          degree of copy-and-paste coding here. The include/asm/highmem.h file
          for these three archs mention x86 CPUs at its top.
      [4] http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html
      [5] As an aside, could someone tell me why mn10300 uses unsigned long as
          the first parameter of kunmap_atomic() instead of void *?
      Signed-off-by: NCesar Eduardo Barros <cesarb@cesarb.net>
      Cc: Russell King <linux@arm.linux.org.uk> (arch/arm)
      Cc: Ralf Baechle <ralf@linux-mips.org> (arch/mips)
      Cc: David Howells <dhowells@redhat.com> (arch/frv, arch/mn10300)
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> (arch/mn10300)
      Cc: Kyle McMartin <kyle@mcmartin.ca> (arch/parisc)
      Cc: Helge Deller <deller@gmx.de> (arch/parisc)
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> (arch/parisc)
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> (arch/powerpc)
      Cc: Paul Mackerras <paulus@samba.org> (arch/powerpc)
      Cc: "David S. Miller" <davem@davemloft.net> (arch/sparc)
      Cc: Thomas Gleixner <tglx@linutronix.de> (arch/x86)
      Cc: Ingo Molnar <mingo@redhat.com> (arch/x86)
      Cc: "H. Peter Anvin" <hpa@zytor.com> (arch/x86)
      Cc: Arnd Bergmann <arnd@arndb.de> (include/asm-generic)
      Cc: Rusty Russell <rusty@rustcorp.com.au> ("Hard To Misuse" list)
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      597781f3
  4. 02 8月, 2010 1 次提交
  5. 29 7月, 2010 1 次提交
  6. 22 7月, 2010 1 次提交
  7. 21 7月, 2010 2 次提交
  8. 19 7月, 2010 1 次提交
  9. 10 7月, 2010 2 次提交
  10. 05 7月, 2010 1 次提交
    • P
      rbtree: Undo augmented trees performance damage and regression · b945d6b2
      Peter Zijlstra 提交于
      Reimplement augmented RB-trees without sprinkling extra branches
      all over the RB-tree code (which lives in the scheduler hot path).
      
      This approach is 'borrowed' from Fabio's BFQ implementation and
      relies on traversing the rebalance path after the RB-tree-op to
      correct the heap property for insertion/removal and make up for
      the damage done by the tree rotations.
      
      For insertion the rebalance path is trivially that from the new
      node upwards to the root, for removal it is that from the deepest
      node in the path from the to be removed node that will still
      be around after the removal.
      
      [ This patch also fixes a video driver regression reported by
        Ali Gholami Rudi - the memtype->subtree_max_end was updated
        incorrectly. ]
      Acked-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NAli Gholami Rudi <ali@rudi.ir>
      Cc: Fabio Checconi <fabio@gandalf.sssup.it>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <1275414172.27810.27961.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b945d6b2
  11. 18 6月, 2010 1 次提交
  12. 12 6月, 2010 1 次提交
  13. 31 5月, 2010 2 次提交
  14. 28 5月, 2010 2 次提交
  15. 27 5月, 2010 1 次提交
    • X
      x86, pat: Fix memory leak in free_memtype · 20413f27
      Xiaotian Feng 提交于
      Reserve_memtype will allocate memory for new memtype, but
      in free_memtype, after the memtype erased from rbtree, the
      memory is not freed.
      
      Changes since V1:
      	make rbt_memtype_erase return erased memtype so that
      	it can be freed in free_memtype.
      
      [ hpa: not for -stable: 2.6.34 and earlier not affected ]
      Signed-off-by: NXiaotian Feng <dfeng@redhat.com>
      LKML-Reference: <1274838670-8731-1-git-send-email-dfeng@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Acked-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      20413f27
  16. 25 5月, 2010 1 次提交
  17. 23 5月, 2010 1 次提交
  18. 06 5月, 2010 1 次提交
    • D
      x86: Fix fake apicid to node mapping for numa emulation · b0c4d952
      David Rientjes 提交于
      With NUMA emulation, it's possible for a single cpu to be bound
      to multiple nodes since more than one may have affinity if
      allocated on a physical node that is local to the cpu.
      
      APIC ids must therefore be mapped to the lowest node ids to
      maintain generic kernel use of functions such as cpu_to_node()
      that determine device affinity.  For example, if a device has
      proximity to physical node 1, for instance, and a cpu happens to
      be mapped to a higher emulated node id 8, the proximity may not
      be correctly determined by comparison in generic code even
      though the cpu may be truly local and allocated on physical node 1.
      
      When this happens, the true topology of the machine isn't
      accurately represented in the emulated environment; although
      this isn't critical to the system's uptime, any generic code
      that is NUMA aware benefits from the physical topology being
      accurately represented.
      
      This can affect any system that maps multiple APIC ids to a
      single node and is booted with numa=fake=N where N is greater
      than the number of physical nodes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      LKML-Reference: <alpine.DEB.2.00.1005060224140.19473@chino.kir.corp.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b0c4d952
  19. 03 5月, 2010 1 次提交
    • I
      x86: Fix parse_reservetop() build failure on certain configs · 56f0e74c
      Ingo Molnar 提交于
      Commit e67a807f ("x86: Fix 'reservetop=' functionality") added a
      fixup_early_ioremap() call to parse_reservetop() and declared it
      in io.h.
      
      But asm/io.h was only included indirectly - and on some configs
      not at all, causing a build failure on those configs.
      
      Cc: Liang Li <liang.li@windriver.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Wang Chen <wangchen@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <1272621711-8683-1-git-send-email-liang.li@windriver.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      56f0e74c
  20. 30 4月, 2010 1 次提交
    • L
      x86: Fix 'reservetop=' functionality · e67a807f
      Liang Li 提交于
      When specifying the 'reservetop=0xbadc0de' kernel parameter,
      the kernel will stop booting due to a early_ioremap bug that
      relates to commit 8827247f.
      
      The root cause of boot failure problem is the value of
      'slot_virt[i]' was initialized in setup_arch->early_ioremap_init().
      But later in setup_arch, the function 'parse_early_param' will
      modify 'FIXADDR_TOP' when 'reservetop=0xbadc0de' being specified.
      
      The simplest fix might be use __fix_to_virt(idx0) to get updated
      value of 'FIXADDR_TOP' in '__early_ioremap' instead of reference
      old value from slot_virt[slot] directly.
      
      Changelog since v0:
      
      -v1: When reservetop being handled then FIXADDR_TOP get
           adjusted, Hence check prev_map then re-initialize slot_virt and
           PMD based on new FIXADDR_TOP.
      
      -v2: place fixup_early_ioremap hence call early_ioremap_init in
           reserve_top_address  to re-initialize slot_virt and
           corresponding PMD when parse_reservertop
      
      -v3: move fixup_early_ioremap out of reserve_top_address to make
           sure other clients of reserve_top_address like xen/lguest won't
           broken
      Signed-off-by: NLiang Li <liang.li@windriver.com>
      Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Cc: Wang Chen <wangchen@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <1272621711-8683-1-git-send-email-liang.li@windriver.com>
      [ fixed three small cleanliness details in fixup_early_ioremap() ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e67a807f
  21. 29 4月, 2010 1 次提交
    • J
      x86-64: Combine SRAT regions when possible · 2e618786
      Jan Beulich 提交于
      ... i.e. when the hole between two regions isn't occupied by memory on
      another node. This reduces the memory->node table size, thus reducing
      cache footprint of lookups, which got increased significantly some
      time ago, and things go back to how they were before that change on
      the systems I looked at.
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      LKML-Reference: <4BCF3230020000780003B3CA@vpn.id2.novell.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      2e618786
  22. 24 4月, 2010 1 次提交
    • R
      x86, pat: Update the page flags for memtype atomically instead of using memtype_lock · 1f9cc3cb
      Robin Holt 提交于
      While testing an application using the xpmem (out of kernel) driver, we
      noticed a significant page fault rate reduction of x86_64 with respect
      to ia64.  For one test running with 32 cpus, one thread per cpu, it
      took 01:08 for each of the threads to vm_insert_pfn 2GB worth of pages.
      For the same test running on 256 cpus, one thread per cpu, it took 14:48
      to vm_insert_pfn 2 GB worth of pages.
      
      The slowdown was tracked to lookup_memtype which acquires the
      spinlock memtype_lock.  This heavily contended lock was slowing down
      vm_insert_pfn().
      
      With the cmpxchg on page->flags method, both the 32 cpu and 256 cpu
      cases take approx 00:01.3 seconds to complete.
      Signed-off-by: NRobin Holt <holt@sgi.com>
      LKML-Reference: <20100423153627.751194346@gulag1.americas.sgi.com>
      Cc: Venkatesh Pallipadi <venkatesh.pallipadi@gmail.com>
      Cc: Rafael Wysocki <rjw@novell.com>
      Reviewed-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      1f9cc3cb
  23. 06 4月, 2010 1 次提交
  24. 30 3月, 2010 2 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
    • Y
      x86: Make sure free_init_pages() frees pages on page boundary · c967da6a
      Yinghai Lu 提交于
      When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
      in a more compact fashion.
      
      Example:
      
       Allocated new RAMDISK: 00ec2000 - 0248ce57
       Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
      
      The new RAMDISK's end is not page aligned.
      Last page could be shared with other users.
      
      When free_init_pages are called for initrd or .init, the page
      could be freed and we could corrupt other data.
      
      code segment in free_init_pages():
      
       |        for (; addr < end; addr += PAGE_SIZE) {
       |                ClearPageReserved(virt_to_page(addr));
       |                init_page_count(virt_to_page(addr));
       |                memset((void *)(addr & ~(PAGE_SIZE-1)),
       |                        POISON_FREE_INITMEM, PAGE_SIZE);
       |                free_page(addr);
       |                totalram_pages++;
       |        }
      
      last half page could be used as one whole free page.
      
      So page align the boundaries.
      
      -v2: make the original initramdisk to be aligned, according to
           Johannes, otherwise we have the chance to lose one page.
           we still need to keep initrd_end not aligned, otherwise it could
           confuse decompressor.
      -v3: change to WARN_ON instead, suggested by Johannes.
      -v4: use PAGE_ALIGN, suggested by Johannes.
           We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
           Add comments about assuming ramdisk start is aligned
           in relocate_initrd(), change to re get ramdisk_image instead of save it
           to make diff smaller. Add warning for wrong range, suggested by Johannes.
      -v6: remove one WARN()
           We need to align beginning in free_init_pages()
           do not copy more than ramdisk_size, noticed by Johannes
      Reported-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Tested-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c967da6a
  25. 02 3月, 2010 1 次提交
  26. 26 2月, 2010 1 次提交
  27. 25 2月, 2010 1 次提交
    • I
      x86, mm: Allow highmem user page tables to be disabled at boot time · 14315592
      Ian Campbell 提交于
      Distros generally (I looked at Debian, RHEL5 and SLES11) seem to
      enable CONFIG_HIGHPTE for any x86 configuration which has highmem
      enabled. This means that the overhead applies even to machines which
      have a fairly modest amount of high memory and which therefore do not
      really benefit from allocating PTEs in high memory but still pay the
      price of the additional mapping operations.
      
      Running kernbench on a 4G box I found that with CONFIG_HIGHPTE=y but
      no actual highptes being allocated there was a reduction in system
      time used from 59.737s to 55.9s.
      
      With CONFIG_HIGHPTE=y and highmem PTEs being allocated:
        Average Optimal load -j 4 Run (std deviation):
        Elapsed Time 175.396 (0.238914)
        User Time 515.983 (5.85019)
        System Time 59.737 (1.26727)
        Percent CPU 263.8 (71.6796)
        Context Switches 39989.7 (4672.64)
        Sleeps 42617.7 (246.307)
      
      With CONFIG_HIGHPTE=y but with no highmem PTEs being allocated:
        Average Optimal load -j 4 Run (std deviation):
        Elapsed Time 174.278 (0.831968)
        User Time 515.659 (6.07012)
        System Time 55.9 (1.07799)
        Percent CPU 263.8 (71.266)
        Context Switches 39929.6 (4485.13)
        Sleeps 42583.7 (373.039)
      
      This patch allows the user to control the allocation of PTEs in
      highmem from the command line ("userpte=nohigh") but retains the
      status-quo as the default.
      
      It is possible that some simple heuristic could be developed which
      allows auto-tuning of this option however I don't have a sufficiently
      large machine available to me to perform any particularly meaningful
      experiments. We could probably handwave up an argument for a threshold
      at 16G of total RAM.
      
      Assuming 768M of lowmem we have 196608 potential lowmem PTE
      pages. Each page can map 2M of RAM in a PAE-enabled configuration,
      meaning a maximum of 384G of RAM could potentially be mapped using
      lowmem PTEs.
      
      Even allowing generous factor of 10 to account for other required
      lowmem allocations, generous slop to account for page sharing (which
      reduces the total amount of RAM mappable by a given number of PT
      pages) and other innacuracies in the estimations it would seem that
      even a 32G machine would not have a particularly pressing need for
      highmem PTEs. I think 32G could be considered to be at the upper bound
      of what might be sensible on a 32 bit machine (although I think in
      practice 64G is still supported).
      
      It's seems questionable if HIGHPTE is even a win for any amount of RAM
      you would sensibly run a 32 bit kernel on rather than going 64 bit.
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      LKML-Reference: <1266403090-20162-1-git-send-email-ian.campbell@citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      14315592
  28. 23 2月, 2010 1 次提交
    • S
      x86_64, cpa: Don't work hard in preserving kernel 2M mappings when using 4K already · 281ff33b
      Suresh Siddha 提交于
      We currently enforce the !RW mapping for the kernel mapping that maps
      holes between different text, rodata and data sections. However, kernel
      identity mappings will have different RWX permissions to the pages mapping to
      text and to the pages padding (which are freed) the text, rodata sections.
      Hence kernel identity mappings will be broken to smaller pages. For 64-bit,
      kernel text and kernel identity mappings are different, so we can enable
      protection checks that come with CONFIG_DEBUG_RODATA, as well as retain 2MB
      large page mappings for kernel text.
      
      Konrad reported a boot failure with the Linux Xen paravirt guest because of
      this. In this paravirt guest case, the kernel text mapping and the kernel
      identity mapping share the same page-table pages. Thus forcing the !RW mapping
      for some of the kernel mappings also cause the kernel identity mappings to be
      read-only resulting in the boot failure. Linux Xen paravirt guest also
      uses 4k mappings and don't use 2M mapping.
      
      Fix this issue and retain large page performance advantage for native kernels
      by not working hard and not enforcing !RW for the kernel text mapping,
      if the current mapping is already using small page mapping.
      Reported-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      LKML-Reference: <1266522700.2909.34.camel@sbs-t61.sc.intel.com>
      Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: stable@kernel.org	[2.6.32, 2.6.33]
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      281ff33b
  29. 19 2月, 2010 2 次提交
  30. 18 2月, 2010 2 次提交
  31. 16 2月, 2010 3 次提交
    • D
      x86, numa: Remove configurable node size support for numa emulation · ca2107c9
      David Rientjes 提交于
      Now that numa=fake=<size>[MG] is implemented, it is possible to remove
      configurable node size support.  The command-line parsing was already
      broken (numa=fake=*128, for example, would not work) and since fake nodes
      are now interleaved over physical nodes, this support is no longer
      required.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1002151343080.26927@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      ca2107c9
    • D
      x86, numa: Add fixed node size option for numa emulation · 8df5bb34
      David Rientjes 提交于
      numa=fake=N specifies the number of fake nodes, N, to partition the
      system into and then allocates them by interleaving over physical nodes.
      This requires knowledge of the system capacity when attempting to
      allocate nodes of a certain size: either very large nodes to benchmark
      scalability of code that operates on individual nodes, or very small
      nodes to find bugs in the VM.
      
      This patch introduces numa=fake=<size>[MG] so it is possible to specify
      the size of each node to allocate.  When used, nodes of the size
      specified will be allocated and interleaved over the set of physical
      nodes.
      
      FAKE_NODE_MIN_SIZE was also moved to the more-appropriate
      include/asm/numa_64.h.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1002151342510.26927@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      8df5bb34
    • D
      x86, numa: Fix numa emulation calculation of big nodes · 68fd111e
      David Rientjes 提交于
      numa=fake=N uses split_nodes_interleave() to partition the system into N
      fake nodes.  Each node size must have be a multiple of
      FAKE_NODE_MIN_SIZE, otherwise it is possible to get strange alignments.
      Because of this, the remaining memory from each node when rounded to
      FAKE_NODE_MIN_SIZE is consolidated into a number of "big nodes" that are
      bigger than the rest.
      
      The calculation of the number of big nodes is incorrect since it is using
      a logical AND operator when it should be multiplying the rounded-off
      portion of each node with N.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1002151342230.26927@chino.kir.corp.google.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      68fd111e