1. 04 7月, 2013 2 次提交
    • C
      mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high · c8e251fa
      Cody P Schafer 提交于
      Because we are going to rely upon a careful transision between old and new
      ->high and ->batch values using memory barriers and will remove
      stop_machine(), we need to prevent multiple updaters from interweaving
      their memory writes.
      
      Add a simple mutex to protect both update loops.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8e251fa
    • C
      mm/page_alloc: factor out setting of pcp->high and pcp->batch · 4008bab7
      Cody P Schafer 提交于
      "Problems" with the current code:
      
      1: there is a lack of synchronization in setting ->high and ->batch in
         percpu_pagelist_fraction_sysctl_handler()
      
      2: stop_machine() in zone_pcp_update() is unnecissary.
      
      3: zone_pcp_update() does not consider the case where
         percpu_pagelist_fraction is non-zero
      
      To fix:
      
      1: add memory barriers, a safe ->batch value, an update side mutex when
         updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users
         that expect a stable value.
      
      2: avoid draining pages in zone_pcp_update(), rely upon the memory
         barriers added to fix #1
      
      3: factor out quite a few functions, and then call the appropriate one.
      
      Note that it results in a change to the behavior of zone_pcp_update(),
      which is used by memory_hotplug.  I'm rather certain that I've diserned
      (and preserved) the essential behavior (changing ->high and ->batch), and
      only eliminated unneeded actions (draining the per cpu pages), but this
      may not be the case.
      
      Further note that the draining of pages that previously took place in
      zone_pcp_update() occured after repeated draining when attempting to
      offline a page, and after the offline has "succeeded".  It appears that
      the draining was added to zone_pcp_update() to avoid refactoring
      setup_pageset() into 2 funtions.
      
      This patch:
      
      Creates pageset_set_batch() for use in setup_pageset().
      pageset_set_batch() imitates the functionality of
      setup_pagelist_highmark(), but uses the boot time
      (percpu_pagelist_fraction == 0) calculations for determining ->high based
      on ->batch.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4008bab7
  2. 13 6月, 2013 1 次提交
    • T
      mm/page_alloc.c: fix watermark check in __zone_watermark_ok() · 026b0814
      Tomasz Stanislawski 提交于
      The watermark check consists of two sub-checks.  The first one is:
      
      	if (free_pages <= min + lowmem_reserve)
      		return false;
      
      The check assures that there is minimal amount of RAM in the zone.  If
      CMA is used then the free_pages is reduced by the number of free pages
      in CMA prior to the over-mentioned check.
      
      	if (!(alloc_flags & ALLOC_CMA))
      		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
      
      This prevents the zone from being drained from pages available for
      non-movable allocations.
      
      The second check prevents the zone from getting too fragmented.
      
      	for (o = 0; o < order; o++) {
      		free_pages -= z->free_area[o].nr_free << o;
      		min >>= 1;
      		if (free_pages <= min)
      			return false;
      	}
      
      The field z->free_area[o].nr_free is equal to the number of free pages
      including free CMA pages.  Therefore the CMA pages are subtracted twice.
      This may cause a false positive fail of __zone_watermark_ok() if the CMA
      area gets strongly fragmented.  In such a case there are many 0-order
      free pages located in CMA.  Those pages are subtracted twice therefore
      they will quickly drain free_pages during the check against
      fragmentation.  The test fails even though there are many free non-cma
      pages in the zone.
      
      This patch fixes this issue by subtracting CMA pages only for a purpose of
      (free_pages <= min + lowmem_reserve) check.
      
      Laura said:
      
        We were observing allocation failures of higher order pages (order 5 =
        128K typically) under tight memory conditions resulting in driver
        failure.  The output from the page allocation failure showed plenty of
        free pages of the appropriate order/type/zone and mostly CMA pages in
        the lower orders.
      
        For full disclosure, we still observed some page allocation failures
        even after applying the patch but the number was drastically reduced and
        those failures were attributed to fragmentation/other system issues.
      Signed-off-by: NTomasz Stanislawski <t.stanislaws@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Tested-by: NLaura Abbott <lauraa@codeaurora.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: <stable@vger.kernel.org>	[3.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      026b0814
  3. 22 5月, 2013 1 次提交
    • R
      mm: Fix virt_to_page() warning · bb3ec6b0
      Ralf Baechle 提交于
      virt_to_page() is typically implemented as a macro containing a cast so
      that it will accept both pointers and unsigned long without causing a
      warning.
      
      But MIPS virt_to_page() uses virt_to_phys which is a function so passing
      an unsigned long will cause a warning:
      
          CC      mm/page_alloc.o
        mm/page_alloc.c: In function ‘free_reserved_area’:
        mm/page_alloc.c:5161:3: warning: passing argument 1 of ‘virt_to_phys’ makes pointer from integer without a cast [enabled by default]
        arch/mips/include/asm/io.h:119:100: note: expected ‘const volatile void *’ but argument is of type ‘long unsigned int’
      
      All others users of virt_to_page() in mm/ are passing a void *.
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Reported-by: NEunbong Song <eunb.song@samsung.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb3ec6b0
  4. 30 4月, 2013 7 次提交
    • C
      page_alloc: make setup_nr_node_ids() usable for arch init code · f9872caf
      Cody P Schafer 提交于
      powerpc and x86 were opencoding copies of setup_nr_node_ids(), which
      page_alloc provides but makes static.  Make it avaliable to the archs in
      linux/mm.h.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9872caf
    • R
      mm: speedup in __early_pfn_to_nid · 7c243c71
      Russ Anderson 提交于
      When booting on a large memory system, the kernel spends considerable
      time in memmap_init_zone() setting up memory zones.  Analysis shows
      significant time spent in __early_pfn_to_nid().
      
      The routine memmap_init_zone() checks each PFN to verify the nid is
      valid.  __early_pfn_to_nid() sequentially scans the list of pfn ranges
      to find the right range and returns the nid.  This does not scale well.
      On a 4 TB (single rack) system there are 308 memory ranges to scan.  The
      higher the PFN the more time spent sequentially spinning through memory
      ranges.
      
      Since memmap_init_zone() increments pfn, it will almost always be
      looking for the same range as the previous pfn, so check that range
      first.  If it is in the same range, return that nid.  If not, scan the
      list as before.
      
      A 4 TB (single rack) UV1 system takes 512 seconds to get through the
      zone code.  This performance optimization reduces the time by 189
      seconds, a 36% improvement.
      
      A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
      a 112.9 second (53%) reduction.
      
      [akpm@linux-foundation.org: make the statics __meminitdata]
      [akpm@linux-foundation.org: fix comment formatting]
      [akpm@linux-foundation.org: fix ia64, per yinghai]
      [akpm@linux-foundation.org: add missing semicolon, per Tony]
      Signed-off-by: NRuss Anderson <rja@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Tested-by: N"Luck, Tony" <tony.luck@intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Lin Feng <linfeng@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c243c71
    • M
      mm: page_alloc: avoid marking zones full prematurely after zone_reclaim() · fed2719e
      Mel Gorman 提交于
      The following problem was reported against a distribution kernel when
      zone_reclaim was enabled but the same problem applies to the mainline
      kernel.  The reproduction case was as follows
      
      1. Run numactl -m +0 dd if=largefile of=/dev/null
         This allocates a large number of clean pages in node 0
      
      2. numactl -N +0 memhog 0.5*Mg
         This start a memory-using application in node 0.
      
      The expected behaviour is that the clean pages get reclaimed and the
      application uses node 0 for its memory.  The observed behaviour was that
      the memory for the memhog application was allocated off-node since
      commits cd38b115 ("mm: page allocator: initialise ZLC for first zone
      eligible for zone_reclaim") and commit 76d3fbf8 ("mm: page
      allocator: reconsider zones for allocation after direct reclaim").
      
      The assumption of those patches was that it was always preferable to
      allocate quickly than stall for long periods of time and they were meant
      to take care that the zone was only marked full when necessary but an
      important case was missed.
      
      In the allocator fast path, only the low watermarks are checked.  If the
      zones free pages are between the low and min watermark then allocations
      from the allocators slow path will succeed.  However, zone_reclaim will
      only reclaim SWAP_CLUSTER_MAX or 1<<order pages.  There is no guarantee
      that this will meet the low watermark causing the zone to be marked full
      prematurely.
      
      This patch will only mark the zone full after zone_reclaim if it the min
      watermarks are checked or if page reclaim failed to make sufficient
      progress.
      
      [mhocko@suse.cz: fix alloc_flags test]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NHedi Berriche <hedi@sgi.com>
      Tested-by: NHedi Berriche <hedi@sgi.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fed2719e
    • D
      mm, hugetlb: include hugepages in meminfo · 949f7ec5
      David Rientjes 提交于
      Particularly in oom conditions, it's troublesome that hugetlb memory is
      not displayed.  All other meminfo that is emitted will not add up to
      what is expected, and there is no artifact left in the kernel log to
      show that a potentially significant amount of memory is actually
      allocated as hugepages which are not available to be reclaimed.
      
      Booting with hugepages=8192 on the command line, this memory is now
      shown in oom conditions.  For example, with echo m >
      /proc/sysrq-trigger:
      
        Node 0 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 1 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 2 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
        Node 3 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      949f7ec5
    • J
      mm: introduce free_highmem_page() helper to free highmem pages into buddy system · cfa11e08
      Jiang Liu 提交于
      The original goal of this patchset is to fix the bug reported by
      
        https://bugzilla.kernel.org/show_bug.cgi?id=53501
      
      Now it has also been expanded to reduce common code used by memory
      initializion.
      
      This is the second part, which applies to the previous part at:
        http://marc.info/?l=linux-mm&m=136289696323825&w=2
      
      It introduces a helper function free_highmem_page() to free highmem
      pages into the buddy system when initializing mm subsystem.
      Introduction of free_highmem_page() is one step forward to clean up
      accesses and modificaitons of totalhigh_pages, totalram_pages and
      zone->managed_pages etc. I hope we could remove all references to
      totalhigh_pages from the arch/ subdirectory.
      
      We have only tested these patchset on x86 platforms, and have done basic
      compliation tests using cross-compilers from ftp.kernel.org. That means
      some code may not pass compilation on some architectures. So any help
      to test this patchset are welcomed!
      
      There are several other parts still under development:
      Part3: refine code to manage totalram_pages, totalhigh_pages and
      	zone->managed_pages
      Part4: introduce helper functions to simplify mem_init() and remove the
      	global variable num_physpages.
      
      This patch:
      
      Introduce helper function free_highmem_page(), which will be used by
      architectures with HIGHMEM enabled to free highmem pages into the buddy
      system.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Suzuki K. Poulose" <suzuki@in.ibm.com>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Attilio Rao <attilio.rao@citrix.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Cong Wang <amwang@redhat.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfa11e08
    • J
      mm: introduce common help functions to deal with reserved/managed pages · 69afade7
      Jiang Liu 提交于
      The original goal of this patchset is to fix the bug reported by
      
        https://bugzilla.kernel.org/show_bug.cgi?id=53501
      
      Now it has also been expanded to reduce common code used by memory
      initializion.
      
      This is the first part, which applies to v3.9-rc1.
      
      It introduces following common helper functions to simplify
      free_initmem() and free_initrd_mem() on different architectures:
      
      adjust_managed_page_count():
      	will be used to adjust totalram_pages, totalhigh_pages,
      	zone->managed_pages when reserving/unresering a page.
      
      __free_reserved_page():
      	free a reserved page into the buddy system without adjusting
      	page statistics info
      
      free_reserved_page():
      	free a reserved page into the buddy system and adjust page
      	statistics info
      
      mark_page_reserved():
      	mark a page as reserved and adjust page statistics info
      
      free_reserved_area():
      	free a continous ranges of pages by calling free_reserved_page()
      
      free_initmem_default():
      	default method to free __init pages.
      
      We have only tested these patchset on x86 platforms, and have done basic
      compliation tests using cross-compilers from ftp.kernel.org.  That means
      some code may not pass compilation on some architectures.  So any help to
      test this patchset are welcomed!
      
      There are several other parts still under development:
      Part2: introduce free_highmem_page() to simplify freeing highmem pages
      Part3: refine code to manage totalram_pages, totalhigh_pages and
      	zone->managed_pages
      Part4: introduce helper functions to simplify mem_init() and remove the
      	global variable num_physpages.
      
      This patch:
      
      Code to deal with reserved/managed pages are duplicated by many
      architectures, so introduce common help functions to reduce duplicated
      code.  These common help functions will also be used to concentrate code
      to modify totalram_pages and zone->managed_pages, which makes the code
      much more clear.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Anatolij Gustschin <agust@denx.de>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.chen@sunplusct.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69afade7
    • D
      mm, show_mem: suppress page counts in non-blockable contexts · 4b59e6c4
      David Rientjes 提交于
      On large systems with a lot of memory, walking all RAM to determine page
      types may take a half second or even more.
      
      In non-blockable contexts, the page allocator will emit a page allocation
      failure warning unless __GFP_NOWARN is specified.  In such contexts, irqs
      are typically disabled and such a lengthy delay may even result in NMI
      watchdog timeouts.
      
      To fix this, suppress the page walk in such contexts when printing the
      page allocation failure warning.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b59e6c4
  5. 29 3月, 2013 1 次提交
  6. 03 3月, 2013 1 次提交
    • Y
      x86, ACPI, mm: Revert movablemem_map support · 20e6926d
      Yinghai Lu 提交于
      Tim found:
      
        WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
        Hardware name: S2600CP
        sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
        smpboot: Booting Node   1, Processors  #1
        Modules linked in:
        Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
        Call Trace:
          set_cpu_sibling_map+0x279/0x449
          start_secondary+0x11d/0x1e5
      
      Don Morris reproduced on a HP z620 workstation, and bisected it to
      commit e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock
      is ready")
      
      It turns out movable_map has some problems, and it breaks several things
      
      1. numa_init is called several times, NOT just for srat. so those
      	nodes_clear(numa_nodes_parsed)
      	memset(&numa_meminfo, 0, sizeof(numa_meminfo))
         can not be just removed.  Need to consider sequence is: numaq, srat, amd, dummy.
         and make fall back path working.
      
      2. simply split acpi_numa_init to early_parse_srat.
         a. that early_parse_srat is NOT called for ia64, so you break ia64.
         b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
      	     set_apicid_to_node(i, NUMA_NO_NODE)
           still left in numa_init. So it will just clear result from early_parse_srat.
           it should be moved before that....
         c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
             early before override from INITRD is settled.
      
      3. that patch TITLE is total misleading, there is NO x86 in the title,
         but it changes critical x86 code. It caused x86 guys did not
         pay attention to find the problem early. Those patches really should
         be routed via tip/x86/mm.
      
      4. after that commit, following range can not use movable ram:
        a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
        b. initrd... it will be freed after booting, so it could be on movable...
        c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
      	anymore.
        d. init_mem_mapping: can not put page table high anymore.
        e. initmem_init: vmemmap can not be high local node anymore. That is
           not good.
      
      If node is hotplugable, the mem related range like page table and
      vmemmap could be on the that node without problem and should be on that
      node.
      
      We have workaround patch that could fix some problems, but some can not
      be fixed.
      
      So just remove that offending commit and related ones including:
      
       f7210e6c ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
          protect movablecore_map in memblock_overlaps_region().")
      
       01a178a9 ("acpi, memory-hotplug: support getting hotplug info from
          SRAT")
      
       27168d38 ("acpi, memory-hotplug: extend movablemem_map ranges to
          the end of node")
      
       e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock is
          ready")
      
       fb06bc8e ("page_alloc: bootmem limit with movablecore_map")
      
       42f47e27 ("page_alloc: make movablemem_map have higher priority")
      
       6981ec31 ("page_alloc: introduce zone_movable_limit[] to keep
          movable limit for nodes")
      
       34b71f1e ("page_alloc: add movable_memmap kernel parameter")
      
       4d59a751 ("x86: get pg_data_t's memory from other node")
      
      Later we should have patches that will make sure kernel put page table
      and vmemmap on local node ram instead of push them down to node0.  Also
      need to find way to put other kernel used ram to local node ram.
      Reported-by: NTim Gardner <tim.gardner@canonical.com>
      Reported-by: NDon Morris <don.morris@hp.com>
      Bisected-by: NDon Morris <don.morris@hp.com>
      Tested-by: NDon Morris <don.morris@hp.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20e6926d
  7. 24 2月, 2013 19 次提交
    • Z
      mm: accurately document nr_free_*_pages functions with code comments · e0fb5815
      Zhang Yanfei 提交于
      nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
      are horribly badly named, so accurately document them with code comments
      in case of the misuse of them.
      
      [akpm@linux-foundation.org: tweak comments]
      Reviewed-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0fb5815
    • Z
      mm: fix return type for functions nr_free_*_pages · ebec3862
      Zhang Yanfei 提交于
      Currently, the amount of RAM that functions nr_free_*_pages return is
      held in unsigned int.  But in machines with big memory (exceeding 16TB),
      the amount may be incorrect because of overflow, so fix it.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebec3862
    • D
      mm: use NUMA_NO_NODE · 00ef2d2f
      David Rientjes 提交于
      Make a sweep through mm/ and convert code that uses -1 directly to using
      the more appropriate NUMA_NO_NODE.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00ef2d2f
    • C
      mm/page_alloc: add informative debugging message in page_outside_zone_boundaries() · b5e6a5a2
      Cody P Schafer 提交于
      Add a debug message which prints when a page is found outside of the
      boundaries of the zone it should belong to. Format is:
      	"page $pfn outside zone [ $start_pfn - $end_pfn ]"
      
      [akpm@linux-foundation.org: s/pr_debug/pr_err/]
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5e6a5a2
    • C
      mm/page_alloc: add a VM_BUG in __free_one_page() if the zone is uninitialized. · d29bb978
      Cody P Schafer 提交于
      Freeing pages to uninitialized zones is not handled by
      __free_one_page(), and should never happen when the code is correct.
      
      Ran into this while writing some code that dynamically onlines extra
      zones.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d29bb978
    • C
      mm: add & use zone_end_pfn() and zone_spans_pfn() · 108bcc96
      Cody P Schafer 提交于
      Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
      duplication.
      
      This also switches to using them in compaction (where an additional
      variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
      kmemleak.
      
      Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
      because I expect at some point the sycronization issues with start_pfn &
      spanned_pages will need fixing, either by actually using the seqlock or
      clever memory barrier usage.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      108bcc96
    • H
      mm: remove offlining arg to migrate_pages · 9c620e2b
      Hugh Dickins 提交于
      No functional change, but the only purpose of the offlining argument to
      migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
      KSM page for memory hotremove (which took ksm_thread_mutex) but not for
      other callers.  Now all cases are safe, remove the arg.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c620e2b
    • M
      mm: rename page struct field helpers · 22b751c3
      Mel Gorman 提交于
      The function names page_xchg_last_nid(), page_last_nid() and
      reset_page_last_nid() were judged to be inconsistent so rename them to a
      struct_field_op style pattern.  As it looked jarring to have
      reset_page_mapcount() and page_nid_reset_last() beside each other in
      memmap_init_zone(), this patch also renames reset_page_mapcount() to
      page_mapcount_reset().  There are others like init_page_count() but as
      it is used throughout the arch code a rename would likely cause more
      conflicts than it is worth.
      
      [akpm@linux-foundation.org: fix zcache]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22b751c3
    • M
      mm: teach mm by current context info to not do I/O during memory allocation · 21caf2fc
      Ming Lei 提交于
      This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
      'struct task_struct'), so that the flag can be set by one task to avoid
      doing I/O inside memory allocation in the task's context.
      
      The patch trys to solve one deadlock problem caused by block device, and
      the problem may happen at least in the below situations:
      
      - during block device runtime resume, if memory allocation with
        GFP_KERNEL is called inside runtime resume callback of any one of its
        ancestors(or the block device itself), the deadlock may be triggered
        inside the memory allocation since it might not complete until the block
        device becomes active and the involed page I/O finishes.  The situation
        is pointed out first by Alan Stern.  It is not a good approach to
        convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
        subsystems may be involved(for example, PCI, USB and SCSI may be
        involved for usb mass stoarage device, network devices involved too in
        the iSCSI case)
      
      - during block device runtime suspend, because runtime resume need to
        wait for completion of concurrent runtime suspend.
      
      - during error handling of usb mass storage deivce, USB bus reset will
        be put on the device, so there shouldn't have any memory allocation with
        GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
        above may be triggered.  Unfortunately, any usb device may include one
        mass storage interface in theory, so it requires all usb interface
        drivers to handle the situation.  In fact, most usb drivers don't know
        how to handle bus reset on the device and don't provide .pre_set() and
        .post_reset() callback at all, so USB core has to unbind and bind driver
        for these devices.  So it is still not practical to resort to GFP_NOIO
        for solving the problem.
      
      Also the introduced solution can be used by block subsystem or block
      drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
      actual I/O transfer.
      
      It is not a good idea to convert all these GFP_KERNEL in the affected
      path into GFP_NOIO because these functions doing that may be implemented
      as library and will be called in many other contexts.
      
      In fact, memalloc_noio_flags() can convert some of current static
      GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
      at least almost all GFP_NOIO in USB subsystem can be converted into
      GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
      only happen in runtime resume/bus reset/block I/O transfer contexts
      generally.
      
      [1], several GFP_KERNEL allocation examples in runtime resume path
      
      - pci subsystem
      acpi_os_allocate
      	<-acpi_ut_allocate
      		<-ACPI_ALLOCATE_ZEROED
      			<-acpi_evaluate_object
      				<-__acpi_bus_set_power
      					<-acpi_bus_set_power
      						<-acpi_pci_set_power_state
      							<-platform_pci_set_power_state
      								<-pci_platform_power_transition
      									<-__pci_complete_power_transition
      										<-pci_set_power_state
      											<-pci_restore_standard_config
      												<-pci_pm_runtime_resume
      - usb subsystem
      usb_get_status
      	<-finish_port_resume
      		<-usb_port_resume
      			<-generic_resume
      				<-usb_resume_device
      					<-usb_resume_both
      						<-usb_runtime_resume
      
      - some individual usb drivers
      usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
      
      That is just what I have found.  Unfortunately, this allocation can only
      be found by human being now, and there should be many not found since
      any function in the resume path(call tree) may allocate memory with
      GFP_KERNEL.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21caf2fc
    • M
      mm: remove MIGRATE_ISOLATE check in hotpath · 194159fb
      Minchan Kim 提交于
      Several functions test MIGRATE_ISOLATE and some of those are hotpath but
      MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
      CMA, memory-hotplug and memory-failure) which are not common config
      option.  So let's not add unnecessary overhead and code when we don't
      enable CONFIG_MEMORY_ISOLATION.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      194159fb
    • J
      mm: set zone->present_pages to number of existing pages in the zone · 306f2e9e
      Jiang Liu 提交于
      Now all users of "number of pages managed by the buddy system" have been
      converted to use zone->managed_pages, so set zone->present_pages to what
      it should be:
      
      	present_pages = spanned_pages - absent_pages;
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Cc: Chris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      306f2e9e
    • J
      mm: use zone->present_pages instead of zone->managed_pages where appropriate · b40da049
      Jiang Liu 提交于
      Now we have zone->managed_pages for "pages managed by the buddy system
      in the zone", so replace zone->present_pages with zone->managed_pages if
      what the user really wants is number of allocatable pages.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Cc: Chris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b40da049
    • T
      acpi, memory-hotplug: support getting hotplug info from SRAT · 01a178a9
      Tang Chen 提交于
      We now provide an option for users who don't want to specify physical
      memory address in kernel commandline.
      
               /*
                * For movablemem_map=acpi:
                *
                * SRAT:                |_____| |_____| |_________| |_________| ......
                * node id:                0       1         1           2
                * hotpluggable:           n       y         y           n
                * movablemem_map:              |_____| |_________|
                *
                * Using movablemem_map, we can prevent memblock from allocating memory
                * on ZONE_MOVABLE at boot time.
                */
      
      So user just specify movablemem_map=acpi, and the kernel will use
      hotpluggable info in SRAT to determine which memory ranges should be set
      as ZONE_MOVABLE.
      
      If all the memory ranges in SRAT is hotpluggable, then no memory can be
      used by kernel.  But before parsing SRAT, memblock has already reserve
      some memory ranges for other purposes, such as for kernel image, and so
      on.  We cannot prevent kernel from using these memory.  So we need to
      exclude these ranges even if these memory is hotpluggable.
      
      Furthermore, there could be several memory ranges in the single node
      which the kernel resides in.  We may skip one range that have memory
      reserved by memblock, but if the rest of memory is too small, then the
      kernel will fail to boot.  So, make the whole node which the kernel
      resides in un-hotpluggable.  Then the kernel has enough memory to use.
      
      NOTE: Using this way will cause NUMA performance down because the
            whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
            on it.  If users don't want to lose NUMA performance, just don't use
            it.
      
      [akpm@linux-foundation.org: fix warning]
      [akpm@linux-foundation.org: use strcmp()]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01a178a9
    • T
      acpi, memory-hotplug: extend movablemem_map ranges to the end of node · 27168d38
      Tang Chen 提交于
      When implementing movablemem_map boot option, we introduced an array
      movablemem_map.map[] to store the memory ranges to be set as
      ZONE_MOVABLE.
      
      Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
      the whole node memory range, we need to extend it to the node end so
      that we can use it to prevent memblock from allocating memory in the
      ranges user didn't specify.
      
      We now implement movablemem_map boot option like this:
      
              /*
               * For movablemem_map=nn[KMG]@ss[KMG]:
               *
               * SRAT:                |_____| |_____| |_________| |_________| ......
               * node id:                0       1         1           2
               * user specified:                |__|                 |___|
               * movablemem_map:                |___| |_________|    |______| ......
               *
               * Using movablemem_map, we can prevent memblock from allocating memory
               * on ZONE_MOVABLE at boot time.
               *
               * NOTE: In this case, SRAT info will be ingored.
               */
      
      [akpm@linux-foundation.org: clean up code, fix build warning]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27168d38
    • T
      page_alloc: make movablemem_map have higher priority · 42f47e27
      Tang Chen 提交于
      If kernelcore or movablecore is specified at the same time with
      movablemem_map, movablemem_map will have higher priority to be
      satisfied.  This patch will make find_zone_movable_pfns_for_nodes()
      calculate zone_movable_pfn[] with the limit from zone_movable_limit[].
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Tested-by: NLin Feng <linfeng@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42f47e27
    • T
      page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes · 6981ec31
      Tang Chen 提交于
      Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE
      limit from movablemem_map boot option for all nodes.  The function
      sanitize_zone_movable_limit() will find out to which node the ranges in
      movable_map.map[] belongs, and calculates the low boundary of
      ZONE_MOVABLE for each node.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NLiu Jiang <jiang.liu@huawei.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Tested-by: NLin Feng <linfeng@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6981ec31
    • T
      page_alloc: add movable_memmap kernel parameter · 34b71f1e
      Tang Chen 提交于
      Add functions to parse movablemem_map boot option.  Since the option
      could be specified more then once, all the maps will be stored in the
      global variable movablemem_map.map array.
      
      And also, we keep the array in monotonic increasing order by start_pfn.
      And merge all overlapped ranges.
      
      [akpm@linux-foundation.org: improve comment]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: remove unneeded parens]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Tested-by: NLin Feng <linfeng@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34b71f1e
    • A
      mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long · 90ae8d67
      Andrew Morton 提交于
      `int' is an inappropriate type for a number-of-pages counter.
      
      While we're there, use the clamp() macro.
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90ae8d67
    • S
      CMA: make putback_lru_pages() call conditional · 2a6f5124
      Srinivas Pandruvada 提交于
      As per documentation and other places calling putback_lru_pages(),
      putback_lru_pages() is called on error only.  Make the CMA code behave
      consistently.
      
      [akpm@linux-foundation.org: remove a test-n-branch in the wrapup code]
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a6f5124
  8. 19 2月, 2013 1 次提交
    • L
      mm: fix pageblock bitmap allocation · 7c45512d
      Linus Torvalds 提交于
      Commit c060f943 ("mm: use aligned zone start for pfn_to_bitidx
      calculation") fixed out calculation of the index into the pageblock
      bitmap when a !SPARSEMEM zome was not aligned to pageblock_nr_pages.
      
      However, the _allocation_ of that bitmap had never taken this alignment
      requirement into accout, so depending on the exact size and alignment of
      the zone, the use of that index could then access past the allocation,
      resulting in some very subtle memory corruption.
      
      This was reported (and bisected) by Ingo Molnar: one of his random
      config builds would hang with certain very specific kernel command line
      options.
      
      In the meantime, commit c060f943 has been marked for stable, so this
      fix needs to be back-ported to the stable kernels that backported the
      commit to use the right alignment.
      Bisected-and-tested-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c45512d
  9. 13 2月, 2013 1 次提交
  10. 08 2月, 2013 1 次提交
  11. 21 1月, 2013 1 次提交
  12. 12 1月, 2013 3 次提交
    • M
      mm: compaction: partially revert capture of suitable high-order page · 8fb74b9f
      Mel Gorman 提交于
      Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
      waiting for POLLIN on a local TCP socket.  It was easier to trigger if
      there was disk IO and dirty pages at the same time and he bisected it to
      commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page
      immediately when it is made available").
      
      The intention of that patch was to improve high-order allocations under
      memory pressure after changes made to reclaim in 3.6 drastically hurt
      THP allocations but the approach was flawed.  For Eric, the problem was
      that page->pfmemalloc was not being cleared for captured pages leading
      to a poor interaction with swap-over-NFS support causing the packets to
      be dropped.  However, I identified a few more problems with the patch
      including the fact that it can increase contention on zone->lock in some
      cases which could result in async direct compaction being aborted early.
      
      In retrospect the capture patch took the wrong approach.  What it should
      have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
      was allocating for THP and avoided races that way.  While the patch was
      showing to improve allocation success rates at the time, the benefit is
      marginal given the relative complexity and it should be revisited from
      scratch in the context of the other reclaim-related changes that have
      taken place since the patch was first written and tested.  This patch
      partially reverts commit 1fb3f8ca ("mm: compaction: capture a
      suitable high-order page immediately when it is made available").
      Reported-and-tested-by: NEric Wong <normalperson@yhbt.net>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fb74b9f
    • L
      mm: use aligned zone start for pfn_to_bitidx calculation · c060f943
      Laura Abbott 提交于
      The current calculation in pfn_to_bitidx assumes that (pfn -
      zone->zone_start_pfn) >> pageblock_order will return the same bit for
      all pfn in a pageblock.  If zone_start_pfn is not aligned to
      pageblock_nr_pages, this may not always be correct.
      
      Consider the following with pageblock order = 10, zone start 2MB:
      
        pfn     | pfn - zone start | (pfn - zone start) >> page block order
        ----------------------------------------------------------------
        0x26000 | 0x25e00	   |  0x97
        0x26100 | 0x25f00	   |  0x97
        0x26200 | 0x26000	   |  0x98
        0x26300 | 0x26100	   |  0x98
      
      This means that calling {get,set}_pageblock_migratetype on a single page
      will not set the migratetype for the full block.  Fix this by rounding
      down zone_start_pfn when doing the bitidx calculation.
      
      For our use case, the effects of this bug were mostly tied to the fact
      that CMA allocations would either take a long time or fail to happen.
      Depending on the driver using CMA, this could result in anything from
      visual glitches to application failures.
      Signed-off-by: NLaura Abbott <lauraa@codeaurora.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c060f943
    • M
      mm: compaction: Partially revert capture of suitable high-order page · 47ecfcb7
      Mel Gorman 提交于
      Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
      waiting for POLLIN on a local TCP socket.  It was easier to trigger if
      there was disk IO and dirty pages at the same time and he bisected it to
      commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page
      immediately when it is made available").
      
      The intention of that patch was to improve high-order allocations under
      memory pressure after changes made to reclaim in 3.6 drastically hurt
      THP allocations but the approach was flawed.  For Eric, the problem was
      that page->pfmemalloc was not being cleared for captured pages leading
      to a poor interaction with swap-over-NFS support causing the packets to
      be dropped.  However, I identified a few more problems with the patch
      including the fact that it can increase contention on zone->lock in some
      cases which could result in async direct compaction being aborted early.
      
      In retrospect the capture patch took the wrong approach.  What it should
      have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
      was allocating for THP and avoided races that way.  While the patch was
      showing to improve allocation success rates at the time, the benefit is
      marginal given the relative complexity and it should be revisited from
      scratch in the context of the other reclaim-related changes that have
      taken place since the patch was first written and tested.  This patch
      partially reverts commit 1fb3f8ca "mm: compaction: capture a suitable
      high-order page immediately when it is made available".
      Reported-and-tested-by: NEric Wong <normalperson@yhbt.net>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47ecfcb7
  13. 05 1月, 2013 1 次提交