1. 24 2月, 2013 40 次提交
    • P
      mm: move page flags layout to separate header · bbeae5b0
      Peter Zijlstra 提交于
      This is a preparation patch for moving page->_last_nid into page->flags
      that moves page flag layout information to a separate header.  This
      patch is necessary because otherwise there would be a circular
      dependency between mm_types.h and mm.h.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbeae5b0
    • M
      mm: numa: handle side-effects in count_vm_numa_events() for !CONFIG_NUMA_BALANCING · 3c0ff468
      Mel Gorman 提交于
      The current definitions for count_vm_numa_events() is wrong for
      !CONFIG_NUMA_BALANCING as the following would miss the side-effect.
      
      	count_vm_numa_events(NUMA_FOO, bar++);
      
      There are no such users of count_vm_numa_events() but this patch fixes
      it as it is a potential pitfall.  Ideally both would be converted to
      static inline but NUMA_PTE_UPDATES is not defined if
      !CONFIG_NUMA_BALANCING and creating dummy constants just to have a
      static inline would be similarly clumsy.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c0ff468
    • M
      mm: numa: take THP into account when migrating pages for NUMA balancing · 3abef4e6
      Mel Gorman 提交于
      Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
      one base page is being migrated when in fact it can also be checking
      THP.
      
      The consequences are that a migration will be attempted when a target
      node is nearly full and fail later.  It's unlikely to be user-visible
      but it should be fixed.  While we are there, migrate_balanced_pgdat()
      should treat nr_migrate_pages as an unsigned long as it is treated as a
      watermark.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Suggested-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3abef4e6
    • M
      mm: numa: fix minor typo in numa_next_scan · 34f0315a
      Mel Gorman 提交于
      s/me/be/ and clarify the comment a bit when we're changing it anyway.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Suggested-by: NSimon Jeons <simon.jeons@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34f0315a
    • K
    • M
      usb: forbid memory allocation with I/O during bus reset · 4d769def
      Ming Lei 提交于
      If one storage interface or usb network interface(iSCSI case) exists in
      current configuration, memory allocation with GFP_KERNEL during
      usb_device_reset() might trigger I/O transfer on the storage interface
      itself and cause deadlock because the 'us->dev_mutex' is held in
      .pre_reset() and the storage interface can't do I/O transfer when the
      reset is triggered by other interface, or the error handling can't be
      completed if the reset is triggered by the storage itself (error
      handling path).
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Reviewed-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d769def
    • M
      pm / runtime: force memory allocation with no I/O during Runtime PM callbcack · db88175f
      Ming Lei 提交于
      Apply the introduced memalloc_noio_save() and memalloc_noio_restore() to
      force memory allocation with no I/O during runtime_resume/runtime_suspend
      callback on device with the flag of 'memalloc_noio' set.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db88175f
    • M
      net/core: apply pm_runtime_set_memalloc_noio on network devices · 9802c8e2
      Ming Lei 提交于
      Deadlock might be caused by allocating memory with GFP_KERNEL in
      runtime_resume and runtime_suspend callback of network devices in iSCSI
      situation, so mark network devices and its ancestor as 'memalloc_noio'
      with the introduced pm_runtime_set_memalloc_noio().
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9802c8e2
    • M
      block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices · 25e823c8
      Ming Lei 提交于
      Apply the introduced pm_runtime_set_memalloc_noio on block device so
      that PM core will teach mm to not allocate memory with GFP_IOFS when
      calling the runtime_resume and runtime_suspend callback for block
      devices and its ancestors.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25e823c8
    • M
      pm / runtime: introduce pm_runtime_set_memalloc_noio() · e823407f
      Ming Lei 提交于
      Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core
      to teach mm not allocating memory with GFP_KERNEL flag for avoiding
      probable deadlock.
      
      As explained in the comment, any GFP_KERNEL allocation inside
      runtime_resume() or runtime_suspend() on any one of device in the path
      from one block or network device to the root device in the device tree
      may cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets
      or clears the flag on device in the path recursively.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e823407f
    • M
      mm: teach mm by current context info to not do I/O during memory allocation · 21caf2fc
      Ming Lei 提交于
      This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
      'struct task_struct'), so that the flag can be set by one task to avoid
      doing I/O inside memory allocation in the task's context.
      
      The patch trys to solve one deadlock problem caused by block device, and
      the problem may happen at least in the below situations:
      
      - during block device runtime resume, if memory allocation with
        GFP_KERNEL is called inside runtime resume callback of any one of its
        ancestors(or the block device itself), the deadlock may be triggered
        inside the memory allocation since it might not complete until the block
        device becomes active and the involed page I/O finishes.  The situation
        is pointed out first by Alan Stern.  It is not a good approach to
        convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
        subsystems may be involved(for example, PCI, USB and SCSI may be
        involved for usb mass stoarage device, network devices involved too in
        the iSCSI case)
      
      - during block device runtime suspend, because runtime resume need to
        wait for completion of concurrent runtime suspend.
      
      - during error handling of usb mass storage deivce, USB bus reset will
        be put on the device, so there shouldn't have any memory allocation with
        GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
        above may be triggered.  Unfortunately, any usb device may include one
        mass storage interface in theory, so it requires all usb interface
        drivers to handle the situation.  In fact, most usb drivers don't know
        how to handle bus reset on the device and don't provide .pre_set() and
        .post_reset() callback at all, so USB core has to unbind and bind driver
        for these devices.  So it is still not practical to resort to GFP_NOIO
        for solving the problem.
      
      Also the introduced solution can be used by block subsystem or block
      drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
      actual I/O transfer.
      
      It is not a good idea to convert all these GFP_KERNEL in the affected
      path into GFP_NOIO because these functions doing that may be implemented
      as library and will be called in many other contexts.
      
      In fact, memalloc_noio_flags() can convert some of current static
      GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
      at least almost all GFP_NOIO in USB subsystem can be converted into
      GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
      only happen in runtime resume/bus reset/block I/O transfer contexts
      generally.
      
      [1], several GFP_KERNEL allocation examples in runtime resume path
      
      - pci subsystem
      acpi_os_allocate
      	<-acpi_ut_allocate
      		<-ACPI_ALLOCATE_ZEROED
      			<-acpi_evaluate_object
      				<-__acpi_bus_set_power
      					<-acpi_bus_set_power
      						<-acpi_pci_set_power_state
      							<-platform_pci_set_power_state
      								<-pci_platform_power_transition
      									<-__pci_complete_power_transition
      										<-pci_set_power_state
      											<-pci_restore_standard_config
      												<-pci_pm_runtime_resume
      - usb subsystem
      usb_get_status
      	<-finish_port_resume
      		<-usb_port_resume
      			<-generic_resume
      				<-usb_resume_device
      					<-usb_resume_both
      						<-usb_runtime_resume
      
      - some individual usb drivers
      usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
      
      That is just what I have found.  Unfortunately, this allocation can only
      be found by human being now, and there should be many not found since
      any function in the resume path(call tree) may allocate memory with
      GFP_KERNEL.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21caf2fc
    • Z
      mm: don't wait on congested zones in balance_pgdat() · 258401a6
      Zlatko Calusic 提交于
      From: Zlatko Calusic <zlatko.calusic@iskon.hr>
      
      Commit 92df3a72 ("mm: vmscan: throttle reclaim if encountering too
      many dirty pages under writeback") introduced waiting on congested zones
      based on a sane algorithm in shrink_inactive_list().
      
      What this means is that there's no more need for throttling and
      additional heuristics in balance_pgdat().  So, let's remove it and tidy
      up the code.
      Signed-off-by: NZlatko Calusic <zlatko.calusic@iskon.hr>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      258401a6
    • N
      mm/memory-failure.c: fix wrong num_poisoned_pages in handling memory error on thp · 4db0e950
      Naoya Horiguchi 提交于
      num_poisoned_pages counts up the number of pages isolated by memory
      errors.  But for thp, only one subpage is isolated because memory error
      handler splits it, so it's wrong to add (1 << compound_trans_order).
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4db0e950
    • N
      mm/memory-failure.c: clean up soft_offline_page() · af8fae7c
      Naoya Horiguchi 提交于
      Currently soft_offline_page() is hard to maintain because it has many
      return points and goto statements.  All of this mess come from
      get_any_page().
      
      This function should only get page refcount as the name implies, but it
      does some page isolating actions like SetPageHWPoison() and dequeuing
      hugepage.  This patch corrects it and introduces some internal
      subroutines to make soft offlining code more readable and maintainable.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NAndi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af8fae7c
    • X
      memory-failure: use num_poisoned_pages instead of mce_bad_pages · 293c07e3
      Xishi Qiu 提交于
      Since MCE is an x86 concept, and this code is in mm/, it would be better
      to use the name num_poisoned_pages instead of mce_bad_pages.
      
      [akpm@linux-foundation.org: fix mm/sparse.c]
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Suggested-by: NBorislav Petkov <bp@alien8.de>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      293c07e3
    • X
      memory-failure: do code refactor of soft_offline_page() · fa8dd8a9
      Xishi Qiu 提交于
      There are too many return points randomly intermingled with some "goto
      done" return points.  So adjust the function structure, one for the
      success path, the other for the failure path.  Use atomic_long_inc
      instead of atomic_long_add.
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa8dd8a9
    • X
      memory-failure: fix an error of mce_bad_pages statistics · 0ebff32c
      Xishi Qiu 提交于
      When doing
      
          $ echo paddr > /sys/devices/system/memory/soft_offline_page
      
      to offline a *free* page, the value of mce_bad_pages will be added, and
      the page is set HWPoison flag, but it is still managed by page buddy
      alocator.
      
         $ cat /proc/meminfo | grep HardwareCorrupted
      
      shows the value.
      
      If we offline the same page, the value of mce_bad_pages will be added
      *again*, this means the value is incorrect now.  Assume the page is
      still free during this short time.
      
        soft_offline_page()
          get_any_page()
            "else if (is_free_buddy_page(p))" branch return 0
              "goto done";
                 "atomic_long_add(1, &mce_bad_pages);"
      
      This patch:
      
      Move poisoned page check at the beginning of the function in order to
      fix the error.
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Tested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ebff32c
    • M
      mm: remove MIGRATE_ISOLATE check in hotpath · 194159fb
      Minchan Kim 提交于
      Several functions test MIGRATE_ISOLATE and some of those are hotpath but
      MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
      CMA, memory-hotplug and memory-failure) which are not common config
      option.  So let's not add unnecessary overhead and code when we don't
      enable CONFIG_MEMORY_ISOLATION.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      194159fb
    • J
      mm: increase totalram_pages when free pages allocated by bootmem allocator · c60514b6
      Jiang Liu 提交于
      Function put_page_bootmem() is used to free pages allocated by bootmem
      allocator, so it should increase totalram_pages when freeing pages into
      the buddy system.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Cc: Chris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c60514b6
    • J
      mm: set zone->present_pages to number of existing pages in the zone · 306f2e9e
      Jiang Liu 提交于
      Now all users of "number of pages managed by the buddy system" have been
      converted to use zone->managed_pages, so set zone->present_pages to what
      it should be:
      
      	present_pages = spanned_pages - absent_pages;
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Cc: Chris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      306f2e9e
    • J
      mm: use zone->present_pages instead of zone->managed_pages where appropriate · b40da049
      Jiang Liu 提交于
      Now we have zone->managed_pages for "pages managed by the buddy system
      in the zone", so replace zone->present_pages with zone->managed_pages if
      what the user really wants is number of allocatable pages.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
      Cc: Chris Clayton <chris2553@googlemail.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b40da049
    • T
      mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in... · f7210e6c
      Tang Chen 提交于
      mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().
      
      The definition of struct movablecore_map is protected by
      CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
      is not.  So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
      movablecore_map in memblock_overlaps_region().
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7210e6c
    • T
      acpi, memory-hotplug: support getting hotplug info from SRAT · 01a178a9
      Tang Chen 提交于
      We now provide an option for users who don't want to specify physical
      memory address in kernel commandline.
      
               /*
                * For movablemem_map=acpi:
                *
                * SRAT:                |_____| |_____| |_________| |_________| ......
                * node id:                0       1         1           2
                * hotpluggable:           n       y         y           n
                * movablemem_map:              |_____| |_________|
                *
                * Using movablemem_map, we can prevent memblock from allocating memory
                * on ZONE_MOVABLE at boot time.
                */
      
      So user just specify movablemem_map=acpi, and the kernel will use
      hotpluggable info in SRAT to determine which memory ranges should be set
      as ZONE_MOVABLE.
      
      If all the memory ranges in SRAT is hotpluggable, then no memory can be
      used by kernel.  But before parsing SRAT, memblock has already reserve
      some memory ranges for other purposes, such as for kernel image, and so
      on.  We cannot prevent kernel from using these memory.  So we need to
      exclude these ranges even if these memory is hotpluggable.
      
      Furthermore, there could be several memory ranges in the single node
      which the kernel resides in.  We may skip one range that have memory
      reserved by memblock, but if the rest of memory is too small, then the
      kernel will fail to boot.  So, make the whole node which the kernel
      resides in un-hotpluggable.  Then the kernel has enough memory to use.
      
      NOTE: Using this way will cause NUMA performance down because the
            whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
            on it.  If users don't want to lose NUMA performance, just don't use
            it.
      
      [akpm@linux-foundation.org: fix warning]
      [akpm@linux-foundation.org: use strcmp()]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01a178a9
    • T
      acpi, memory-hotplug: extend movablemem_map ranges to the end of node · 27168d38
      Tang Chen 提交于
      When implementing movablemem_map boot option, we introduced an array
      movablemem_map.map[] to store the memory ranges to be set as
      ZONE_MOVABLE.
      
      Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
      the whole node memory range, we need to extend it to the node end so
      that we can use it to prevent memblock from allocating memory in the
      ranges user didn't specify.
      
      We now implement movablemem_map boot option like this:
      
              /*
               * For movablemem_map=nn[KMG]@ss[KMG]:
               *
               * SRAT:                |_____| |_____| |_________| |_________| ......
               * node id:                0       1         1           2
               * user specified:                |__|                 |___|
               * movablemem_map:                |___| |_________|    |______| ......
               *
               * Using movablemem_map, we can prevent memblock from allocating memory
               * on ZONE_MOVABLE at boot time.
               *
               * NOTE: In this case, SRAT info will be ingored.
               */
      
      [akpm@linux-foundation.org: clean up code, fix build warning]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27168d38
    • T
      acpi, memory-hotplug: parse SRAT before memblock is ready · e8d19552
      Tang Chen 提交于
      On linux, the pages used by kernel could not be migrated.  As a result,
      if a memory range is used by kernel, it cannot be hot-removed.  So if we
      want to hot-remove memory, we should prevent kernel from using it.
      
      The way now used to prevent this is specify a memory range by
      movablemem_map boot option and set it as ZONE_MOVABLE.
      
      But when the system is booting, memblock will allocate memory, and
      reserve the memory for kernel.  And before we parse SRAT, and know the
      node memory ranges, memblock is working.  And it may allocate memory in
      ranges to be set as ZONE_MOVABLE.  This memory can be used by kernel,
      and never be freed.
      
      So, let's parse SRAT before memblock is called first.  And it is early
      enough.
      
      The first call of memblock_find_in_range_node() is in:
      
        setup_arch()
          |-->setup_real_mode()
      
      so, this patch add a function early_parse_srat() to parse SRAT, and call
      it before setup_real_mode() is called.
      
      NOTE:
      
      1) early_parse_srat() is called before numa_init(), and has initialized
         numa_meminfo.  So DO NOT clear numa_nodes_parsed in numa_init() and DO
         NOT zero numa_meminfo in numa_init(), otherwise we will lose memory
         numa info.
      
      2) I don't know why using count of memory affinities parsed from SRAT
         as a return value in original acpi_numa_init().  So I add a static
         variable srat_mem_cnt to remember this count and use it as the return
         value of the new acpi_numa_init()
      
      [mhocko@suse.cz: parse SRAT before memblock is ready fix]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8d19552
    • T
      page_alloc: bootmem limit with movablecore_map · fb06bc8e
      Tang Chen 提交于
      Ensure the bootmem will not allocate memory from areas that may be
      ZONE_MOVABLE.  The map info is from movablecore_map boot option.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Tested-by: NLin Feng <linfeng@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb06bc8e
    • T
      page_alloc: make movablemem_map have higher priority · 42f47e27
      Tang Chen 提交于
      If kernelcore or movablecore is specified at the same time with
      movablemem_map, movablemem_map will have higher priority to be
      satisfied.  This patch will make find_zone_movable_pfns_for_nodes()
      calculate zone_movable_pfn[] with the limit from zone_movable_limit[].
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Tested-by: NLin Feng <linfeng@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42f47e27
    • T
      page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes · 6981ec31
      Tang Chen 提交于
      Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE
      limit from movablemem_map boot option for all nodes.  The function
      sanitize_zone_movable_limit() will find out to which node the ranges in
      movable_map.map[] belongs, and calculates the low boundary of
      ZONE_MOVABLE for each node.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NLiu Jiang <jiang.liu@huawei.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Tested-by: NLin Feng <linfeng@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6981ec31
    • T
      page_alloc: add movable_memmap kernel parameter · 34b71f1e
      Tang Chen 提交于
      Add functions to parse movablemem_map boot option.  Since the option
      could be specified more then once, all the maps will be stored in the
      global variable movablemem_map.map array.
      
      And also, we keep the array in monotonic increasing order by start_pfn.
      And merge all overlapped ranges.
      
      [akpm@linux-foundation.org: improve comment]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: remove unneeded parens]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Tested-by: NLin Feng <linfeng@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34b71f1e
    • Y
      x86: get pg_data_t's memory from other node · 4d59a751
      Yasuaki Ishimatsu 提交于
      During the implementation of SRAT support, we met a problem.  In
      setup_arch(), we have the following call series:
      
       1) memblock is ready;
       2) some functions use memblock to allocate memory;
       3) parse ACPI tables, such as SRAT.
      
      Before 3), we don't know which memory is hotpluggable, and as a result,
      we cannot prevent memblock from allocating hotpluggable memory.  So, in
      2), there could be some hotpluggable memory allocated by memblock.
      
      Now, we are trying to parse SRAT earlier, before memblock is ready.  But
      I think we need more investigation on this topic.  So in this v5, I
      dropped all the SRAT support, and v5 is just the same as v3, and it is
      based on 3.8-rc3.
      
      As we planned, we will support getting info from SRAT without users'
      participation at last.  And we will post another patch-set to do so.
      
      And also, I think for now, we can add this boot option as the first step
      of supporting movable node.  Since Linux cannot migrate the direct
      mapped pages, the only way for now is to limit the whole node containing
      only movable memory.
      
      Using SRAT is one way.  But even if we can use SRAT, users still need an
      interface to enable/disable this functionality if they don't want to
      loose their NUMA performance.  So I think, a user interface is always
      needed.
      
      For now, users can disable this functionality by not specifying the boot
      option.  Later, we will post SRAT support, and add another option value
      "movablecore_map=acpi" to using SRAT.
      
      This patch:
      
      If system can create movable node which all memory of the node is
      allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
      the node's pg_data_t.  So, use memblock_alloc_try_nid() instead of
      memblock_alloc_nid() to retry when the first allocation fails.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d59a751
    • T
      sched: do not use cpu_to_node() to find an offlined cpu's node. · aa00d89c
      Tang Chen 提交于
      If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu)
      will return -1.  As a result, cpumask_of_node(nid) will return NULL.  In
      this case, find_next_bit() in for_each_cpu will get a NULL pointer and
      cause panic.
      
      Here is a call trace:
        Call Trace:
         <IRQ>
          select_fallback_rq+0x71/0x190
          try_to_wake_up+0x2cb/0x2f0
          wake_up_process+0x15/0x20
          hrtimer_wakeup+0x22/0x30
          __run_hrtimer+0x83/0x320
          hrtimer_interrupt+0x106/0x280
          smp_apic_timer_interrupt+0x69/0x99
          apic_timer_interrupt+0x6f/0x80
      
      There is a hrtimer process sleeping, whose cpu has already been
      offlined.  When it is waken up, it tries to find another cpu to run, and
      get a -1 nid.  As a result, cpumask_of_node(-1) returns NULL, and causes
      ernel panic.
      
      This patch fixes this problem by judging if the nid is -1.  If nid is
      not -1, a cpu on the same node will be picked.  Else, a online cpu on
      another node will be picked.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa00d89c
    • W
      cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node · e13fe869
      Wen Congyang 提交于
      When the node is offlined, there is no memory/cpu on the node.  If a
      sleep task runs on a cpu of this node, it will be migrated to the cpu on
      the other node.  So we can clear cpu-to-node mapping.
      
      [akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit]
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e13fe869
    • W
      cpu-hotplug, memory-hotplug: try offlining the node when hotremoving a cpu · 76bba142
      Wen Congyang 提交于
      The node will be offlined when all memory/cpu on the node is hotremoved.
      So we should try offline the node when hotremoving a cpu on the node.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Len Brown <lenb@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76bba142
    • W
      memory-hotplug: export the function try_offline_node() · 90b30cdc
      Wen Congyang 提交于
      try_offline_node() will be needed in the tristate
      drivers/acpi/processor_driver.c.
      
      The node will be offlined when all memory/cpu on the node have been
      hotremoved.  So we need the function try_offline_node() in cpu-hotplug
      path.
      
      If the memory-hotplug is disabled, and cpu-hotplug is enabled
      
      1. no memory no the node
         we don't online the node, and cpu's node is the nearest node.
      
      2. the node contains some memory
         the node has been onlined, and cpu's node is still needed
         to migrate the sleep task on the cpu to the same node.
      
      So we do nothing in try_offline_node() in this case.
      
      [rientjes@google.com: export the function try_offline_node() fix]
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Len Brown <lenb@kernel.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90b30cdc
    • W
      cpu_hotplug: clear apicid to node when the cpu is hotremoved · c4c60524
      Wen Congyang 提交于
      When a cpu is hotpluged, we call acpi_map_cpu2node() in
      _acpi_map_lsapic() to store the cpu's node and apicid's node.  But we
      don't clear the cpu's node in acpi_unmap_lsapic() when this cpu is
      hotremoved.  If the node is also hotremoved, we will get the following
      messages:
      
        kernel BUG at include/linux/gfp.h:329!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
        RIP: 0010:[<ffffffff811bc3fd>]  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
        RSP: 0018:ffff88078a049cf8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
        RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
        R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
        FS:  00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
        Call Trace:
          new_slab+0x30/0x1b0
          __slab_alloc+0x358/0x4c0
          kmem_cache_alloc_node_trace+0xb4/0x1e0
          alloc_fair_sched_group+0xd0/0x1b0
          sched_create_group+0x3e/0x110
          sched_autogroup_create_attach+0x4d/0x180
          sys_setsid+0xd4/0xf0
          system_call_fastpath+0x16/0x1b
        Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe <0f> 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
        RIP  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
         RSP <ffff88078a049cf8>
        ---[ end trace adf84c90f3fea3e5 ]---
      
      The reason is that the cpu's node is not NUMA_NO_NODE, we will call
      alloc_pages_exact_node() to alloc memory on the node, but the node is
      offlined.
      
      If the node is onlined, we still need cpu's node.  For example: a task
      on the cpu is sleeped when the cpu is hotremoved.  We will choose
      another cpu to run this task when it is waked up.  If we know the cpu's
      node, we will choose the cpu on the same node first.  So we should clear
      cpu-to-node mapping when the node is offlined.
      
      This patch only clears apicid-to-node mapping when the cpu is
      hotremoved.
      
      [akpm@linux-foundation.org: fix section error]
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4c60524
    • L
      mempolicy: fix is_valid_nodemask() · d3eb1570
      Lai Jiangshan 提交于
      is_valid_nodemask() was introduced by commit 19770b32 ("mm: filter
      based on a nodemask as well as a gfp_mask").  but it does not match its
      comments, because it does not check the zone which > policy_zone.
      
      Also in commit b377fd39 ("Apply memory policies to top two highest
      zones when highest zone is ZONE_MOVABLE"), this commits told us, if
      highest zone is ZONE_MOVABLE, we should also apply memory policies to
      it.  so ZONE_MOVABLE should be valid zone for policies.
      is_valid_nodemask() need to be changed to match it.
      
      Fix: check all zones, even its zoneid > policy_zone.  Use
      nodes_intersects() instead open code to check it.
      Reported-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3eb1570
    • W
      memory-hotplug: consider compound pages when free memmap · 8a356ce3
      Wen Congyang 提交于
      usemap could also be allocated as compound pages.  Should also consider
      compound pages when freeing memmap.
      
      If we don't fix it, there could be problems when we free vmemmap
      pagetables which are stored in compound pages.  The old pagetables will
      not be freed properly, and when we add the memory again, no new
      pagetable will be created.  And the old pagetable entry is used, than
      the kernel will panic.
      
      The call trace is like the following:
      
        BUG: unable to handle kernel paging request at ffffea0040000000
        IP: [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
        PGD 7ff7d4067 PUD 78e035067 PMD 78e11d067 PTE 0
        Oops: 0002 [#1] SMP
        Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg lpc_ich mfd_core i2c_i801 i2c_core i7core_edac edac_core ioatdma e1000e igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
        CPU 0
        Pid: 4, comm: kworker/0:0 Tainted: G        W 3.8.0-rc3-phy-hot-remove+ #3 FUJITSU-SV PRIMEQUEST 1800E/SB
        RIP: 0010:[<ffffffff816a483f>]  [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
        RSP: 0018:ffff8807bdcb35d8  EFLAGS: 00010006
        RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000200000
        RDX: ffff88078df01148 RSI: 0000000000000282 RDI: ffffea0040000000
        RBP: ffff8807bdcb3618 R08: 4cf05005b019467a R09: 0cd98fa09631467a
        R10: 0000000000000000 R11: 0000000000030e20 R12: 0000000000008000
        R13: ffffea0040000000 R14: ffff88078df66248 R15: ffff88078ea13b10
        FS:  0000000000000000(0000) GS:ffff8807c1a00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: ffffea0040000000 CR3: 0000000001c0c000 CR4: 00000000000007f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process kworker/0:0 (pid: 4, threadinfo ffff8807bdcb2000, task ffff8807bde18000)
        Call Trace:
          __add_pages+0x85/0x120
          arch_add_memory+0x71/0xf0
          add_memory+0xd6/0x1f0
          acpi_memory_device_add+0x170/0x20c
          acpi_device_probe+0x50/0x18a
          really_probe+0x6c/0x320
          driver_probe_device+0x47/0xa0
          __device_attach+0x53/0x60
          bus_for_each_drv+0x6c/0xa0
          device_attach+0xa8/0xc0
          bus_probe_device+0xb0/0xe0
          device_add+0x301/0x570
          device_register+0x1e/0x30
          acpi_device_register+0x1d8/0x27c
          acpi_add_single_object+0x1df/0x2b9
          acpi_bus_check_add+0x112/0x18f
          acpi_ns_walk_namespace+0x105/0x255
          acpi_walk_namespace+0xcf/0x118
          acpi_bus_scan+0x5b/0x7c
          acpi_bus_add+0x2a/0x2c
          container_notify_cb+0x112/0x1a9
          acpi_ev_notify_dispatch+0x46/0x61
          acpi_os_execute_deferred+0x27/0x34
          process_one_work+0x20e/0x5c0
          worker_thread+0x12e/0x370
          kthread+0xee/0x100
          ret_from_fork+0x7c/0xb0
        Code: 00 00 48 89 df 48 89 45 c8 e8 3e 71 b1 ff 48 89 c2 48 8b 75 c8 b8 ef ff ff ff f6 02 01 75 4b 49 63 cc 31 c0 4c 89 ef 48 c1 e1 06 <f3> aa 48 8b 02 48 83 c8 01 48 85 d2 48 89 02 74 29 a8 01 74 25
        RIP  [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
         RSP <ffff8807bdcb35d8>
        CR2: ffffea0040000000
        ---[ end trace e7f94e3a34c442d4 ]---
        Kernel panic - not syncing: Fatal exception
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a356ce3
    • T
      memory-hotplug: do not allocate pgdat if it was not freed when offline. · a1e565aa
      Tang Chen 提交于
      Since there is no way to guarentee the address of pgdat/zone is not on
      stack of any kernel threads or used by other kernel objects without
      reference counting or other symchronizing method, we cannot reset
      node_data and free pgdat when offlining a node.  Just reset pgdat to 0
      and reuse the memory when the node is online again.
      
      The problem is suggested by Kamezawa Hiroyuki.  The idea is from Wen
      Congyang.
      
      NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node()
            will be triggered.
      
      [akpm@linux-foundation.org: fix warning when CONFIG_NEED_MULTIPLE_NODES=n]
      [akpm@linux-foundation.org: fix the warning again again]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1e565aa
    • W
      memory-hotplug: free node_data when a node is offlined · d822b86a
      Wen Congyang 提交于
      We call hotadd_new_pgdat() to allocate memory to store node_data.  So we
      should free it when removing a node.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d822b86a
    • T
      memory-hotplug: remove sysfs file of node · 60a5a19e
      Tang Chen 提交于
      Introduce a new function try_offline_node() to remove sysfs file of node
      when all memory sections of this node are removed.  If some memory
      sections of this node are not removed, this function does nothing.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60a5a19e