1. 16 12月, 2009 25 次提交
    • H
      ksm: rmap_walk to remove_migation_ptes · e9995ef9
      Hugh Dickins 提交于
      A side-effect of making ksm pages swappable is that they have to be placed
      on the LRUs: which then exposes them to isolate_lru_page() and hence to
      page migration.
      
      Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
      rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c.  Perhaps some
      consolidation with existing code is possible, but don't attempt that yet
      (try_to_unmap needs to handle nonlinears, but migration pte removal does
      not).
      
      rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
      remove_anon_migration_ptes() which it replaces, avoids calling
      page_lock_anon_vma(), because that includes a page_mapped() test which
      fails when all migration ptes are in place.  That was valid when NUMA page
      migration was introduced (holding mmap_sem provided the missing guarantee
      that anon_vma's slab had not already been destroyed), but I believe not
      valid in the memory hotremove case added since.
      
      For now do the same as before, and consider the best way to fix that
      unlikely race later on.  When fixed, we can probably use rmap_walk() on
      hwpoisoned ksm pages too: for now, they remain among hwpoison's various
      exceptions (its PageKsm test comes before the page is locked, but its
      page_lock_anon_vma fails safely if an anon gets upgraded).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9995ef9
    • H
      ksm: hold anon_vma in rmap_item · db114b83
      Hugh Dickins 提交于
      For full functionality, page_referenced_one() and try_to_unmap_one() need
      to know the vma: to pass vma down to arch-dependent flushes, or to observe
      VM_LOCKED or VM_EXEC.  But KSM keeps no record of vma: nor can it, since
      vmas get split and merged without its knowledge.
      
      Instead, note page's anon_vma in its rmap_item when adding to stable tree:
      all the vmas which might map that page are listed by its anon_vma.
      
      page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
      first to find the probable vma, that which matches rmap_item's mm; but if
      that is not enough to locate all instances, traverse again to try the
      others.  This catches those occasions when fork has duplicated a pte of a
      ksm page, but ksmd has not yet come around to assign it an rmap_item.
      
      But each rmap_item in the stable tree which refers to an anon_vma needs to
      take a reference to it.  Andrea's anon_vma design cleverly avoided a
      reference count (an anon_vma was free when its list of vmas was empty),
      but KSM now needs to add that.  Is a 32-bit count sufficient?  I believe
      so - the anon_vma is only free when both count is 0 and list is empty.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db114b83
    • H
      ksm: let shared pages be swappable · 5ad64688
      Hugh Dickins 提交于
      Initial implementation for swapping out KSM's shared pages: add
      page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
      faced with a PageKsm page.
      
      Most of what's needed can be got from the rmap_items listed from the
      stable_node of the ksm page, without discovering the actual vma: so in
      this patch just fake up a struct vma for page_referenced_one() or
      try_to_unmap_one(), then refine that in the next patch.
      
      Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
      implicit there (being only set with VM_SHARED, already excluded), but
      let's make it explicit, to help justify the lack of nonlinear unmap.
      
      Rely on the page lock to protect against concurrent modifications to that
      page's node of the stable tree.
      
      The awkward part is not swapout but swapin: do_swap_page() and
      page_add_anon_rmap() now have to allow for new possibilities - perhaps a
      ksm page still in swapcache, perhaps a swapcache page associated with one
      location in one anon_vma now needed for another location or anon_vma.
      (And the vma might even be no longer VM_MERGEABLE when that happens.)
      
      ksm_might_need_to_copy() checks for that case, and supplies a duplicate
      page when necessary, simply leaving it to a subsequent pass of ksmd to
      rediscover the identity and merge them back into one ksm page.
      Disappointingly primitive: but the alternative would have to accumulate
      unswappable info about the swapped out ksm pages, limiting swappability.
      
      Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
      particular case it was handling, so just use it instead.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ad64688
    • H
      ksm: stable_node point to page and back · 08beca44
      Hugh Dickins 提交于
      Add a pointer to the ksm page into struct stable_node, holding a reference
      to the page while the node exists.  Put a pointer to the stable_node into
      the ksm page's ->mapping.
      
      Then we don't need get_ksm_page() while traversing the stable tree: the
      page to compare against is sure to be present and correct, even if it's no
      longer visible through any of its existing rmap_items.
      
      And we can handle the forked ksm page case more efficiently: no need to
      memcmp our way through the tree to find its match.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08beca44
    • H
      mm: CONFIG_MMU for PG_mlocked · af8e3354
      Hugh Dickins 提交于
      Remove three degrees of obfuscation, left over from when we had
      CONFIG_UNEVICTABLE_LRU.  MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
      CONFIG_HAVE_MLOCK is CONFIG_MMU.  rmap.o (and memory-failure.o) are only
      built when CONFIG_MMU, so don't need such conditions at all.
      
      Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
      169 defconfigs: leave those to evolve in due course.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af8e3354
    • H
      mm: define PAGE_MAPPING_FLAGS · 3ca7b3c5
      Hugh Dickins 提交于
      At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit set
      in page->mapping, with the higher bits a pointer to the anon_vma; and have
      defined PageKsm(page) as that with NULL anon_vma.
      
      But KSM swapping will need to store a pointer there: so in preparation for
      that, now define PAGE_MAPPING_FLAGS as the low two bits, including
      PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
      other use for the bit emerges).
      
      Declare page_rmapping(page) to return the pointer part of page->mapping,
      and page_anon_vma(page) to return the anon_vma pointer when that's what it
      is.  Use these in a few appropriate places: notably, unuse_vma() has been
      testing page->mapping, but is better to be testing page_anon_vma() (cases
      may be added in which flag bits are set without any pointer).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ca7b3c5
    • K
      vmscan: stop kswapd waiting on congestion when the min watermark is not being met · bb3ab596
      KOSAKI Motohiro 提交于
      If reclaim fails to make sufficient progress, the priority is raised.
      Once the priority is higher, kswapd starts waiting on congestion.
      However, if the zone is below the min watermark then kswapd needs to
      continue working without delay as there is a danger of an increased rate
      of GFP_ATOMIC allocation failure.
      
      This patch changes the conditions under which kswapd waits on congestion
      by only going to sleep if the min watermarks are being met.
      
      [mel@csn.ul.ie: add stats to track how relevant the logic is]
      [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb3ab596
    • M
      vmscan: have kswapd sleep for a short interval and double check it should be asleep · f50de2d3
      Mel Gorman 提交于
      After kswapd balances all zones in a pgdat, it goes to sleep.  In the
      event of no IO congestion, kswapd can go to sleep very shortly after the
      high watermark was reached.  If there are a constant stream of allocations
      from parallel processes, it can mean that kswapd went to sleep too quickly
      and the high watermark is not being maintained for sufficient length time.
      
      This patch makes kswapd go to sleep as a two-stage process.  It first
      tries to sleep for HZ/10.  If it is woken up by another process or the
      high watermark is no longer met, it's considered a premature sleep and
      kswapd continues work.  Otherwise it goes fully to sleep.
      
      This adds more counters to distinguish between fast and slow breaches of
      watermarks.  A "fast" premature sleep is one where the low watermark was
      hit in a very short time after kswapd going to sleep.  A "slow" premature
      sleep indicates that the high watermark was breached after a very short
      interval.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Frans Pop <elendil@planet.nl>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f50de2d3
    • L
      swap: rework map_swap_page() again · d4906e1a
      Lee Schermerhorn 提交于
      Seems that page_io.c doesn't really need to know that page_private(page)
      is the swp_entry 'val'.  Rework map_swap_page() to do what its name says
      and map a page to a page offset in the swap space.
      
      The only other caller of map_swap_page() is internal to mm/swapfile.c and
      it does want to map a swap entry to the 'sector'.  So rename
      map_swap_page() to map_swap_entry(), make it 'static' and and implement
      map_swap_page() as a wrapper around that.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4906e1a
    • H
      swap_info: reorder its fields · 7509765a
      Hugh Dickins 提交于
      Reorder (and comment) the fields of swap_info_struct, to make better
      use of its cachelines: it's good for swap_duplicate() in particular
      if unsigned int max and swap_map are near the start.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7509765a
    • H
      swap_info: note SWAP_MAP_SHMEM · aaa46865
      Hugh Dickins 提交于
      While we're fiddling with the swap_map values, let's assign a particular
      value to shmem/tmpfs swap pages: their swap counts are never incremented,
      and it helps swapoff's try_to_unuse() a little if it can immediately
      distinguish those pages from process pages.
      
      Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
      we might as well use that 0xbf value for SWAP_MAP_SHMEM.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aaa46865
    • H
      swap_info: swap count continuations · 570a335b
      Hugh Dickins 提交于
      Swap is duplicated (reference count incremented by one) whenever the same
      swap page is inserted into another mm (when forking finds a swap entry in
      place of a pte, or when reclaim unmaps a pte to insert the swap entry).
      
      swap_info_struct's vmalloc'ed swap_map is the array of these reference
      counts: but what happens when the unsigned short (or unsigned char since
      the preceding patch) is full? (and its high bit is kept for a cache flag)
      
      We then lose track of it, never freeing, leaving it in use until swapoff:
      at which point we _hope_ that a single pass will have found all instances,
      assume there are no more, and will lose user data if we're wrong.
      
      Swapping of KSM pages has not yet been enabled; but it is implemented,
      and makes it very easy for a user to overflow the maximum swap count:
      possible with ordinary process pages, but unlikely, even when pid_max
      has been raised from PID_MAX_DEFAULT.
      
      This patch implements swap count continuations: when the count overflows,
      a continuation page is allocated and linked to the original vmalloc'ed
      map page, and this used to hold the continuation counts for that entry
      and its neighbours.  These continuation pages are seldom referenced:
      the common paths all work on the original swap_map, only referring to
      a continuation page when the low "digit" of a count is incremented or
      decremented through SWAP_MAP_MAX.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      570a335b
    • H
      swap_info: swap_map of chars not shorts · 8d69aaee
      Hugh Dickins 提交于
      Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned
      chars: it's still very unusual to reach a swap count of 126, and the
      next patch allows it to be extended indefinitely.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d69aaee
    • H
      swap_info: SWAP_HAS_CACHE cleanups · 253d553b
      Hugh Dickins 提交于
      Though swap_count() is useful, I'm finding that swap_has_cache() and
      encode_swapmap() obscure what happens in the swap_map entry, just at
      those points where I need to understand it.  Remove them, and pass
      more usable "usage" values to scan_swap_map(), swap_entry_free() and
      __swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      253d553b
    • H
      swap_info: include first_swap_extent · 9625a5f2
      Hugh Dickins 提交于
      Make better use of the space by folding first swap_extent into its
      swap_info_struct, instead of just the list_head: swap partitions need
      only that one, and for others it's used as a circular list anyway.
      
      [jirislaby@gmail.com: fix crash on double swapon]
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9625a5f2
    • H
      swap_info: change to array of pointers · efa90a98
      Hugh Dickins 提交于
      The swap_info_struct is only 76 or 104 bytes, but it does seem wrong
      to reserve an array of about 30 of them in bss, when most people will
      want only one.  Change swap_info[] to an array of pointers.
      
      That does need a "type" field in the structure: pack it as a char with
      next type and short prio (aha, char is unsigned by default on PowerPC).
      Use the (admittedly peculiar) name "type" throughout for this index.
      
      /proc/swaps does not take swap_lock: I wouldn't want it to, but do take
      care with barriers when adding a new item to the array (never removed).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efa90a98
    • H
      swap_info: private to swapfile.c · f29ad6a9
      Hugh Dickins 提交于
      The swap_info_struct is mostly private to mm/swapfile.c, with only
      one other in-tree user: get_swap_bio().  Adjust its interface to
      map_swap_page(), so that we can then remove get_swap_info_struct().
      
      But there is a popular user out-of-tree, TuxOnIce: so leave the
      declaration of swap_info_struct in linux/swap.h.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Nigel Cunningham <ncunningham@crca.org.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f29ad6a9
    • D
      mm: add gfp flags for NODEMASK_ALLOC slab allocations · bad44b5b
      David Rientjes 提交于
      Objects passed to NODEMASK_ALLOC() are relatively small in size and are
      backed by slab caches that are not of large order, traditionally never
      greater than PAGE_ALLOC_COSTLY_ORDER.
      
      Thus, using GFP_KERNEL for these allocations on large machines when
      CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
      the allocation attempt, each time invoking both direct reclaim or the oom
      killer.
      
      This is of particular interest when using NODEMASK_ALLOC() from a
      mempolicy context (either directly in mm/mempolicy.c or the mempolicy
      constrained hugetlb allocations) since the oom killer always kills current
      when allocations are constrained by mempolicies.  So for all present use
      cases in the kernel, current would end up being oom killed when direct
      reclaim fails.  That would allow the NODEMASK_ALLOC() to succeed but
      current would have sacrificed itself upon returning.
      
      This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
      CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
      All current use cases either directly from hugetlb code or indirectly via
      NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
      killer when the slab allocator needs to allocate additional pages.
      
      The side-effect of this change is that all current use cases of either
      NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
      when the allocation fails (never for CONFIG_NODES_SHIFT <= 8).  All
      current use cases were audited and do have appropriate error handling at
      this time.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bad44b5b
    • L
      hugetlb: offload per node attribute registrations · 39da08cb
      Lee Schermerhorn 提交于
      Offload the registration and unregistration of per node hstate sysfs
      attributes to a worker thread rather than attempt the
      allocation/attachment or detachment/freeing of the attributes in the
      context of the memory hotplug handler.
      
      I don't know that this is absolutely required, but the registration can
      sleep in allocations and other mem hot plug handlers do it this way.  If
      it turns out this is NOT required, we can drop this patch.
      
      N.B.,  Only tested build, boot, libhugetlbfs regression.
             i.e., no memory hotplug testing.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: NAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39da08cb
    • D
      mm: clear node in N_HIGH_MEMORY and stop kswapd when all memory is offlined · 8fe23e05
      David Rientjes 提交于
      When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
      there are no present pages left.
      
      In such a situation, kswapd must also be stopped since it has nothing left
      to do.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fe23e05
    • L
      hugetlb: add per node hstate attributes · 9a305230
      Lee Schermerhorn 提交于
      Add the per huge page size control/query attributes to the per node
      sysdevs:
      
      /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
      	nr_hugepages       - r/w
      	free_huge_pages    - r/o
      	surplus_huge_pages - r/o
      
      The patch attempts to re-use/share as much of the existing global hstate
      attribute initialization and handling, and the "nodes_allowed" constraint
      processing as possible.
      
      Calling set_max_huge_pages() with no node indicates a change to global
      hstate parameters.  In this case, any non-default task mempolicy will be
      used to generate the nodes_allowed mask.  A valid node id indicates an
      update to that node's hstate parameters, and the count argument specifies
      the target count for the specified node.  From this info, we compute the
      target global count for the hstate and construct a nodes_allowed node mask
      contain only the specified node.
      
      Setting the node specific nr_hugepages via the per node attribute
      effectively ignores any task mempolicy or cpuset constraints.
      
      With this patch:
      
      (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
      ./  ../  free_hugepages  nr_hugepages  surplus_hugepages
      
      Starting from:
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:     0
      Node 2 HugePages_Free:      0
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      vm.nr_hugepages = 0
      
      Allocate 16 persistent huge pages on node 2:
      (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
      
      [Note that this is equivalent to:
      	numactl -m 2 hugeadmin --pool-pages-min 2M:+16
      ]
      
      Yields:
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:    16
      Node 2 HugePages_Free:     16
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      vm.nr_hugepages = 16
      
      Global controls work as expected--reduce pool to 8 persistent huge pages:
      (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
      Node 0 HugePages_Total:     0
      Node 0 HugePages_Free:      0
      Node 0 HugePages_Surp:      0
      Node 1 HugePages_Total:     0
      Node 1 HugePages_Free:      0
      Node 1 HugePages_Surp:      0
      Node 2 HugePages_Total:     8
      Node 2 HugePages_Free:      8
      Node 2 HugePages_Surp:      0
      Node 3 HugePages_Total:     0
      Node 3 HugePages_Free:      0
      Node 3 HugePages_Surp:      0
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a305230
    • L
      hugetlb: add generic definition of NUMA_NO_NODE · 4e25b257
      Lee Schermerhorn 提交于
      Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific headers
      to generic header 'linux/numa.h' for use in generic code.  NUMA_NO_NODE
      replaces bare '-1' where it's used in this series to indicate "no node id
      specified".  Ultimately, it can be used to replace the -1 elsewhere where
      it is used similarly.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e25b257
    • L
      hugetlb: derive huge pages nodes allowed from task mempolicy · 06808b08
      Lee Schermerhorn 提交于
      This patch derives a "nodes_allowed" node mask from the numa mempolicy of
      the task modifying the number of persistent huge pages to control the
      allocation, freeing and adjusting of surplus huge pages when the pool page
      count is modified via the new sysctl or sysfs attribute
      "nr_hugepages_mempolicy".  The nodes_allowed mask is derived as follows:
      
      * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
        is produced.  This will cause the hugetlb subsystem to use
        node_online_map as the "nodes_allowed".  This preserves the
        behavior before this patch.
      * For "preferred" mempolicy, including explicit local allocation,
        a nodemask with the single preferred node will be produced.
        "local" policy will NOT track any internode migrations of the
        task adjusting nr_hugepages.
      * For "bind" and "interleave" policy, the mempolicy's nodemask
        will be used.
      * Other than to inform the construction of the nodes_allowed node
        mask, the actual mempolicy mode is ignored.  That is, all modes
        behave like interleave over the resulting nodes_allowed mask
        with no "fallback".
      
      See the updated documentation [next patch] for more information
      about the implications of this patch.
      
      Examples:
      
      Starting with:
      
      	Node 0 HugePages_Total:     0
      	Node 1 HugePages_Total:     0
      	Node 2 HugePages_Total:     0
      	Node 3 HugePages_Total:     0
      
      Default behavior [with or without this patch] balances persistent
      hugepage allocation across nodes [with sufficient contiguous memory]:
      
      	sysctl vm.nr_hugepages[_mempolicy]=32
      
      yields:
      
      	Node 0 HugePages_Total:     8
      	Node 1 HugePages_Total:     8
      	Node 2 HugePages_Total:     8
      	Node 3 HugePages_Total:     8
      
      Of course, we only have nr_hugepages_mempolicy with the patch,
      but with default mempolicy, nr_hugepages_mempolicy behaves the
      same as nr_hugepages.
      
      Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
      '--membind' because it allows multiple nodes to be specified
      and it's easy to type]--we can allocate huge pages on
      individual nodes or sets of nodes.  So, starting from the
      condition above, with 8 huge pages per node, add 8 more to
      node 2 using:
      
      	numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
      
      This yields:
      
      	Node 0 HugePages_Total:     8
      	Node 1 HugePages_Total:     8
      	Node 2 HugePages_Total:    16
      	Node 3 HugePages_Total:     8
      
      The incremental 8 huge pages were restricted to node 2 by the
      specified mempolicy.
      
      Similarly, we can use mempolicy to free persistent huge pages
      from specified nodes:
      
      	numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
      
      yields:
      
      	Node 0 HugePages_Total:     4
      	Node 1 HugePages_Total:     4
      	Node 2 HugePages_Total:    16
      	Node 3 HugePages_Total:     8
      
      The 8 huge pages freed were balanced over nodes 0 and 1.
      
      [rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06808b08
    • L
      hugetlb: factor init_nodemask_of_node() · c1e6c8d0
      Lee Schermerhorn 提交于
      Factor init_nodemask_of_node() out of the nodemask_of_node() macro.
      
      This will be used to populate the huge pages "nodes_allowed" nodemask for
      a single node when basing nodes_allowed on a preferred/local mempolicy or
      when a persistent huge page pool page count is modified via a per node
      sysfs attribute.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1e6c8d0
    • D
      nodemask: make NODEMASK_ALLOC more general · 4e7b8a6c
      David Rientjes 提交于
      This is a series of patches to provide control over the location of the
      allocation and freeing of persistent huge pages on a NUMA platform.
      Please consider for merging into mmotm.
      
      This series uses two mechanisms to constrain the nodes from which
      persistent huge pages are allocated: 1) the task NUMA mempolicy of the
      task modifying a new sysctl "nr_hugepages_mempolicy", based on a
      suggestion by Mel Gorman; and 2) a subset of the hugepages hstate sysfs
      attributes have been added [in V4] to each node system device under:
      
      	/sys/devices/node/node[0-9]*/hugepages
      
      The per node attibutes allow direct assignment of a huge page count on a
      specific node, regardless of the task's mempolicy or cpuset constraints.
      
      This patch:
      
      NODEMASK_ALLOC(x, m) assumes x is a type of struct, which is unnecessary.
      It's perfectly reasonable to use this macro to allocate a nodemask_t,
      which is anonymous, either dynamically or on the stack depending on
      NODES_SHIFT.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e7b8a6c
  2. 15 12月, 2009 7 次提交
  3. 14 12月, 2009 8 次提交